# Students Do: PCA in Action

In this activity, you will use PCA to reduce the dimensions of the consumers shopping dataset from `4` to `2` features. After applying PCA, you will use the principal components data, to fit a K-Means model with `k=6` and make some conclusions.

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [3]:
# Load the data
file_path = Path("../Resources/shopping_data_cleaned.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Age,Annual Income,Spending Score,Previous Shopper
0,-0.29524,-0.118424,-1.625204,1
1,-0.979855,0.05197,-0.05306,1
2,-0.009984,1.244733,0.208964,0
3,-0.181138,0.39276,-0.839132,1
4,-0.124086,1.074338,-0.577108,0


In [9]:
# If necessary, standardize the data
df_scaled = df.drop('Previous Shopper', axis = 'columns')
df_scaled = StandardScaler().fit_transform(df_scaled)
print(df_scaled[0:5])

[[-0.29524017 -0.11842429 -1.62520404]
 [-0.97985507  0.05197037 -0.05305987]
 [-0.00998397  1.24473298  0.20896416]
 [-0.18113769  0.39275969 -0.83913196]
 [-0.12408645  1.07433832 -0.57710793]]


In [10]:
# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the data.
df_pca = pca.fit_transform(df_scaled)

In [11]:
# Transform PCA data to a DataFrame
df_scaled_pca = pd.DataFrame(
    data=df_pca, columns=["principal component 1", "principal component 2"]
)
df_scaled_pca.head()

Unnamed: 0,principal component 1,principal component 2
0,-1.024876,1.237093
1,-0.654009,-0.004835
2,-0.588746,-1.023807
3,-0.801846,0.319429
4,-1.003417,-0.345748


In [12]:
# Fetch the explained variance
pca.explained_variance_ratio_

array([0.77450478, 0.14986314])

**Sample Analysis**

According to the explained variance, the first principal component contains `X%` of the variance and the second principal component contains `Y%` of the variance. Since we have `Z%` of the information in the original dataset, we will see whether increasing the number of principal components to 3 will increase the explained variance.

In [14]:
# Initialize PCA model for 3 principal components
pca = PCA(n_components=3)

df_pca = pca.fit_transform(df_scaled)

In [15]:
# Transform PCA data to a DataFrame
df_scaled_pca = pd.DataFrame(
    data=df_pca, columns=["principal component 1", "principal component 2", 'principal component 3']
)
df_scaled_pca.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3
0,-1.024876,1.237093,0.402133
1,-0.654009,-0.004835,-0.733404
2,-0.588746,-1.023807,0.445335
3,-0.801846,0.319429,0.382389
4,-1.003417,-0.345748,0.613405


In [16]:
# Fetch the explained variance
pca.explained_variance_ratio_

array([0.77450478, 0.14986314, 0.07563208])

**Sample Analysis**

With three principal components, we have `83.1%` of the information in the original dataset. We therefore conclude that three principal components preserves.

In [10]:
# Initialize the K-Means model

# Fit the model

# Predict clusters

# Add the predicted class columns


In [20]:
# BONUS: plot the 3 principal components
import plotly.express as px
fig = px.scatter_3d(
    df_scaled_pca,
    x="principal component 3",
    y="principal component 2",
    z="principal component 1",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

ValueError: Value of 'color' is not the name of a column in 'data_frame'. Expected one of ['principal component 1', 'principal component 2', 'principal component 3'] but received: class