# Students Do: PCA in Action

In this activity, you will use PCA to reduce the dimensions of the consumers shopping dataset from `4` to `2` features. After applying PCA, you will use the principal components data, to fit a K-Means model with `k=6` and make some conclusions.

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [2]:
# Load the data
file_path = Path("../Resources/shopping_data_cleaned.csv")
df_shopping = pd.read_csv(file_path)
df_shopping.head()

Unnamed: 0,Genre,Age,Annual Income,Spending Score (1-100)
0,1,19,15.0,39
1,1,21,15.0,81
2,0,20,16.0,6
3,0,23,16.0,77
4,0,31,17.0,40


In [3]:
# Standardize the data
shopping_scaled = StandardScaler().fit_transform(df_shopping)
print(shopping_scaled[0:5])

[[ 1.12815215 -1.42456879 -1.73899919 -0.43480148]
 [ 1.12815215 -1.28103541 -1.73899919  1.19570407]
 [-0.88640526 -1.3528021  -1.70082976 -1.71591298]
 [-0.88640526 -1.13750203 -1.70082976  1.04041783]
 [-0.88640526 -0.56336851 -1.66266033 -0.39597992]]


In [4]:
# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [5]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca, columns=["principal component 1", "principal component 2"]
)
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2
0,-0.406383,-0.520714
1,-1.427673,-0.36731
2,0.050761,-1.894068
3,-1.694513,-1.631908
4,-0.313108,-1.810483


In [6]:
# Fetch the explained variance
pca.explained_variance_ratio_

array([0.33690046, 0.26230645])

**Sample Analysis**

According to the explained variance, the first principal component contains `33.7%` of the variance and the second principal component contains `26.2%` of the variance. Since we have `59.9%` of the information in the original dataset, we will see whether increasing the number of principal components to 3 will increase the explained variance.

In [7]:
# Initialize PCA model for 3 principal components
pca = PCA(n_components=3)

# Get two principal components for the iris data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [8]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca,
    columns=["principal component 1", "principal component 2", "principal component 3"],
)
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3
0,-0.406383,-0.520714,-2.072527
1,-1.427673,-0.36731,-2.277644
2,0.050761,-1.894068,-0.367375
3,-1.694513,-1.631908,-0.717467
4,-0.313108,-1.810483,-0.42646


In [9]:
# Fetch the explained variance
pca.explained_variance_ratio_

array([0.33690046, 0.26230645, 0.23260639])

**Sample Analysis**

With three principal components, we have `83.1%` of the information in the original dataset. We therefore conclude that three principal components preserves.

In [10]:
# Initialize the K-Means model
model = KMeans(n_clusters=5, random_state=0)

# Fit the model
model.fit(df_shopping_pca)

# Predict clusters
predictions = model.predict(df_shopping_pca)

# Add the predicted class columns
df_shopping_pca["class"] = model.labels_
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3,class
0,-0.406383,-0.520714,-2.072527,3
1,-1.427673,-0.36731,-2.277644,3
2,0.050761,-1.894068,-0.367375,0
3,-1.694513,-1.631908,-0.717467,0
4,-0.313108,-1.810483,-0.42646,0


In [11]:
# BONUS: plot the 3 principal components
# import plotly.express as px
# fig = px.scatter_3d(
#     df_shopping_pca,
#     x="principal component 3",
#     y="principal component 2",
#     z="principal component 1",
#     color="class",
#     symbol="class",
#     width=800,
# )
# fig.update_layout(legend=dict(x=0, y=1))
# fig.show()