# Students Do: PCA in Action

In this activity, you will use PCA to reduce the dimensions of the consumers shopping dataset from `4` to `2` features. After applying PCA, you will use the principal components data, to fit a K-Means model with `k=6` and make some conclusions.

In [None]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [None]:
# Load the data
file_path = Path("./Resources/shopping_data_cleaned.csv")
df_shopping = pd.read_csv(file_path)
df_shopping.head()

In [None]:
# Standardize the data
shopping_scaled = StandardScaler().fit_transform(df_shopping)
print(shopping_scaled[0:5])

In [None]:
# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [None]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca, columns=["principal component 1", "principal component 2"]
)
df_shopping_pca.head()

In [None]:
# Fetch the explained variance
pca.explained_variance_ratio_

**Sample Analysis**

According to the explained variance, the first principal component contains `33.7%` of the variance and the second principal component contains `26.2%` of the variance. Since we have `59.9%` of the information in the original dataset, we will see whether increasing the number of principal components to 3 will increase the explained variance.

In [None]:
# Initialize PCA model for 3 principal components
pca = PCA(n_components=3)

# Get two principal components for the iris data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [None]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca,
    columns=["principal component 1", "principal component 2", "principal component 3"],
)
df_shopping_pca.head()

In [None]:
# Fetch the explained variance
pca.explained_variance_ratio_

**Sample Analysis**

With three principal components, we have `83.1%` of the information in the original dataset. We therefore conclude that three principal components preserves.

In [None]:
# Initialize the K-Means model
model = KMeans(n_clusters=5, random_state=0)

# Fit the model
model.fit(df_shopping_pca)

# Predict clusters
predictions = model.predict(df_shopping_pca)

# Add the predicted class columns
df_shopping_pca["class"] = model.labels_
df_shopping_pca.head()

In [None]:
# BONUS: plot the 3 principal components
# import plotly.express as px
# fig = px.scatter_3d(
#     df_shopping_pca,
#     x="principal component 3",
#     y="principal component 2",
#     z="principal component 1",
#     color="class",
#     symbol="class",
#     width=800,
# )
# fig.update_layout(legend=dict(x=0, y=1))
# fig.show()