# Welcome to the Challenge Task


After successfully clustering the e-books, BookzOn is now asking for a more in-depth analysis of the results. They want to visualize the features and clusters for easier interpretation and insights. Your challenge is to apply PCA (Principal Component Analysis) to reduce the dimensionality of the embedding vectors and then visualize the clustered e-books in a 2D space.

The task involves:

1.   Apply PCA to reduce the embedding vectors to 2 principal components.
2.   Visualize the 2D projection of the e-books and highlight the clusters.

This will help BookzOn visually assess how distinct the clusters are and how the e-books relate to one another.

Good luck! 🍀

— Ahmad


Run the following block to install `pyspark` module

In [None]:
!pip install pyspark

----
Run the following block to import the necessary modules

**Do not forget to upload the `challenge_task_data.csv` into the Google Colab environment.**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType
from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.clustering import KMeans
import plotly.express as px

# Create the spark session
spark = SparkSession.builder \
    .appName("myApp") \
    .getOrCreate()
# load the data
df = pd.read_csv("challenge_task_data.csv")

# store the feature names
features_column_names = [f"feature_{i}" for i in range(1, 513)]

# update the column names
df.columns = features_column_names

# convert tha pandas dataframe to a spark dataframe
data = spark.createDataFrame(df)

# assemble the features
assembler = VectorAssembler(inputCols=features_column_names, outputCol="features")
data = assembler.transform(data)

# select the features
data = data.select("features")

# cluster the data into 3 groups
kmeans = KMeans(k=3,
                seed=1,
                featuresCol="features",
                predictionCol="cluster")
model = kmeans.fit(data)
clustered_data = model.transform(data)
clustered_data.show()

Apply PCA to reduce the dimensionality of the `features` vector. call the output column `pcaFeatures`

Convert the pyspark data frame into a pandas dataframe call it `pca_df`

Extract `x` and `y` from PCA features

Use plotly express to visualize the vlusters in a 2D-scatterplot

In [None]:
# Create the base cluster plot
fig = px.scatter(
    pca_df,
    x='x',
    y='y',
    opacity=0.6,
    size_max=4,
    color= pca_df.cluster.astype(str),
    title='2D Visualization of Clusters with Recently Viewed Products',
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
    category_orders={'cluster': list(range(4))},
    # show the product id in the tooltip
)

# Update layout to add legend title and adjust plot settings
fig.update_layout(legend_title_text='Clusters', legend=dict(x=1, y=1), width=600, height=500)