<a href="https://colab.research.google.com/github/Gee1225/projects/blob/main/The_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to the Notebook**

### Task 1 - Set up the project

Installing the needed modules.

In [1]:

!pip install openai==1.14.3 python-dotenv pyspark


[0mCollecting openai==1.14.3
  Using cached openai-1.14.3-py3-none-any.whl.metadata (20 kB)
Using cached openai-1.14.3-py3-none-any.whl (262 kB)
[0mInstalling collected packages: openai
[0mSuccessfully installed openai


Imporint the modules

In [2]:
from dotenv import load_dotenv
import os
import pandas as pd
import numpy as np

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.clustering import KMeans
import plotly.express as px

Setup the OpenAI API

In [3]:
# !pip uninstall -y openai
!pip install --no-cache-dir openai==1.85.0

[0mCollecting openai==1.85.0
  Downloading openai-1.85.0-py3-none-any.whl.metadata (25 kB)
Downloading openai-1.85.0-py3-none-any.whl (730 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m730.2/730.2 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: openai
[0mSuccessfully installed openai


In [4]:
!pip install --upgrade openai
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load API key
load_dotenv(dotenv_path='apikey.env.txt')
api_key = os.getenv("APIKEY")

# DEBUG: Print API key to confirm
print("API Key:", api_key)

# Check if API key was loaded
if not api_key:
    raise ValueError("APIKEY not found in environment!")

# Create client (for v1.x OpenAI SDK)
client = OpenAI(api_key=api_key)

# Try a simple call to list models
models = client.models.list()
print(models)

[0mCollecting openai
  Using cached openai-1.86.0-py3-none-any.whl.metadata (25 kB)
Using cached openai-1.86.0-py3-none-any.whl (730 kB)
[0mInstalling collected packages: openai
[0mSuccessfully installed openai
API Key: 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
SyncPage[Model](data=[Model(id='text-embedding-ada-002', created=1671217299, object='model', owned_by='openai-internal'), Model(id='whisper-1', created=1677532384, object='model', owned_by='openai-internal'), Model(id='gpt-3.5-turbo', created=1677610602, object='model', owned_by='openai'), Model(id='tts-1', created=1681940951, object='model', owned_by='openai-internal'), Model(id='gpt-3.5-turbo-16k', created=1683758102, object='model', owned_by='openai-internal'), Model(id='davinci-002', created=1692634301, object='model', owned_by='system'), Model(id='babbage-002', created=1692634615, object='model', own

Create a Spark session

In [5]:
spark = SparkSession.builder.appName("ProductClustering").getOrCreate()
spark

Loading the dataset

In [6]:
FilePath= "/content/products_dataset.csv"
# df= spark.read.csv(FilePath, header=True, inferSchema=True) ### infers schema instead of assuming string
df= spark.read.csv(FilePath, header=True, inferSchema=True, samplingRatio=1) ### infers schema instead of assuming string
df.show()

+----------+--------------------+--------------------+
|product_id|               title|         description|
+----------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|
|        P1|Turmode 30 ft. RP...|If you need more ...|
|        P2|Large Tapestry Bo...|Polyester cover r...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|
|        P9|Traditional Silve...|This transitional...|
|       P10|15 in. x 59 in. O...|Its easy to add a...|
|       P11|1 qt. #350F-7 Wil...|BEHR PREMIUM PLUS...|
|       P12|Anthracite Cordle...|BlindsAvenue ligh...|
|       P13|SlimGrip 78-Inch ...|Luverne SlimGrip ...|
|       P14|6 in. x 28 in. x ...|Our Rustic Collec...|
|       P1

List of 8 products recently viewed by the user.

In [7]:
recently_viewed_products = [
    'P316',
    'P333',
    'P1115',
    'P1691',
    'P1082',
    'P397',
    'P1441',
    'P1054',
]

### Task 2 - Prepare the dataset

Combine `title` and `description` Columns

In [8]:
df=df.withColumn("combined_text", concat_ws(" ", "title", "description"))
df.show()

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|       combined_text|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|Mariana 6 ft. Mul...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|5 gal. #650C-2 Po...|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|7/8 in. x 4-1/2 i...|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|Ring Gold Bar Car...|
|        P9|Traditional Silve...|This transitional...|Traditional Silve...|
|       P10|

get the combined_text column and convert it into a list

In [9]:
list_combined_text=df.select("combined_text").rdd.flatMap(lambda x : x).collect()
print(list_combined_text[:4])

["Men's 3X Large Carbon Heather Cotton/Polyester Rain Defender Paxton Heavyweight Hooded Zip-Front Sweatshirt This heavyweight, water-repellent hooded sweatshirt has a zip front for fast layering. ORIGINAL FIT. 13 oz., 75% cotton/25% polyester blend with Rain Defender durable water repellent. Attached, jersey-lined three-piece hood with drawcord closure. Antique-finish brass front zipper. Two front hand-warmer pockets have a hidden security pocket inside. Stretchable, spandex-reinforced rib-knit cuffs and waistband. Locker loop facilitates hanging.", "Turmode 30 ft. RP TNC Female to RP TNC Male Adapter Cable If you need more length between your existing wireless device and Hi-Gain Antenna, this is the product for you. It's compatible with most Wi-Fi Antennas, so it is easy for you to extend your wireless network. Just replace your existing cable that runs between your wireless device and Antenna and you're ready to use your network with extended range.", 'Large Tapestry Bolster Bed Pol

Use OpenAI text embedding model to create the vector embeddings.

In [10]:
embedding_vectors = []
batch_size = 250  # You can adjust the batch size based on your needs and API limits

for i in range(0, len(list_combined_text), batch_size):
    batch = list_combined_text[i : i + batch_size]
    response = client.embeddings.create(
        input=batch,
        model="text-embedding-3-small",
        encoding_format="float",
        dimensions=512
    )
    embedding_vectors.extend([d.embedding for d in response.data])

# embedding_vectors[:2]

Let't put the embedding vectors into our original dataframe

Convert embedding vectors list into a Pyspark DataFrame

In [11]:
features_column_names = [f"embedding_{i}" for i in range(len(embedding_vectors[0]))]

In [12]:
embeddings_df = spark.createDataFrame(embedding_vectors, schema=features_column_names)
embeddings_df.show()

+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+------------+-------------+-------------+-------------+------------+-------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+------------+------------+------------+-------------+-------------+------------+-------------+-------------+-------------+-------------+-------------+------------+-------------+------------+------------+-------------+-------------+-----

Add unique `row_id` to each row in the pysaprk dataframe

In [13]:
embeddings_df=embeddings_df.repartition(1).withColumn("row_id", F.monotonically_increasing_id()) ### Ensure accurate join and improve information
embeddings_df.show()

+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+------------+-------------+-------------+-------------+------------+-------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+------------+------------+------------+-------------+-------------+------------+-------------+-------------+-------------+-------------+-------------+------------+-------------+------------+------------+-------------+-------------+-----

Add unique `row_id` to each row in our main pyspark dataframe `df`

In [14]:
df=df.repartition(1).withColumn("row_id", F.monotonically_increasing_id())
df.show()

+----------+--------------------+--------------------+--------------------+------+
|product_id|               title|         description|       combined_text|row_id|
+----------+--------------------+--------------------+--------------------+------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|     0|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|     1|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|     2|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|     3|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|     4|
|        P5|Mariana 6 ft. Mul...|With robust struc...|Mariana 6 ft. Mul...|     5|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|5 gal. #650C-2 Po...|     6|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|7/8 in. x 4-1/2 i...|     7|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|Ring Gold Bar Car...|     8|
|   

Let's join the two dataframes

In [15]:
df=df.join(embeddings_df, on="row_id", how="inner").drop("row_id")
df.show ()

+----------+--------------------+--------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+------------+-------------+-------------+-------------+------------+-------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+-------------+------------+------------+-------------+-------------+------------+-------------+------------+-------------+-------------+------------+------------+------------+-------------+-------------+------------+-------------+-------------+-------------+-------------+-------------+------------

### Task 3 - Cluster products using K-means

Assemble the 512 Embedding Columns into a Single 'features' Column

In [16]:
assembler = VectorAssembler(inputCols=features_column_names, outputCol="features")
data = assembler.transform(df)
data=data.select(["product_id", "title", "description","features"])
data.show()

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|            features|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.04266477,0.020...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.04413316,0.009...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.042361606,-0.0...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.049733717,-0....|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.026085882,0.04...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.06058844,0.042...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[0.0074977363,-0....|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|[-0.021779666,-0....|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|[-4.6415857E-4,0....|
|        P9|Traditional Silve...|This transitional...|[0.039088096,1.85...|
|       P10|

Apply K-Means Clustering with 5 Clusters on the `features` Column

In [17]:
KMeans= KMeans(featuresCol="features", k=5, predictionCol="cluster")
model=KMeans.fit(data)
clustered_data=model.transform(data)
clustered_data.show()

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.04266477,0.020...|      4|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.04413316,0.009...|      4|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.042361606,-0.0...|      3|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.049733717,-0....|      0|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.026085882,0.04...|      4|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.06058844,0.042...|      1|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[0.0074977363,-0....|      2|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|[-0.021779666,-0....|      4|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|[-4.6415857E-4,0....| 

### Task 4 - Visualize the clusters

Let's reduce the dimensionality of our features for visualization purpose

`512 dimensions => 2 dimensions`

In [18]:
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(clustered_data)
pca_results = model.transform(clustered_data)
pca_results.show()

+----------+--------------------+--------------------+--------------------+-------+--------------------+
|product_id|               title|         description|            features|cluster|        pca_features|
+----------+--------------------+--------------------+--------------------+-------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.04266477,0.020...|      4|[0.18866925391377...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.04413316,0.009...|      4|[-0.1739714733377...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.042361606,-0.0...|      3|[-0.0202194626435...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.049733717,-0....|      0|[0.00736958083395...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.026085882,0.04...|      4|[-0.0240268950362...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.06058844,0.042...|      1|[-0.0014852463193...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[

In [19]:
pca_df = pca_results.select("product_id",  "cluster", "pca_features").toPandas()
pca_df[['x', 'y']] = pd.DataFrame(pca_df.pca_features.tolist(), index=pca_df.index)
pca_df.head()

Unnamed: 0,product_id,cluster,pca_features,x,y
0,P0,4,"[0.1886692539137776, 0.034299116560300304]",0.188669,0.034299
1,P1,4,"[-0.17397147333773288, -0.1330088695852248]",-0.173971,-0.133009
2,P2,3,"[-0.020219462643530285, 0.3163356600743902]",-0.020219,0.316336
3,P3,0,"[0.007369580833953067, 0.05935625808512694]",0.00737,0.059356
4,P4,4,"[-0.024026895036207738, -0.04420924497209652]",-0.024027,-0.044209


Let's plot the Clusters

In [20]:
def plot_clusters(pca_df, num_clusters=5):
    """
    Plots a 2D visualization of clusters using Plotly Express.

    Parameters:
    - pca_df (DataFrame): A Pandas DataFrame containing columns 'x', 'y', and 'cluster'.
      'x' and 'y' are the 2D PCA components, and 'cluster' indicates the cluster label.
    - num_clusters (int): The number of unique clusters to display.
    - recently_viewed_df (DataFrame, optional): DataFrame with 'x' and 'y' coordinates for recently viewed products.

    This function creates an interactive scatter plot where each point is colored according to its cluster.
    Recently viewed products are marked as black crosses if provided.

    Returns:
    - fig (Figure): The Plotly figure object for the plot.
    """

    # Create the base cluster plot
    fig = px.scatter(
        pca_df,
        x='x',
        y='y',
        opacity=0.6,
        size_max=4,
        color= pca_df.cluster.astype(str),
        title='2D Visualization of Clusters with Recently Viewed Products',
        labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
        category_orders={'cluster': list(range(num_clusters))},
        # show the product id in the tooltip
        hover_data={'product_id': True}

    )

    # Update layout to add legend title and adjust plot settings
    fig.update_layout(legend_title_text='Clusters', legend=dict(x=1, y=1), width=600, height=500)

    return fig

fig = plot_clusters(pca_df)
fig.show()

### Task 5 - Highlight recently viewed products

In [21]:
print("The user has recently viewed the following products: ", recently_viewed_products)

The user has recently viewed the following products:  ['P316', 'P333', 'P1115', 'P1691', 'P1082', 'P397', 'P1441', 'P1054']


Let's have a look at the records in our `clustered_data` dataframe related to the recently viewed products.

In [25]:
filtered_df=clustered_data.where(F.col("product_id").isin(recently_viewed_products))
unique_clusters=filtered_df.select("cluster").distinct().rdd.flatMap(lambda x:x).collect()
unique_clusters
# filtered_df.show()

[3, 2]

### Task 6 - Recommend products based on recently viewed products

Let's have a look at the recently viewed products titles

In [26]:
filtered_df.select("title").rdd.flatMap(lambda x : x).collect()

["Mystic Fitz Roy Beige 9' 0 x 12' 0 Area Rug",
 'Florida Shag Beige/Multi 3 ft. x 5 ft. Floral Area Rug',
 '1 gal. #M250-3 Apple Turnover Extra Durable Flat Interior Paint & Primer',
 '1 gal. #HDPG60 Misty Emerald Lake Flat Interior Paint and Primer',
 '1 qt. #S220-7 Molasses Extra Durable Flat Interior Paint & Primer',
 'Modern Gray/Multi 9 ft. x 12 ft. Vibrant Abstract Polyester Area Rug',
 '1 qt. #PPU6-06 Honey Locust Eggshell Enamel Low Odor Interior Paint & Primer',
 'Genet Rust/Red-Brown 8 ft. x 11 ft. Abstract Wool Area Rug']

Let's see the distinct clusters of the recenetly viewed products.

In [27]:
print("The user has recently viewed the following clusters: ", unique_clusters)

The user has recently viewed the following clusters:  [3, 2]


Let's find the possible products for the recommendation.

In [30]:
possible_recommendations = clustered_data.filter(F.col("cluster").isin(unique_clusters)).filter(~clustered_data['product_id'].isin(recently_viewed_products))

Let's perform a groupby and generate a list of product IDs that can be recommended for each of the clusters.

In [37]:
recommendations = possible_recommendations.groupBy("cluster").agg(F.collect_list("product_id").alias("random_recommendations"))
recommendations_df=recommendations.toPandas()


In [39]:
recommendations_df["random_recommendations"]=recommendations_df["random_recommendations"].apply(lambda x: np.random.choice(x,5,replace=False).tolist())
display(recommendations_df)

Unnamed: 0,cluster,random_recommendations
0,3,"[P1397, P1721, P263, P652, P262]"
1,2,"[P865, P1302, P1282, P1062, P1415]"


In [40]:
# write a python function to display the recommendations
def display_recommendations(row):
  # find the title of the product in df
  product_ids = row['random_recommendations']
  cluster = row.cluster

  titles = data. \
          filter(data["product_id"]. \
          isin(product_ids)).select("title").collect()

  print("\n")
  print("Recommendations for Cluster:", cluster)
  for title in titles:
    print(title[0])

recommendations_df.apply(display_recommendations, axis=1)



Recommendations for Cluster: 3
Arctic Shag Ivory 5 ft. x 8 ft. Solid Area Rug
Apollo 7 Carbon 2 ft. 3 in.  x 7 ft. 5 in.  Distressed Patterned Indoor Area Rug Runner
Aspen Green/Red 2 ft. x 7 ft. Geometric Runner Rug
Montreal Shag Gray/Ivory 7 ft. x 7 ft. Diamonds Geometric Round Area Rug
NFL - Las Vegas Raiders Black Man Cave 3 ft. x 4 ft. Area Rug


Recommendations for Cluster: 2
5 gal. #W-F-510 Silver Sky Semi-Gloss Enamel Exterior Paint & Primer
1 gal. i100 White Base Dead Flat Interior Paint
1 qt. #440C-3 Rockwood Jade Flat Exterior Paint & Primer
1 qt. #PPU2-15 Cajun Red Semi-Gloss Enamel Low Odor Interior Paint & Primer
1 qt. #MQ2-33 Parisian Cafe Flat Exterior Paint & Primer


Unnamed: 0,0
0,
1,
