<center><h1 style="font-size:35px; font-family: 'Times New Roman'; letter-spacing: 0.1em;">Item2Vec </h1></center>


Our recommender model is based on the Item2Vec specification, which is a direct adaptation of the Word2Vec model introduced by Mikolov (2013) at Google. recommend products to users based on their historical purchasing behavior. By grouping users into clusters, this model aims to provide more personalized recommendations tailored to specific customer segments

In this notebook, we will train 13 different Item2Vec models, one for each identified cluster. To ensure that our recommendations are informative, we will restrict the data to only include orders containing at least 4 items. This restriction will provide the recommender with sufficient information, in addition to the cluster to which the user belongs.

The overall goal of the code is to prepare customer order data, split it into training and testing sets, build an Item2Vec model for each cluster of customers based on their purchasing behavior, and save the models for later recommendations. The use of clustering allows the model to tailor recommendations to specific user segments, improving the relevance of the suggestions.

### **Overview of Item2Vec:**

- **Item2Vec**: This is a modification of **Word2Vec**, a model developed by Google for natural language processing. In **Word2Vec**, the model learns vector representations for words based on the surrounding words (context) in sentences.
  - In **Item2Vec**, instead of words, we treat products as "words" and customer orders as "sentences." By analyzing the sequences of products purchased together, the model learns vector embeddings for each product.
  - These embeddings can be used to find products that are commonly purchased together and recommend them to users.

- **Clustering**: Users are grouped into different clusters based on their purchasing habits (derived from KMeans clustering done earlier). For each cluster, a separate **Item2Vec** model is trained.

### **steps** :

1. **Data Preparation**: Purchase histories are prepared by filtering out small orders and aggregating products bought together in each order.
2. **Item2Vec Training**: A separate **Item2Vec** model is trained for each user cluster. This helps the model learn product embeddings based on products frequently bought together.
3. **Saving Models**: The trained models are saved to disk for later use in making recommendations.

In [1]:
import pickle
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
from os import listdir
from typing import List
from gensim.models import Word2Vec

In [2]:
products = pd.read_csv('Data/products.csv')
cluster_data = pq.read_table('Savings/dummy_k18.parquet').to_pandas()
cluster_data_named = pd.merge(cluster_data, products, on='product_id', how='inner')

In [3]:
cluster_data.head()         

Unnamed: 0,order_id,product_id,cluster,user_id
0,2,33120,3,202279
1,2,28985,3,202279
2,2,9327,3,202279
3,2,45918,3,202279
4,2,30035,3,202279


In [4]:
cluster_data_named['product_id'] = cluster_data_named['product_id'].astype(str)
cluster_data_named['user_id'] = cluster_data_named['user_id'].astype(str)

In [5]:
#This function filters the DataFrame to return only the rows belonging to a specific cluster.
def filter_data_by_cluster(data: pd.DataFrame, cluster_num: int):
    return data.loc[data['cluster'] == cluster_num, :]

In [6]:
clusters_separated = [filter_data_by_cluster(cluster_data_named, cluster_num) for cluster_num in range(0, len(cluster_data_named['cluster'].unique()))]

In [7]:
# This function randomly splits users in each cluster into training and testing sets based on a specified training rate (e.g., 75%).
def split_users_in_cluster(cluster_data: pd.DataFrame, train_rate: float):
    unique_users = cluster_data['user_id'].unique()
    train_users = np.random.choice(unique_users, round(len(unique_users)*train_rate), False).tolist()
    test_users = [user for user in unique_users if user not in train_users]
    return train_users, test_users

Product Lookup: This function creates a dictionary that maps product_id to product_name. This dictionary is saved as a pickle file and will be useful when we want to show the actual product names instead of just their IDs during recommendations.


In [8]:

# Save product lookup
def save_product_lookup(products: pd.DataFrame):
    product_lookup = dict(zip(products['product_id'].astype(str), products['product_name'].tolist()))
    with open('Savings/product_lookup.pkl', 'wb') as file:
        pickle.dump(product_lookup, file)

save_product_lookup(products)


ThIS function prepare the purchase history for training the Item2Vec model.
get_orders_from_cluster: Groups the purchase data by user_id and order_id and aggregates the product_ids into lists. Each list represents an order made by a user.

In [9]:
# Generate purchase history for each cluster
def get_orders_from_cluster(cluster):
    return cluster.groupby(['user_id', 'order_id'])['product_id'].apply(list).values

This function generates a purchase history for each cluster, including only orders with more than 3 items (to ensure sufficient data). These aggregated orders will serve as the input for the Item2Vec model, where sequences of products in the same order are used to train the model.

In [10]:

def generate_purchase_history_in_cluster(cluster: pd.DataFrame):
    purchase_history = get_orders_from_cluster(cluster)
    filtered_purchase_history = [
        purchase for purchase in purchase_history if len(purchase) > 3]
    return filtered_purchase_history


purchase_history_in_cluster = [generate_purchase_history_in_cluster(cluster) for cluster in clusters_separated]


- **Item2Vec Model**: The **Word2Vec** model is trained using the aggregated purchase histories. Key parameters:
  - **`window=3`**: This controls the context window size (i.e., how many products on either side are considered when training the embeddings).
  - **`sg=1`**: This enables the **Skip-gram** model, which predicts surrounding words (products) for a given word (product).
  - **`vector_size=100`**: The dimensionality of the embedding vectors (how many dimensions each product is represented by).
  - **`epochs=10`**: The model is trained for 10 epochs.


In [11]:
# Build and train Item2Vec model
def build_item2vec_model(purchases_data):
    model = Word2Vec(window=3, sg=1, hs=0, vector_size=100, negative=10, alpha=0.03, min_alpha=0.0007, seed=28101997, workers=6)
    model.build_vocab(purchases_data, progress_per=200)
    model.train(purchases_data, total_examples=model.corpus_count, epochs=10, report_delay=1)
    return model

models = [build_item2vec_model(purchase_history) for purchase_history in purchase_history_in_cluster]

In [12]:
# Save models
def save_cluster_model(model, cluster_id: int):
    model.save(f'Models Clusters/model_cluster_{cluster_id}.model')
    return f"Model for cluster {cluster_id} successfully saved."


[save_cluster_model(models[i], i) for i in range(len(models))]

['Model for cluster 0 successfully saved.',
 'Model for cluster 1 successfully saved.',
 'Model for cluster 2 successfully saved.',
 'Model for cluster 3 successfully saved.',
 'Model for cluster 4 successfully saved.',
 'Model for cluster 5 successfully saved.',
 'Model for cluster 6 successfully saved.',
 'Model for cluster 7 successfully saved.',
 'Model for cluster 8 successfully saved.',
 'Model for cluster 9 successfully saved.',
 'Model for cluster 10 successfully saved.',
 'Model for cluster 11 successfully saved.',
 'Model for cluster 12 successfully saved.',
 'Model for cluster 13 successfully saved.',
 'Model for cluster 14 successfully saved.',
 'Model for cluster 15 successfully saved.',
 'Model for cluster 16 successfully saved.',
 'Model for cluster 17 successfully saved.']