<div class="alert alert-block alert-success">
    <h1 align="center">Product Recommendations using word2vec</h1>
    <h3 align="center">Recommender System</h3>
    <h4 align="center"><a href="http://www.iran-machinelearning.ir">Alireza Javid</a></h4>
</div>

The key principle behind word2vec is the notion that the meaning of a word can be inferred from it’s context–what words tend to be around it. To abstract that a bit, text is really just a sequence of words, and the meaning of a word can be extracted from what words tend to be just before and just after it in the sequence.

word2vec (and other word vector models) have revolutionized Natural Language Processing by providing much better vector representations for words than past approaches. In the same way that word embeddings revolutionized NLP, item embeddings are revolutionizing recommendations.

User activity around an item encodes many abstract qualities of that item which are difficult to capture by more direct means. For instance, how do you encode qualities like “architecture, style and feel” of an Airbnb listing?

The word2vec approach has proven successful in extracting these hidden insights, and being able to compare, search, and categorize items on these abstract dimensions opens up a lot of opportunities for smarter, better recommendations. Commercially, Yahoo saw a 9% lift in CTR when applying this technique to their advertisements, and AirBNB saw a 21% lift in CTR on their Similar Listing carousel, a product that drives 99% of bookings along with search ranking.

<img src = 'http://mccormickml.com/assets/word2vec_apps/Spotify_user_activity.png'>

Imagine you are looking for an apartment to rent for your vacation to Paris. As you browse through the available homes, it’s likely that you will investigate a number of listings which fit your preferences and are comparable in features like amenities and design taste.


<img src = 'http://mccormickml.com/assets/word2vec_apps/Airbnb_user_activity.png'>

In [1]:
import random
import gensim
import warnings

import numpy as np
import pandas as pd

import plotly.express as px
import plotly.offline as pyoff
import matplotlib.pyplot as plt
import plotly.graph_objects as go

from tqdm import tqdm

from gensim.models import Word2Vec 

%matplotlib inline

warnings.filterwarnings('ignore')



## Data gathering and understanding

In [2]:
df = pd.read_excel('/kaggle/input/online-retail/Online Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
df.to_csv('data.csv')

Given below is the description of the fields in this dataset:

1. __InvoiceNo:__ Invoice number, a unique number assigned to each transaction.

2. __StockCode:__ Product/item code. a unique number assigned to each distinct product.

3. __Description:__ Product description

4. __Quantity:__ The quantities of each product per transaction.

5. __InvoiceDate:__ Invoice Date and time. The day and time when each transaction was generated.

6. __CustomerID:__ Customer number, a unique number assigned to each customer.

## Data Preprocessing

In [4]:
# check for missing values
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [5]:
df.shape

(541909, 8)

Since we have sufficient data, we will drop all the rows with missing values.

In [6]:
# remove missing values
df.dropna(inplace=True)

# again check missing values
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [7]:
# Convert the StockCode to string datatype
df['StockCode']= df['StockCode'].astype(str)

In [8]:
# Check out the number of unique customers in our dataset
customers = df["CustomerID"].unique().tolist()
len(customers)

4372

There are 4,372 customers in our dataset. For each of these customers we will extract their buying history. In other words, we can have 4,372 sequences of purchases.

In [9]:
len(df["Quantity"].unique())

436

In [10]:
df["Quantity"].unique()

array([     6,      8,      2,     32,      3,      4,     24,     12,
           48,     18,     20,     36,     80,     64,     10,    120,
           96,     23,      5,      1,     -1,     50,     40,    100,
          192,    432,    144,    288,    -12,    -24,     16,      9,
          128,     25,     30,     28,      7,     72,    200,    600,
          480,     -6,     14,     -2,     -4,     -5,     -7,     -3,
           11,     70,    252,     60,    216,    384,     27,    108,
           52,  -9360,     75,    270,     42,    240,     90,    320,
           17,   1824,    204,     69,    -36,   -192,   -144,    160,
         2880,   1400,     19,     39,    -48,    -50,     56,     13,
         1440,     -8,     15,    720,    -20,    156,    324,     41,
          -10,    -72,    -11,    402,    378,    150,    300,     22,
           34,    408,    972,    208,   1008,     26,   1000,    -25,
         1488,    250,   1394,    400,    110,    -14,     37,    -33,
      

In [11]:
df.shape

(406829, 8)

In [12]:
df = df[df["Quantity"]>0]

In [13]:
df.shape

(397924, 8)

In [14]:
df["Quantity"].unique()

array([    6,     8,     2,    32,     3,     4,    24,    12,    48,
          18,    20,    36,    80,    64,    10,   120,    96,    23,
           5,     1,    50,    40,   100,   192,   432,   144,   288,
          16,     9,   128,    25,    30,    28,     7,    72,   200,
         600,   480,    14,    11,    70,   252,    60,   216,   384,
          27,   108,    52,    75,   270,    42,   240,    90,   320,
          17,  1824,   204,    69,   160,  2880,  1400,    19,    39,
          56,    13,  1440,    15,   720,   156,   324,    41,   402,
         378,   150,   300,    22,    34,   408,   972,   208,  1008,
          26,  1000,  1488,   250,  1394,   400,   110,    37,    78,
          21,   272,    84,    47,  1728,    38,    53,    76,   576,
          29,  2400,   500,   180,   960,  1296,   147,   168,   256,
          54,    31,   860,  1010,  1356,  1284,   186,   114,   360,
        1930,  2000,  3114,  1300,   670,   176,   648,    62, 74215,
          89,    33,

## Data Preparation

It is a good practice to set aside a small part of the dataset for validation purpose. Therefore, we will use data of 90% of the customers to create word2vec embeddings. Let's split the data.

In [15]:
# shuffle customer ID's
random.shuffle(customers)

# extract 90% of customer ID's
customers_train = [customers[i] for i in range(round(0.9*len(customers)))]

# split data into train and validation set
train_df = df[df['CustomerID'].isin(customers_train)]
validation_df = df[~df['CustomerID'].isin(customers_train)]

Let's create sequences of purchases made by the customers in the dataset for both the train and validation set.

In [16]:
# list to capture purchase history of the customers
purchases_train = []

# populate the list with the product codes
for i in tqdm(customers_train):
    temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
    purchases_train.append(temp)

100%|██████████| 3935/3935 [00:03<00:00, 1219.51it/s]


In [17]:
len(purchases_train)

3935

In [18]:
# list to capture purchase history of the customers
purchases_val = []

# populate the list with the product codes
for i in tqdm(validation_df['CustomerID'].unique()):
    temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
    purchases_val.append(temp)

100%|██████████| 432/432 [00:00<00:00, 1713.91it/s]


In [19]:
len(purchases_val)

432

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397924 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    397924 non-null  object        
 1   StockCode    397924 non-null  object        
 2   Description  397924 non-null  object        
 3   Quantity     397924 non-null  int64         
 4   InvoiceDate  397924 non-null  datetime64[ns]
 5   UnitPrice    397924 non-null  float64       
 6   CustomerID   397924 non-null  float64       
 7   Country      397924 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 27.3+ MB


In [21]:
def print_unique_col_values(df):
       for column in df:
            if df[column].dtypes=='object':
                print(f'{column}: {df[column].unique()}') 

In [22]:
print_unique_col_values(df)

InvoiceNo: [536365 536366 536367 ... 581585 581586 581587]
StockCode: ['85123A' '71053' '84406B' ... '90214Z' '90089' '23843']
Description: ['WHITE HANGING HEART T-LIGHT HOLDER' 'WHITE METAL LANTERN'
 'CREAM CUPID HEARTS COAT HANGER' ... 'PINK CRYSTAL SKULL PHONE CHARM'
 'CREAM HANGING HEART T-LIGHT HOLDER' 'PAPER CRAFT , LITTLE BIRDIE']
Country: ['United Kingdom' 'France' 'Australia' 'Netherlands' 'Germany' 'Norway'
 'EIRE' 'Switzerland' 'Spain' 'Poland' 'Portugal' 'Italy' 'Belgium'
 'Lithuania' 'Japan' 'Iceland' 'Channel Islands' 'Denmark' 'Cyprus'
 'Sweden' 'Finland' 'Austria' 'Greece' 'Singapore' 'Lebanon'
 'United Arab Emirates' 'Israel' 'Saudi Arabia' 'Czech Republic' 'Canada'
 'Unspecified' 'Brazil' 'USA' 'European Community' 'Bahrain' 'Malta' 'RSA']


In [23]:
def print_unique_number_of_values(df):
       for column in df:
            if df[column].dtypes=='object':
                print(f'{column}: {len(df[column].unique())}') 

In [24]:
print_unique_number_of_values(df)

InvoiceNo: 18536
StockCode: 3665
Description: 3877
Country: 37


In [25]:
#converting the type of Invoice Date Field from string to datetime.
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

#creating YearMonth field for the ease of reporting and visualization
df['InvoiceYearMonth'] = df['InvoiceDate'].map(lambda date: 100*date.year + date.month)

#calculate Revenue for each row and create a new dataframe with YearMonth - Revenue columns
df['Revenue'] = df['UnitPrice'] * df['Quantity']
tx_revenue = df.groupby(['InvoiceYearMonth'])['Revenue'].sum().reset_index()
tx_revenue

Unnamed: 0,InvoiceYearMonth,Revenue
0,201012,572713.89
1,201101,569445.04
2,201102,447137.35
3,201103,595500.76
4,201104,469200.361
5,201105,678594.56
6,201106,661213.69
7,201107,600091.011
8,201108,645343.9
9,201109,952838.382


In [26]:
#X and Y axis inputs for Plotly graph. We use Scatter for line graphs
plot_data = [
    go.Scatter(
        x=tx_revenue['InvoiceYearMonth'],
        y=tx_revenue['Revenue'],
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        title='Montly Revenue'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [27]:
#using pct_change() function to see monthly percentage change
tx_revenue['MonthlyGrowth'] = tx_revenue['Revenue'].pct_change()

#showing first 5 rows
tx_revenue.head()

#visualization - line graph
plot_data = [
    go.Scatter(
        x=tx_revenue.query("InvoiceYearMonth < 201112")['InvoiceYearMonth'],
        y=tx_revenue.query("InvoiceYearMonth < 201112")['MonthlyGrowth'],
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        title='Montly Growth Rate'
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [28]:
# Calculate the total sales
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Visualize the top countries by total sales without UK
top_countries = df.groupby('Country')['TotalSales'].sum().sort_values(ascending=True).head(-1)

# Create a DataFrame for the top countries data
top_countries_df = pd.DataFrame({'Country': top_countries.index, 'Total Sales': top_countries.values})

# Create a bar chart using Plotly Express
fig = px.bar(top_countries_df, x='Country', y='Total Sales', title='Top Countries by Total Sales Excluding UK')

# Customize the appearance of the chart
fig.update_layout(xaxis_title='Country', yaxis_title='Total Sales', xaxis_tickangle=-45)

# Show the chart
fig.show()

In [29]:
# Visualize the top countries by total sales with UK
top_countries = df.groupby('Country')['TotalSales'].sum().sort_values(ascending=False).head(10)

# Create a bar chart using Plotly
fig = go.Figure(data=[go.Bar(
    x=top_countries.index,
    y=top_countries.values,
    marker=dict(color='rgb(26, 118, 255)')  # Optional: Set the bar color
)])

# Update the layout of the chart
fig.update_layout(
    title='Top Countries by Total Sales Including UK',
    xaxis=dict(title='Country'),
    yaxis=dict(title='Total Sales'),
    xaxis_tickangle=-45,  # Rotate the x-axis labels for better readability
    width=800,  # Optional: Set the width of the chart
    height=500  # Optional: Set the height of the chart
)

# Show the plot
fig.show()

In [30]:
fig = px.histogram(df, x="Country")
fig.show()

In [31]:
# Visualize the product categories
top_categories = df['Description'].value_counts().head(10)
fig = px.bar(x=top_categories.index, y=top_categories.values,
             labels={'x': 'Product Category', 'y': 'Count'},
             title='Top Product Categories',
             text=top_categories.values)

# Customize the layout
fig.update_layout(
    xaxis_tickangle=-45,
    xaxis_title_font=dict(size=14),
    yaxis_title_font=dict(size=14),
    title_font=dict(size=16)
)

# Show the plot
fig.show()

In [32]:
# Visualize the top selling products
top_products = df.groupby('Description')['Quantity'].sum().nlargest(10).reset_index()

fig = px.bar(top_products, x='Quantity', y='Description', orientation='h', 
             labels={'Quantity': 'Total Quantity Sold', 'Description': 'Product Description'},
             title='Top 10 Selling Products')
fig.show()

In [33]:
# Create a histogram chart using Plotly
fig = px.histogram(df.groupby('InvoiceNo')['TotalSales'].sum(),
                   nbins=50,  # Number of bins in the histogram
                   x='TotalSales',
                   labels={'TotalSales': 'Total Sales'},
                   title='Total Sales Distribution',
                   width=800,
                   height=500)

# Customize the chart layout
fig.update_layout(
    xaxis=dict(title='Total Sales', range=[-1000, 5000]),
    yaxis=dict(title='Frequency'),
    title=dict(x=0.5, y=0.95),
    bargap=0.2,  # Gap between bars
)

# Show the plot
fig.show()

In [34]:
# Create a new column for the hour of the day
df['Hour'] = df['InvoiceDate'].dt.hour

# Group by hour and calculate transaction count
transactions_by_hour = df.groupby('Hour')['InvoiceNo'].count()

# Create a Plotly figure
fig = go.Figure()

# Add a line trace to the figure
fig.add_trace(go.Scatter(x=transactions_by_hour.index, y=transactions_by_hour.values, mode='lines', name='Transactions'))

# Customize the layout
fig.update_layout(title='Number of Transactions by Hour',
                  xaxis_title='Hour of the Day',
                  yaxis_title='Number of Transactions')

# Display the plot
fig.show()

In [35]:
#create dataframe with uk data only
tx_uk = df.query("Country=='United Kingdom'").reset_index(drop=True)

tx_6m = tx_uk[(tx_uk.InvoiceDate < pd.to_datetime('2011-9-1')) & (tx_uk.InvoiceDate >= pd.to_datetime('2011-3-1'))].reset_index(drop=True)
tx_next = tx_uk[(tx_uk.InvoiceDate >= pd.to_datetime('2011-9-1')) & (tx_uk.InvoiceDate < pd.to_datetime('2011-12-1'))].reset_index(drop=True)

tx_user = pd.DataFrame(tx_6m['CustomerID'].unique())
tx_user.columns = ['CustomerID']

#get max purchase date for Recency and create a dataframe
tx_max_purchase = tx_6m.groupby('CustomerID').InvoiceDate.max().reset_index()
tx_max_purchase.columns = ['CustomerID','MaxPurchaseDate']

#find the recency in days and add it to tx_user
tx_max_purchase['Recency'] = (tx_max_purchase['MaxPurchaseDate'].max() - tx_max_purchase['MaxPurchaseDate']).dt.days
tx_user = pd.merge(tx_user, tx_max_purchase[['CustomerID','Recency']], on='CustomerID')

In [36]:
#plot recency
plot_data = [
    go.Histogram(
        x=tx_max_purchase['Recency']
    )
]

plot_layout = go.Layout(
        title='Recency'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [37]:
#get total purchases for frequency scores
tx_frequency = tx_6m.groupby('CustomerID').InvoiceDate.count().reset_index()
tx_frequency.columns = ['CustomerID','Frequency']

#add frequency column to tx_user
tx_user = pd.merge(tx_user, tx_frequency, on='CustomerID')

In [38]:
#plot frequency
plot_data = [
    go.Histogram(
        x=tx_user.query('Frequency < 1000')['Frequency']
    )
]

plot_layout = go.Layout(
        title='Frequency'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [39]:
#calculate monetary value, create a dataframe with it
tx_6m['Revenue'] = tx_6m['UnitPrice'] * tx_6m['Quantity']
tx_revenue = tx_6m.groupby('CustomerID').Revenue.sum().reset_index()

#add Revenue column to tx_user
tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')

In [40]:
#plot Revenue
plot_data = [
    go.Histogram(
        x=tx_user.query('Revenue < 10000')['Revenue']
    )
]

plot_layout = go.Layout(
        title='Monetary Value'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

In [41]:
# Which stock codes were used the most?
stock_counts = df['StockCode'].value_counts().sort_values(ascending=False).iloc[0:15]

# Create the bar plot using Plotly
fig = go.Figure(data=go.Bar(x=stock_counts.index, y=stock_counts.values, marker_color='lightblue'))

# Update the layout of the plot
fig.update_layout(
    title="Which stock codes were used the most?",
    xaxis_title="Stock Code",
    yaxis_title="Counts",
    xaxis=dict(tickangle=90),
    showlegend=False,
    template="plotly_white"  # You can choose a different template if you prefer
)

# Show the plot
fig.show()

In [42]:
# Which invoices had the most items?
inv_counts = df['InvoiceNo'].value_counts().sort_values(ascending=False).iloc[0:15]

# Create the bar plot using Plotly
fig = go.Figure(data=go.Bar(x=inv_counts.index, y=inv_counts.values, marker=dict(color='lightblue')))
fig.update_layout(
    title="Which invoices had the most items?",
    xaxis_title="Invoice Number",
    yaxis_title="Counts",
    xaxis=dict(type='category', tickangle=90),
    yaxis=dict(showgrid=True, zeroline=False),
    showlegend=False,
    width=1000,
    height=400,
)

# Show the plot
fig.show()

## Build word2vec Embeddings for Products

In [43]:
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
                 negative = 10, # for negative sampling
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)

# window (int, optional) – Maximum distance between the current and predicted word within a sentence.
# sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
# hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero,
## negative sampling will be used.
# alpha (float, optional) – The initial learning rate.
# min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
# seed (int, optional) – Seed for the random number generator.

model.build_vocab(purchases_train, progress_per=200)

model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

(3547393, 3580510)

In [44]:
# save word2vec model
model.save("word2vec_3.model")

As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient

In [45]:
model.init_sims(replace=True)

In [46]:
print(model)

Word2Vec<vocab=3151, vector_size=100, alpha=0.03>


In [47]:
X = model.wv[model.wv.key_to_index]

X.shape

(3151, 100)

Now we will extract the vectors of all the words in our vocabulary and store it in one place for easy access

## Visualize word2vec Embeddings

It is always quite helpful to visualize the embeddings that you have created. Over here we have 100 dimensional embeddings. We can't even visualize 4 dimensions let alone 100. Therefore, we are going to reduce the dimensions of the product embeddings from 100 to 2 by using the UMAP algorithm, it is used for dimensionality reduction.

UMAP, at its core, works very similarly to t-SNE - both use graph layout algorithms to arrange data in low-dimensional space. In the simplest sense, UMAP constructs a high dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible. While the mathematics UMAP uses to construct the high-dimensional graph is advanced, the intuition behind them is remarkably simple.

In [48]:
!pip install umap-learn



In [49]:
#collapse
import umap

cluster_embedding = umap.UMAP(n_neighbors=30, min_dist=0.0,
                              n_components=2, random_state=42).fit_transform(X)

# Create a DataFrame to store the embedding data
embedding_df = pd.DataFrame(cluster_embedding, columns=["UMAP_1", "UMAP_2"])

# Create the Plotly scatter plot
fig = px.scatter(embedding_df, x="UMAP_1", y="UMAP_2", 
                 size_max=3, color_continuous_scale="Spectral")

# Set the figure size
fig.update_layout(width=800, height=600)

# Show the plot
fig.show()

Every dot in this plot is a product. As you can see, there are several tiny clusters of these datapoints. These are groups of similar products.

## Generate and validate recommendations

We are finally ready with the word2vec embeddings for every product in our online retail dataset. Now our next step is to suggest similar products for a certain product or a product's vector. 

Let's first create a product-ID and product-description dictionary to easily map a product's description to its ID and vice versa.

In [50]:
products = train_df[["StockCode", "Description"]]

# remove duplicates
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()

In [51]:
# test the dictionary
products_dict['84029E']

['RED WOOLLY HOTTIE WHITE HEART.']

We have defined the function below. It will take a product's vector (n) as input and return top 6 similar products.

In [52]:
def similar_products(v, n=6):

    # extract most similar products for the input vector
    ms = model.wv.most_similar([v], topn=n + 1)[1:]

    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        product_id = j[0]
        product_name = products_dict[product_id][0]
        similarity_score = j[1]
        pair = (product_name, similarity_score)
        new_ms.append(pair)

    return new_ms

Let's try out our function by passing the vector of the product '90019A' ('SILVER M.O.P ORBIT BRACELET')

In [53]:
similar_products(model.wv['90019A'])

[('BLUE MURANO TWIST BRACELET', 0.781118631362915),
 ('SILVER M.O.P ORBIT DROP EARRINGS', 0.757378339767456),
 ('RASPBERRY ANT COPPER FLOWER NECKLAC', 0.756861686706543),
 ('JADE DROP EARRINGS W FILIGREE', 0.7486621141433716),
 ('ANT SILVER PURPLE BOUDICCA RING', 0.7273343205451965),
 ('ANT SILVER LIME GREEN BOUDICCA RING', 0.7223764061927795)]

Cool! The results are pretty relevant and match well with the input product. However, this output is based on the vector of a single product only. What if we want recommend a user products based on the multiple purchases he or she has made in the past?

One simple solution is to take average of all the vectors of the products he has bought so far and use this resultant vector to find similar products. For that we will use the function below that takes in a list of product ID's and gives out a 100 dimensional vector which is mean of vectors of the products in the input list.

In [54]:
def aggregate_vectors(products):
    product_vec = []
    for i in products:
        try:
            vector = model.wv.get_vector(i)
            product_vec.append(vector)
        except KeyError:
            continue

    return np.mean(product_vec, axis=0)

If you can recall, we have already created a separate list of purchase sequences for validation purpose. Now let's make use of that.

In [55]:
len(purchases_val[0])

13

The length of the first list of products purchased by a user is 314. We will pass this products' sequence of the validation set to the function aggregate_vectors.

In [56]:
vector = aggregate_vectors(purchases_val[0])
vector.shape

(100,)

Well, the function has returned an array of 100 dimension. It means the function is working fine. Now we can use this result to get the most similar products. Let's do it.

In [57]:
similar_products(vector)

[('REGENCY MIRROR WITH SHUTTERS', 0.736352264881134),
 ('NATURAL SLATE HEART CHALKBOARD ', 0.7330039739608765),
 ('WOOD BLACK BOARD ANT WHITE FINISH', 0.7299142479896545),
 ('CREAM SWEETHEART MINI CHEST', 0.7252875566482544),
 ('CREAM HANGING HEART T-LIGHT HOLDER', 0.723342776298523),
 ('HEART OF WICKER SMALL', 0.7112729549407959)]

As it turns out, our system has recommended 6 products based on the entire purchase history of a user. Moreover, if you want to get products suggestions based on the last few purchases only then also you can use the same set of functions.

Below we are giving only the last 10 products purchased as input.

In [58]:
# Get the last 10 product IDs from the first purchase
last_10_products = purchases_val[0][-10:]

# Get the aggregated vector for the last 10 product IDs
vector = aggregate_vectors(last_10_products)

# Find similar products to the aggregated vector
similar_products_list = similar_products(vector)

print(similar_products_list)

[('HEART OF WICKER SMALL', 0.7330625057220459), ('REGENCY MIRROR WITH SHUTTERS', 0.7318141460418701), ('NATURAL SLATE HEART CHALKBOARD ', 0.7282062768936157), ('HEART OF WICKER LARGE', 0.7261298894882202), ('CREAM SWEETHEART MINI CHEST', 0.7022709846496582), ('CREAM HANGING HEART T-LIGHT HOLDER', 0.702028751373291)]


## References

- [https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/](https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/)
- [https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484](https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484)