## Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import random
from tqdm import tqdm
from gensim.models import Word2Vec 
import matplotlib.pyplot as plt
%matplotlib inline

import warnings;
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_excel('dataset/Online Retail.xlsx')

Let's take a quick look at our data. You can __download it from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/).__

In [6]:
df.head(6) # By Default head funtion return 5 records

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom


Given below is the description of the fields in this dataset:

1. __InvoiceNo:__ Invoice number, a unique number assigned to each transaction.

2. __StockCode:__ Product/item code. a unique number assigned to each distinct product.

3. __Description:__ Product description

4. __Quantity:__ The quantities of each product per transaction.

5. __InvoiceDate:__ Invoice Date and time. The day and time when each transaction was generated.

6. __CustomerID:__ Customer number, a unique number assigned to each customer.

In [7]:
df.shape

(541909, 8)

The dataset contains 541,909 transactions. That is a pretty good number for us.

## Treat Missing Data

In [8]:
# check for missing values
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

<br>
Since we have sufficient data, we will drop all the rows with missing values.

In [9]:
# remove missing values
df.dropna(inplace=True)

# again check missing values
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [13]:
df.shape

(406829, 8)

## Data Preparation

Let's convert the StockCode to string datatype.

In [14]:
df['StockCode']= df['StockCode'].astype(str)

Let's check out the number of unique customers in our dataset.

In [15]:
customers = df["CustomerID"].unique().tolist()
len(customers)

4372

There are 4,372 customers in our dataset. For each of these customers we will extract their buying history. In other words, we can have 4,372 sequences of purchases.

It is a good practice to set aside a small part of the dataset for validation purpose. Therefore, I will use data of 90% of the customers to create word2vec embeddings. Let's split the data.

In [17]:
# shuffle customer ID's
random.shuffle(customers)

# extract 90% of customer ID's
customers_train = [customers[i] for i in range(round(0.9*len(customers)))]

# split data into train and validation set
train_df = df[df['CustomerID'].isin(customers_train)]
validation_df = df[~df['CustomerID'].isin(customers_train)]
print(train_df.shape)
print(validation_df.shape)

(369365, 8)
(37464, 8)


Let's create sequences of purchases made by the customers in the dataset for both the train and validation set.

In [18]:
# list to capture purchase history of the customers
purchases_train = []

# populate the list with the product codes
for i in tqdm(customers_train):
    temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
    purchases_train.append(temp)

100%|█████████████████████████████████████████████████████████████████████████████| 3935/3935 [00:10<00:00, 367.85it/s]


In [19]:
# list to capture purchase history of the customers
purchases_val = []

# populate the list with the product codes
for i in tqdm(validation_df['CustomerID'].unique()):
    temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
    purchases_val.append(temp)

100%|███████████████████████████████████████████████████████████████████████████████| 437/437 [00:00<00:00, 507.44it/s]


## Build word2vec Embeddings for Products

In [20]:
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
                 negative = 10, # for negative sampling
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)

model.build_vocab(purchases_train, progress_per=200)

model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

(3657270, 3693650)

In [21]:
# save word2vec model
model.save("word2vec_2.model")

As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient.

In [22]:
model.init_sims(replace=True)

In [23]:
print(model)

Word2Vec(vocab=3183, size=100, alpha=0.03)


Now we will extract the vectors of all the words in our vocabulary and store it in one place for easy access.

In [24]:
# extract all vectors
X = model[model.wv.vocab]

X.shape

(3183, 100)

## Start Recommending Products

Congratulations! We are finally ready with the word2vec embeddings for every product in our online retail dataset. Now our next step is to suggest similar products for a certain product or a product's vector. 

Let's first create a product-ID and product-description dictionary to easily map a product's description to its ID and vice versa.

In [28]:
products = train_df[["StockCode", "Description"]]

# remove duplicates
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()

In [29]:
# test the dictionary
products_dict['84029E']

['RED WOOLLY HOTTIE WHITE HEART.']

<br>

I have defined the function below. It will take a product's vector (n) as input and return top 6 similar products.

In [30]:
def similar_products(v, n = 6):
    
    # extract most similar products for the input vector
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1])
        new_ms.append(pair)
        
    return new_ms        

Let's try out our function by passing the vector of the product '90019A' ('SILVER M.O.P ORBIT BRACELET')

In [33]:
print(products_dict['90019A'])
similar_products(model['90019A'])

['SILVER M.O.P ORBIT BRACELET']


[('BLACK VINTAGE  CRYSTAL EARRINGS', 0.8562676906585693),
 ('SILVER M.O.P ORBIT DROP EARRINGS', 0.8460784554481506),
 ('ANT COPPER RED BOUDICCA BRACELET', 0.8188737630844116),
 ('DROP DIAMANTE EARRINGS PURPLE', 0.8162773251533508),
 ('GOLD/M.O.P PENDANT ORBIT NECKLACE', 0.8090729117393494),
 ('SILVER LARIAT BLACK STONE EARRINGS', 0.8064510822296143)]

<br>

Cool! The results are pretty relevant and match well with the input product. However, this output is based on the vector of a single product only. What if we want recommend a user products based on the multiple purchases he or she has made in the past?

One simple solution is to take average of all the vectors of the products he has bought so far and use this resultant vector to find similar products. For that we will use the function below that takes in a list of product ID's and gives out a 100 dimensional vector which is mean of vectors of the products in the input list.

In [34]:
def aggregate_vectors(products):
    product_vec = []
    for i in products:
        try:
            product_vec.append(model[i])
        except KeyError:
            continue
        
    return np.mean(product_vec, axis=0)

If you can recall, we have already created a separate list of purchase sequences for validation purpose. Now let's make use of that.

In [35]:
len(purchases_val[0])

196

The length of the first list of products purchased by a user is 314. We will pass this products' sequence of the validation set to the function *aggregate_vectors*.

In [36]:
aggregate_vectors(purchases_val[0]).shape

(100,)

Well, the function has returned an array of 100 dimension. It means the function is working fine. Now we can use this result to get the most similar products. Let's do it.

In [37]:
similar_products(aggregate_vectors(purchases_val[0]))

[('JAM MAKING SET WITH JARS', 0.6791869401931763),
 ('SET OF 3 REGENCY CAKE TINS', 0.6773524880409241),
 ('SET OF 3 CAKE TINS PANTRY DESIGN ', 0.6668287515640259),
 ('ROSE DU SUD CUSHION COVER', 0.6576700806617737),
 ('PARTY BUNTING', 0.6521560549736023),
 ('SPOTTY BUNTING', 0.6507757902145386)]

As it turns out, our system has recommended 6 products based on the entire purchase history of a user. Moreover, if you want to get products suggestions based on the last few purchases only then also you can use the same set of functions.

Below I am giving only the last 10 products purchased as input.

In [38]:
similar_products(aggregate_vectors(purchases_val[0][-10:]))

[('PACK OF SIX LED TEA LIGHTS', 0.6577250957489014),
 ("BOX OF 6 MINI 50'S CRACKERS", 0.6457664966583252),
 ("PAPER CHAIN KIT 50'S CHRISTMAS ", 0.6398783326148987),
 ("6 GIFT TAGS 50'S CHRISTMAS ", 0.6396781802177429),
 ('BISCUIT TIN VINTAGE CHRISTMAS', 0.6326345205307007),
 (" 50'S CHRISTMAS GIFT BAG LARGE", 0.6296260356903076)]

Feel free to play this code, try to get product recommendation for more sequences from the validation set