## Import Libraries and Load Data

In [0]:
%pip install mlflow

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
import pandas as pd
import numpy as np
import random
import pickle

import re
import os, sys
import jieba
import jieba.posseg as pseg
import logging 

from sys import version_info
from tqdm import tqdm

import gensim
from gensim.models import Word2Vec 
import umap

import matplotlib.pyplot as plt
%matplotlib inline
import warnings;
warnings.filterwarnings('ignore')

import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature

import os
import sys
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import cloudpickle



In [0]:
%sql
REFRESH TABLE invoices;
REFRESH TABLE stopwords_en;

In [0]:
invoices = spark.sql("SELECT * FROM invoices")
df = invoices.toPandas()
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850,United Kingdom


Let's take a quick look at our data. You can __download it from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/).__

Given below is the description of the fields in this dataset:

1. __InvoiceNo:__ Invoice number, a unique number assigned to each transaction.

2. __StockCode:__ Product/item code. a unique number assigned to each distinct product.

3. __Description:__ Product description

4. __Quantity:__ The quantities of each product per transaction.

5. __InvoiceDate:__ Invoice Date and time. The day and time when each transaction was generated.

6. __CustomerID:__ Customer number, a unique number assigned to each customer.

In [0]:
df.shape

Out[4]: (65499, 8)

The dataset contains 541,909 transactions. That is a pretty good number for us.

## Treat Missing Data

In [0]:
# check for missing values
df.isnull().sum()

Out[5]: InvoiceNo          0
StockCode          0
Description      166
Quantity           0
InvoiceDate        0
UnitPrice          0
CustomerID     25281
Country            0
dtype: int64

<br>
Since we have sufficient data, we will drop all the rows with missing values.

In [0]:
# remove missing values
df.dropna(inplace=True)

# again check missing values
df.isnull().sum()

Out[6]: InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

## Data Preparation

Let's convert the StockCode to string datatype.

In [0]:
df['StockCode']= df['StockCode'].astype(str)

Let's check out the number of unique customers in our dataset.

In [0]:
products = df[["StockCode", "Description"]]

# remove duplicates
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()

print(products)

      StockCode                         Description
107       84854                 GIRLY PINK TOOL SET
201      35004G          SET OF 3 GOLD FLYING DUCKS
336       20820               SILVER LOOKING MIRROR
496      90129F          RED GLASS TASSLE BAG CHARM
525      90199C     5 STRAND GLASS NECKLACE CRYSTAL
...         ...                                 ...
65097     22636  CHILDS BREAKFAST SET CIRCUS PARADE
65098     84945  MULTI COLOUR SILVER T-LIGHT HOLDER
65099     22440       BALLOON WATER BOMB PACK OF 35
65100     22437       SET OF 9 BLACK SKULL BALLOONS
65101     22423            REGENCY CAKESTAND 3 TIER

[2589 rows x 2 columns]


There are 4,372 customers in our dataset. For each of these customers we will extract their buying history. In other words, we can have 4,372 sequences of purchases.

It is a good practice to set aside a small part of the dataset for validation purpose. Therefore, I will use data of 90% of the customers to create word2vec embeddings. Let's split the data.

In [0]:
from sklearn.model_selection import train_test_split

train_df, validation_df = train_test_split(products, test_size=0.2)
train_df

Unnamed: 0,StockCode,Description
64162,22852,DOG BOWL VINTAGE CREAM
50198,37444C,PINK BREAKFAST CUP AND SAUCER
46810,21815,STAR T-LIGHT HOLDER
52604,21468,BUTTERFLY CROCHET FOOD COVER
28737,84313C,ORANGE TV TRAY TABLE
...,...,...
38303,90019A,SILVER M.O.P ORBIT BRACELET
64254,22227,HANGING HEART MIRROR DECORATION
64736,22844,VINTAGE CREAM DOG FOOD CONTAINER
64987,22963,JAM JAR WITH GREEN LID


Let's create sequences of purchases made by the customers in the dataset for both the train and validation set.

In [0]:
class Preprocess():
    def __init__(self, lang="en", filter_words=[], stopwords_path='', userdict_path='', filter_pos=[]):
        '''
        Input:
            filter_words: list of strings. Words need to be filtered after tokenization.
            stopwords_path: string. Path of stopwords file.  
            userdict_path: string. Path of userdict file.
            filter_pos: list of POS strings. POS need to be filtered during tokenization.
        '''
        #if stopwords_path=='':
        #    stopwords_path = "textgo/textgo/data/stopwords_en.txt"
        self.stopwords = spark.sql("SELECT stopword FROM stopwords_en").toPandas()["stopword"].values.tolist()       
        #self.stopwords = open(stopwords_path).read().strip().split('\n')
        self.stopwords.extend(filter_words) 
        self.stopwords.append(' ')
        if sys.version_info[0] == 2: # python2
            self.stopwords = [word.decode('utf-8') for word in self.stopwords] # for python2
        self.stopwords = set(self.stopwords)
        self.filter_pos_num = len(filter_pos)
        self.filter_pos = set(filter_pos)

    def clean(self, texts, drop_space=False):
        '''Clean text, including dropping html tags, url, extra space and punctuation 
        as well as string lower.
        Input:
            text: list of strings.
        Output:
            text: preprocessed string.
        '''
        ptexts = []
        for text in texts:
            # drop \n
            text = re.sub('\n','',text)
            # drop html tags 
            text = re.sub('<[^>]*>|&quot|&nbsp','',text)
            # drop url
            url_regrex = u'((http|https)\\:\\/\\/)?[a-zA-Z0-9\\.\\/\\?\\:@\\-_=#]+\\.([a-zA-Z]){2,6}([a-zA-Z0-9\\.\\&\\/\\?\\:@\\-_=#])*'
            text = re.sub(url_regrex,'',text)
            # only keep Chinese/English/space/numbers/decimal
            text = re.sub(u"[^\u4E00-\u9FFF^a-z^A-Z^\s^0-9^\d+(\.\d+)?]", " ",text) 
            if drop_space:
                # drop any space
                text = re.sub(u"[\s]{1,}","",text).strip()
            else:
                # drop space at the start and in the end
                text = re.sub(u"\s$|^\s","",text)
                # replace more than 2 space with 1 space
                text = re.sub(u"[\s]{2,}"," ",text).strip()
            # lower string
            text = text.lower()
            ptexts.append(text)
        return ptexts

    def tokenize(self, texts, drop_stopwords=True):
        '''Tokenize string.
        Input:
            text: list of strings.
            drop_stopwords: boolean. Whether drop stopwords or not.
        Output:
            tokens_list: list of list of tokens.
        '''
        tokens_list = []
        for text in texts:
            tokens = text.split(' ')
            if drop_stopwords:
                tokens = [token for token in tokens if token not in self.stopwords]
            tokens_list.append(tokens)
        return tokens_list

    def preprocess(self, texts, sep=' ', drop_stopwords=True):
        '''Text preprocess for English/Chinese, including clean text, tokenize and remove 
        stopwords.
        Input:
            texts: list of text strings
            sep: string used to join words after tokenization
            drop_stopwords: boolean. Whether drop stopwords or not.
        Output:
            list of preprocessed text strings (tokens join by sep)
        '''
        # clean text
        ptexts = self.clean(texts)
        # tokenize
        tokens_list = self.tokenize(ptexts, drop_stopwords)
        # join by sep
        result = [sep.join(tokens) for tokens in tokens_list]
        return tokens_list

In [0]:
preprocess = Preprocess()
preprocessed_data = preprocess.preprocess(train_df["Description"].to_list())
preprocessed_data[:3]

Out[11]: [['dog', 'bowl', 'vintage', 'cream'],
 ['pink', 'breakfast', 'cup', 'saucer'],
 ['star', 'light', 'holder']]

In [0]:
train_df.shape

Out[12]: (2071, 2)

## Build word2vec Embeddings for Products

In [0]:
PYTHON_VERSION = "{major}.{minor}.{micro}".format(major=version_info.major,
                                                  minor=version_info.minor,
                                                  micro=version_info.micro)

conda_env = {
    'channels': ['defaults'],
    'dependencies': [
      'python={}'.format(PYTHON_VERSION),
      'pip',
      {
        'pip': [
          'mlflow',
          'gensim=={}'.format(gensim.__version__),
          'cloudpickle=={}'.format(cloudpickle.__version__),
        ],
      },
    ],
    'name': 'gensim_env'
}

class GensimModelWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
    
    def load_context(self, context):
        '''Load embedding model including word2vec.
        Input: 
            context: string. Path of model.
        Output:
            model: model object.
        '''
        #self.model = Word2Vec.load(context.artifacts["gensim_model"])
        self.model = KeyedVectors.load(context.artifacts["gensim_model"], mmap='r').wv
        
    def predict(self, model, data):
        return self.word2vec(data)

    def word2vec(self, corpus):
        '''Get Word2Vec embeddings.   
        Input:    
            corpus: list of preprocessed strings.   
        Output:    
            embeddings: array of shape [n_sample, dim]    
        '''
        embeddings = [] 
        # drop tokens which not in vocab
        for text in corpus:
            print(text)
            tokens = text.split(' ')
            tokens = [token for token in tokens if token in self.model.vocab]
            #logger.info(', '.join(tokens))
            print(tokens)
            if len(tokens)==0:
                embedding = self.model['unk'].tolist()
            else:
                embedding = np.mean(self.model[tokens],axis=0).tolist()
            embeddings.append(embedding)
            print(embedding)
        embeddings = np.array(embeddings)
        return embeddings

In [0]:
mlflow.end_run()

mlflow.set_experiment("/Shared/embedding")

mlflow.start_run()

# Hyperparameters
window=10
sg=1
hs=0
negative=10
alpha=0.03
min_alpha=0.0007
seed=14
epochs=10

# train word2vec model
model = Word2Vec(
    window = window, 
    sg = sg, 
    hs = hs,
    negative = negative, # for negative sampling
    alpha=alpha, 
    min_alpha=min_alpha,
    seed = seed
)

mlflow.log_param('window', window)
mlflow.log_param('sg', sg)
mlflow.log_param('hs', hs)
mlflow.log_param('negative', negative)
mlflow.log_param('alpha', alpha)
mlflow.log_param('min_alpha', min_alpha)
mlflow.log_param('seed', seed)
mlflow.log_param('epochs', epochs)

model.build_vocab(preprocessed_data, progress_per=200)

model.train(
    preprocessed_data, 
    total_examples = model.corpus_count, 
    epochs=epochs, 
    report_delay=1
)

wrappedModel = GensimModelWrapper(model)

#signature = infer_signature(preprocessed_data, wrappedModel.predict(None, preprocessed_data))
mlflow.pyfunc.log_model("lattest_gensim_model", python_model=wrappedModel, conda_env=conda_env)

mlflow.end_run()

2023/02/20 11:09:45 INFO mlflow.tracking.fluent: Experiment with name '/Shared/embedding' does not exist. Creating a new experiment.


In [0]:
# save word2vec model
model.save("word2vec_2.model")

mlflow_pyfunc_model_path = "gensim_mlflow_pyfunc_custom"
mlflow.pyfunc.save_model(path=mlflow_pyfunc_model_path, python_model=GensimWrapper(), conda_env=conda_env, artifacts=artifacts)



In [0]:
# Load and test gensim model
artifacts = {
    "gensim_model": "word2vec.model"
}

loaded_model = mlflow.pyfunc.load_model(artifacts)

input_data = ['jumbo bag', "bag"]
# Evaluate the model
test_predictions = loaded_model.predict(input_data)
print(test_predictions)

Metrics().cosine_sim(test_predictions[0], test_predictions[1])



As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient.

In [0]:
model.init_sims(replace=True)



In [0]:
print(model)



Now we will extract the vectors of all the words in our vocabulary and store it in one place for easy access.

In [0]:
# extract all vectors
X = model[model.wv.vocab]

X.shape



In [0]:
%sql CREATE DATABASE IF NOT EXISTS feature_store_embeddings;




In [0]:
fs = feature_store.FeatureStoreClient()



## Visualize word2vec Embeddings

It is always quite helpful to visualize the embeddings that you have created. Over here we have 100 dimensional embeddings. We can't even visualize 4 dimensions let alone 100. Therefore, we are going to reduce the dimensions of the product embeddings from 100 to 2 by using the UMAP algorithm, it is used for dimensionality reduction.

In [0]:
cluster_embedding = umap.UMAP(n_neighbors=30, min_dist=0.0,
                              n_components=2, random_state=42).fit_transform(X)

plt.figure(figsize=(10,9))
plt.scatter(cluster_embedding[:, 0], cluster_embedding[:, 1], s=3, cmap='Spectral')



Every dot in this plot is a product. As you can see, there are several tiny clusters of these datapoints. These are groups of similar products.

## Start Recommending Products

Congratulations! We are finally ready with the word2vec embeddings for every product in our online retail dataset. Now our next step is to suggest similar products for a certain product or a product's vector.

Let's first create a product-ID and product-description dictionary to easily map a product's description to its ID and vice versa.

In [0]:
products = train_df[["StockCode", "Description"]]

# remove duplicates
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()



In [0]:
# test the dictionary
products_dict['84029E']



<br>

I have defined the function below. It will take a product's vector (n) as input and return top 6 similar products.

In [0]:
def similar_products(v, n = 6):
    
    # extract most similar products for the input vector
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1])
        new_ms.append(pair)
        
    return new_ms        



Let's try out our function by passing the vector of the product '90019A' ('SILVER M.O.P ORBIT BRACELET')

In [0]:
similar_products(model['84029E'])



<br>

Cool! The results are pretty relevant and match well with the input product. However, this output is based on the vector of a single product only. What if we want recommend a user products based on the multiple purchases he or she has made in the past?

One simple solution is to take average of all the vectors of the products he has bought so far and use this resultant vector to find similar products. For that we will use the function below that takes in a list of product ID's and gives out a 100 dimensional vector which is mean of vectors of the products in the input list.

In [0]:
def aggregate_vectors(products):
    product_vec = []
    for i in products:
        try:
            product_vec.append(model[i])
        except KeyError:
            continue
        
    return np.mean(product_vec, axis=0)



If you can recall, we have already created a separate list of purchase sequences for validation purpose. Now let's make use of that.

In [0]:
len(purchases_val[0])



The length of the first list of products purchased by a user is 314. We will pass this products' sequence of the validation set to the function *aggregate_vectors*.

In [0]:
aggregate_vectors(purchases_val[0]).shape



Well, the function has returned an array of 100 dimension. It means the function is working fine. Now we can use this result to get the most similar products. Let's do it.

In [0]:
similar_products(aggregate_vectors(purchases_val[0]))



As it turns out, our system has recommended 6 products based on the entire purchase history of a user. Moreover, if you want to get products suggestions based on the last few purchases only then also you can use the same set of functions.

Below I am giving only the last 10 products purchased as input.

In [0]:
similar_products(aggregate_vectors(purchases_val[0][-10:]))



Feel free to play this code, try to get product recommendation for more sequences from the validation set