In this notebook, I will attempt at implementing **Content Based Recommendation System**.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re

In [18]:
pre_df=pd.read_csv("dataset/flipkart_com-ecommerce_sample.csv", na_values=["No rating available"])

In [19]:
print(pre_df.shape)
pre_df.head()

(20000, 15)


Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,,,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,,,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,,,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


In [20]:
pre_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             1849 non-null float64
overall_rating             1849 non-null float64
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(4), object(10)
memory usage: 2.2+ MB


In this dataset user information is not provided, so we can not build user based recommendation system. Also only 1849 product_rating have non missing value in, so product rating is also not going to help much with building our recommendation system. So, I will be implementing **Content Based(Description + Metadata) Recommendation System. **

## ***Data Preprocessing***

In [21]:
#Normalize product_category_tree column
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('[]'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('"'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.split('>>'))
pre_df.head(1)

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [22]:
#delete unwanted columns
del_list=['crawl_timestamp','product_url','image',"retail_price","discounted_price","is_FK_Advantage_product","product_rating","overall_rating","product_specifications"]
pre_df=pre_df.drop(del_list,axis=1)

In [23]:
pre_df.head(2)

Unnamed: 0,uniq_id,product_name,product_category_tree,pid,description,brand
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,Key Features of Alisha Solid Women's Cycling S...,Alisha
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,"[Furniture , Living Room Furniture , Sofa Be...",SBEEH3QGU7MFYJFY,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor


In [24]:
# Dropping duplicates
pre_df.drop_duplicates(subset ="product_name", keep = "first", inplace = True)
pre_df.head()

Unnamed: 0,uniq_id,product_name,product_category_tree,pid,description,brand
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,Key Features of Alisha Solid Women's Cycling S...,Alisha
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,"[Furniture , Living Room Furniture , Sofa Be...",SBEEH3QGU7MFYJFY,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor
2,f449ec65dcbc041b6ae5e6a32717d01b,AW Bellies,"[Footwear , Women's Footwear , Ballerinas , ...",SHOEH4GRSUBJGZXE,Key Features of AW Bellies Sandals Wedges Heel...,AW
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,Sicons All Purpose Arnica Dog Shampoo,"[Pet Supplies , Grooming , Skin & Coat Care ...",PSOEH3ZYDMSYARJ5,Specifications of Sicons All Purpose Arnica Do...,Sicons
5,c2a17313954882c1dba461863e98adf2,Eternal Gandhi Super Series Crystal Paper Weig...,[Eternal Gandhi Super Series Crystal Paper Wei...,PWTEB7H2E4KCYUE3,Key Features of Eternal Gandhi Super Series Cr...,Eternal Gandhi


In [40]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
#download stopwords vocabulary
nltk.download('stopwords')
#download wordnet vocabulary
nltk.download('wordnet')
lem = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) 
exclude = set(string.punctuation)
import string

[nltk_data] Downloading package stopwords to C:\Users\S
[nltk_data]     Ponraj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\S
[nltk_data]     Ponraj\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [26]:
pre_df.shape

(12676, 6)

### Data Cleaning

In [36]:
#Stop wods are nothing but {'a', 'she', 'was'}
print(list(stop_words)[:5])
#to excluding punctuations
print(list(exclude)[:5])

['hasn', 'what', 'isn', 'over', 'if']
['*', '>', '/', '^', '|']


<strong>Lemmatization</strong> is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

<b>‘Caring’ -> Lemmatization -> ‘Care’ </b> <br/>
<b>‘Caring’ -> Stemming -> ‘Car’</b>

In [43]:
# if the input is 'Alisha Solid Women's Cycling Shorts', then below variable act as menioned
# word_tokens:  ['alisha', 'solid', 'womens', 'cycling', 'shorts']
# filtered_sentence :  ['alisha', 'solid', 'womens', 'cycle', 'short']

def filter_keywords(doc):
    doc=doc.lower()
    stop_free = " ".join([i for i in doc.split() if i not in stop_words])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    word_tokens = word_tokenize(punc_free)
    filtered_sentence = [(lem.lemmatize(w, "v")) for w in word_tokens]
    return filtered_sentence

In [44]:
smd = pre_df.copy()
smd['product'] = smd['product_name'].apply(filter_keywords)
smd['description'] = smd['description'].astype("str").apply(filter_keywords)
smd['brand'] = smd['brand'].astype("str").apply(filter_keywords)

In [47]:
# Here we combine all columns in the form of LIST
smd["all_meta"]=smd['product']+smd['brand']+ pre_df['product_category_tree']+smd['description']
# Here we create string from those tuples
smd["all_meta"] = smd["all_meta"].apply(lambda x: ' '.join(x))

In [48]:
smd["all_meta"].head()

0    alisha solid womens cycle short alisha Clothin...
1    fabhomedecor fabric double sofa bed fabhomedec...
2    aw belly aw Footwear   Women's Footwear   Ball...
4    sicons purpose arnica dog shampoo sicons Pet S...
5    eternal gandhi super series crystal paper weig...
Name: all_meta, dtype: object

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
# count_matrix = count.fit_transform(smd['all_meta'])
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['all_meta'])

### Cosine Similarity
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two products.
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

In [50]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

We now have a pairwise cosine similarity matrix for all the products in our dataset. The next step is to write a function that returns the most similar products based on the cosine similarity score.

In [51]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    product_indices = [i[0] for i in sim_scores]
    return titles.iloc[product_indices]

In [52]:
smd = smd.reset_index()
titles = smd['product_name']
indices = pd.Series(smd.index, index=smd['product_name'])

Let us now try and get the top recommendations for a few products.

12219    Comfort Couch Engineered Wood 3 Seater Sofa
12199        @home Annulus Solid Wood Dressing Table
11866       Ethnic Handicrafts Solid Wood Single Bed
11857        Ethnic Handicrafts Solid Wood Queen Bed
5191                    HomeEdge Solid Wood King Bed

In [53]:
get_recommendations("FabHomeDecor Fabric Double Sofa Bed").head(5)

12219    Comfort Couch Engineered Wood 3 Seater Sofa
12199        @home Annulus Solid Wood Dressing Table
11866       Ethnic Handicrafts Solid Wood Single Bed
11857        Ethnic Handicrafts Solid Wood Queen Bed
5191                    HomeEdge Solid Wood King Bed
Name: product_name, dtype: object

In [54]:
get_recommendations("Alisha Solid Women's Cycling Shorts").head(5)

644     Mynte Solid Women's Cycling Shorts, Gym Shorts...
7162                    Ashdan Solid Women's Basic Shorts
899               Vitamins Solid Baby Girl's Basic Shorts
1786                 Vero Moda Solid Women's Basic Shorts
7158                  Lilliput Solid Boy's Bermuda Shorts
Name: product_name, dtype: object

In [56]:
get_recommendations("Alisha Solid Women's Cycling Shorts").head(5).to_csv("Alisha Solid Women's Cycling Shorts recommendations.csv",index=False,header=True)