In this notebook, I will attempt at implementing **Content Based Recommendation System**.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re
%matplotlib inline

In [2]:
pre_df=pd.read_csv("dataset/flipkart_com-ecommerce_sample.csv", na_values=["No rating available"])

In [3]:
pre_df.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,,,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,,,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,,,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


In [4]:
pre_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             1849 non-null float64
overall_rating             1849 non-null float64
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(4), object(10)
memory usage: 2.2+ MB


In this dataset user information is not provided, so we can not build user based recommendation system. Also only 1849 product_rating have non missing value in, so product rating is also not going to help much with building our recommendation system. So, I will be implementing **Content Based(Description + Metadata) Recommendation System. **

## ***Data Preprocessing***

In [5]:
#Normalize product_category_tree column
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('[]'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('"'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.split('>>'))
pre_df.head(1)

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,,,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [6]:
#delete unwanted columns
del_list=['crawl_timestamp','product_url','image',"retail_price","discounted_price","is_FK_Advantage_product","product_rating","overall_rating","product_specifications"]
pre_df=pre_df.drop(del_list,axis=1)

In [7]:
pre_df.head(2)

Unnamed: 0,uniq_id,product_name,product_category_tree,pid,description,brand
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,Key Features of Alisha Solid Women's Cycling S...,Alisha
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,"[Furniture , Living Room Furniture , Sofa Be...",SBEEH3QGU7MFYJFY,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor


In [8]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) 
exclude = set(string.punctuation)
import string

In [9]:
# Dropping duplicates
pre_df.drop_duplicates(subset ="product_name", 
                     keep = "first", inplace = True)
pre_df.head()

Unnamed: 0,uniq_id,product_name,product_category_tree,pid,description,brand
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,"[Clothing , Women's Clothing , Lingerie, Sle...",SRTEH2FF9KEDEFGF,Key Features of Alisha Solid Women's Cycling S...,Alisha
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,"[Furniture , Living Room Furniture , Sofa Be...",SBEEH3QGU7MFYJFY,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor
2,f449ec65dcbc041b6ae5e6a32717d01b,AW Bellies,"[Footwear , Women's Footwear , Ballerinas , ...",SHOEH4GRSUBJGZXE,Key Features of AW Bellies Sandals Wedges Heel...,AW
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,Sicons All Purpose Arnica Dog Shampoo,"[Pet Supplies , Grooming , Skin & Coat Care ...",PSOEH3ZYDMSYARJ5,Specifications of Sicons All Purpose Arnica Do...,Sicons
5,c2a17313954882c1dba461863e98adf2,Eternal Gandhi Super Series Crystal Paper Weig...,[Eternal Gandhi Super Series Crystal Paper Wei...,PWTEB7H2E4KCYUE3,Key Features of Eternal Gandhi Super Series Cr...,Eternal Gandhi


In [10]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
pre_df['description'] = pre_df['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(pre_df['description'])

#Output the shape of tfidf_matrix
print('Shape: ', tfidf_matrix.shape)

Shape:  (12676, 24646)


In [12]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [13]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(pre_df.index, index=pre_df['product_name']).drop_duplicates()

In [14]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return pre_df['product_name'].iloc[movie_indices]

In [16]:
get_recommendations("Alisha Solid Women's Cycling Shorts")

935      Mynte Solid Women's Cycling Shorts, Gym Shorts...
1292               Vitamins Solid Baby Girl's Basic Shorts
10425                    Ashdan Solid Women's Basic Shorts
3034                  Vero Moda Solid Women's Basic Shorts
1298         Vitamins Embroidered Baby Girl's Denim Shorts
1643              FS Mini Klub Printed Girl's Denim Shorts
15593         SMART DENIM Solid Women's White Denim Shorts
10422                   Broche Printed Boy's Sports Shorts
10421                  Lilliput Solid Boy's Bermuda Shorts
3270              Only Printed Women's Purple Basic Shorts
Name: product_name, dtype: object