<a href="https://colab.research.google.com/github/JinHuiXu1991/Jin_DATA606/blob/main/ipynb/DATA606_Part2_ContentBasedFiltering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Product Recommender Systems
## Author: Jin Hui Xu

##Content Based Filtering 

In [1]:
!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_product.zip?raw=true

!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_review.zip?raw=true

--2022-03-20 06:16:15--  https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_product.zip?raw=true
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/JinHuiXu1991/Jin_DATA606/raw/main/cleaned_data/cleaned_amazon_product.zip [following]
--2022-03-20 06:16:15--  https://github.com/JinHuiXu1991/Jin_DATA606/raw/main/cleaned_data/cleaned_amazon_product.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JinHuiXu1991/Jin_DATA606/main/cleaned_data/cleaned_amazon_product.zip [following]
--2022-03-20 06:16:15--  https://raw.githubusercontent.com/JinHuiXu1991/Jin_DATA606/main/cleaned_data/cleaned_amazon_product.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
import time
import numpy as np
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Description Based Recommender

In [3]:
product_df = pd.read_csv('/content/cleaned_amazon_product.zip?raw=true', compression='zip')

In [4]:
product_df.head()

Unnamed: 0,category,description,title,brand,feature,main_cat,date,price,asin,imageURLHighRes,dateYear,dateMonth
0,appliances refrigerators freezers ice makers,,tupperware freezer square round container set ...,tupperware,each 3 pc set includes two 7 8 cup 200 ml and ...,appliances,2008-11-19,,7301113188,[],2008.0,11.0
1,appliances refrigerators freezers ice makers,2 x tupperware pure fresh unique covered cool ...,2 x tupperware pure amp fresh unique covered c...,tupperware,2 x tupperware pure fresh unique covered cool ...,appliances,2016-06-05,3.62,7861850250,['https://images-na.ssl-images-amazon.com/imag...,2016.0,6.0
2,appliances parts accessories,,the cigar moments of pleasure,the cigar book,,amazon home,,150.26,8792559360,['https://images-na.ssl-images-amazon.com/imag...,,
3,appliances parts accessories,multi purpost descaler especially suited to wa...,caraselle 2x 50g appliance descalene,caraselle,,tools home improvement,2014-12-17,,9792954481,['https://images-na.ssl-images-amazon.com/imag...,2014.0,12.0
4,appliances parts accessories range parts acces...,full gauge and size beveled edge furnished wit...,eaton wiring 39ch sp l arrow hart 1 gang chrom...,eaton wiring,returns will not be honored on this closeout i...,tools home improvement,2007-01-16,3.43,B00002N5EL,[],2007.0,1.0


In [5]:
product_df.shape

(30239, 12)

In [6]:
# Replace all NaN with an empty string
product_df = product_df.fillna('')
product_df.isnull().sum()

category           0
description        0
title              0
brand              0
feature            0
main_cat           0
date               0
price              0
asin               0
imageURLHighRes    0
dateYear           0
dateMonth          0
dtype: int64

In [7]:
lem = WordNetLemmatizer()
def lemma(text):
    return ' '.join(lem.lemmatize(w) for w in text.split() if w not in stop)   

In [8]:
product_df['description'] = product_df['description'].apply(lemma)

In [9]:
# Define a TF-IDF Vectorizer object and remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english', max_df = 0.9, min_df = 5)

In [10]:
# Construct the required TF-IDF matrix 
tfidf_matrix = tfidf.fit_transform(product_df['description'])

In [11]:
# Output the shape of tfidf_matrix
tfidf_matrix.shape

(30239, 9738)

Both linear_kernel and cosine_similarity methods produce the same result. We would like to apply both methods can see the performance of creating the cosine similarity matrix, and choose the faster one.

In [12]:
start = time.time()
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print("Time taken: %s seconds" % (time.time() - start))

Time taken: 15.70565390586853 seconds


In [13]:
# start = time.time()
# cosine_sim2 = cosine_similarity(tfidf_matrix, tfidf_matrix)
# print("Time taken: %s seconds" % (time.time() - start))

The performance results are very close, and the linear_kernel result will be used

In [14]:
# Reverse mapping of indices and product id
indices = pd.Series(product_df.index, index=product_df['asin'].str.lower()).drop_duplicates()

In [15]:
# Function that takes in product id as input and gives recommendations 
def description_recommender(id, cosine_sim = cosine_sim, df = product_df, indices = indices):
  # get the index of the product that matches the id
  idx = indices[id.lower()]

  # get the pairwsie similarity scores 
  # then convert it into a list of tuples as described above
  sim_scores = list(enumerate(cosine_sim[idx]))

  # sort the product based on the cosine similarity scores
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  # get the scores of the 10 most similar product. Ignore the first one because it is the input product.
  sim_scores = sim_scores[1:11]

  # get the product indices
  product_indices = [i[0] for i in sim_scores]

  # return the top 10 most similar product
  return df['asin'].iloc[product_indices].tolist(), df['title'].iloc[product_indices].tolist()

In [16]:
product_df[product_df['title'].str.contains('refrigerator', case=False)].head(1)

Unnamed: 0,category,description,title,brand,feature,main_cat,date,price,asin,imageURLHighRes,dateYear,dateMonth
131,appliances refrigerators freezers ice makers,compact mini cooler warmer hold 17 liter twent...,coldmate mr 128 mini cooler warmer deluxe mini...,coldmate,press the cold button to cool to 40 f and hot ...,appliances,2001-10-02,,B0001YH10C,['https://images-na.ssl-images-amazon.com/imag...,2001.0,10.0


In [17]:
# get input id and title for the recommendation
input_id = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['asin']
input_title = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['title']

In [18]:
#Get recommendations for Coldmate MR-128 Mini Cooler/Warmer Deluxe Mini Refrigerator, input the product id 
asin, title = description_recommender(input_id)

In [19]:
print('Description Based Recommender Result for {}, {}: '.format(input_id, input_title))
for i in range(0, 10):
  print('{}. {}, {}'.format(i+1, asin[i], title[i]))

Description Based Recommender Result for B0001YH10C, coldmate mr 128 mini cooler warmer deluxe mini refrigerator: 
1. B00ID8CLMG, avanti ff45006w 4 3 cf frost free refrigerator freezer white
2. B00RNAH5OY, gofridge mini fridge portable electric cooler
3. B001H80RN4, frigidaire 241505301 refrigerator door bin genuine original equipment manufacturer oem part
4. B004NEYPYQ, frost free 4 3 cu ft refrigerator freezer white
5. B000JLL3BK, pek vino vault wine preserving refrigerator silver
6. B01F79MKME, amana ama43bk compact single door refrigerator 4 3 cu ft black
7. B001F7H4RY, portable cooler warmer mini fridge wine beer
8. B001AAHW6E, whirlpool 2179404kra beverage rack
9. B001775T4C, nostalgia electrics crf170retrored retro series mini fridge 1 7 cubic feet
10. B004Y3C9J4, 1 7 cuft superconduction refrigerator


In [20]:
# np.save('cosine_sim', cosine_sim)

In [21]:
# original_cs = np.load("cosine_sim.npy")
# original_cs

In [22]:
# get input id and title for the recommendation
# input_id = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[1]['asin']
# input_title = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[1]['title']

In [23]:
#Get recommendations
# asin, title = description_recommender(input_id, cosine_sim=original_cs)

In [24]:
# print('Description Based Recommender Result for {}, {}: '.format(input_id, input_title))
# for i in range(0, 10):
#   print('{}. {}, {}'.format(i+1, asin[i], title[i]))

### Metadata Based Recommender

In [25]:
product_df = pd.read_csv('/content/cleaned_amazon_product.zip?raw=true', compression='zip')
product_df = product_df.fillna('')

In [26]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30239 entries, 0 to 30238
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   category         30239 non-null  object
 1   description      30239 non-null  object
 2   title            30239 non-null  object
 3   brand            30239 non-null  object
 4   feature          30239 non-null  object
 5   main_cat         30239 non-null  object
 6   date             30239 non-null  object
 7   price            30239 non-null  object
 8   asin             30239 non-null  object
 9   imageURLHighRes  30239 non-null  object
 10  dateYear         30239 non-null  object
 11  dateMonth        30239 non-null  object
dtypes: object(12)
memory usage: 2.8+ MB


In [27]:
# Use meta data except description and features
product_df['meta_text'] = product_df['category'] + ' ' +  product_df['title'] + ' ' +  product_df['brand'] + ' ' + product_df['main_cat'] + ' ' + product_df['price'].astype(str) + ' ' + product_df['dateYear'].astype(str) + ' ' + product_df['dateMonth'].astype(str) 
product_df['meta_text'] = product_df['meta_text'].apply(lemma)

In [28]:
# Define a TF-IDF Vectorizer object and remove all english stopwords
tfidf2 = TfidfVectorizer(stop_words='english', max_df = 0.9, min_df = 5)

In [29]:
tfidf_matrix2 = tfidf2.fit_transform(product_df['meta_text'])

In [30]:
tfidf_matrix2.shape

(30239, 3342)

In [31]:
cosine_sim2 = linear_kernel(tfidf_matrix2, tfidf_matrix2)

In [32]:
indices = pd.Series(product_df.index, index=product_df['asin'].str.lower()).drop_duplicates()

In [33]:
# Function that takes in product id as input and gives recommendations 
def description_recommender(id, cosine_sim = cosine_sim2, df = product_df, indices = indices):
  # get the index of the product that matches the id
  idx = indices[id.lower()]

  # get the pairwsie similarity scores 
  # then convert it into a list of tuples as described above
  sim_scores = list(enumerate(cosine_sim[idx]))

  # sort the product based on the cosine similarity scores
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  # get the scores of the 10 most similar product. Ignore the first one because it is the input product.
  sim_scores = sim_scores[1:11]

  # get the product indices
  product_indices = [i[0] for i in sim_scores]

  # return the top 10 most similar product
  return df['asin'].iloc[product_indices].tolist(), df['title'].iloc[product_indices].tolist()

In [34]:
product_df[product_df['title'].str.contains('refrigerator', case=False)].head(1)

Unnamed: 0,category,description,title,brand,feature,main_cat,date,price,asin,imageURLHighRes,dateYear,dateMonth,meta_text
131,appliances refrigerators freezers ice makers,this compact mini cooler and warmer holds 17 l...,coldmate mr 128 mini cooler warmer deluxe mini...,coldmate,press the cold button to cool to 40 f and hot ...,appliances,2001-10-02,,B0001YH10C,['https://images-na.ssl-images-amazon.com/imag...,2001.0,10.0,appliance refrigerator freezer ice maker coldm...


In [35]:
# get input id and title for the recommendation
input_id = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['asin']
input_title = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['title']

In [36]:
#Get recommendations
asin2, title2 = description_recommender(input_id, cosine_sim=cosine_sim2)

In [37]:
print('Description Based Recommender Result for {}, {}: '.format(input_id, input_title))
for i in range(0, 10):
  print('{}. {}, {}'.format(i+1, asin2[i], title2[i]))

Description Based Recommender Result for B0001YH10C, coldmate mr 128 mini cooler warmer deluxe mini refrigerator: 
1. B001F7H4RY, portable cooler warmer mini fridge wine beer
2. B00YNNEC8Q, mini wine cooler
3. B00RNAH5OY, gofridge mini fridge portable electric cooler
4. B00YNMUYV6, mini wine cooler refrigerator with lock
5. B00ND5CWAA, phoenix usb 5v portable one zip top can cooler mini car compact refrigerator and warmer
6. B016K4J3U2, honeykoko mini usb pc refrigerator fridge beverage drink can cooler warmer heater gadget one can in home office
7. B016KQ7X8E, threeh new mini red usb fridge cooler beverage drink cans cooler warmer refrigerator for laptop pc computer red h uf05red
8. B0187KYRQC, coca cola mini can cooler
9. B00KE7FM3O, mini usb desktop fridge cooler refrigerator
10. B007M4X2ZW, mini pizza maker
