# Amazon Product Recommender

**Jason Truong**  **|**  Jasontruong19@gmail.com  **|**  September 25, 2022

**Problem Statement:** based on a person’s review of a product, can items of similar properties be recommended to them such that they are more inclined to buy the item? This project adds business value by improving the customer buying experience on Amazon through personalized recommendations for products. These recommendations influence customer purchases by making it easier for customers to find similar products. 

***

# 4 Recommendation System

**Note**: This is notebook **5 of 5** for building the recommendation system.

# Table of Contents

1. **[Introduction](#1Introduction)**  
2. **[Preliminary Data Setup](#2Preliminary)**   
3. **[Content Based Recommendation](#3Content_Recommend)**  
    3.1 [Comparison Between All Products](#3.1_comparision)  
    3.2 [Evaluate Recommendation System](#3.2_evaluate_recommendation)  
    3.3 [Single Product Comparison](#3.3_single_product)  
    3.4 [Save Vectorizer Model](#3.4_save_vectorizer)  
    3.5 [Recommendation Function](#3.5_recommend_function)  
    3.6 [Test Recommendation System](#3.6_test_recommendation)    
4. **[Conclusion and Future Works](#4Conclusion)**  

# 1. Introduction<a class ='anchor' id='1Introduction'></a>

In the previous notebook, the sentiment was predicted using machine learning models. The predicted sentiment can be used in conjunction with a content-based filtering method to recommend similar products for the user. This notebook will focus on developing and evaluating a content-based recommendation system.

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

The necessary base packages will be imported below.

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import random

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

The product meta data will be loaded in.

In [2]:
# load data
meta_df = pd.read_csv('clean_meta.csv')

Check the description column and ensure that there are no null values

In [3]:
meta_df['description_0'].isna().sum()

1748

There seems to be null values in the description although they were all removed in the preprocessing notebook. Thus these null values will be replaced with an empty string ''.

In [4]:
# Ensure that the description column has no null values.
meta_df['description_0'].fillna('', inplace = True)

# Check results
meta_df['description_0'].isna().sum()

0

There are no longer null values in the description column.

# 3. Content Based Recommendation<a class ='anchor' id='3Content_Recommend'></a>

The first step is to use the descriptions of the different Amazon items, in this case, movies/tv shows to recommend similar products.

In [5]:
# Take the description, title and product id from the metadata
rec_df = meta_df[['title','description_0','product_id']].copy()

Start with a subset of the data since NLP can be computationally expensive.

In [6]:
# A subset of 50000 rows will be used
rec_df_subset = rec_df.iloc[0:50000,:]

# Check results
rec_df_subset.head()

Unnamed: 0,title,description_0,product_id
0,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,0000143529
1,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,0000143588
2,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,0000143502
3,The Power of the Cross Joseph Prince,Have failures in your life caused you to feel ...,000073991X
4,Live in Houston [VHS],Track Listings 1. Come On Everybody 2. My Stre...,000107461X


The next step is to turn the description into numeric features using a TF-IDF Vectorizer. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer.fit(rec_df_subset['description_0'])

# Transform the description
TF_matrix = vectorizer.transform(rec_df_subset['description_0'])

Now that the text is turned into numeric features, the cosine similarity can be determined.

## 3.1 Comparision between all products<a class ='anchor' id='3.1_comparision'></a>

The metric cosine similarity can be used to compare the products to see which products lie in a similar direction, meaning that they are similar to one another.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# Get the index for the movie of interest
movie_index = rec_df_subset[rec_df_subset['title'] =='My Fair Pastry (Good Eats Vol. 9)'].index

# Determine the similarity between the movie and every other movie in the dataset
mov_similarities = cosine_similarity(TF_matrix,dense_output = False)


Put results in a dataframe table for better visualization.

In [9]:
# Store results in a dataframe
sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[movie_index,:].todense()).squeeze()})

List of the top recommended products based on similarities.

In [10]:
# Sort the values by descending to find the most similar
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

Unnamed: 0,item,similarities
0,My Fair Pastry (Good Eats Vol. 9),1.0
21171,Great Chefs: Chocolate Passion,0.279596
30335,Raquel Welch Collection: (One Million Years B....,0.237981
22417,American Pie / American Pie 2,0.230922
24160,Angelina Ballerina - Friends Forever,0.226967
28781,The Ultimate Chick Flick Collection: (The Bang...,0.225391
30291,Penthouse: Sweet Chocolate,0.216746
29017,Sweet Addition - Breakfast Pastries w/ Daniell...,0.216039
29728,The Sidney Poitier Collection: (For Love of Iv...,0.215795
37041,Jacques Pepin&rsquo;s Summertime Celebration,0.215467


In [11]:
top_recommend.iloc[1].name

21171

Determine the top recommendations in the dataset.

In [12]:
# Store the index of the best recommended items
best_rec = []

for i in range(0,50000):
    
    # Put results in a dataframe to 
    sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[i,:].todense()).squeeze()})
    
    # Sort the values by descending to find the most similar
    top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]
    
    # Find the product indices that have atleast 80% similarities
    if top_recommend.iloc[1]['similarities'] > 0.5 and top_recommend.iloc[1]['similarities'] < 1:
        best_rec.append(top_recommend.iloc[0].name)
        

In [13]:
print(f'There are {len(best_rec)} products that have items that are at least 80% similar to them.')

There are 10558 products that have items that are at least 80% similar to them.


In [14]:
sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[best_rec[0],:].todense()).squeeze()})

# Sort the values by descending to find the most similar
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

Unnamed: 0,item,similarities
9,Praise Aerobics VHS,1.0
8630,Richard Simmons - Disco Sweat,0.727897
47294,"Low Impact, High Intensity with Charlene Prickett",0.677509
8723,Low Impact Aerobics VHS,0.660788
9368,Weight Watchers: Low Impact Aerobics VHS,0.483444
44038,Rev Up - The Sequel,0.481968
6555,Aerobicise 2000 - A Workout For The Next Gener...,0.457063
37899,Cathe Friedrich's Low Max DVD,0.436477
8638,Mary Tyler Moore: Everywoman's Workout Aerobic...,0.429791
38748,Superbody: Aerobics Plus ! Created By Deborah ...,0.414496


The first and second product descriptions can be viewed to determine how they are similar to one another.

## 3.2 Evaluate recommendation system<a class ='anchor' id='3.2_evaluate_recommendation'></a>

**Check the actual descriptions to see if products are similar.**

Get the indices for the first 2 items.

In [15]:
index_1 = top_recommend.iloc[0].name
index_2 = top_recommend.iloc[1].name

In [16]:
rec_df_subset['description_0'][index_1]

'Praise Aerobics - A low-intensity/high-intesity low impact aerobic workout.'

In [17]:
rec_df_subset['description_0'][index_2]

'low-impact aerobic workout'

The descriptions above both contain the words 'low impact aerobic workout' which explains why there is a high similarity between the two. 

Calculate the proportion of items that have an >80% similarity. 50,000 descriptions were sampled so that will be used in calculating the proportion.

In [18]:
(10588/50000)*100

21.176000000000002

From the above, ~21.2% of the products have over 80% similarity, which shows the strength of this filtering method. For future works, different filtering methods such as item-item or user-to-item can be tested to determine if those methods result in better performance.

This above approach in recommending products take a lot of computation power and memory because the cosine similarity between every product is computed. A more efficient way is to compare only the product that is chosen with every other product.

## 3.3 One Product Comparision<a class ='anchor' id='3.3_single_product'></a>

Find the cosine similarity for the requested product instead of generating the cosine similarity for all the products.

In [19]:
# Define the vectorizer
vectorizer2 = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer2.fit(rec_df['description_0'])


TfidfVectorizer(min_df=10, stop_words='english')

Now that the vectorizer has been fitted, it can be used to transform the descriptions.

In [20]:
# Transform the description
TF_matrix = vectorizer2.transform(rec_df['description_0'])

TF_matrix.shape
# Check the shape

(156481, 39048)

The new matrix contains 156481 products and 39048 features for each product.

A sample movie will be used to determine the similarity below.

In [21]:
# Determine the index of the movie
movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

In [22]:
TF_matrix_product = vectorizer2.transform(rec_df_subset['description_0'][movie_index])

In [23]:
# Check the shape of the transformed description
TF_matrix_product.shape

(1, 39048)

The specific product is obtained by indexing the above. The above product can now be compared with every other product in the dataset.

In [24]:
# Find the similarity between the chosen product with every other product in the dataset
mov_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)

# Check output results
mov_similarities

<156481x1 sparse matrix of type '<class 'numpy.float64'>'
	with 55953 stored elements in Compressed Sparse Row format>

The shape of the matrix makes sense because there is a similarity score between the requested product and every other product in the dataset.

In [25]:
# Convert the movie similarities output into a dataframe
single_df = pd.DataFrame({'item':rec_df['title'], 
                       'similarities': np.array(mov_similarities.todense()).squeeze()})

In [26]:
# Check the top 10 most similar products
single_df.sort_values(by = 'similarities', ascending = False).head(10)

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


The most similar products are shown in the table above with the most similar product to Rise and Swine being itself.

## 3.4 Save the vectorizer model<a class ='anchor' id='3.4_save_vectorizer'></a>

The recommender vectorizer model will be saved to be used in the future

In [27]:
pickle.dump(vectorizer2, open('Saved_Models/4_recommender_vectorizer.pickle','wb'))

The model will be loaded in as a test to ensure that the model saved properly.

In [28]:
# Load in model
file_vectorizer = pickle.load(open('Saved_Models/4_recommender_vectorizer.pickle','rb'))

## 3.5 Top 10 Recommendation Function<a class ='anchor' id='3.5_recommend_function'></a>

Create a function that would take in the product list, the product name and the sentiment of the review. Then the output would be the top 10 recommendations.

In [29]:
def rec_system(vectorizer, sentiment, product_list, product_name):
    '''
    
    The purpose of this function is to recommend products based on the predicted
    sentiment and product name.
    
    The inputs to this function include the vectorizer model, the sentiment of 
    the review text, the list of products, product name to be used for 
    recommending other products.
    
    
    Parameters
    ----------
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer,
                 Vectorizer model that was trained on the product descriptions
    sentiment = int, 
                A positive or negative sentiment prediction
    product_list = pandas.core.frame.DataFrame,
                   A Dataframe of the products with the descriptions
    product_name = str,
                   The name of the movie for comparison.
    
    
    Returns
    --------
    Recommended Product list
    
    '''
    
    # Transform the description
    TF_matrix = vectorizer.transform(product_list['description_0'])
    
    # Determine the index of the product of recommendation
    product_index = product_list[product_list['title'] == product_name].index
    
    # Get the TF matrix of the required product
    TF_matrix_product = TF_matrix[product_index]
    
    # Determine the similarity between the product and everything else
    product_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)
    
    # 
    single_df = pd.DataFrame({'item':product_list['title'], 
                       'similarities': np.array(product_similarities.todense()).squeeze()})
    
    # Sort the products by similarity
    single_df_sorted = single_df.sort_values(by = 'similarities', ascending = False)
    
    
    # When the sentiment is positive, recommend the top 10 most similar products
    if sentiment == 1:
        recommended_products = single_df_sorted.head(10)
    
    # Recommend 10 products randomly from the top 100 most similar products
    else:
        # Split out the top 100 recommended_products
        top_100 = single_df_sorted.iloc[:100].reset_index(drop = True)
        
        # Randomly sample 10 numbers between 0 and 99
        ran_num = random.sample(range(0, 100), 10)
        
        # Find the index where the ran_num exists in the top 100 recommended products
        top_100_index = top_100.index.isin(ran_num)
        
        # Determine recommended products
        recommended_products = top_100[top_100_index]
        
    return recommended_products.reset_index(drop = True)

## 3.6 Test out recommendation system<a class ='anchor' id='3.6_test_recommendation'></a>

The recommendation system function will be tested below.

In [30]:
# Test out recommendation system
rec_system(file_vectorizer, 0 ,rec_df,'Rise and Swine (Good Eats Vol. 7)')

Unnamed: 0,item,similarities
0,American Eats: Holiday Foods,0.256991
1,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
2,Food Network Takeout Collection DVD - Good Eat...,0.243199
3,Food Network Takeout Collection DVD - Good Eat...,0.156876
4,Christmas Mix,0.135535
5,Lost Season 1 Disc 6 Replacement Disc!,0.13397
6,Floyd Mayweather jr. Boxing DVD Collection,0.131203
7,The Carmen Miranda Collection: (The Gang's All...,0.13089
8,Cutie Honey: The Complete TV Series,0.125494
9,"Marx, Groucho: Groucho Marx Collection",0.117973


The function works as expected and the items most similar to `Rise and Swine` are `American Eats` and `Love is the Devil` 

# 4. Conclusion and Future Works<a class ='anchor' id='4Conclusion'></a>