# 4 Recommendation System

- Author: Jason Truong
- Last Modified: September 19, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. **[Objective and Roadmap](#1Objective)**  
2. **[Preliminary Data Setup](#2Preliminary)**   
3. **[Content Based Recommendation](#3Content_Recommend)**   
    3.1 [Single Product Comparison](#3.1_single_product)  
    3.2 [Save Vectorizer Model](#3.2_save_vectorizer)  
    3.3 [Recommendation Function](#3.3_recommend_function)  
    3.4 [Test Recommendation System](#3.4_test_recommendation)   
    3.5 [Evaluate Recommendation System](#3.5_evaluate_recommendation)  
4. **[Conclusion and Future Works](#4Conclusion)**  

# 1. Objective<a class ='anchor' id='1Objective'></a>

To use the predicted sentiment from review text and product description to come up with recommendations for users.

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [2]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import random


from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Load in the dataset

In [3]:
meta_df = pd.read_csv('clean_meta.csv')

Check the description column and ensure that there are no null values

In [4]:
meta_df['description_0'].isna().sum()

1748

In [6]:
# Ensure that the description column has no null values.
meta_df['description_0'].fillna('', inplace = True)

# Check results
meta_df['description_0'].isna().sum()

0

# 3. Content Based Recommendation<a class ='anchor' id='3Content_Recommend'></a>

The first step is to use the descriptions of the different Amazon items, in this case, movies/tv shows to recommend products that are similar.

In [8]:
# Take the description, title and product id from the metadata
rec_df = meta_df[['title','description_0','product_id']].copy()

Start off with a subset of the data since NLP can be computationally expensive

In [9]:
# A subset of 50000 rows will be used
rec_df_subset = rec_df.iloc[0:50000,:]

# Check results
rec_df_subset.head()

Unnamed: 0,title,description_0,product_id
0,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,0000143529
1,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,0000143588
2,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,0000143502
3,The Power of the Cross Joseph Prince,Have failures in your life caused you to feel ...,000073991X
4,Live in Houston [VHS],Track Listings 1. Come On Everybody 2. My Stre...,000107461X


The next step is to turn the description into numeric features with TF-IDF Vectorizer. This vectorizer combines the count of the word in the text with the inverse count of the word in every document to come up with a numeric representation. As the count of the word in the text increases, the numeric value increases. As the count of the word in the whole library of text increases, the numeric value decreases. This ensures that common words like movie will get a lower value because it may appear in every text.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer.fit(rec_df_subset['description_0'])

# Transform the description
TF_matrix = vectorizer.transform(rec_df_subset['description_0'])

Now that the text is turned into numeric features, the cosine similarity can be determined.

## 3.1 Comparision between all products

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

# Get the index for the movie of interest
movie_index = rec_df_subset[rec_df_subset['title'] =='My Fair Pastry (Good Eats Vol. 9)'].index

# Determine the similarity between the movie and every other movie in the dataset
mov_similarities = cosine_similarity(TF_matrix,dense_output = False)


Put results in a dataframe table for better visualization

In [35]:
sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[movie_index,:].todense()).squeeze()})

List of the top recommended products based off of similarities

In [36]:
# Sort the values by descending to find the most similar
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

Unnamed: 0,item,similarities
300,Le Million,1.0
17810,Lotto Land,0.298626
10928,Just Your Luck VHS,0.198216
25677,Go For Broke,0.196722
25926,Lucky Day,0.188302
17483,Winning at Varna VHS,0.188189
37341,"Anthology Of Surreal Cinema, Volume 1",0.183959
48928,"'Allo 'Allo! - The Complete Series Five, Parts...",0.178894
2832,Goober &amp; the Ghost Chasers - The Chase Is ...,0.177632
46843,Hotel Du Nord,0.175046


In [50]:
top_recommend.iloc[1].name

17810

Determine the top recommends in the dataset.

In [90]:
# Store the index of the best recommended items
best_rec = []

for i in range(0,50000):
    
    # Put results in a dataframe to 
    sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[i,:].todense()).squeeze()})
    
    # Sort the values by descending to find the most similar
    top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]
    
    # Find the product indices that have atleast 80% similarities
    if top_recommend.iloc[1]['similarities'] > 0.5 and top_recommend.iloc[1]['similarities'] < 1:
        best_rec.append(top_recommend.iloc[0].name)
        

In [91]:
print(f'There are {len(best_rec)} products that have items that are atleast 80% similar to them.')

There are 10558 products that have items that are atleast 80% similar to them.


In [117]:
sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[best_rec[0],:].todense()).squeeze()})

# Sort the values by descending to find the most similar
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

Unnamed: 0,item,similarities
9,Praise Aerobics VHS,1.0
8630,Richard Simmons - Disco Sweat,0.727897
47294,"Low Impact, High Intensity with Charlene Prickett",0.677509
8723,Low Impact Aerobics VHS,0.660788
9368,Weight Watchers: Low Impact Aerobics VHS,0.483444
44038,Rev Up - The Sequel,0.481968
6555,Aerobicise 2000 - A Workout For The Next Gener...,0.457063
37899,Cathe Friedrich's Low Max DVD,0.436477
8638,Mary Tyler Moore: Everywoman's Workout Aerobic...,0.429791
38748,Superbody: Aerobics Plus ! Created By Deborah ...,0.414496


The first and second product's description can be viewed to determine how they are similar to one another.

## 3.2 Evaluate recommendation system<a class ='anchor' id='3.5_evaluate_recommendation'></a>

**Check the actual descriptions to see if products are similar.**

Get the indices for the first 2 items.

In [118]:
index_1 = top_recommend.iloc[0].name
index_2 = top_recommend.iloc[1].name

In [119]:
rec_df_subset['description_0'][index_1]

'Praise Aerobics - A low-intensity/high-intesity low impact aerobic workout.'

In [120]:
rec_df_subset['description_0'][index_2]

'low-impact aerobic workout'

The descriptions above both contain the words 'low impact aerobic workout' which explains why there is a 72.7% similarity between the two.

Another recommendation can be evaluated

In [125]:
sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similarities[best_rec[1],:].todense()).squeeze()})

# Sort the values by descending to find the most similar
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

Unnamed: 0,item,similarities
14,50 Years of Thorns and Roses VHS,1.0
5442,"Richard Strauss - Salome / Petr Weigl, Giusepp...",0.976313
10097,Faster Pussycat Kill Kill VHS,0.967741
9955,Robinson Crusoe and the Tiger VHS,0.955238
5447,Tchaikovsky: Sleeping Beauty VHS,0.955238
4045,Stevie VHS,0.928834
910,"How To Behave So Your Children Will, Too! VHS",0.905555
7879,Simply Moguls VHS,0.789825
459,Belle's Magical World VHS,0.758028
18233,Air Power: SR-71 Blackbird: The Secret Vigil VHS,0.757445


The first and second product's description can be viewed to determine how they are similar to one another.

**Check the actual descriptions to see if products are similar.**

Get the indices for the first 2 items.

In [118]:
index_1 = top_recommend.iloc[0].name
index_2 = top_recommend.iloc[1].name

In [119]:
rec_df_subset['description_0'][index_1]

'Praise Aerobics - A low-intensity/high-intesity low impact aerobic workout.'

In [120]:
rec_df_subset['description_0'][index_2]

'low-impact aerobic workout'

This above approach in recommending products take a lot of computation power and memory because the cosine similarity between every product is computed. A more efficient way is to compare only the product that is chosen with every other product.

## 3.1 One Product Comparision<a class ='anchor' id='3.1_single_product'></a>

Find the Cosine similarity for the requested product instead of generating the cosine similarity for all the products.

In [23]:
# Define the vectorizer
vectorizer2 = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer2.fit(rec_df['description_0'])


TfidfVectorizer(min_df=10, stop_words='english')

In [24]:
# Transform the description
TF_matrix = vectorizer2.transform(rec_df['description_0'])

In [25]:
# Determine the index of the movie
movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

In [26]:
TF_matrix_product = vectorizer2.transform(rec_df_subset['description_0'][movie_index])

In [27]:
movie_index

Int64Index([2], dtype='int64')

In [28]:
TF_matrix.shape

(156481, 39048)

In [29]:
TF_matrix[movie_index]

<1x39048 sparse matrix of type '<class 'numpy.float64'>'
	with 54 stored elements in Compressed Sparse Row format>

In [30]:
# Check the shape of the transformed description
TF_matrix_product.shape

(1, 39048)

In [31]:
# Find the similarity between the chosen product with every other product in the dataset
mov_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)

# Check output results
mov_similarities

<156481x1 sparse matrix of type '<class 'numpy.float64'>'
	with 55953 stored elements in Compressed Sparse Row format>

In [32]:
# Convert the movie similarities output into a dataframe
single_df = pd.DataFrame({'item':rec_df['title'], 
                       'similarities': np.array(mov_similarities.todense()).squeeze()})

In [33]:
# Check the top 10 most similar products
single_df.sort_values(by = 'similarities', ascending = False).head(10)

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


## 3.2 Save the vectorizer model<a class ='anchor' id='3.2_save_vectorizer'></a>

Ther recommender vectorizer model will be saved to be used in the future

In [34]:
pickle.dump(vectorizer2, open('recommender_vectorizer.pickle','wb'))

The model will be loaded in as a test to ensure that the model saved properly.

In [35]:
file_vectorizer = pickle.load(open('recommender_vectorizer.pickle','rb'))

## 3.3 Top 10 Recommendation Function<a class ='anchor' id='3.3_recommend_function'></a>

Create a function that would take in the item and output the top 10 recommendations.

In [65]:
def rec_system(vectorizer, sentiment, product_list, product_name):
    '''
    
    The purpose of this function is to recommend products based on the predicted
    sentiment and product name.
    
    The inputs to this function includes the vectorizer model, the sentiment of 
    the review text, the list of products, product name to be used for 
    recommending other products.
    
    
    Parameters
    ----------
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer (Vectorizer model)
    sentiment = int
    product_list = pandas.core.frame.DataFrame
    product_name = str 
    
    
    Returns
    --------
    Recommended Product list
    
    '''
    
    # Transform the description
    TF_matrix = vectorizer.transform(product_list['description_0'])
    
    # Determine the index of the product of recommendation
    product_index = product_list[product_list['title'] == product_name].index
    
    # Get the TF matrix of the required product
    TF_matrix_product = TF_matrix[product_index]
    
    # Determine the similarity between the product and everything else
    product_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)
    
    # 
    single_df = pd.DataFrame({'item':product_list['title'], 
                       'similarities': np.array(product_similarities.todense()).squeeze()})
    
    # Sort the products by similarity
    single_df_sorted = single_df.sort_values(by = 'similarities', ascending = False)
    
    
    # When the sentiment is positive, recommend the top 10 most similar products
    if sentiment == 1:
        recommended_products = single_df_sorted.head(10)
    
    # Recommend 10 products randomly from the top 100 most similar products
    else:
        # Split out the top 100 recommended_products
        top_100 = single_df_sorted.iloc[:100].reset_index(drop = True)
        
        # Randomly sample 10 numbers between 0 and 99
        ran_num = random.sample(range(0, 100), 10)
        
        # Find the index where the ran_num exists in the top 100 recommended products
        top_100_index = top_100.index.isin(ran_num)
        
        # Determine recommended products
        recommended_products = top_100[top_100_index]
        
    return recommended_products.reset_index(drop = True)

## 3.4 Test out recommendation system<a class ='anchor' id='3.4_test_recommendation'></a>

In [66]:
# Test out recommendation system

rec_system(file_vectorizer, 0 ,rec_df,'Rise and Swine (Good Eats Vol. 7)')

Unnamed: 0,item,similarities
0,American Eats: Holiday Foods,0.256991
1,Love Is the Devil,0.185373
2,"Surf, Turf &amp; A Side (Good Eats Vol. 14)",0.180079
3,Strawberry Shortcake - Get Well Adventure,0.144364
4,Dare To Cook: Barbecue &amp; Grilling,0.139608
5,Floyd Mayweather jr. Boxing DVD Collection,0.131203
6,Untold Secrets of the Civil War/American India...,0.125405
7,Partying 101: (Bio Dome / P.C.U. / Back to Sch...,0.124416
8,"Strawberry Shortcake - Berry, Merry Christmas",0.123068
9,Elvis Presley MGM Movie Legends Collection: (C...,0.121558


Sample tests can be a movie review + the rating -> Feed into model, Output top 10 movies the person may like.

Use reviews and movie descriptions to determine which movies to recommend based off of if the person rated the movie highly or not.

# 4. Conclusion and Future Works<a class ='anchor' id='4Conclusion'></a>