Running out of memory issue.

# 4 Recommendation System

- Author: Jason Truong
- Last Modified: August 21, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. [Objective and Roadmap](#1Objective)  
2. [Preliminary Data Setup](#2Preliminary)   
3. [Content Based Recommendation](#3Content_Recommend)   
    3.1 [Single Product Comparison](#3.1_single_product)  
    3.2 [Save Vectorizer Model](#3.2_save_vectorizer)  
    3.3 [Recommendation Function](#3.3_recommend_function)  
    3.4 [Test Recommendation System](#3.4_test_recommendation)   
    3.5 [Evaluate Recommendation System](#3.5_evaluate_recommendation)  
4. [Conclusion and Future Works](#4Conclusion)  

# 1. Objective<a class ='anchor' id='1Objective'></a>

To use the predicted sentiment from review text and product description to come up with recommendations for users.

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [3]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import random


from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Load in the dataset

In [8]:
meta_df = pd.read_csv('clean_meta.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'clean_meta.csv'

In [1]:
# Ensure that the description column has no null values.
meta_df['description_0'].fillna('', inplace = True)

# Check results
meta_df

NameError: name 'meta_df' is not defined

# 3. Content Based Recommendation<a class ='anchor' id='3Content_Recommend'></a>

The first step is to use the descriptions of the different Amazon items, in this case, movies/tv shows to recommend products that are similar.

In [10]:
# Take the description, title and product id from the metadata
rec_df = meta_df[['title','description_0','product_id']].copy()

In [12]:
# Start off with a subset of the data
rec_df_subset = rec_df.iloc[0:50000,:]

# Check results
rec_df_subset

Unnamed: 0,title,description_0,product_id
0,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,0000143529
1,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,0000143588
2,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,0000143502
3,The Power of the Cross Joseph Prince,Have failures in your life caused you to feel ...,000073991X
4,Live in Houston [VHS],Track Listings 1. Come On Everybody 2. My Stre...,000107461X
...,...,...,...
49995,Shirley Temple: America's Sweetheart Collectio...,"Includes Baby Take a Bow, Bright Eyes & Rebecc...",B000FKPDY8
49996,America's Castles - The Grand Resorts,While taverns and inns have been a part of the...,B000FKP22Q
49997,Dragon in Fury,Dragon in Fury movie,B000FKPDUW
49998,"Rin Tin Tin: Double Feature, Vol. 3",Caryl of the mountains: The 1914 silent film w...,B000FKP42O


Turn the description into numeric features with TF-IDF Vectorizer

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer.fit(rec_df_subset['description_0'])

# Transform the description
TF_matrix2 = vectorizer.transform(rec_df_subset['description_0'])

In [None]:

movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similaries[movie_index,:].todense()).squeeze()})

List of the top recommended products based off of similarities

In [None]:
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

**Check the actual descriptions to see if products are similar.**

In [None]:
rec_df_subset['description_0'][14782]

In [None]:
rec_df_subset.loc[14782,:]

In [None]:
rec_df_subset['description_0'][2]

In [None]:
top_recommend.index[1]

In [None]:
sim_df.sort_values(by = 'similarities', ascending = True)

Find the Cosine similarity for the requested product instead of generating te cosine similarity for all the products.

In [44]:
# Define the vectorizer
vectorizer2 = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer2.fit(rec_df['description_0'])


TfidfVectorizer(min_df=10, stop_words='english')

This approach takes a lot of computation power and memory because the cosine similarity between every product is computed. A more efficient way is to compare only the product that is chosen with every other product.

## 3.1 Single Product Comparision<a class ='anchor' id='3.1_single_product'></a>

In [None]:
# Transform the description
TF_matrix = vectorizer2.transform(rec_df['description_0'])

In [18]:
# Determine the index of the movie
movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

In [39]:
TF_matrix_product = vectorizer2.transform(rec_df_subset['description_0'][movie_index])

In [29]:
movie_index

Int64Index([2], dtype='int64')

In [30]:
TF_matrix.shape

(156481, 39048)

In [54]:
TF_matrix[movie_index]

<1x39048 sparse matrix of type '<class 'numpy.float64'>'
	with 54 stored elements in Compressed Sparse Row format>

In [31]:
# Check the shape of the transformed description
TF_matrix_product.shape

(1, 39048)

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the similarity between the chosen product with every other product in the dataset
mov_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)

# Check output results
mov_similarities

NameError: name 'TF_matrix' is not defined

In [35]:
# Convert the movie similarities output into a dataframe
single_df = pd.DataFrame({'item':rec_df['title'], 
                       'similarities': np.array(mov_similarities.todense()).squeeze()})

In [37]:
# Check the top 10 most similar products
single_df.sort_values(by = 'similarities', ascending = False).head(10)

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


## 3.2 Save the vectorizer model<a class ='anchor' id='3.2_save_vectorizer'></a>

In [46]:
pickle.dump(vectorizer2, open('recommender_vectorizer.pickle','wb'))

In [48]:
file_vectorizer = pickle.load(open('recommender_vectorizer.pickle','rb'))

## 3.3 Top 10 Recommendation Function<a class ='anchor' id='3.3_recommend_function'></a>

Create a function that would take in the item and output the top 10 recommendations.

In [55]:
def rec_system(vectorizer, sentiment, product_list, product_name):
    '''
    
    The purpose of this function is to recommend products based on the predicted
    sentiment and product name.
    
    The inputs to this function includes the vectorizer model, the sentiment of 
    the review text, the list of products, product name to be used for 
    recommending other products.
    
    
    Parameters
    ----------
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer (Vectorizer model)
    sentiment = int
    product_list = pandas.core.frame.DataFrame
    product_name = str 
    
    
    Returns
    --------
    Recommended Product list
    
    '''
    
    # Transform the description
    TF_matrix = vectorizer.transform(product_list['description_0'])
    
    # Determine the index of the product of recommendation
    product_index = product_list[product_list['title'] == product_name].index
    
    # Get the TF matrix of the required product
    TF_matrix_product = TF_matrix[product_index]
    
    # Determine the similarity between the product and everything else
    product_similarities = cosine_similarity(TF_matrix,TF_matrix_product, dense_output = False)
    
    # 
    single_df = pd.DataFrame({'item':product_list['title'], 
                       'similarities': np.array(product_similarities.todense()).squeeze()})
    
    # Sort the products by similarity
    single_df_sorted = single_df.sort_values(by = 'similarities', ascending = False)
    
    
    # When the sentiment is positive, recommend the top 10 most similar products
    if sentiment == 1:
        recommended_products = single_df_sorted.head(10)
    
    # Recommend 10 products randomly from the top 100 most similar products
    else:
        # Split out the top 100 recommended_products
        top_100 = single_df_sorted.iloc[:100]
        
        # Randomly sample 10 numbers between 0 and 99
        ran_num = random.sample(range(0, 100), 10)
        
        # Find the index where the ran_num exists in the top 100 recommended products
        top_100_index = top_100.index.isin([ran_num])
        
        # Determine recommended products
        recommended_products = top_100[top_100_index]
        
    return recommended_products.reset_index(drop = True)

## 3.4 Test out recommendation system<a class ='anchor' id='3.4_test_recommendation'></a>

In [56]:
# Test out recommendation system

rec_system(file_vectorizer, rec_df,'Rise and Swine (Good Eats Vol. 7)')

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


## 3.5 Evaluate recommendation system<a class ='anchor' id='3.5_evaluate_recommendation'></a>

Sample tests can be a movie review + the rating -> Feed into model, Output top 10 movies the person may like.

Use reviews and movie descriptions to determine which movies to recommend based off of if the person rated the movie highly or not.

# 4. Conclusion and Future Works<a class ='anchor' id='4Conclusion'></a>