Running out of memory issue.

# 4 Recommendation System

- Author: Jason Truong
- Last Modified: August 21, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. [Objective and Roadmap](#1Objective)  
2. [Preliminary Data Setup](#2Preliminary)   
3. [Content Based Recommendation](#4Test_Train)  
4. [Collaborative Based Recommendation](#3NLP)  
5. [Conclusion and Future Works](#5AdvancedModels)  

# 1. Objective<a class ='anchor' id='1Objective'></a>

To use review text and product description to come up with recommendations for users.

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [43]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle


from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Load in the dataset

In [2]:
meta_df = pd.read_csv('clean_meta.csv')

In [6]:
meta_df['description_0'].fillna('', inplace = True)

In [25]:
meta_df

Unnamed: 0,title,brand,rank,product_id,description_0,category_1,category_2
0,My Fair Pastry (Good Eats Vol. 9),Alton Brown,370026,0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
1,"Barefoot Contessa (with Ina Garten), Entertain...",Ina Garten,342914,0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
2,Rise and Swine (Good Eats Vol. 7),Alton Brown,351684,0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
3,The Power of the Cross Joseph Prince,Joseph Prince,444474,000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
4,Live in Houston [VHS],Douglas Miller,1005955,000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
...,...,...,...,...,...,...,...
156476,Verdi: Otello,Sonya Yoncheva,68026,B01HJ1INB0,Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
156477,Mr. Miracle - Ihn schickt der Himmel,,344483,B01HJ3E0PQ,Mr. Miracle DVD Region 2 need an all region DV...,Movies,
156478,The President,Misha Gomiashvili,199854,B01HJ6R77G,The President and his family rule the land wit...,Independently Distributed,Drama
156479,She.....Who Would Be Pope,Liv Ullmann,246494,B01HJCCLOY,"Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


In [9]:
meta_df['category_2'].value_counts()

Documentary             13297
Drama                   11489
Action & Adventure       9986
Comedy                   8762
Special Interests        8195
                        ...  
MGM DVDs Under $15          1
Soap Operas                 1
Cine espaol                 1
Other Topics                1
Five Star Collection        1
Name: category_2, Length: 337, dtype: int64

# 3. Content Based Recommendation

The first step is to use the descriptions of the different Amazon items, in this case, movies/tv shows to recommend products that are similar.

In [10]:
rec_df = meta_df[['title','description_0','product_id']].copy()

In [12]:
rec_df_subset = rec_df.iloc[0:50000,:]

# Check results
rec_df_subset

Unnamed: 0,title,description_0,product_id
0,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,0000143529
1,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,0000143588
2,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,0000143502
3,The Power of the Cross Joseph Prince,Have failures in your life caused you to feel ...,000073991X
4,Live in Houston [VHS],Track Listings 1. Come On Everybody 2. My Stre...,000107461X
...,...,...,...
49995,Shirley Temple: America's Sweetheart Collectio...,"Includes Baby Take a Bow, Bright Eyes & Rebecc...",B000FKPDY8
49996,America's Castles - The Grand Resorts,While taverns and inns have been a part of the...,B000FKP22Q
49997,Dragon in Fury,Dragon in Fury movie,B000FKPDUW
49998,"Rin Tin Tin: Double Feature, Vol. 3",Caryl of the mountains: The 1914 silent film w...,B000FKP42O


In [13]:
rec_df_subset

Unnamed: 0,title,description_0,product_id
0,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,0000143529
1,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,0000143588
2,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,0000143502
3,The Power of the Cross Joseph Prince,Have failures in your life caused you to feel ...,000073991X
4,Live in Houston [VHS],Track Listings 1. Come On Everybody 2. My Stre...,000107461X
...,...,...,...
49995,Shirley Temple: America's Sweetheart Collectio...,"Includes Baby Take a Bow, Bright Eyes & Rebecc...",B000FKPDY8
49996,America's Castles - The Grand Resorts,While taverns and inns have been a part of the...,B000FKP22Q
49997,Dragon in Fury,Dragon in Fury movie,B000FKPDUW
49998,"Rin Tin Tin: Double Feature, Vol. 3",Caryl of the mountains: The 1914 silent film w...,B000FKP42O


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer.fit(rec_df_subset['description_0'])

# Transform the description
TF_matrix2 = vectorizer.transform(rec_df_subset['description_0'])

In [None]:
movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

sim_df = pd.DataFrame({'item':rec_df_subset['title'], 
                       'similarities': np.array(mov_similaries[movie_index,:].todense()).squeeze()})

List of the top recommended products based off of similarities

In [None]:
top_recommend = sim_df.sort_values(by = 'similarities', ascending = False)[0:10]

# Check results
top_recommend

**Check the actual descriptions to see if products are similar.**

In [None]:
rec_df_subset['description_0'][14782]

In [None]:
rec_df_subset.loc[14782,:]

In [None]:
rec_df_subset['description_0'][2]

In [None]:
top_recommend.index[1]

In [None]:
sim_df.sort_values(by = 'similarities', ascending = True)

Find the Cosine similarity for the requested product instead of generating te cosine similarity for all the products.

In [44]:
# Define the vectorizer
vectorizer2 = TfidfVectorizer(stop_words = 'english', min_df = 10)

# Fit
vectorizer2.fit(rec_df['description_0'])


TfidfVectorizer(min_df=10, stop_words='english')

In [None]:
# Transform the description
TF_matrix = vectorizer2.transform(rec_df['description_0'])

In [18]:
movie_index = rec_df_subset[rec_df_subset['title'] =='Rise and Swine (Good Eats Vol. 7)'].index

In [39]:
TF_matrix1 = vectorizer2.transform(rec_df_subset['description_0'][movie_index])

In [29]:
movie_index

Int64Index([2], dtype='int64')

In [30]:
TF_matrix.shape

(156481, 39048)

In [54]:
TF_matrix[movie_index]

<1x39048 sparse matrix of type '<class 'numpy.float64'>'
	with 54 stored elements in Compressed Sparse Row format>

In [31]:
# Check the shape of the transformed description
TF_matrix1.shape

(1, 39048)

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

mov_similaries = cosine_similarity(TF_matrix,TF_matrix1, dense_output = False)

In [33]:
mov_similaries

<156481x1 sparse matrix of type '<class 'numpy.float64'>'
	with 55953 stored elements in Compressed Sparse Row format>

In [35]:
single_df = pd.DataFrame({'item':rec_df['title'], 
                       'similarities': np.array(mov_similaries.todense()).squeeze()})

In [37]:
single_df.sort_values(by = 'similarities', ascending = False).head(10)

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


# Save the vectorizer into a file to be reused in the function

In [46]:
pickle.dump(vectorizer2, open('recommender_vectorizer.pickle','wb'))

In [48]:
file_vectorizer = pickle.load(open('recommender_vectorizer.pickle','rb'))

## Create a function that would take in the item and output the top 10 recommendations.

In [55]:
def rec_system(vectorizer, product_list, product_name):
    '''
    The inputs to this function include the product_name to be used for recommending other products.
    '''
    
    # Transform the description
    TF_matrix = vectorizer.transform(product_list['description_0'])
    
    # Determine the index of the product of recommendation
    product_index = product_list[product_list['title'] == product_name].index
    
    # Get the TF matrix of the required product
    TF_matrix_product = TF_matrix[product_index]
    
    # Determine the similarity between the product and everything else
    product_similaries = cosine_similarity(TF_matrix,TF_matrix1, dense_output = False)
    
    single_df = pd.DataFrame({'item':product_list['title'], 
                       'similarities': np.array(product_similaries.todense()).squeeze()})
    
    recommended_products = single_df.sort_values(by = 'similarities', ascending = False).head(10)
    
    return recommended_products

In [56]:
rec_system(file_vectorizer, rec_df,'Rise and Swine (Good Eats Vol. 7)')

Unnamed: 0,item,similarities
2,Rise and Swine (Good Eats Vol. 7),1.0
76483,Good Eats with Alton Brown Vol. 12,0.301367
94294,American Eats: Holiday Foods,0.256991
61427,Food Network: Good Eats with Alton Brown - Bre...,0.247336
92830,Good Eats with Alton Brown - Holiday Treats (R...,0.244306
99539,Food Network Takeout Collection DVD - Good Eat...,0.243199
94305,American Eats: Hot Dogs,0.236412
76760,Good Eats: Cupboard Cuisine - Volume 11,0.224773
94298,American Eats,0.213358
95611,Kitchen Wisdom from Good Eats (Good Eats Vol. 20),0.202247


# 4. Collaborative Based Recommendations

In this section, the review text will be converted to features and then combined with the the product description features. This combination of features allow for a user based recommendation based off of similar user reviews and product descriptions.

### Load in the processed review data

In [None]:
# Load in the data
review_df = pd.read_json('preprocessed_review.json')

# Check the datatypes and null values in the data
review_df.info(show_counts= True)

In [None]:
review_df.head()

In [None]:
review_subsample = review_df[0:5000]

#check results
review_subsample

### Transform all the review text to a vector

In [None]:
## Convert the text in the reviewText column to vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate 
# Discard stop words and words need to be in atleast 10 reviews
review_wordbank = TfidfVectorizer(stop_words = "english", min_df = 25)

# Fit the first 50000 reviews
review_wordbank.fit(review_subsample['reviewText'])

# 3. Transform
X_train_transformed = review_wordbank.transform(review_subsample['reviewText'])
X_train_transformed

### Combine with numeric features

In [None]:
review_numeric = pd.DataFrame(columns = review_wordbank.get_feature_names(),data = X_train_transformed.toarray())

# Check results
review_numeric

In [None]:
review_numeric_df = pd.concat([review_subsample[['reviewScore','product_id']],review_numeric], axis = 1)

# check results
review_numeric_df

Combine the with meta data

### Combine with meta data features based on product_id

In [None]:
meta_numeric = pd.DataFrame(columns = vectorizer.get_feature_names(),data = TF_matrix2.toarray())

# Check results
meta_numeric

In [None]:
meta_numeric_df = pd.concat([new_df['product_id'],meta_numeric], axis = 1)

# Check results
meta_numeric_df

In [None]:
meta_numeric_df['product_id']

In [None]:
combined_df = pd.merge(review_numeric_df, meta_numeric_df,  how='left', left_on='product_id', right_on = 'product_id')

# Check results
combined_df

In [None]:
combined_similar = combined_df.drop(columns = 'product_id')

In [None]:
combined_similar.dropna(inplace = True)

In [None]:
combined_similar

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
review_similaries = cosine_similarity(combined_similar, dense_output = False)

In [None]:
review_similaries[1]

In [None]:
combined_similar

Combine the with meta data

In [None]:
combined_df['product_id'].value_counts()

### Use cosine similarity

### Test out recommendation system

Sample tests can be a movie review + the rating -> Feed into model, Output top 10 movies the person may like.

Use reviews and movie descriptions to determine which movies to recommend based off of if the person rated the movie highly or not.

# 5. Conclusion and Future Works