# 4 Recommendation System

- Author: Jason Truong
- Last Modified: August 21, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. [Objective and Roadmap](#1Objective)  
2. [Preliminary Data Setup](#2Preliminary)   
3. [Content Based Recommendation](#4Test_Train)  
4. [Collaborative Based Recommendation](#3NLP)  
5. [Conclusion and Future Works](#5AdvancedModels)  

# 1. Objective<a class ='anchor' id='1Objective'></a>

To use review text and product description to come up with recommendations for users.

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [7]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Load in the dataset

In [8]:
meta_df = pd.read_csv('clean_meta.csv')

In [40]:
meta_df.head(10)

Unnamed: 0,title,brand,rank,price,asin,description_0,category_1,category_2
0,Understanding Seizures and Epilepsy,,886503,,0000695009,,Movies,
1,Spirit Led&mdash;Moving By Grace In The Holy S...,,342688,,0000791156,,Movies,
2,My Fair Pastry (Good Eats Vol. 9),Alton Brown,370026,,0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
3,"Barefoot Contessa (with Ina Garten), Entertain...",Ina Garten,342914,74.95,0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
4,Rise and Swine (Good Eats Vol. 7),Alton Brown,351684,,0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
5,The Power of the Cross Joseph Prince,Joseph Prince,444474,,000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
6,Live in Houston [VHS],Douglas Miller,1005955,,000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
7,"Everyday Italian (with Giada de Laurentiis), V...",,409173,24.95,0000143561,"Giada de Laurentis on ""Everyday Italian"" DVDs,...",Movies,
8,At Home with the Guitar VHS,,806803,,0001499572,like new,Genre for Featured Categories,Faith & Spirituality
9,Steve Green: Hide 'em in Your Heart: 13 Bible ...,Steve Green,282599,,0001526863,Steve Green: Hide 'em in Your Heart: 13 Bible ...,Christian Video,Bible


In [10]:
meta_df['category_2'].value_counts()

Documentary                  14242
Drama                        12693
Action & Adventure           11136
Comedy                        9590
Special Interests             8881
                             ...  
Two-Disc Special Editions        1
Krauss, Alison                   1
Amazing Vacation Homes           1
Spanish-Language                 1
Buble, Michael                   1
Name: category_2, Length: 340, dtype: int64

# 3. Content Based Recommendation

The first step is to use the descriptions of the different Amazon items, in this case, movies/tv shows to recommend products that are similar.

In [11]:
working_df = meta_df[['title','description_0']].copy()

In [12]:
working_df['description_0'] = working_df['description_0'].fillna("")

In [27]:
new_df = working_df.iloc[0:50000,:]

In [28]:
new_df

Unnamed: 0,title,description_0
0,Understanding Seizures and Epilepsy,
1,Spirit Led&mdash;Moving By Grace In The Holy S...,
2,My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...
3,"Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...
4,Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...
...,...,...
49995,Mitr: My Friend,Chidambaram-based Lakshmi gets married to Prit...
49996,Glitter &amp; Queer,From the label that brought you the Divas Of D...
49997,Kics Flix - Volume 5,
49998,Fragile Machine,Merging computer animation and music in what m...


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 25)

# Fit
vectorizer.fit(new_df['description_0'])

# Transform the description
TF_matrix2 = vectorizer.transform(new_df['description_0'])

In [31]:
# Check the shape of the transformed description
TF_matrix2.shape

(50000, 9046)

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

mov_similaries = cosine_similarity(TF_matrix2, dense_output = False)

In [33]:
mov_similaries

<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 483157748 stored elements in Compressed Sparse Row format>

In [41]:
movie_index = new_df[new_df['title'] =='At Home with the Guitar VHS'].index

sim_df = pd.DataFrame({'item':new_df['title'], 
                       'similarities': np.array(mov_similaries[movie_index,:].todense()).squeeze()})

In [42]:
sim_df.sort_values(by = 'similarities', ascending = False)

Unnamed: 0,item,similarities
7166,Warren Miller's Snowboarding: Tweaked &amp; Tw...,1.0
7168,Casanova VHS,1.0
4217,The Living Legend VHS,1.0
465,It's a Gift VHS,1.0
39929,Best Of Street Fury Uncut,1.0
...,...,...
17701,Mr. Wong Collection (Mr. Wong: Detective / Mys...,0.0
17703,JKA Shotokan Karate Kata Series-Vol 1 Heian 1-...,0.0
17704,Land of the Lost Vol. 2 VHS,0.0
17705,JKA Shotokan Karate Kata Series-Vol 11 Unsu Ts...,0.0


# 4. Collaborative Based Recommendations

In this section, the review text will be converted to features and then combined with the the product description features. This combination of features allow for a user based recommendation based off of similar user reviews and product descriptions.

### Load in the processed review data

In [43]:
# Load in the data
review_df = pd.read_json('preprocessed_review.json')

# Check the datatypes and null values in the data
review_df.info(show_counts= True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1698253 entries, 0 to 1698252
Data columns (total 15 columns):
 #   Column              Non-Null Count    Dtype 
---  ------              --------------    ----- 
 0   reviewScore         1698253 non-null  int64 
 1   verified            1698253 non-null  int64 
 2   reviewerID          1698253 non-null  object
 3   asin                1698253 non-null  object
 4   reviewText          1698253 non-null  object
 5   summary             1698253 non-null  object
 6   vote                1698253 non-null  int64 
 7   reviewDay           1698253 non-null  int64 
 8   reviewMonth         1698253 non-null  int64 
 9   reviewYear          1698253 non-null  int64 
 10  style_Amazon Video  1698253 non-null  int64 
 11  style_Blu-ray       1698253 non-null  int64 
 12  style_DVD           1698253 non-null  int64 
 13  style_Other         1698253 non-null  int64 
 14  style_VHS Tape      1698253 non-null  int64 
dtypes: int64(11), object(4)
memory u

In [44]:
review_df.head()

Unnamed: 0,reviewScore,verified,reviewerID,asin,reviewText,summary,vote,reviewDay,reviewMonth,reviewYear,style_Amazon Video,style_Blu-ray,style_DVD,style_Other,style_VHS Tape
0,5,1,A1HP3B92A3JDQ1,5019281,Of course it's impossible to separate Henry Wi...,The Fonz as Scrooge,4,2,11,2002,0,0,1,0,0
1,5,0,AZB4CQ9JZSUQB,5019281,"When this first aired in 1979, I enjoyed it so...",A Christmas Carol to be remembered,3,28,1,2002,0,0,1,0,0
2,5,0,A1PXS5N63PS6WR,5019281,I must confess to being a bit of a coinsure of...,Change can be good,2,12,12,2001,0,0,1,0,0
3,3,0,A17TPT3FWAE5T1,5019281,If you already have (and love) the Alistair Si...,An interesting contrast to more traditional ve...,31,11,12,2001,0,0,0,0,1
4,4,0,A3P98J5DZ00A75,5019281,Henry Winkler proves his acting ability in thi...,grey,62,19,10,2001,0,0,1,0,0


### Transform all the review text to a vector

In [None]:
## Convert the text in the reviewText column to vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate 
# Discard stop words and words need to be in atleast 10 reviews
review_wordbank = TfidfVectorizer(stop_words = "english", min_df = 25)

# Fit the first 200000 reviews
review_wordbank.fit(X_train['reviewText'])

# 3. Transform
X_train_transformed = review_wordbank.transform(X_train['reviewText'])
X_test_transformed = review_wordbank.transform(X_test['reviewText'])
X_train_transformed

### Combine with numeric features

### Combine with meta data features based on ASIN

### Use cosine similarity

### Test out recommendation system

Sample tests can be a movie review + the rating -> Feed into model, Output top 10 movies the person may like.

Use reviews and movie descriptions to determine which movies to recommend based off of if the person rated the movie highly or not.

# 5. Conclusion and Future Works