1. Is there something I can do now to start working on the presentation of my work. Not sure what kind of visuals/charts I can make with NLP analysis other than the word count graphs.
2. Do you have any suggestions on how I can structure my capstone better. 
3. I also have metadata on the movie reviews but I'm not sure how to use it.

count, distribution of length of reviews
distribution of review scores
By Genre
product category

Meta data, recommendation overlap with 'also bought column' 

What to do with product ids


# 3 Recommendation System

- Author: Jason Truong
- Last Modified: August 21, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. [Objective and Roadmap](#1Objective)  
2. [Preliminary Data Setup](#2Preliminary)  
    2.1. [Preprocessing: 'Overall'](#2_1Overall)  
    2.2. [Preprocessing: 'reviewScore'](#2_2Review)  
    2.3. [Preprocessing: 'Vote'](#2_3Vote)  
    2.4. [Drop duplicates and NaNs](#2_4Drop)  
3. [Test/Train Setup](#4Test_Train)  
4. [NLP Analysis Setup](#3NLP)  
5. [Advanced Models](#5AdvancedModels)  

# 1. Objective and Roadmap<a class ='anchor' id='1Objective'></a>

**Goal #1:** To predict the if the review has a positive or negative sentiment to it. This prediction is related to the prediction overall review score of the product.  
**Goal #2:** Predict if a review will have high or low votes from the community

NLP Roadmap:
1. Tokenize the review text
2. Remove the unnecessary tokens
3. Create a test train data split
4. See if Stemming and Lemmatization is needed
5. Create Models and Evaluate performance

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Load in the dataset

In [2]:
meta_df = pd.read_csv('clean_meta.csv')

In [3]:
meta_df.head()

Unnamed: 0,title,brand,rank,price,asin,description_0,category_1,category_2
0,Understanding Seizures and Epilepsy,,886503,,695009,,Movies,
1,Spirit Led&mdash;Moving By Grace In The Holy S...,,342688,,791156,,Movies,
2,My Fair Pastry (Good Eats Vol. 9),Alton Brown,370026,,143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
3,"Barefoot Contessa (with Ina Garten), Entertain...",Ina Garten,342914,74.95,143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
4,Rise and Swine (Good Eats Vol. 7),Alton Brown,351684,,143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,


In [None]:
meta_df['category_2'].value_counts()

In [None]:
working_df = meta_df[['title','description_0']].copy()

In [None]:
working_df['description_0'] = working_df['description_0'].fillna("")

In [None]:
new_df = working_df.iloc[0:100000,:]

In [None]:
new_df

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 10)

vectorizer.fit(new_df['description_0'])

TF_matrix2 = vectorizer.transform(new_df['description_0'])

In [None]:
TF_matrix2.shape

In [None]:
1356*181552

In [None]:
42278**2

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

mov_similaries = cosine_similarity(TF_matrix2, dense_output = False)

In [None]:
movie_index = new_df[new_df['title'] =='Mitr: My Friend'].index

sim_df = pd.DataFrame({'item':new_df['title'], 
                       'similarities': np.array(mov_similaries[movie_index,:].todense()).squeeze()})

In [None]:
sim_df.sort_values(by = 'similarities', ascending = False).head(10)

First step is to only focus on the sentiment of the review to try to predict the overall rating as well as the vote count so every other column will be dropped. A separate analysis will be done to include the summary text if time permits.

In [None]:
# Only keep the necessary columns
review_df = review_df[['overall', 'vote','reviewText','reviewerID','asin']]
review_df

In [None]:
review_df.info()

Sample tests can be a movie review + the rating -> Feed into model, Output top 10 movies the person may like.

Use reviews and movie descriptions to determine which movies to recommend based off of if the person rated the movie highly or not.

# FINAL OBJECTIVE

Build grid search 
Make models notebook with just numeric data
NLP notebook models with text analysis

Recommendation


### Load in the processed review data

### Transform all the review text to a vector

### Combine with numeric features

### Combine with meta data features based on ASIN

### Use cosine similarity

### Test out recommendation system

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

estimators = [('normalize', StandardScaler()),
             ('svm', )]

Use grid search to determine the best model and hyperparameter

In [None]:
#review_transformed.toarray().sum(axis=0)

In [None]:
from sklearn.svm import SVC

# Instantiate
svm_model = SVC(kernel='rbf')

# Fit the model
svm_model.fit(X_train_transformed,y_train)

# Score the model
print(f"Train score: {svm_model.score(X_train_transformed,y_train)}")
print(f"Train score: {svm_model.score(X_train_transformed,y_train)}")
print(f"Test score: {svm_model.score(X_test_transformed,y_test)}")

## Support Vector Machine Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Instantiate
decisiontree_model = DecisionTreeClassifier(max_depth = 25)

# Fit the model
decisiontree_model.fit(X_train_transformed,y_train)

# Score the model
print(f"Train score: {decisiontree_model.score(X_train_transformed,y_train)}")
print(f"Test score: {decisiontree_model.score(X_test_transformed,y_test)}")

## Decision Tree Classification

In [None]:
# Instantiate
logreg = LogisticRegression(C = 0.1)

# Fit the model
logreg.fit(X_train_scaled,y_train)

# Score the model
print(f"Train score: {logreg.score(X_train_scaled,y_train)}")
print(f"Test score: {logreg.score(X_test_scaled,y_test)}")

## Logistic Regression

In [None]:
# from sklearn.preprocessing import StandardScaler

# # Instantiate
# standscaler = StandardScaler()
# standscaler.fit(X_train_transformed)

# X_train_scaled = standscaler.transform(X_train_transformed)

**REMEMBER TO SCALE THE DATA**

# 5. Advanced Models <a class ='anchor' id='5AdvancedModels'></a>

In [None]:
X_train_transformed.shape

In [None]:
X_train_transformed = X_train_transformed.toarray()
X_test_transformed = X_test_transformed.toarray()

In [None]:
word_counts = pd.DataFrame({"counts":X_train_transformed.toarray().sum(axis=0)},
                          index = review_wordbank.get_feature_names()
                          ).sort_values("counts",ascending= False)

word_counts.head(20).plot(kind="bar",figsize=(15,5), legend = False)

plt.show()

In [None]:
prelim_df = pd.DataFrame(columns = review_wordbank.get_feature_names(),data = X_train_transformed.toarray())
display(prelim_df)

In [None]:
X_train_transformed.toarray().sum(axis=0)

After preliminary vectorization with countvectorizer(), 66915 rows of reviews returned 15623 unique terms or tokens.

In [None]:
# Instantiate 
# Discard stop words and words need to be in atleast 10 reviews
review_wordbank = CountVectorizer(stop_words = "english", min_df = 100)

# Fit the first 200000 reviews
review_wordbank.fit(X_train['reviewText'])

# 3. Transform
X_train_transformed = review_wordbank.transform(X_train['reviewText'])
X_validation_transformed = review_wordbank.transform(X_validation['reviewText'])
X_test_transformed = review_wordbank.transform(X_test['reviewText'])
X_train_transformed

## Convert the text in the reviewText column to vectors

In [None]:
X_train.shape

# 4. Set up NLP analysis <a class ='anchor' id='3NLP'></a>

The testing data contains 82.1% of positive reviews which means that the data is highly skewed towards positive reviews. There is a 82.1% chance of predicting the review sentiment correctly if every prediction made was positive.

In [None]:
y_test.value_counts()/y_test.shape[0]

The training data contains 82.1% of positive reviews which means that the data is highly skewed towards positive reviews. There is a 82.1% chance of predicting the review sentiment correctly if every prediction made was positive.

In [None]:
y_train_val.value_counts()/y_train.shape[0]

Check the split of the data for the train and test set

In [None]:
#Set up data for training, validation and testing
X = subsample.drop(columns = ['review_class', 'reviewScore'])
y = subsample['review_class']

# Stratify ensures that both the train and test set includes all the classes in the data
X_train_val, X_test, y_train_val, y_test = train_test_split(X,y, test_size = 0.25, stratify = y)


In [None]:
subsample = review_df.sample(frac = 0.05)

# Check results
subsample

In [None]:
review_df.shape

Since the dataset is 2,000,000 rows, a smaller amount will be sampled for NLP analysis 

# 3. Set up Train/Validation/Test split <a class ='anchor' id='4Test_Train'></a>

The null values have been dropped since the dataframe now only contains 1997484 entries

In [None]:
review_df.dropna(inplace = True)
review_df.reset_index(drop=True, inplace= True)
review_df.info(show_counts= True)

There seems to be 1497 NaN's in the `reviewText` column, since the NLP model is dependent on the `reviewText` those rows will be dropped.

In [None]:
review_df.isna().sum()

Check the number of NaN values in the dataframe

### Remove any NaNs in the dataframe

1365 entries has been dropped.

In [None]:
review_df.drop_duplicates(inplace = True, ignore_index = True)
review_df.info()

The votes have now all been converted to numbers and the datatype can now be changed.

## 2.4 Drop any duplicates and NaNs in the dataframe <a class ='anchor' id='2_4Drop'></a>

In [None]:
review_df['vote'] = review_df['vote'].astype('int32')
review_df.info(show_counts= True)

In [None]:
review_df['vote'] = review_df['vote'].str.replace(r"\,","",regex = True)
review_df['vote'] = review_df['vote'].fillna(0)
review_df.head()

For the `vote` column, since the NaN values are essentially no votes with can be represented with 0, the NaN values will be replaced with a zero. There are also commas within the vote that causes problems when converted to an int so they will be removed.

## 2.3 Check the datatype in the column `Vote`<a class ='anchor' id='2_3Vote'></a>

This new `review_class` column will be used for the logistic regression with the sentiment from the `reviewText`

In [None]:
review_df.head()

In [None]:
review_df['review_class'] = np.where(review_df['reviewScore']>=4,1,0)

Split the reviewScore column to 'Good (value of 1)'  for reviews that are 4 or 5 and 'Bad (value of 0)' for reviews that are 1,2 or 3.

## 2.2 Check the datatype in the column `reviewScore` <a class ='anchor' id='2_2Review'></a>

In [None]:
review_df['overall'] = review_df['overall'].astype('int8')
review_df.rename(columns={'overall':'reviewScore'}, inplace = True)
review_df

It looks like the values for the overall column are contained between 1 and 5 which makes sense since its a review out of 5. All of these are also integers so the datatype can be changed to a int8 type. This column represents the review score so it will be renamed for clarity.

In [None]:
review_df['overall'].value_counts()

## 2.1. Check the datatype in the column `overall`<a class ='anchor' id='2_1Overall'></a> 