# 2 Modelling without Text Data

- Author: Jason Truong
- Last Modified: August 26, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. [Objective and Roadmap](#1Objective)  
2. [Preliminary Data Setup](#2Preliminary)  
    2.1. [Preprocessing: 'Overall'](#2_1Overall)  
    2.2. [Preprocessing: 'reviewScore'](#2_2Review)  
    2.3. [Preprocessing: 'Vote'](#2_3Vote)  
    2.4. [Drop duplicates and NaNs](#2_4Drop)  
3. [Test/Train Setup](#4Test_Train)  
4. [NLP Analysis Setup](#3NLP)  
5. [Advanced Models](#5AdvancedModels)  

# 1. Objective and Roadmap<a class ='anchor' id='1Objective'></a>


**Goal 1:** Preprocess and finish cleaning the review data   
**Goal 2:** Simple Exploratory data analysis and modelling

Data analysis Roadmap:
1. Load in the data
2. Clean data
    - Check for nulls
    - Unpack any
3. Preprocessing
4. EDA

# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

Load in the dataset

In [2]:
review_df = pd.read_json('numeric_review.json')

In [3]:
review_df.head()

Unnamed: 0,reviewScore,verified,vote,reviewDay,reviewMonth,reviewYear,style_Amazon Video,style_Blu-ray,style_DVD,style_Other,style_VHS Tape,reviewer_ID,itemID
0,5,1,0,11,3,2013,0,0,0,0,1,0,0
1,5,1,3,18,2,2013,1,0,0,0,0,1,0
2,5,0,0,17,1,2013,1,0,0,0,0,2,0
3,5,1,0,10,1,2013,1,0,0,0,0,3,0
4,4,1,0,26,12,2012,1,0,0,0,0,4,0


In [4]:
review_df['reviewScore'] = np.where(review_df['reviewScore']>=4,1,0)

In [5]:
review_df['itemID'].value_counts()

12517    24478
7826     16661
10567    10051
8021      6709
425       6385
         ...  
4914         1
6749         1
4033         1
10418        1
11056        1
Name: itemID, Length: 15434, dtype: int64

**Setup train and test split**

In [None]:
# The prediction is for the reviewScore
X = review_df.drop(columns = 'reviewScore')
y = review_df['reviewScore']


X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, stratify = y)

**Scale the data**

In [None]:
from sklearn.preprocessing import StandardScaler

# Instantiate Scaler
ss = StandardScaler()

# Fit the Scaler
ss.fit(X_train)

# Transform
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate
logreg = LogisticRegression(C = 1)

# Fit the model
logreg.fit(X_train_ss,y_train)

# Score the model
print(f"Train score: {logreg.score(X_train_ss,y_train)}")
print(f"Test score: {logreg.score(X_test_ss,y_test)}")

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Instantiate
decisiontree_model = DecisionTreeClassifier()

# Fit the model
decisiontree_model.fit(X_train_ss,y_train)

# Score the model
print(f"Train score: {decisiontree_model.score(X_train_ss,y_train)}")
print(f"Test score: {decisiontree_model.score(X_test_ss,y_test)}")

**Create ML Pipelines to determine the best model** 

Set up caching for the pipeline.

In [None]:
from tempfile import mkdtemp
cachedir = mkdtemp()

## Logistic Regression Hyperparameter Tuning

In [None]:
from sklearn.pipeline import Pipeline

# Instantiate pipeline settings
estimators = [('normalize', StandardScaler()),
             ('model', LogisticRegression())]

# Instantiate pipeline model
pipeline_model = Pipeline(estimators, memory = cachedir)

In [None]:
from sklearn.model_selection import GridSearchCV

# Set up parameters for the pipeline
logit_param_grid = [
    
    {'normalize': [None, StandardScaler()],
     'model__solver': ['lbfgs', 'liblinear'],
     'model__penalty': ['l2'],
     'model__C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}
]

# Instantiate grid search
logit_gsearch = GridSearchCV(estimator=pipeline_model, param_grid=logit_param_grid, cv = 5, verbose = 1, n_jobs = -1)

Fit the grid search with a 5 fold cross validation.

In [None]:
fit_logit_grid = logit_gsearch.fit(X_train,y_train)

In [None]:
fit_logit_grid.best_params_

In [None]:
logit_gsearch.score(X_train,y_train)

In [None]:
logit_gsearch.score(X_test,y_test)

## Decision Tree Hyperparameter Tuning

In [None]:
# Instantiate pipeline settings
tree_estimators = [('model', DecisionTreeClassifier())]

# Instantiate pipeline model
tree_pipeline_model = Pipeline(tree_estimators, memory = cachedir)

In [None]:
# Set up parameters for the pipeline
tree_param_grid = [
    
    {'model__max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
     'model__min_samples_leaf': [2, 4, 6, 8, 10]}
]

# Instantiate grid search
tree_gsearch = GridSearchCV(estimator=tree_pipeline_model, param_grid=tree_param_grid, cv = 5, verbose = 1, n_jobs = -1)

Fit the grid search with a 5 fold cross validation.

In [None]:
fit_tree_grid = tree_gsearch.fit(X_train,y_train)

In [None]:
fit_tree_grid.best_params_

In [None]:
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_train,y_train)}%')
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_test,y_test)}%')

## Support Vector Machine Hyperparameter Tuning

In [None]:
# Instantiate pipeline settings
svm_estimators = [('normalize', StandardScaler()),
             ('model', LinearSVC())]

# Instantiate pipeline model
svm_pipeline_model = Pipeline(svm_estimators, memory = cachedir)

In [None]:
# Set up parameters for the pipeline
svm_param_grid = [
    
    {'model': [LinearSVC()]
     'model__C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000],
     'model__max_iter': [1000]}
    ]
# Instantiate grid search
svm_gsearch = GridSearchCV(estimator=svm_pipeline_model, param_grid=svm_param_grid, cv = 5, verbose = 4, n_jobs = 1)

Fit the grid search with a 5 fold cross validation.

In [None]:
fit_svm_grid = svm_gsearch.fit(X_train,y_train)

In [None]:
fit_tree_grid.best_params_

In [None]:
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_train,y_train)}%')
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_test,y_test)}%')