# 2 Modelling without Text Data

- Author: Jason Truong
- Last Modified: September 19, 2022
- Email: Jasontruong19@gmail.com

# Table of Contents

1. **[Objective](#1Objective)**  
2. **[Preliminary Data Setup](#2Preliminary)**  
    2.1. [Set up Train/Test Split](#2.1traintest)  
    2.2. [Scale data](#2.2scaledata)  
3. **[Logistic Regression](#3logistic)**  
    3.1 [Logistic model](#3.1logitmodel)  
    3.2 [Logistic Hyperparameter tuning](#3.2logit_tuning)  
4. **[Decision Tree](#4decisiontree)**  
    4.1 [Decision Tree model](#4.1dtmodel)  
    4.2 [Decision Tree Hyperparameter tuning](#4.2dt_tuning)  
5. **[XGBoost](#5xgboost)**  
    5.1 [XGBoost model](#5.1xgboost_model)  
    5.2 [XGBoost Hyperparameter tuning](#5.2_xgboost_tuning)  
6. **[Conclusion](#6conclusion)**


# 1. Objective and Roadmap<a class ='anchor' id='1Objective'></a>


**Goal 1:** Preprocess and finish cleaning the review data   
**Goal 2:** Simple Exploratory data analysis and modelling


# 2. Preliminary Data Setup<a class ='anchor' id='2Preliminary'></a>

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

Load in the dataset

In [2]:
review_df = pd.read_json('numeric_review.json')

In [3]:
review_df.head()

Unnamed: 0,reviewScore,verified,vote,reviewDay,reviewMonth,reviewYear,style_Amazon Video,style_Blu-ray,style_DVD,style_Other,style_VHS Tape,reviewer_ID,itemID
0,5,1,4,2,11,2002,0,0,1,0,0,0,0
1,5,0,3,28,1,2002,0,0,1,0,0,1,0
2,5,0,2,12,12,2001,0,0,1,0,0,2,0
3,3,0,31,11,12,2001,0,0,0,0,1,3,0
4,4,0,62,19,10,2001,0,0,1,0,0,4,0


In [4]:
review_df['reviewClass'] = np.where(review_df['reviewScore']>=4,1,0)

In [5]:
review_df['itemID'].value_counts()

3279    24436
2194    16643
2811    10032
2248     6695
258      6379
        ...  
2850      100
2326      100
2570      100
1222      100
3054      100
Name: itemID, Length: 3744, dtype: int64

## 2.1 Setup train and test split<a class ='anchor' id='2.1traintest'></a>

In [6]:
# The prediction is for the reviewScore
X = review_df.drop(columns = 'reviewScore')
y = review_df['reviewScore']


X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, stratify = y)

## 2.2 Scale the data<a class ='anchor' id='2.2scaledata'></a>

In [7]:
from sklearn.preprocessing import StandardScaler

# Instantiate Scaler
ss = StandardScaler()

# Fit the Scaler
ss.fit(X_train)

# Transform
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

# 3. Logistic Regression<a class ='anchor' id='3logistic'></a>

## 3.1 Logistic Model<a class ='anchor' id='3.1logitmodel'></a>

In [8]:
from sklearn.linear_model import LogisticRegression

# Instantiate
logreg = LogisticRegression(C = 1)

# Fit the model
logreg.fit(X_train_ss,y_train)

# Score the model
print(f"Train score: {logreg.score(X_train_ss,y_train)}")
print(f"Test score: {logreg.score(X_test_ss,y_test)}")

Train score: 0.8349934717187634
Test score: 0.8350307609688998


## 3.2 Logistic Hyperparameter Tuning <a class ='anchor' id='3.2logit_tuning'></a>

**Create a ML Pipelines to determine the best hyper parameters** 

In [10]:
# Set up caching for the pipeline.
from tempfile import mkdtemp
cachedir = mkdtemp()

In [11]:
from sklearn.pipeline import Pipeline

# Instantiate pipeline settings
estimators = [('normalize', StandardScaler()),
             ('model', LogisticRegression())]

# Instantiate pipeline model
pipeline_model = Pipeline(estimators, memory = cachedir)

In [12]:
from sklearn.model_selection import GridSearchCV

# Set up parameters for the pipeline
logit_param_grid = [
    
    {'normalize': [None, StandardScaler()],
     'model__solver': ['lbfgs', 'liblinear'],
     'model__penalty': ['l2'],
     'model__C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}
]

# Instantiate grid search
logit_gsearch = GridSearchCV(estimator=pipeline_model, param_grid=logit_param_grid, cv = 5, verbose = 1, n_jobs = -1)

Fit the grid search with a 5 fold cross validation.

In [13]:
fit_logit_grid = logit_gsearch.fit(X_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [14]:
fit_logit_grid.best_params_

{'model__C': 0.0001,
 'model__penalty': 'l2',
 'model__solver': 'lbfgs',
 'normalize': None}

In [15]:
logit_gsearch.score(X_train,y_train)

0.8351465703166158

In [16]:
logit_gsearch.score(X_test,y_test)

0.8351461734862118

# 4. Decision Tree<a class ='anchor' id='4decisiontree'></a>

## 4.1 Decision Tree Model<a class ='anchor' id='4.1dtmodel'></a>

In [9]:
from sklearn.tree import DecisionTreeClassifier

# Instantiate
decisiontree_model = DecisionTreeClassifier()

# Fit the model
decisiontree_model.fit(X_train_ss,y_train)

# Score the model
print(f"Train score: {decisiontree_model.score(X_train_ss,y_train)}")
print(f"Test score: {decisiontree_model.score(X_test_ss,y_test)}")

Train score: 0.9999450415289761
Test score: 0.7616519535335073


## 4.2 Decision Tree Hyperparameter Tuning<a class ='anchor' id='4.2dt_tuning'></a>

In [17]:
# Instantiate pipeline settings
tree_estimators = [('model', DecisionTreeClassifier())]

# Instantiate pipeline model
tree_pipeline_model = Pipeline(tree_estimators, memory = cachedir)

In [18]:
# Set up parameters for the pipeline
tree_param_grid = [
    
    {'model__max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
     'model__min_samples_leaf': [2, 4, 6, 8, 10]}
]

# Instantiate grid search
tree_gsearch = GridSearchCV(estimator=tree_pipeline_model, param_grid=tree_param_grid, cv = 5, verbose = 1, n_jobs = -1)

Fit the grid search with a 5 fold cross validation.

In [19]:
fit_tree_grid = tree_gsearch.fit(X_train,y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [20]:
# Determine the best decision tree hyper parameters
fit_tree_grid.best_params_

{'model__max_depth': 12, 'model__min_samples_leaf': 10}

In [21]:
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_train,y_train)}%')
print(f'Decision Tree model train set accuracy: {tree_gsearch.score(X_test,y_test)}%')

Decision Tree model train set accuracy: 0.8406047316103068%
Decision Tree model train set accuracy: 0.8373861184650606%


# 5. XGBoost<a class ='anchor' id='5xgboost'></a>

## 5.1 XGboost Model<a class ='anchor' id='5.1xgboost_model'></a>

In [10]:
from xgboost import XGBClassifier

# Instantiate model
XGB_model = XGBClassifier()

# Fit model
XGB_model.fit(X_train_ss,y_train)

# Score the model
print(f"Train score: {XGB_model.score(X_train_ss,y_train)}")
print(f"Test score: {XGB_model.score(X_test_ss,y_test)}")

  from pandas import MultiIndex, Int64Index


Train score: 0.8424379891794622
Test score: 0.8410557654440791


## 5.2 XGBoost Hyperparameter Tuning<a class ='anchor' id='5.2_xgboost_tuning'></a>

# 6. Conclusion<a class ='anchor' id='6conclusion'></a>