# Rating Predictor Using Logistic Regression, Bayes and Random Forest

In [1]:
import pandas as pd
import numpy as np

#Import functions from py files

from models import run_all_models, cross_validation
from import_and_split import read_data, split_and_vectorise
from sampling import over_sampling

In [2]:
#Importing data and splitting with imported function
df = read_data()
X_train, X_test, y_train, y_test = split_and_vectorise(df)

All data has been explored and cleaned using cleaning.py file previously. Please refer to this file to see the cleaning methods used. The df was a dataframe of amazon reviews and ratings. The reviews has been cleaned following these steps:
- Lowercasing
- Remove stopwords
- Stem
- Tokenised

Then the data has been split up into train and test data, as well as vectorised. This means the words have been split up into seperate columns and represented in binary, with max features being set as 5000. "TfidfVectorizer" was used in this instance.

# Logistic Regression, Bayes and Random Forest Modelling Results

### Test Stage 1

Please see models.py for functions used for these results.

In [3]:
#Using the run_all_models function from models.py, returning all model results
model_results = run_all_models(X_train, X_test, y_train, y_test)

In [4]:
#Getting each models test results
logistic_reg, bayes, random_forest = model_results[0], model_results[1], model_results[2]

In [5]:
#Print accuracy as a percentage
print("Accuracy for Logstic Reg: {}%".format((logistic_reg * 100).round(2)))
print("Accuracy for Bayes: {}%".format((bayes * 100).round(2)))
print("Accuracy for Random Forest: {}%".format((random_forest * 100).round(2)))

Accuracy for Logstic Reg: 85.71%
Accuracy for Bayes: 39.97%
Accuracy for Random Forest: 85.57%


In [6]:
#Showing the unbalanced dataset
df["rating"].value_counts()

5.0    2980
4.0     270
3.0      92
1.0      66
2.0      54
Name: rating, dtype: int64

Logistic regression and random forest produced similar results in this first test stage. Bayes performed very poorly, which can be caused by overfitting the training set. As the values above suggest, there is a heavy weight towards 5 star rating reviews. I will now look into oversampling the data, to try and increase the amount of lower rated reviews.

### Test Stage 2
### Oversampling Results For All Models

Please see models.py and sampling.py for functions used for these results.

In [7]:
#Running the models using over sampling now
df_over_sample = over_sampling(df)
over_sample_results = run_all_models(df_over_sample[0], df_over_sample[1], df_over_sample[2], df_over_sample[3])

In [8]:
log_resample, bayes_resample, forest_resample = over_sample_results[0], over_sample_results[1], over_sample_results[2]

In [9]:
#Print accuracy as a percentage
print("Accuracy for Logstic Reg: {}%".format((log_resample * 100).round(2)))
print("Accuracy for Bayes: {}%".format((bayes_resample * 100).round(2)))
print("Accuracy for Random Forest: {}%".format((forest_resample * 100).round(2)))

Accuracy for Logstic Reg: 86.58%
Accuracy for Bayes: 41.41%
Accuracy for Random Forest: 84.56%


The results for all models have a very small improvement. The 5 star reviews weren't undersampled, which may be a reason for only seeing a small improvement in performance. Now, I'll look into using cross-validation to see if the model results can be improved. 

### Test Stage 3

### Cross Validation

Please see models.py for functions used for these results.

In [10]:
cross_val = cross_validation(df["test"], df["rating"])

In [11]:
log_cv, bayes_cv, forest_cv = cross_val[0], cross_val[1], cross_val[2]

In [23]:
#Print accuracy as a percentage
log_cv_result, bayes_cv_result, forest_cv_result = log_cv.round(4), bayes_cv.round(4), forest_cv.round(4)
print("Accuracy for Logstic Reg: {}%".format(log_cv_result * 100))
print("Accuracy for Bayes: {}%".format((bayes_cv_result * 100).round(2)))
print("Accuracy for Random Forest: {}%".format(forest_cv_result * 100))

Accuracy for Logstic Reg: 86.19%
Accuracy for Bayes: 85.93%
Accuracy for Random Forest: 39.33%


Logistic Regression looks like it performed best overall and was consistant throughout all testing. Cross Validation improved Bayes, which performed badly over the first 2 test stages. Random Forest had similar results over the first two test stages but performed poorly in cross validation.  