**Justin Hardy | JEH180008 | Dr. Mazidi | CS 4395.001**

The purpose of this assignment is to (...)

# Imports

In [272]:
import pandas
import math
import re as regex
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split as tts
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# ...

# About the Data Set
DESCRIPTION OF DATA SET GOES HERE

# Reading in the Data Set
We'll start by reading in both the train and test data from the files as data frames. We'll then combine the two data frames and cut down the amount of rows we'll use in the data by a substantial amount. Then do a 70/30 split (rather than the 90/10 split already done).

In [244]:
# Read train and test
col_names = ['Type', 'Title', 'Review']
df_partial_1 = pandas.read_csv('data/train.csv', names=col_names, header=None, encoding='utf-8', keep_default_na=False)
df_partial_2 = pandas.read_csv('data/test.csv', names=col_names, header=None, encoding='utf-8', keep_default_na=False)

# Combine the data frames
df = pandas.concat([df_partial_1, df_partial_2], ignore_index=True)

# Print Shape of Data Frame
print("Shape before cut:", df.shape)

# Cut down Data Frame size
df_type_1 = df.loc[df['Type'] == 1]
df_type_2 = df.loc[df['Type'] == 2]
df_type_1_cut = df_type_1
df_type_2_cut = df_type_2

# Take a tenth of the Data Frame's contents
df_type_1 = df_type_1.iloc[:int(len(df_type_1)/20)] # 20 = 200,000; 25 =  160,000; 40 = 100,000
df_type_2 = df_type_2.iloc[:int(len(df_type_2)/20)]

# Combine the two separate Data Frame back into the full Data Frame.
df = pandas.concat([df_type_1, df_type_2], ignore_index=True)

# Convert Type column from 1/2 notation to binary 0/1 notation
df.Type = [{1:0, 2:1}[t] for t in df.Type]

# Print Shape of Data Frame
print("Shape after cut:", df.shape)

# Print Head/Tail of the Data Frame
print("Final Data Frame (head and tail):")
print(df.head())
print()
print(df.tail())

Shape before cut: (4000000, 3)
Shape after cut: (200000, 3)
Final Data Frame (head and tail):
   Type                                    Title  \
0     0                             Buyer beware   
1     0                               The Worst!   
2     0                                Oh please   
3     0                     Awful beyond belief!   
4     0  Don't try to fool us with fake reviews.   

                                              Review  
0  This is a self-published book, and if you want...  
1  A complete waste of time. Typographical errors...  
2  I guess you have to be a romance novel lover f...  
3  I feel I have to write to keep others from was...  
4  It's glaringly obvious that all of the glowing...  

        Type                      Title  \
199995     1                  It Works!   
199996     1            Love this book!   
199997     1           Good basics book   
199998     1  Must read for new parents   
199999     1                great book!   

   

# Text Preprocessing
To process the text, we'll need to specify which columns we'll use as features, and which one will be our target. Since we're vectorizing our features & labels using SKLearn's TF-IDF Vectorizer, we'll need to concatenate the contents of each feature together, so that it can be transformed together. 

In [296]:
# Initialize tfidf vars
stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Define x and y columns
x = df.Title + ' ' + df.Review # concatenate Title and Review columns (for vectorizer)
y = df.Type

# Split into train/test
x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, train_size=0.7, random_state=66)

# Print shapes
print('x shape:', x_train.shape)
print('y shape:', y_train.shape)

# Apply tfidf vectorizer to features
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

# Print snapshot of the vectorized features
print("vectorized train size:", x_train.shape)
print("vectorized test size:", x_test.shape)

x shape: (140000,)
y shape: (140000,)
vectorized train size: (140000, 135928)
vectorized test size: (60000, 135928)


# Training The Models
For the Machine Learning models, we'll train Naive Bayes, Logistic Regression, and Neural Network models, making two attempts at each. The first attempt will be a simple version of the model, while the second attempt will be my attempt at an improved version of the simple model. Any things I tried that didn't make it into the final version of the second attempt will be noted in my explanation of the model.

## Naive Bayes (First Attempt)
In this attempt, I'll create a simple Naive Bayes model using Multinomial Naive Bayes, and examine its performance from there.

### Training

In [247]:
# Create Naive Bayes model & fit it to the training data
nb1 = MultinomialNB()
nb1.fit(x_train, y_train)

# Calculate priors
prior_p = sum(y_train == 1) / len(y_train)
log_prior_p = math.log(prior_p)
print('prior spam:', prior_p, '\n')
print('log of prior:', log_prior_p)

print('The prior above should match the following model prior:', nb1.class_log_prior_[1])

prior spam: 0.5004142857142857 

log of prior: -0.6923189522071845
The prior above should match the following model prior: -0.6923189522071844
prior spam: 0.5004142857142857 

log of prior: -0.6923189522071845
The prior above should match the following model prior: -0.6923189522071844


### Evaluation

In [305]:
# Predict off the test data
pred_nb1 = nb1.predict(x_test)

# Print accuracy report
print(confusion_matrix(y_test, pred_nb1))
print()
print('Accuracy:\t\t\t\t', accuracy_score(y_test, pred_nb1))
print()
print('Precision (positive):\t', precision_score(y_test, pred_nb1, pos_label=1))
print('Precision (negative):\t', precision_score(y_test, pred_nb1, pos_label=0))
print('Precision (average):\t', (precision_score(y_test, pred_nb1, pos_label=1)+precision_score(y_test, pred_nb1, pos_label=0))/2)
print()
print('Recall (positive):\t\t', recall_score(y_test, pred_nb1, pos_label=1))
print('Recall (negative):\t\t', recall_score(y_test, pred_nb1, pos_label=0))
print('Recall (average):\t\t', (recall_score(y_test, pred_nb1, pos_label=1)+recall_score(y_test, pred_nb1, pos_label=0))/2)
print()
print('F1 Score:\t\t\t\t', f1_score(y_test, pred_nb1))
print()
print('First 10 Mis-classifications (out of ' + str(len(y_test[y_test != pred_nb1])) + '):')
#print(y_test[y_test != pred_nb1].iloc[:10])
for i in y_test[y_test!= pred_nb1].iloc[:10].index:
    print(i)
    print("Title:", df.loc[i].Title)
    print('Review:', df.loc[i].Review)
    print()

[[26049  4009]
 [ 4962 24980]]

Accuracy:				 0.8504833333333334

Precision (positive):	 0.861706164407189
Precision (negative):	 0.8399922608106801
Precision (average):	 0.8508492126089345

Recall (positive):		 0.8342796072406653
Recall (negative):		 0.8666245259165614
Recall (average):		 0.8504520665786133

F1 Score:				 0.847771122159814

First 10 Mis-classifications (out of 8971):
162196
Title: What's Really THAT bad about this movie?
Review: I don't know why people are saying that this is such a horrible movie. It wasn't that bad, I guess it was a little far fetched, but look at some other big movies these days. It's a horror film, their usually all the same, and the killer never dies. Look at Halloween for example, will he ever die? And these killers are already dead! It was scary and when I see it, it still is sometimes scary. I thought it deserves 3 and a half stars, but the rating doesn't have that, so I gave it 4.

145969
Title: A GREAT Portable printer
Review: I realllly wan

As we can see, the algorithm achieved an 85% accuracy, average precision, average recall, and (roughly) F1 score. It seems to have a relatively balanced false positive/false negative rate, as well as a true positive/true negative rate.

## Naive Bayes (Second Attempt)
With this attempt, I wanted to do my best to improve the Precision/Recall scores of the model, without affecting the Accuracy too negatively.

### Training

In [299]:
# Create Naive Bayes model & fit it to the training data
nb2 = BernoulliNB()
nb2.fit(x_train, y_train)

# Calculate priors
prior_p = sum(y_train == 1) / len(y_train)
log_prior_p = math.log(prior_p)
print('prior spam:', prior_p, '\n')
print('log of prior:', log_prior_p)

print('The prior above should match the following model prior:', nb1.class_log_prior_[1])

prior spam: 0.5004142857142857 

log of prior: -0.6923189522071845
The prior above should match the following model prior: -0.6923189522071844


### Evaluation

In [304]:
# Predict off the test data
pred_nb2 = nb2.predict(x_test)

# Print accuracy report
print(confusion_matrix(y_test, pred_nb2))
print()
print('Accuracy:\t\t\t\t', accuracy_score(y_test, pred_nb2))
print()
print('Precision (positive):\t', precision_score(y_test, pred_nb2, pos_label=1))
print('Precision (negative):\t', precision_score(y_test, pred_nb2, pos_label=0))
print('Precision (average):\t', (precision_score(y_test, pred_nb2, pos_label=1)+precision_score(y_test, pred_nb2, pos_label=0))/2)
print()
print('Recall (positive):\t\t', recall_score(y_test, pred_nb2, pos_label=1))
print('Recall (negative):\t\t', recall_score(y_test, pred_nb2, pos_label=0))
print('Recall (average):\t\t', (recall_score(y_test, pred_nb2, pos_label=1)+recall_score(y_test, pred_nb2, pos_label=0))/2)
print()
print('F1 Score:\t\t\t\t', f1_score(y_test, pred_nb2))
print()
print('First 10 Mis-classifications (out of ' + str(len(y_test[y_test != pred_nb2])) + '):')
for i in y_test[y_test!= pred_nb2].iloc[:10].index:
    print(i)
    print("Title:", df.loc[i].Title)
    print('Review:', df.loc[i].Review)
    print()

[[25509  4549]
 [ 4124 25818]]

Accuracy:				 0.85545

Precision (positive):	 0.8501992294266802
Precision (negative):	 0.8608308304930314
Precision (average):	 0.8555150299598558

Recall (positive):		 0.8622670496292832
Recall (negative):		 0.848659258766385
Recall (average):		 0.8554631541978341

F1 Score:				 0.8561906183156742

First 10 Mis-classifications (out of 8673):
162196
Title: What's Really THAT bad about this movie?
Review: I don't know why people are saying that this is such a horrible movie. It wasn't that bad, I guess it was a little far fetched, but look at some other big movies these days. It's a horror film, their usually all the same, and the killer never dies. Look at Halloween for example, will he ever die? And these killers are already dead! It was scary and when I see it, it still is sometimes scary. I thought it deserves 3 and a half stars, but the rating doesn't have that, so I gave it 4.

145969
Title: A GREAT Portable printer
Review: I realllly wanted to giv

I'd at first started by having the algorithm identify all-caps words, triple (or more) dots, as well as consecutive exclamation points/question marks. This resulted in an overall worse performance in all instances that I'd included this (and with all combinations of the three). So, I scrapped the idea and instead switched the algorithm to use the Bernoulli (binomial) variant of Naive Bayes. This resulted in a slightly better overall performance than my original model, but only by a small percentage.

I believe the model performed worst after identifying those aforementioned sequences due to the fact that they aren't necessarily telling of the sentiment of the message. For instance, both positive and negative reviews may include triple dots (or more), as well as some number of capital words. However, one thing I should have tried (in reflection) was separately checking for consecutive exclamation points and consecutive question marks, rather than checking for any combination of both. Mainly because, I'd imagine, question marks would be used more often in negative reviews, and exclamation points would be used more often in positive reviews (albeit still in negative reviews). Combining both makes the algorithm ignorant of context. Or at least, that's what I speculate.

## Logistic Regression
DESCRIPTION OF LOGISTIC REGRESSION

### Training

### Evaluation

## Neural Network
DESCRIPTION OF NEURAL NETWORK

### Training

### Evaluation