<H1>Naive Bayes Sentiment Analysis Challenge</H1><br><br>
I really want to overfit this time.<br><br>
Sentiment raw data was taken from the <a href='https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences'>UCI Machine Learning Repository.</a>

<H2>Class Imbalance</H2><br><br>
It turns out that all three data sets have exacty 1000 points, 500 positive and 500 negative. There is no class imbalance here.

In [126]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

In [127]:
data = pd.read_csv('amazon_cells_labelled.txt', engine='python', header=None, sep=None)
data.columns = ['text', 'sentiment']
data['positive_sentiment'] = np.where((data.sentiment == 1), True, False)

def get_words(series):
    words = []
    for item in series: 
        words += item #put all words in same list
    translator = str.maketrans('', '', string.punctuation)
    for i, word in enumerate(words):
        word = word.translate(translator)
        words[i] = word #strip away punctutation
    return words

positive_words = pd.Series(get_words(data[data['sentiment'] == 1].text.str.lower().str.split()))
negative_words = pd.Series(get_words(data[data['sentiment'] == 0].text.str.lower().str.split()))

In [128]:
neg = list(negative_words.value_counts().index)
pos = list(positive_words.value_counts().index)

In [129]:
keywords = neg + pos #simply grab all words for keywords

In [130]:
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    data[str(key)] = data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [131]:
bnb = BernoulliNB()

In [132]:
overfit = bnb.fit(data[keywords], data.positive_sentiment)

In [133]:
correct_predictions = np.where((overfit.predict(data[keywords]) == data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     767
False    233
dtype: int64

We still only have 77% accuracy on the training data...<br>
Let's try training on one half of the data set and running it against the other half.

In [134]:
holdout = BernoulliNB().fit(data.iloc[:500][keywords], data.positive_sentiment.iloc[:500])
#the distribution of positive and negative sentiments among our points is even, so splitting it like this
#still avoids class imbalance.

In [135]:
correct_predictions = np.where((holdout.predict(data.iloc[500:][keywords]) == data.iloc[500:].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     258
False    242
dtype: int64

In [136]:
correct_predictions = np.where((holdout.predict(data.iloc[:500][keywords]) == data.iloc[:500].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     316
False    184
dtype: int64

Here we see about 64% accuracy with the training group. With the holdout group, we see about 50% accuracy, which is as good as guessing all positive or all negative. <B>This is because there is no class imbalance here: all three data sets have exactly 500 positive and 500 negative points.</B> This looks like overfitting to me.<br>
Let's compare these two models against the other data sets.

In [137]:
imdb_data = pd.read_csv('imdb_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
imdb_data.columns = ['text', 'sentiment']
imdb_data['positive_sentiment'] = np.where((imdb_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    imdb_data[str(key)] = imdb_data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [138]:
correct_predictions = np.where((overfit.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()  #'overfit' model against imdb set

True     610
False    390
dtype: int64

In [139]:
correct_predictions = np.where((holdout.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against imdb set

True     532
False    468
dtype: int64

In [140]:
yelp_data = pd.read_csv('yelp_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
yelp_data.columns = ['text', 'sentiment']
yelp_data['positive_sentiment'] = np.where((yelp_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    yelp_data[str(key)] = yelp_data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [141]:
correct_predictions = np.where((overfit.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'overfit' model against yelp set

True     583
False    417
dtype: int64

In [142]:
correct_predictions = np.where((holdout.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against yelp set

True     515
False    485
dtype: int64

<H2>Did we overfit?</H2><br><br>
I'd say we did! Our 'overfit' and 'holdout' models both give a 50%-60% performance against the other sets, and the holdout test suggests overfitting even within our training set. I'd say that using every single word as a feature is a pretty lousy strategy overall.