<H1>Naive Bayes Sentiment Analysis Challenge</H1><br><br>
I really want to overfit this time.<br><br>
Sentiment raw data was taken from the <a href='https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences'>UCI Machine Learning Repository.</a>

<H2>Class Imbalance</H2><br><br>
It turns out that all three data sets have exacty 1000 points, 500 positive and 500 negative. There is no class imbalance here.

In [17]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
from sklearn.naive_bayes import BernoulliNB

In [18]:
data = pd.read_csv('amazon_cells_labelled.txt', engine='python', header=None, sep=None)
data.columns = ['text', 'sentiment']
data['positive_sentiment'] = np.where((data.sentiment == 1), True, False)

def get_words(series):
    words = []
    for item in series: 
        words += item #put all words in same list
    translator = str.maketrans('', '', string.punctuation)
    for i, word in enumerate(words):
        word = word.translate(translator)
        words[i] = word #strip away punctutation
    return words

#This is the confusion matrix calculator I wrote for the evaluation drill...
def conf_calc(predictions, actual):
    true_positives = np.where((np.where((actual == True), True, None) == predictions), True, False)
    false_positives = np.where((np.where((actual == False), True, None) == predictions), True, False)
    true_negatives = np.where((np.where((actual == False), False, None) == predictions), True, False)
    false_negatives = np.where((np.where((actual == True), False, None) == predictions), True, False)
    num_true_positives = true_positives.sum()
    total_positives = data.positive_sentiment.sum()
    num_true_negatives = true_negatives.sum()
    total_negatives = len(data) - total_positives

    print('sensitivity is: {}% (positives correctly identified)'.format(num_true_positives/total_positives)) 
    print('specificity is: {}% (negatives correctly identified)'.format(num_true_negatives/total_negatives))
    conf_matrix = pd.DataFrame(index=['actual_false', 'actual_true'], columns=['predicted false', 'predicted true'])
    conf_matrix.loc['actual_true'] = [false_positives.sum(), num_true_positives]
    conf_matrix.loc['actual_false'] = [num_true_negatives, false_negatives.sum()]
    print(conf_matrix)

In [None]:
positive_words = pd.Series(get_words(data[data['sentiment'] == 1].text.str.lower().str.split()))
negative_words = pd.Series(get_words(data[data['sentiment'] == 0].text.str.lower().str.split()))

In [19]:
neg = list(negative_words.value_counts().index)
pos = list(positive_words.value_counts().index)

In [20]:
keywords = neg + pos #simply grab all words for keywords

In [21]:
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    data[str(key)] = data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [22]:
bnb = BernoulliNB()

In [23]:
overfit = bnb.fit(data[keywords], data.positive_sentiment)

In [24]:
correct_predictions = np.where((overfit.predict(data[keywords]) == data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     747
False    253
dtype: int64

In [40]:
conf_calc(overfit.predict(data[keywords]), data.positive_sentiment)

sensitivity is: 1.0
specificity is: 0.494
              predicted false  predicted true
actual_false              247               0
actual_true               253             500


We still only have 77% accuracy on the training data...<br>
Let's try training on one half of the data set and running it against the other half.

In [26]:
holdout = BernoulliNB().fit(data.iloc[:500][keywords], data.positive_sentiment.iloc[:500])
#the distribution of positive and negative sentiments among our points is even, so splitting it like this
#still avoids class imbalance.

In [27]:
correct_predictions = np.where((holdout.predict(data.iloc[500:][keywords]) == data.iloc[500:].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     251
False    249
dtype: int64

In [41]:
conf_calc(holdout.predict(data.iloc[500:][keywords]), data.iloc[500:].positive_sentiment)

sensitivity is: 0.478
specificity is: 0.024
              predicted false  predicted true
actual_false               12               0
actual_true               249             239


In [28]:
correct_predictions = np.where((holdout.predict(data.iloc[:500][keywords]) == data.iloc[:500].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     314
False    186
dtype: int64

In [42]:
conf_calc(overfit.predict(data.iloc[:500][keywords]), data.iloc[:500].positive_sentiment)

sensitivity is: 0.522
specificity is: 0.224
              predicted false  predicted true
actual_false              112               0
actual_true               127             261


In [43]:
correct_predictions = np.where((holdout.predict(data[keywords]) == data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     565
False    435
dtype: int64

In [44]:
conf_calc(holdout.predict(data[keywords]), data.positive_sentiment)

sensitivity is: 1.0
specificity is: 0.13
              predicted false  predicted true
actual_false               65               0
actual_true               435             500


Here we see about 64% accuracy with the training group. With the holdout group, we see about 50% accuracy, which is as good as guessing all positive or all negative. <B>This is because there is no class imbalance here: all three data sets have exactly 500 positive and 500 negative points.</B> This looks like overfitting to me.<br>
Let's compare these two models against the other data sets.

In [29]:
imdb_data = pd.read_csv('imdb_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
imdb_data.columns = ['text', 'sentiment']
imdb_data['positive_sentiment'] = np.where((imdb_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    imdb_data[str(key)] = imdb_data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [30]:
correct_predictions = np.where((overfit.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()  #'overfit' model against imdb set

True     606
False    394
dtype: int64

In [31]:
correct_predictions = np.where((holdout.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against imdb set

True     523
False    477
dtype: int64

In [32]:
yelp_data = pd.read_csv('yelp_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
yelp_data.columns = ['text', 'sentiment']
yelp_data['positive_sentiment'] = np.where((yelp_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    yelp_data[str(key)] = yelp_data.text.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [33]:
correct_predictions = np.where((overfit.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'overfit' model against yelp set

True     569
False    431
dtype: int64

In [34]:
correct_predictions = np.where((holdout.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against yelp set

True     513
False    487
dtype: int64

<H2>Did we overfit?</H2><br><br>
I'd say we did! Our 'overfit' and 'holdout' models both give a 50%-60% performance against the other sets, and the holdout test suggests overfitting even within our training set. I'd say that using every single word as a feature is a pretty lousy strategy overall.