<H1>Naive Bayes Sentiment Analysis Challenge</H1><br><br>
I really want to overfit this time.<br><br>
Sentiment raw data was taken from the <a href='https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences'>UCI Machine Learning Repository.</a>

<H2>Class Imbalance</H2><br><br>
It turns out that all three data sets have exacty 1000 points, 500 positive and 500 negative. There is no class imbalance here.

<H2>My Overfitting Model Design: Use Every Word as a Feature</H2>

In [107]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
from sklearn.naive_bayes import BernoulliNB

In [108]:
data = pd.read_csv('amazon_cells_labelled.txt', engine='python', header=None, sep=None)
data.columns = ['text_', 'sentiment']
data['positive_sentiment'] = np.where((data.sentiment == 1), True, False)

def get_words(series):
    words = []
    for item in series: 
        words += item #put all words in same list
    translator = str.maketrans('', '', string.punctuation)
    for i, word in enumerate(words):
        word = word.translate(translator)
        words[i] = word #strip away punctutation
    return words

#This is the confusion matrix calculator I wrote for the evaluation drill...
def conf_calc(predictions, actual):
    true_positives = np.where((np.where((actual == True), True, None) == predictions), True, False)
    false_positives = np.where((np.where((actual == False), True, None) == predictions), True, False)
    true_negatives = np.where((np.where((actual == False), False, None) == predictions), True, False)
    false_negatives = np.where((np.where((actual == True), False, None) == predictions), True, False)
    num_true_positives = true_positives.sum()
    total_positives = actual.sum()
    num_true_negatives = true_negatives.sum()
    total_negatives = len(actual) - total_positives

    print('sensitivity is: {}% (positives correctly identified)'.format(str(100*num_true_positives/total_positives)[:4])) 
    print('specificity is: {}% (negatives correctly identified)'.format(str(100*num_true_negatives/total_negatives)[:4]))
    conf_matrix = pd.DataFrame(index=['actual_false', 'actual_true'], columns=['predicted false', 'predicted true'])
    conf_matrix.loc['actual_true'] = [false_negatives.sum(), num_true_positives]
    conf_matrix.loc['actual_false'] = [num_true_negatives, false_positives.sum()]
    print(conf_matrix)

In [109]:
fake_act = pd.Series([True, True, False, False])
fake_pred = pd.Series([True, False, False, True])
conf_calc(fake_act, fake_pred) #testing our evaluator function

sensitivity is: 50.0% (positives correctly identified)
specificity is: 50.0% (negatives correctly identified)
              predicted false  predicted true
actual_false                1               1
actual_true                 1               1


In [110]:
positive_words = pd.Series(get_words(data[data['sentiment'] == 1].text_.str.lower().str.split()))
negative_words = pd.Series(get_words(data[data['sentiment'] == 0].text_.str.lower().str.split()))

In [111]:
neg = list(negative_words.value_counts().index)
pos = list(positive_words.value_counts().index)

In [112]:
keywords = neg + pos #simply grab all words for keywords

In [113]:
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    data[str(key)] = data.text_.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [114]:
bnb = BernoulliNB()

In [115]:
overfit = bnb.fit(data[keywords], data.positive_sentiment)

In [116]:
correct_predictions = np.where((overfit.predict(data[keywords]) == data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     903
False     97
dtype: int64

<H3>90% Accuracy! HOORAY!!!</H3> <br>
The first time around, I made a silly, but highly consequential, mistake: naming the review content column "text". It turns out that "text" was also a keyword, and thus my script would overwrite this column and begin matching keywords to a column fool of boolean values. WOOPS!

In [117]:
conf_calc(overfit.predict(data[keywords]), data.positive_sentiment) #Remember, "True" is positive sentiment.

sensitivity is: 93.8% (positives correctly identified)
specificity is: 86.8% (negatives correctly identified)
              predicted false  predicted true
actual_false              434              66
actual_true                31             469


Now we'll try training with a holdout group, on the first 500 rows:

In [118]:
holdout = BernoulliNB().fit(data.iloc[:500][keywords], data.positive_sentiment.iloc[:500])
#the distribution of positive and negative sentiments among our points is even, so splitting it like this
#still avoids class imbalance problems.

And running our model on the holdout set gives:

In [119]:
correct_predictions = np.where((holdout.predict(data.iloc[500:][keywords]) == data.iloc[500:].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     326
False    174
dtype: int64

In [120]:
conf_calc(holdout.predict(data.iloc[500:][keywords]), data.iloc[500:].positive_sentiment)

sensitivity is: 74.8% (positives correctly identified)
specificity is: 56.3% (negatives correctly identified)
              predicted false  predicted true
actual_false              147             114
actual_true                60             179


A MUCH worse performance than on the training set:

In [121]:
correct_predictions = np.where((holdout.predict(data.iloc[:500][keywords]) == data.iloc[:500].positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     428
False     72
dtype: int64

In [122]:
conf_calc(overfit.predict(data.iloc[:500][keywords]), data.iloc[:500].positive_sentiment)

sensitivity is: 94.2% (positives correctly identified)
specificity is: 87.0% (negatives correctly identified)
              predicted false  predicted true
actual_false              208              31
actual_true                15             246


Looks like textbook overfitting! Let's run the holdout model against the entire data set:

In [123]:
correct_predictions = np.where((holdout.predict(data[keywords]) == data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()

True     754
False    246
dtype: int64

In [124]:
conf_calc(holdout.predict(data[keywords]), data.positive_sentiment)

sensitivity is: 86.8% (positives correctly identified)
specificity is: 64.0% (negatives correctly identified)
              predicted false  predicted true
actual_false              320             180
actual_true                66             434


Much worse performance than the "overfit" model.

<br><H3>Why The Tendency Towards Positive Sentiment Prediction?</H3><br>
You may have noticed that both of these models tend to predict more positive than negative, yielding a higher sensitivity to positive sentiments and a lower specificity to negative sentiments. Why is that? It could have something to do with the length of reviews; maybe negative reviews are long and positives are short, and thus the model interprets the reviews with less words (less features) as positive and only the longest (more features) as negative. Let's take a look:<br><br>

In [125]:
positive_words = pd.Series(data[data['sentiment'] == 1].text_.str.lower().str.split())
negative_words = pd.Series(data[data['sentiment'] == 0].text_.str.lower().str.split())
data['pos_word_count'] = positive_words.apply(lambda x: len(x))
data['neg_word_count'] = negative_words.apply(lambda x: len(x))
data[['pos_word_count', 'neg_word_count']].describe()

Unnamed: 0,pos_word_count,neg_word_count
count,500.0,500.0
mean,9.914,10.578
std,6.785772,6.578028
min,1.0,1.0
25%,4.0,5.0
50%,8.0,10.0
75%,14.0,15.0
max,30.0,30.0


Sure enough, negative reviews tend to be longer, so this could explain our model's tendancy to usually predict negatives.

<H2> My Models Vs. Other Data Sets </H2><br>
Let's compare these two models against the other data sets.<br><br>
<H3> imdb Data </H3>

In [126]:
imdb_data = pd.read_csv('imdb_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
imdb_data.columns = ['text_', 'sentiment']
imdb_data['positive_sentiment'] = np.where((imdb_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    imdb_data[str(key)] = imdb_data.text_.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [127]:
correct_predictions = np.where((overfit.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts()  #'overfit' model against imdb set

True     640
False    360
dtype: int64

In [128]:
conf_calc(overfit.predict(imdb_data[keywords]), imdb_data.positive_sentiment)

sensitivity is: 59.8% (positives correctly identified)
specificity is: 68.2% (negatives correctly identified)
              predicted false  predicted true
actual_false              341             159
actual_true               201             299


In [129]:
correct_predictions = np.where((holdout.predict(imdb_data[keywords]) == imdb_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against imdb set

True     584
False    416
dtype: int64

In [130]:
conf_calc(holdout.predict(imdb_data[keywords]), imdb_data.positive_sentiment)

sensitivity is: 63.6% (positives correctly identified)
specificity is: 53.2% (negatives correctly identified)
              predicted false  predicted true
actual_false              266             234
actual_true               182             318


Both models are really lousy, and this is not a surprise. Both overfit our training data. Notice the "overfit" model tends to guess negative. Maybe negative imdb reviews are shorter than positive ones?



In [131]:
positive_words_imdb = pd.Series(imdb_data[imdb_data['sentiment'] == 1].text_.str.lower().str.split())
negative_words_imdb = pd.Series(imdb_data[imdb_data['sentiment'] == 0].text_.str.lower().str.split())
imdb_data['pos_word_count'] = positive_words_imdb.apply(lambda x: len(x))
imdb_data['neg_word_count'] = negative_words_imdb.apply(lambda x: len(x))
imdb_data[['pos_word_count', 'neg_word_count']].describe()

Unnamed: 0,pos_word_count,neg_word_count
count,500.0,500.0
mean,15.128,13.582
std,10.102859,9.036293
min,1.0,1.0
25%,8.0,7.0
50%,13.0,11.0
75%,20.0,19.0
max,71.0,56.0


Sure enough, they are! This supports the idea that just having more words tends to make the models assume the review is negative. The "holdout" model, on the other hand, doesn't exhibit the same behavior, so I could be wrong. Note that the "overfit" model performs better overall (though still not great.)
<br><H3> Yelp Data </H3>

In [132]:
yelp_data = pd.read_csv('yelp_labelled.txt', engine='python', header=None, sep='\t', quoting=3)
yelp_data.columns = ['text_', 'sentiment']
yelp_data['positive_sentiment'] = np.where((yelp_data.sentiment == 1), True, False)
for key in keywords:
    re_string = '[^a-zA-Z]' + key + '[^a-zA-Z]' #to match words with spaces or punctuations around only
    yelp_data[str(key)] = yelp_data.text_.apply(lambda x: bool(re.search(re_string, str(x), re.IGNORECASE)))

In [133]:
correct_predictions = np.where((overfit.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'overfit' model against yelp set

True     710
False    290
dtype: int64

In [134]:
conf_calc(overfit.predict(yelp_data[keywords]), yelp_data.positive_sentiment)

sensitivity is: 66.4% (positives correctly identified)
specificity is: 75.6% (negatives correctly identified)
              predicted false  predicted true
actual_false              378             122
actual_true               168             332


In [135]:
correct_predictions = np.where((holdout.predict(yelp_data[keywords]) == yelp_data.positive_sentiment), True, False)
pd.Series(correct_predictions).value_counts() #'holdout' model against yelp set

True     659
False    341
dtype: int64

In [136]:
conf_calc(holdout.predict(yelp_data[keywords]), yelp_data.positive_sentiment)

sensitivity is: 72.8% (positives correctly identified)
specificity is: 59.0% (negatives correctly identified)
              predicted false  predicted true
actual_false              295             205
actual_true               136             364


Our models seem to fare a little better with the yelp data. Again, the "overfit" model tends to predict more negative than the "holdout" model, and the "overfit" has a little better overall accuracy. Just for the sake of curiosity, let's look at the mean review length for Yelp:

In [137]:
positive_words_yelp = pd.Series(yelp_data[imdb_data['sentiment'] == 1].text_.str.lower().str.split())
negative_words_yelp = pd.Series(yelp_data[imdb_data['sentiment'] == 0].text_.str.lower().str.split())
yelp_data['pos_word_count'] = positive_words_yelp.apply(lambda x: len(x))
yelp_data['neg_word_count'] = negative_words_yelp.apply(lambda x: len(x))
yelp_data[['pos_word_count', 'neg_word_count']].describe()

Unnamed: 0,pos_word_count,neg_word_count
count,500.0,500.0
mean,11.03,10.758
std,6.310368,6.212619
min,2.0,1.0
25%,6.0,5.0
50%,10.0,10.0
75%,15.0,15.0
max,32.0,28.0


The distribution of review lengths for Yelp is almost identical between positive and negative sentiments. This suggests that maybe the tendency for the "overfit" model to guess negative has to do with something else.

<H2>Did we overfit?</H2><br><br>
I'd say we did! Our 'overfit' and 'holdout' models both give no better than 75% accuracy performance against the other sets, and the holdout test suggests overfitting even within our training set. I'd say that using every single word as a feature is a pretty lousy strategy overall.

<H2>Conclusion</H2><br>
I came up with two models that were pretty clearly overfit to the training data, although I am still proud of getting above 90% predictive accuracy on the training set.<br><br>
In the process of evaluating these two models, we see that the model named "overfit" seems to predict more negative reviews in general, while the "holdout" model seems a bit move even-keeled. In a real-life application, these details would have important performance impacts for models.<br><br>
While using every single word as a feature was an interesting excersize, it's clearly a bad technique for creating a generally-applicable sentiment analysis model. Trying to hand-pick certain one-word features and identifying meaningful n-grams to be used as features would be an excellent avenue to improve overall accuracy.