# Naive Bayes Classifier

Sticking to multinomial now, as I've seen it used much more

Here's a really in-depth resource explaining the entire intuition about NB sentiment analysis: https://web.stanford.edu/~jurafsky/slp3/4.pdf

In [1]:
import pandas as pd
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

In [2]:
reviews = pd.read_csv("comments_preproc.csv", index_col=0).sample(n=100000, random_state=1)
reviews.dropna(subset=["cleaned comment"], inplace=True)
reviews.reset_index(inplace=True, drop=True)

In [21]:
#reviews["sentiment"] = reviews["clarityRating"].apply(lambda x: 1 if x > 3 else 0 if x == 3 else -1)
reviews["sentiment"] = reviews["clarityRating"].apply(lambda x: 1 if x > 2.5 else 0)
reviews["sentiment"].value_counts()

1    73735
0    26259
Name: sentiment, dtype: int64

In [22]:
comments_proper = [x for x in reviews["cleaned comment"]]

In [23]:
def getNGrams(n=2):
    all_words = " ".join(comments_proper)
    all_words = all_words.split()               # get full list of unigrams in data
    
    ngs = nltk.ngrams(all_words, n=n)           # find ngrams of data
    
    ngram_vectors = []
    for item in ngs:
        ngram_vectors.append(" ".join(item))    # create vectors for each ngram
    return ngram_vectors

For more info on eval metrics used, 
* Confusion Matrix: https://towardsdatascience.com/understanding-the-confusion-matrix-from-scikit-learn-c51d88929c79
* Precision + Recall: https://en.m.wikipedia.org/wiki/Precision_and_recall
* F1 Score: https://www.educative.io/answers/what-is-the-f1-score

In [24]:
def evalPerformance(y_pred, y_test, mode="weighted"):
    conf_m = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    print("Accuracy Score: " + str(acc_score * 100) + "\n")
    print("Confusion Matrix: ")
    print(conf_m)
    print()

    precision = precision_score(y_test, y_pred, average=mode)
    recall = recall_score(y_test, y_pred, average=mode)

    print("Precision: {0}".format(precision * 100))
    print("Recall: {0}\n".format(recall * 100))

    f1 = f1_score(y_test, y_pred, average=mode)
    print("F1 Score: {0}".format(f1 * 100))

## Model 1: Unigram-based

In [25]:
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(comments_proper).toarray()       # get list of features (comments)

In [26]:
y = pd.get_dummies(reviews.loc[:, ["sentiment"]])     # isolate sentiments
y = y.loc[:, "sentiment"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)        # split into training and testing subsets

In [28]:
model = MultinomialNB().fit(X_train, y_train)           # create and fit model, use it to predict outcomes on test set
y_pred = model.predict(X_test)

In [29]:
evalPerformance(y_test, y_pred, mode="weighted")        # compare predicted and actual outcomes using common performance metrics

Accuracy Score: 84.62423121156057

Confusion Matrix: 
[[ 3714  1537]
 [ 1538 13210]]

Precision: 84.62517433988945
Recall: 84.62423121156057

F1 Score: 84.62470253146363


## Model 2: Bigram-based

Please don't make me do it [this way](https://ai.plainenglish.io/fast-natural-language-processing-101-cf1afefdc449), but it seems I'll have to overhaul my entire system regardless :/

Fundamentally, an NB model based on bigrams *should* be better than a unigram-based one, as we accept some amount of dependency between words in text.

In [31]:
bigram_vector = getNGrams(n=2)
bigram_vector

['nice teacher',
 'teacher know',
 'know stuff',
 'stuff good',
 'good note',
 'note extra',
 'extra credit',
 'credit good',
 'good grade',
 'grade test',
 'test enjoy',
 'enjoy class',
 'class soo',
 'soo nice',
 'nice helpful',
 'helpful max',
 'max dan',
 'dan pronounciation',
 'pronounciation meaning',
 'meaning foreign',
 'foreign classical',
 'classical word',
 'word ,',
 ', impress',
 'impress opinion',
 'opinion 70',
 '70 paper',
 'paper freshman',
 'freshman class',
 'class m',
 'm junior',
 'junior transfer',
 'transfer upenn',
 'upenn ,',
 ', couldn',
 'couldn t',
 't bad',
 'bad prof',
 'prof conlon',
 'conlon accompanist',
 'accompanist ,',
 ', mean',
 'mean accompany',
 'accompany soloist',
 'soloist ,',
 ', good',
 'good piano',
 'piano prof',
 'prof walk',
 'walk ,',
 ', talk',
 'talk perform',
 'perform accompanist',
 'accompanist ,',
 ', want',
 'want real',
 'real artist',
 'artist ,',
 ', seek',
 'seek overbooke',
 'overbooke student',
 'student waiste',
 'waiste s