## Naive Bayes Model

## Due date

March 23, 2023

N.B.: You are allowed to consult lecture notebooks, the kaggle competition, ChatGPT, and Stack Overflow. You are also permitted to work with one other student.

## Assignment description


Your task is to train a Naive Bayes model on the [Sentiment Analysis on Movie Reviews](https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data). The dataset contains 15,000 movie reviews. Your task is to train a Naive Bayes model on the training set and predict the labels of the test set. You will be evaluated on the methodologies you use to train the model and a comparison of your solutions. Your higest performing model will become your baseline model.

### Data

The dataset contains movie reviews from Rotten Tomatoes. Each review is labeled as follows:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

In addition to the entire review, the reviews are split into phrases and each phrase is labeled. The entire review is assigned a `SentenceID` and each phrase is assigned a `PhraseID`. The phrases were produced by the Stanford Parser (stanza). You are free to use the entire review or the phrases to train your model. In addition, you are free to create additional features from the data (such as Brown Tags). Finally, you are free to adjust the labels to negative, neutral, positive, if a classificaiton report demonstrates that a certain label is underperforming.

## Section 0: Load the data

In [81]:
import pandas as pd
import numpy as np


train_url = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/assignment_notebooks/data/nb_train.tsv'
test_url = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/assignment_notebooks/data/nb_test.tsv'

train_df = pd.read_csv(train_url, sep='\t')
test_df = pd.read_csv(test_url, sep='\t')

In [82]:
## train data
train_df.head(20)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what...,2
6,7,1,of,2
7,8,1,escapades demonstrating the adage that what is...,2
8,9,1,escapades,2
9,10,1,demonstrating the adage that what is good for ...,2


In [83]:
train_df.shape

(156060, 4)

In [84]:
## test data
test_df.head(20)

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine
5,156066,8545,intermittently pleasing but
6,156067,8545,intermittently pleasing
7,156068,8545,intermittently
8,156069,8545,pleasing
9,156070,8545,but


## Section 1: Data Exploration

In the following section, please explore the data. You should explore the data to understand the following:

* has the data been preprocessed already?
* How many reviews are in the dataset?
* How many phrases are in the dataset?
* What is the distribution of the labels? (i.e. how many reviews are negative, neutral, positive)
* What is the distribution of the labels for each phrase? (i.e. how many phrases are negative, neutral, positive)
* What is the distribution of the words/tokens in the dataset?
* How many unique words are in the dataset?

**has the data been preprocessed already?**

In [85]:
### YOUR CODE HERE
print(train_df['Phrase'])
train_df['Phrase'][63] #pulling a bunch of phrases to look at the data

0         A series of escapades demonstrating the adage ...
1         A series of escapades demonstrating the adage ...
2                                                  A series
3                                                         A
4                                                    series
                                ...                        
156055                                            Hearst 's
156056                            forced avuncular chortles
156057                                   avuncular chortles
156058                                            avuncular
156059                                             chortles
Name: Phrase, Length: 156060, dtype: object


'This quiet , introspective and entertaining independent is worth seeking .'

It looks the data has been preprocessed but punctuation and stopwords have not been removed.

**How many reviews are in the dataset?**

In [86]:
print('the total number of sentences are: ' + str(test_df['SentenceId'].max()))

the total number of sentences are: 11855


**How many Phrases are in the dataset?**

In [87]:
print('the total number of phrases are ' + str(len(train_df) + len(test_df)))

the total number of phrases are 222352


**What is the distribution of the labels? (i.e. how many reviews are negative, neutral, positive)**

In [88]:
train_df.groupby(['SentenceId'])['Sentiment'].value_counts()

SentenceId  Sentiment
1           2            56
            1             4
            3             3
2           2             8
            3             6
                         ..
8543        1             5
            2             3
8544        2            15
            1             4
            3             2
Name: Sentiment, Length: 26547, dtype: int64

In [89]:
test = train_df.loc[train_df['SentenceId'] == 1]
test.loc[test['Sentiment'] == 3]

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
21,22,1,good for the goose,3
22,23,1,good,3
46,47,1,amuses,3


It looks like sentiments are broken up by each phrase and not each review. The whole sentiment is described by the first review in each sentence ID. Let's use that then!

In [90]:
reviews = train_df.drop_duplicates(subset=['SentenceId'],keep='first').copy()
reviews['Sentiment'].value_counts()

3    2321
1    2200
2    1655
4    1281
0    1072
Name: Sentiment, dtype: int64

**What is the distribution of the labels for each phrase? (i.e. how many phrases are negative, neutral, positive)**

In [91]:
train_df['Sentiment'].value_counts() #since each phrase is an entry

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

**What is the distribution of the words/tokens in the dataset?**

In [92]:
reviews

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
63,64,2,"This quiet , introspective and entertaining in...",4
81,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1
116,117,4,A positively thrilling combination of ethnogra...,3
156,157,5,Aggressive self-glorification and a manipulati...,1
...,...,...,...,...
155984,155985,8540,... either you 're willing to go with this cla...,2
155997,155998,8541,"Despite these annoyances , the capable Claybur...",2
156021,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1
156031,156032,8543,The movie 's downfall is to substitute plot fo...,1


Weird that the SentenceId and length are different, let's just check to see where and make sure they aren't in the dataset.

In [93]:
list(set(reviews['SentenceId']).symmetric_difference(set(range(1,8544)))) #find where a sentence id doesn't increase by one

[2628,
 2746,
 4044,
 4365,
 4761,
 5695,
 5916,
 6231,
 6358,
 6673,
 6922,
 7325,
 7473,
 8443,
 8530,
 8544]

In [94]:
train_df.loc[train_df['SentenceId'] == 2628] #check to see what is in that sentence Id

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment


In [95]:
# Unwind the data on the tokens
review_tokens = (reviews.explode('Phrase'))
review_tokens

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
63,64,2,"This quiet , introspective and entertaining in...",4
81,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1
116,117,4,A positively thrilling combination of ethnogra...,3
156,157,5,Aggressive self-glorification and a manipulati...,1
...,...,...,...,...
155984,155985,8540,... either you 're willing to go with this cla...,2
155997,155998,8541,"Despite these annoyances , the capable Claybur...",2
156021,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1
156031,156032,8543,The movie 's downfall is to substitute plot fo...,1


In [96]:
reviews['Phrase'] = [x.replace(' , ', ',') for x in reviews['Phrase']]# commas are weird and cause bugs so make them normal
reviews['Phrase'] = [x.replace(' ', ',') for x in reviews['Phrase']]# make phrase a list so we can explode
reviews['Phrase'] = [x.strip('()').split(',') for x in reviews['Phrase']]

In [97]:
reviews

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,"[A, series, of, escapades, demonstrating, the,...",1
63,64,2,"[This, quiet, introspective, and, entertaining...",4
81,82,3,"[Even, fans, of, Ismail, Merchant, 's, work, I...",1
116,117,4,"[A, positively, thrilling, combination, of, et...",3
156,157,5,"[Aggressive, self-glorification, and, a, manip...",1
...,...,...,...,...
155984,155985,8540,"[..., either, you, 're, willing, to, go, with,...",2
155997,155998,8541,"[Despite, these, annoyances, the, capable, Cla...",2
156021,156022,8542,"[-LRB-, Tries, -RRB-, to, parody, a, genre, th...",1
156031,156032,8543,"[The, movie, 's, downfall, is, to, substitute,...",1


In [98]:
# Unwind the data on the tokens
reviews_tokens = (reviews
                  .explode('Phrase'))
reviews_tokens

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A,1
0,1,1,series,1
0,1,1,of,1
0,1,1,escapades,1
0,1,1,demonstrating,1
...,...,...,...,...
156039,156040,8544,'s,2
156039,156040,8544,forced,2
156039,156040,8544,avuncular,2
156039,156040,8544,chortles,2


In [99]:
term_frequency = (reviews_tokens
                  .groupby(by=['Phrase','Sentiment'])
                  .agg({'Phrase': 'count'})
                  .rename(columns={'Phrase': 'term_count'})
                  .reset_index()
                  .rename(columns={'Phrase': 'term'})
                 )

In [100]:
term_frequency

Unnamed: 0,term,Sentiment,term_count
0,,1,4
1,,2,6
2,,3,4
3,,4,2
4,!,0,9
...,...,...,...
33923,zone,1,2
33924,zone,3,2
33925,zone,4,2
33926,zoning,1,1


In [101]:
term_frequency.sort_values(by=['term_count'],axis=0, ascending = False)

Unnamed: 0,term,Sentiment,term_count
103,.,3,2183
101,.,1,2047
30837,the,3,1773
102,.,2,1477
30835,the,1,1476
...,...,...,...
13386,deform,2,1
13385,deflated,1,1
13384,definitive,4,1
13383,definitive,2,1


as expected, the stopwords are most common

**How many unique words are in the dataset?**

In [102]:
len(term_frequency) #each token has it's own row so the length is the amount of words

33928

## Section 2: Data Preprocessing

In the following section, please preprocess the data. How you preprocess the data will align with what features you are engineering for your model. This might include: tokenization, lemmatization, stemming, removing stopwords, etc. You should also consider how you will handle the labels. You might consider the following: one-hot encoding, label encoding, etc.

In [103]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
from nltk.stem.porter import *

In [104]:
#!pip install nltk

In [105]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [106]:
punct = string.punctuation
punctuation = list(punct.replace('',',')) #make punctuation a list
stopword =  stopwords.words('english')  + punctuation + ['','!?',"''"] #all stopwords and punct are here
stemmer = PorterStemmer()

In [107]:
### YOUR CODE HERE
train_df['Phrase'] = train_df.apply(lambda row: nltk.word_tokenize(row['Phrase']), axis=1) #tokenize
train_df['Phrase'] = train_df.apply(lambda row: [word for word in row['Phrase'] if word.lower() not in stopword],axis = 1)#remove stopwords
train_df['Phrase'] = train_df.apply(lambda row: [stemmer.stem(i) for i in row['Phrase']],axis = 1)#stem
s = train_df['Phrase'].str.len() == 0 #remove empty lists - they have no predictive power
train_df = train_df[~s]

## Section 3: Feature Engineering

In this section, you should engineer features for your model. You could consider the following: bag of words, tf-idf, brown tags.

In [108]:
### YOUR CODE HERE
docs = train_df.drop_duplicates(subset=['SentenceId'],keep='first') #only look at the first sentence(kernal keeps crashing otherwise)
#count document frequency
doc_frequency = (docs.explode('Phrase') 
                     .groupby(['SentenceId','Phrase'])
                     .size()
                     .unstack()
                     .sum()
                     .reset_index()
                     .rename(columns={0: 'document_frequency'}))

In [109]:
#count term frequency
term_frequency = (docs.explode('Phrase')
                  .groupby(by=['Phrase','Sentiment'])
                  .agg({'Phrase': 'count'})
                  .rename(columns={'Phrase': 'term_frequency'})
                  .reset_index()
                  .rename(columns={'Phrase': 'token'})
                 )

In [110]:
term_frequency = term_frequency.merge(doc_frequency,left_on='token',right_on='Phrase')#join into one table
term_frequency.drop(['Phrase'],axis=1,inplace = True)
term_frequency['idf'] = np.log((1 + 8544) / (1 + term_frequency['document_frequency'])) + 1 #calculate idf
term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf'] #calculate tfidf
term_frequency.sort_values(by=['term_frequency'], ascending=False)

Unnamed: 0,token,Sentiment,term_frequency,document_frequency,idf,tfidf
41,'s,3,722,2517.0,2.221881,1604.198357
39,'s,1,651,2517.0,2.221881,1446.444779
40,'s,2,514,2517.0,2.221881,1142.047030
8054,film,3,395,1290.0,2.889929,1141.522036
42,'s,4,337,2517.0,2.221881,748.774025
...,...,...,...,...,...,...
10511,hurley,2,1,3.0,8.666807,8.666807
10512,hurri,0,1,2.0,8.954489,8.954489
10513,hurri,1,1,2.0,8.954489,8.954489
10514,hurt,0,1,6.0,8.107191,8.107191


In [111]:
#a few punctuations snuck through, these have no predictive value
idx = term_frequency.loc[term_frequency['token'] == "'s"].index
term_frequency.drop(list(idx),inplace = True)
idx = term_frequency.loc[term_frequency['token'] == "n't"].index
term_frequency.drop(list(idx),inplace = True)
term_frequency.sort_values(by=['term_frequency'], ascending=False)

Unnamed: 0,token,Sentiment,term_frequency,document_frequency,idf,tfidf
8054,film,3,395,1290.0,2.889929,1141.522036
14024,movi,1,311,1128.0,3.024014,940.468364
8052,film,1,287,1290.0,2.889929,829.409682
14026,movi,3,253,1128.0,3.024014,765.075550
8055,film,4,236,1290.0,2.889929,682.023292
...,...,...,...,...,...,...
10511,hurley,2,1,3.0,8.666807,8.666807
10512,hurri,0,1,2.0,8.954489,8.954489
10513,hurri,1,1,2.0,8.954489,8.954489
10514,hurt,0,1,6.0,8.107191,8.107191


In [112]:
idx = term_frequency.loc[term_frequency['token'] == "10"].index
term_frequency.drop(list(idx),inplace = True)

In [113]:
token = list(term_frequency['token'])
sent = list(term_frequency['Sentiment'])
freqs = list(term_frequency['term_frequency'])

y = term_frequency[['Sentiment']]

In [114]:
freq_dict= {}
x = list(zip(token,sent))
for i in range(len(freqs)):
    freq_dict[x[i]] = freqs[i]

In [None]:
freq_dict

## Section 4: Model Training

In this section, you should engineer at least two models. You should train each model on the training set and evaluate the performance of each model on the test set. You should also compare the performance of each model. You should also explain the performance of each model.

In [64]:
#make a dictionary for our classifier input
terms = list(term_frequency['token'])
times = list(term_frequency['term_frequency'])
mydict = {}
for i in range(len(terms)):
  mydict[terms[i]] = times[i]
#mydict

In [125]:
sub = train_df.iloc[:50000,:]#ram is crashing make a subset

In [126]:
X = sub['Phrase']  # feature variable
y = sub['Sentiment']  # target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [127]:
#take from Jurafsky and Martin psuedo code
logpriors = {}
probs = {0:{}, 1:{},2:{},3:{},4:{}}
N_doc = X_train.shape[0]  # number of documents in training set
V = set([pair[0] for pair in freq_dict.keys()])  # vocab
for i in y_train.unique():  # for every class
    bigdoc = []
    N_c = len(y_train.loc[y_train == i])  # total number of docs in class
    logpriors['logprior' + str(i)] = np.log(N_c / N_doc)  # get the log prior
    for x in freq_dict.keys():  # for every word
        if x[1] == 1 and x[0] in X_train.values:
            bigdoc.append(x)  # all words in documents with class i that appear in training set
    words = [n[0] for n in bigdoc] 

    for w in V:
        if w in words:
            N_cw = freq_dict[bigdoc[words.index(w)]]  # number of times word occurs in bigdoc with class i
        else:
            N_cw = 0
        N_w = len(bigdoc)  # total number of words in bigdoc with class i
        probs[i][w] = np.log((N_cw + 1) / (N_w + len(V) + 1))  # probabilities for each class set


In [128]:
correct = 0  # initialize counter for correct predictions
for j in range(X_test.shape[0]):  # iterate over each entry in testing set
    phrase = X_test.iloc[j]  # get the phrase
    score = {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}  # initialize scores
    for i in y_train.unique():
        score[i] = logpriors['logprior'+str(i)]  # get the logprior for each class
        for word in phrase:
            score[i] += probs[i].get(word, 0)  # add probability if word is in V for class i
    pred = np.argmax(list(score.values()))  # get prediction
    if pred == y_test.iloc[j]:
        correct += 1  # increment counter if prediction is correct
accuracy = correct / X_test.shape[0]
print(f"Accuracy: {accuracy:.2%}")  # print accuracy as percentage

Accuracy: 53.17%


Not so bad especially since I had to create a subset of the data

### Using TF-IDF

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [None]:
#I need to read it back in since I altered the train_df for phrase to be a list earlier
train_url = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/assignment_notebooks/data/nb_train.tsv'
test_url = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/assignment_notebooks/data/nb_test.tsv'

train_df = pd.read_csv(train_url, sep='\t')
test_df = pd.read_csv(test_url, sep='\t')

In [13]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2')#create a tfidf object
x = tfidf.fit_transform(sub.Phrase).toarray() 

In [12]:
xtrain,xtest,ytrain,ytest = train_test_split(x,sub['Sentiment'], test_size = .3) #create a tts

In [18]:
MNB = MultinomialNB()#creare mnb object
MNB.fit(xtrain,ytrain)#fit it

In [23]:
from sklearn import metrics
predicted = MNB.predict(xtest)#predict
Accuracy_score = metrics.accuracy_score(predicted,ytest)#get score
Accuracy_score

0.5946666666666667

## Section 5: Model Evaluation

In this section, you should evaluate the performance of your model. You should consider the following: accuracy, precision, recall, f1-score, confusion matrix, classification report, etc.

In [24]:
### YOUR CODE HERE
metrics.confusion_matrix(predicted,ytest)

array([[  21,    8,    1,    0,    0],
       [ 193,  523,  294,   45,    2],
       [ 392, 1782, 7155, 1921,  362],
       [   6,   45,  541, 1192,  479],
       [   0,    1,    3,    5,   29]])

It looks like we are really struggling on the extremes. It doesn't look like our model has much confidence to guess either a class 0 or class 4. 

When looking at the accuracy, the MNB using tf-idf performed a bit better. The word frequency one did okay, but it is dissapointing that I had to limit the model size on both. I think the term frequency one is impacted more by having less of a train data set, but both definitely could have been better. 

I would use precision recall F-1 Score, but I'm not confident in my understanding of them in multi class classifications.

## Section 6: Summary

Please answer the following questions:
    
* What is the performance of your model?



* What are the limitations of your model?



* What are the strengths of your model?



* What is the assumption of Naive Bayes? How does the assumption introduce bias into your model?



* Why do you think Naive Bayes is a popular model?
    

My MNB model is performing okay with 60% accuracy. My hand written model is performing slightly worse with 53% accuracy. I didn't engineer a way to see where my handwritten model was missing, so I can't provide insights as to why the accuracy is lower. 

Both my handwritten model and my TF-IDF MNB model are definitely limited by RAM. Neither model would be performing well when it sees text that it hasn't seen before. This is because it will have no sentiment counts from the training data and will be guessing based off of logprior in the handwritten version. in the TF-IDF version, it has no TF-IDF to go off of.

The strength of the handwritten model are that it is easily debugable. It can also be manipulated easily to fit the dataset. The strength of the MNB model is that it uses TF-IDF and has a fast run time.

The assumption of the niave bayes model is that the features are unrelated to each other. That is that the presence of one word has no implication on the others. This is implemented by getting either the value counts or the tf-idf of just a word in the phrase.  

I think the Naive Bayes model was popularized because of the speed that it can run at as well as the simplicity of the math behind the model. I also think it is still able to be used now to see if there is predictive value in your current data set. If you are getting results here, then transitioning into a higher cost model like neural networks is worth your time to get a better model.