# POS Tagging, n - grams(uni/bi/tri), Naive Bayes Classifier
## Prathamesh Patil
### 09/22/2022

#### Task 1: Load the dataset 
- Import the training set and a test set (csv files).
    - Note: The data has already been preprocessed. No additional preprocessing is expected. 
- List down the number of reviews in the training set and the test set.
- Remove the reviews that have blank (empty) CleanedReview from both training set and the test sets.
- List down the number of reviews in the training set and the test set after removing empty values. 

In [1]:
# Importing all the necessary libraries

import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read train corpus from a csv file into a dataframe and see the shapes and data types of the columns in the dataset.
df_train = pd.read_csv('Books_preprocessed_train_data.csv', skipinitialspace = True)
print(df_train.shape)

(13000, 20)


There are **13000** instances in training dataset along with **20** attributes for each instance.

In [3]:
# Read test corpus from a csv file into a dataframe and see the shapes and data types of the columns in the dataset.

df_test = pd.read_csv('Books_preprocessed_test_data.csv',skipinitialspace = True)
print(df_test.shape)

(1602, 20)


There are **1602** instances in training dataset along with **12** attributes for each instance.

In [4]:
df_train['CleanedReview'].isnull().sum()

4

In [5]:
df_test['CleanedReview'].isnull().sum()

0

Since there are only **4** null values out of **13,000** instances, dropping these would be most sensible. This won't affect our analysis. But if there had been more values, then we would have filled these values with NA or string - 'No review available'.

In [6]:
# Dropping null values

df_train = df_train[pd.notnull(df_train['CleanedReview'])]
df_train = df_train.reset_index(drop=True)

In [7]:
df_train['CleanedReview'].isnull().sum()

0

In [8]:
print(df_train.shape)

(12996, 20)


- There are **12996** instances in **training** dataset along with **20** attributes for each instance.
- There are **1602** instances in **testing** dataset along with **20** attributes for each instance.

#### Task 2: POS Tagging

-  Make a copy of the training data for Task 2. Implement the below steps on this copy to refrain from editing the actual training data that will be used for further tasks.
- Using a package of your choice (e.g. NLTK in Python), perform part-of-speech (POS) tagging of the words. 
    - Input:
        - TokenizedReview: ['this', 'book', 'is', 'super', 'annoying', 'to', 'read', 'it', 's', 'so', 
        'repetitive']
    - Expected Output:
        - PosTaggedReview: [[('this', ['DT']), ('book', ['NN']), ('is', ['VBZ']), ('super', ['JJ']), 
        ('annoying', ['VBG']), ('to', ['TO']), ('read', ['VB']), ('it', ['PRP']), ('s', ['PRP']), ('so', 
        ['RB']), ('repetitive', ['JJ'])]]
- Explain what parts of speech could be useful for sentiment analysis and why?
- Report the POS-tagging results for 3 examples in the dataset.

In [9]:
from nltk import pos_tag
from nltk import RegexpParser
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Nemo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
df_train_copy = df_train

In [11]:
def posTagger(review):
    wordlist = eval(review)
    tokens_tag = pos_tag(wordlist) 
    return tokens_tag

df_train_copy["PosTaggedReview"] = df_train_copy.apply(lambda row : posTagger(row["TokenizedReview"]), axis = 1)

In [12]:
df_train_copy.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,...,image,OriginalReview,Mentions,CleanedReview,TokenizedReview,StopwordRemovedReview,StemmedReview,BiGrams,ProcessedReview,PosTaggedReview
0,neutral,False,"09 18, 2006",A294QSAEH1Z7YI,1713353,{'Format:': ' Hardcover'},BHGobuchul,41 years later:\n\nThe cheese is government ch...,"Outdated, but a good 1960s primer",1158537600,...,are,41 years later:\n\nthe cheese is government ch...,,41 years later the cheese is government cheese...,"['41', 'years', 'later', 'the', 'cheese', 'is'...","['41', 'years', 'later', 'cheese', 'government...","['41', 'year', 'later', 'the', 'chees', 'is', ...","[('41', 'years'), ('years', 'later:'), ('later...",41 year later the chees is govern chees the mi...,"[(41, CD), (years, NNS), (later, RB), (the, DT..."
1,negative,True,"03 1, 2015",A3ZG0U3FOF3T1,1061240,{'Format:': ' Hardcover'},P. Howell,Looking for a Louis Untermeyer book from the ...,Two Stars,1425168000,...,are,looking for a louis untermeyer book from the ...,,looking for a louis untermeyer book from the 1...,"['looking', 'for', 'a', 'louis', 'untermeyer',...","['looking', 'louis', 'untermeyer', 'book', '19...","['look', 'for', 'a', 'loui', 'untermey', 'book...","[('looking', 'for'), ('for', 'a'), ('a', 'loui...",look for a loui untermey book from the 1980 an...,"[(looking, VBG), (for, IN), (a, DT), (louis, J..."
2,neutral,False,"05 18, 2002",AJ8AQG2X9JJ2Y,1712799,{'Format:': ' School & Library Binding'},Donald Gillies,Dr. Seuss has some really brilliant books. Th...,A below-average Dr. Seuss Book,1021680000,...,are,dr. seuss has some really brilliant books. th...,,dr seuss has some really brilliant books this ...,"['dr', 'seuss', 'has', 'some', 'really', 'bril...","['dr', 'seuss', 'really', 'brilliant', 'books'...","['dr', 'seuss', 'ha', 'some', 'realli', 'brill...","[('dr.', 'seuss'), ('seuss', 'has'), ('has', '...",dr seuss ha some realli brilliant book thi boo...,"[(dr, NN), (seuss, NN), (has, VBZ), (some, DT)..."
3,negative,True,"02 20, 2016",A2M08SO0PJKPAV,1712799,{'Format:': ' Hardcover'},Emily,Completly boring!!! Yes it's a childerns book ...,Don't waste your money,1455926400,...,are,completly boring!!! yes it's a childerns book ...,,completly boring yes it s a childerns book tha...,"['completly', 'boring', 'yes', 'it', 's', 'a',...","['completly', 'boring', 'yes', 'childerns', 'b...","['completli', 'bore', 'ye', 'it', 's', 'a', 'c...","[('completly', 'boring!!!'), ('boring!!!', 'ye...",completli bore ye it s a childern book that th...,"[(completly, RB), (boring, VBG), (yes, NNS), (..."
4,neutral,False,"07 8, 2004",A1JS302JFHH9DJ,2006448,{'Format:': ' Hardcover'},Daniel H. Bigelow,The Carpet Wars is a sampler of informal writi...,Painless Education,1089244800,...,are,the carpet wars is a sampler of informal writi...,,the carpet wars is a sampler of informal writi...,"['the', 'carpet', 'wars', 'is', 'a', 'sampler'...","['carpet', 'wars', 'sampler', 'informal', 'wri...","['the', 'carpet', 'war', 'is', 'a', 'sampler',...","[('the', 'carpet'), ('carpet', 'wars'), ('wars...",the carpet war is a sampler of inform write fr...,"[(the, DT), (carpet, NN), (wars, NNS), (is, VB..."


*I think verbs,noun, adverb and adjectives are the most important part of speech for sentiment analysis. Majority of the sentiments are hidden in these 4 parts of speech.* 

*For instance, verbs like love,hate,enjoy can tell you more about feelings. Similarly, adverbs like beautifully, sadly, happily or angrily carry a lot of weight in the sentence. In addition, adjevctives like awesome, superb, disgusting will help in analyzing emotions of reviewer. Lastly, nouns will have least immportance of these 4 parts of speech because there will be words like boy, book, character, character Name, etc that majorly lie in neutral category.*

In [13]:
print("----Review 1----\n")
print("TokenizedReview:",df_train_copy.loc[0]["TokenizedReview"])
print("\n")
print("PosTaggedReview:", df_train_copy.loc[0]["PosTaggedReview"])
print("\n")

----Review 1----

TokenizedReview: ['41', 'years', 'later', 'the', 'cheese', 'is', 'government', 'cheese', 'the', 'mice', 'objected', 'to', 'the', 'king', 's', 'idea', 'of', 'good', 'manners', 'as', 'species', 'centric', 'and', 'rebelled', 'the', 'king', 'blamed', 'the', 'peasants', 'and', 'forbade', 'them', 'to', 'keep', 'cats', 'or', 'chase', 'mice', 'from', 'their', 'homes', 'this', 'made', 'things', 'worse', 'peasants', 'that', 'could', 'afford', 'to', 'do', 'so', 'moved', 'as', 'far', 'away', 'from', 'mice', 'as', 'possible', 'i', 'can', 't', 'wait', 'for', 'the', 'next', 'chapter']


PosTaggedReview: [('41', 'CD'), ('years', 'NNS'), ('later', 'RB'), ('the', 'DT'), ('cheese', 'NN'), ('is', 'VBZ'), ('government', 'NN'), ('cheese', 'VBG'), ('the', 'DT'), ('mice', 'NN'), ('objected', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('king', 'NN'), ('s', 'JJ'), ('idea', 'NN'), ('of', 'IN'), ('good', 'JJ'), ('manners', 'NNS'), ('as', 'IN'), ('species', 'NNS'), ('centric', 'VBP'), ('and', 'CC'), ('

In [14]:
print("----Review 2----\n")
print("TokenizedReview:",df_train_copy.loc[4]["TokenizedReview"])
print("\n")
print("PosTaggedReview:", df_train_copy.loc[4]["PosTaggedReview"])
print("\n")

----Review 2----

TokenizedReview: ['the', 'carpet', 'wars', 'is', 'a', 'sampler', 'of', 'informal', 'writing', 'from', 'australian', 'journalist', 'and', 'avid', 'carpet', 'collector', 'christopher', 'kremmer', 'over', 'ten', 'years', 'in', 'central', 'asia', 'since', 'most', 'of', 'it', 'was', 'written', 'and', 'concerns', 'events', 'before', '9', '11', 'when', 'the', 'area', 'was', 'not', 'established', 'in', 'the', 'west', 's', 'cultural', 'radar', 'as', 'it', 'is', 'today', 'it', 'gives', 'a', 'view', 'of', 'the', 'region', 'that', 'is', 'uncluttered', 'by', 'hindsight', 'reevaluations', 'kremmer', 'writes', 'of', 'his', 'time', 'in', 'afghanistan', 'pakistan', 'tajikstan', 'kashmir', 'and', 'iran', 'giving', 'us', 'colorful', 'and', 'non', 'journalistic', 'slices', 'of', 'life', 'from', 'each', 'region', 'he', 'enlivens', 'his', 'writings', 'with', 'vivid', 'character', 'studies', 'of', 'those', 'he', 'met', 'on', 'his', 'travels', 'from', 'dignitaries', 'like', 'ill', 'fated', '

In [15]:
print("----Review 3----\n")
print("TokenizedReview:",df_train_copy.loc[789]["TokenizedReview"])
print("\n")
print("PosTaggedReview:", df_train_copy.loc[789]["PosTaggedReview"])
print("\n")

----Review 3----

TokenizedReview: ['please', 'note', 'that', 'this', 'review', 'concerns', 'only', 'the', 'new', 'publications', 'the', 'chronicles', 'of', 'narnia', 'are', 'perfect', 'books', 'they', 'are', 'wonderful', 'for', 'children', 'and', 'adults', 'and', 'can', 'be', 'read', 'again', 'and', 'again', 'c', 's', 'lewis', 'was', 'a', 'brilliant', 'author', 'and', 'theologian', 'and', 'was', 'competent', 'in', 'what', 'he', 'was', 'doing', 'i', 'have', 'been', 'reading', 'these', 'books', 'since', 'i', 'was', 'young', 'enough', 'to', 'pick', 'up', 'a', 'book', 'and', 'i', 'was', 'horrified', 'when', 'i', 'found', 'out', 'they', 'were', 'reprinting', 'them', 'in', 'chronological', 'order', 'why', 'have', 'the', 'publishers', 'decided', 'to', 'tamper', 'with', 'the', 'order', 'reading', 'these', 'books', 'in', 'chronological', 'order', 'spoils', 'all', 'of', 'the', 'surprise', 'and', 'magic', 'out', 'of', 'the', 'first', 'visit', 'to', 'narnia', 'in', 'the', 'lion', 'the', 'witch', 

#### Task 3: Extract unigram features
- Extract unigrams from the OriginalReview column in the training set. 
- Fit the unigrams to the OriginalReview column to generate a feature vector for each review in the training set and testing set. Note that we are using count-based features (i.e., feature values should be the number of times a specific unigram appears in the sentence). 
- Report the number of features in the training set and the test set.

In [16]:
from nltk.util import ngrams


def ngrammar(review,n = 1):
    unigrams_list = []
    unigrams_from_reviews = ngrams(review.split(), n)
    for item in unigrams_from_reviews:
        unigrams_list.append(item)
    return unigrams_list
        

unis = df_train.apply(lambda row : ngrammar(row["OriginalReview"]), axis = 1)

In [17]:
unis.head()

0    [(41,), (years,), (later:,), (the,), (cheese,)...
1    [(looking,), (for,), (a,), (louis,), (untermey...
2    [(dr.,), (seuss,), (has,), (some,), (really,),...
3    [(completly,), (boring!!!,), (yes,), (it's,), ...
4    [(the,), (carpet,), (wars,), (is,), (a,), (sam...
dtype: object

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

train_og_review = df_train['OriginalReview']
test_og_review = df_test['OriginalReview']

cv = CountVectorizer(ngram_range = [1,1]) #for unigrams

cv.fit(train_og_review)
cv.fit(test_og_review)

CountVectorizer(ngram_range=[1, 1])

In [19]:
train_vector = cv.transform(train_og_review)
print("Train Vector Shape:",train_vector.shape)

test_vector = cv.transform(test_og_review)
print("Test Vector Shape:", test_vector.shape)

Train Vector Shape: (12996, 6110)
Test Vector Shape: (1602, 6110)


#### Task 4: Train and evaluate classifier
- With unigram features generated in step 3, train a Naïve Bayes classifier. 
- After the training, apply the classifier to the test set and calculate the following performance metrics: 
    - Overall (average) accuracy, precision, recall and F1 score of the classification system.
    - Accuracy, precision, recall and F1 score for each label: Positive, Negative and Neutral.

In [20]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

# define true labels from train set
x_train = train_vector
y_train = df_train["overall"]
x_test = test_vector
y_test = df_test["overall"]

# fit model and test on trai


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

In [21]:
print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9101123595505618
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.91011236]
F1 score:  [0.         0.         0.95294118]


In [22]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.91      0.95      1602

    accuracy                           0.91      1602
   macro avg       0.33      0.30      0.32      1602
weighted avg       1.00      0.91      0.95      1602

[[   0    0    0]
 [   0    0    0]
 [  67   77 1458]]


#### Task 5: Add bigram features
- Extract bigrams from the OriginalReview column in the training set and add these bigrams into the feature space. 
- Fit the new features to the OriginalReview column to generate a feature vector for each sentence in the training set and the test set. Report the number of features of the training set and the test set.
- Repeat task 4.

In [23]:
def ngrammar(review,n = 2):
    unigrams_list = []
    unigrams_from_reviews = ngrams(review.split(), n)
    for item in unigrams_from_reviews:
        unigrams_list.append(item)
    return unigrams_list
        

bis = df_train.apply(lambda row : ngrammar(row["OriginalReview"]), axis = 1)

In [24]:
bis.head()

0    [(41, years), (years, later:), (later:, the), ...
1    [(looking, for), (for, a), (a, louis), (louis,...
2    [(dr., seuss), (seuss, has), (has, some), (som...
3    [(completly, boring!!!), (boring!!!, yes), (ye...
4    [(the, carpet), (carpet, wars), (wars, is), (i...
dtype: object

In [25]:
cv2 = CountVectorizer(ngram_range = [1,2]) #for bigrams
train_og_review_bigrams = df_train['OriginalReview']
test_og_review_bigrams = df_test['OriginalReview']

cv2.fit(train_og_review_bigrams)
cv2.fit(test_og_review_bigrams)


train_vector_bigrams = cv2.transform(train_og_review_bigrams)
print("Train Vector Shape:",train_vector_bigrams.shape)

test_vector_bigrams = cv2.transform(test_og_review_bigrams)
print("Train Vector Shape:",test_vector_bigrams.shape)

Train Vector Shape: (12996, 42128)
Train Vector Shape: (1602, 42128)


In [26]:
x_train = train_vector_bigrams
y_train = df_train["overall"]
x_test = test_vector_bigrams
y_test = df_test["overall"]

# build model on the training data
model = MultinomialNB()
model.fit(x_train, y_train)

# predict the labels for the test data
predictions = model.predict(x_test)

In [27]:
print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9038701622971286
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.90387016]
F1 score:  [0.        0.        0.9495082]


In [28]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.90      0.95      1602

    accuracy                           0.90      1602
   macro avg       0.33      0.30      0.32      1602
weighted avg       1.00      0.90      0.95      1602

[[   0    0    0]
 [   0    0    0]
 [  73   81 1448]]


#### Task 6: Add trigram features
- Extract trigrams from the OriginalReview column in the training set and add these trigrams into the feature space. 
- Fit the new features to the OriginalReview column to generate a feature vector for each sentence in the training set and testing set. Report the number of features of the training set and testing set in your documents.
- Repeat task 4.

In [29]:
def ngrammar(review,n = 3):
    unigrams_list = []
    unigrams_from_reviews = ngrams(review.split(), n)
    for item in unigrams_from_reviews:
        unigrams_list.append(item)
    return unigrams_list
        
tris = df_train.apply(lambda row : ngrammar(row["OriginalReview"]), axis = 1)

In [30]:
tris.head()

0    [(41, years, later:), (years, later:, the), (l...
1    [(looking, for, a), (for, a, louis), (a, louis...
2    [(dr., seuss, has), (seuss, has, some), (has, ...
3    [(completly, boring!!!, yes), (boring!!!, yes,...
4    [(the, carpet, wars), (carpet, wars, is), (war...
dtype: object

In [31]:
cv3 = CountVectorizer(ngram_range = [1,3]) #for trigrams
train_og_review_trigrams = df_train['OriginalReview']
test_og_review_trigrams = df_test['OriginalReview']

cv3.fit(train_og_review_trigrams)
cv3.fit(test_og_review_trigrams)


train_vector_trigrams = cv3.transform(train_og_review_trigrams)
print("Train Vector Shape:",train_vector_bigrams.shape)

test_vector_trigrams = cv3.transform(test_og_review_trigrams)
print("Train Vector Shape:",test_vector_trigrams.shape)

Train Vector Shape: (12996, 42128)
Train Vector Shape: (1602, 99462)


In [32]:
x_train = train_vector_trigrams
y_train = df_train["overall"]
x_test = test_vector_trigrams
y_test = df_test["overall"]

# build model on the training data
model = MultinomialNB()
model.fit(x_train, y_train)

# predict the labels for the test data
predictions = model.predict(x_test)

In [33]:
print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9038701622971286
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.90387016]
F1 score:  [0.        0.        0.9495082]


In [34]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.90      0.95      1602

    accuracy                           0.90      1602
   macro avg       0.33      0.30      0.32      1602
weighted avg       1.00      0.90      0.95      1602

[[   0    0    0]
 [   0    0    0]
 [  92   62 1448]]


#### Task 7: Add TF-IDF features
- Among the models trained by features in tasks 3, 5 or 6 (unigram, unigram + bigram, unigram + bigram + trigram, respectively) choose the best performing model (based on overall F1 score) and substitute count-based features with TF-IDF features. 
- Repeat task 4. 

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

# this time, we vectorize using TF-IDF
train_og_review_tfidf = df_train['OriginalReview']
test_og_review_tfidf = df_test['OriginalReview']

tf = TfidfVectorizer()
tf.fit(train_og_review_tfidf)

# encode document
data = tf.transform(train_og_review_tfidf)

# summarize encoded vector
print(data.shape,"\n") 

(12996, 35910) 



In [36]:
train_data_tfidf = tf.fit_transform(train_og_review_tfidf)
print(train_data_tfidf.shape,"\n") 

test_data_tfidf = tf.transform(test_og_review_tfidf)
print(test_data_tfidf.shape,"\n") 

idf = tf.idf_

(12996, 35910) 

(1602, 35910) 



In [37]:
# define true labels from train set
x_train = train_data_tfidf
y_train = df_train["overall"]
x_test = test_data_tfidf
y_test = df_test["overall"]


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.966916354556804
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.96691635]
F1 score:  [0.         0.         0.98317994]


In [38]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.97      0.98      1602

    accuracy                           0.97      1602
   macro avg       0.33      0.32      0.33      1602
weighted avg       1.00      0.97      0.98      1602

[[   0    0    0]
 [   0    0    0]
 [  16   37 1549]]


#### Task 8:  Train models with other columns
- Among the models trained by features in tasks 3, 5, 6 or 7, choose the best performing feature set, and train Naïve Bayes classifiers on the following columns: **CleanedReview, StopwordRemovedReview, StemmedReview** and record the results

##### For cleanedReview

In [39]:
# this time, we vectorize using TF-IDF
train_clean_review_tfidf = df_train['CleanedReview']
test_clean_review_tfidf = df_test['CleanedReview']

tf = TfidfVectorizer()
tf.fit(train_clean_review_tfidf)

# encode document
data = tf.transform(train_clean_review_tfidf)

# summarize encoded vector
print(data.shape,"\n")

(12996, 35837) 



In [40]:
train_data_tfidf = tf.fit_transform(train_clean_review_tfidf)
print(train_data_tfidf.shape,"\n") 

test_data_tfidf = tf.transform(test_clean_review_tfidf)
print(test_data_tfidf.shape,"\n") 

idf = tf.idf_

(12996, 35837) 

(1602, 35837) 



In [41]:
# define true labels from train set
x_train = train_data_tfidf
y_train = df_train["overall"]
x_test = test_data_tfidf
y_test = df_test["overall"]


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9675405742821473
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.96754057]
F1 score:  [0.         0.         0.98350254]


In [42]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.97      0.98      1602

    accuracy                           0.97      1602
   macro avg       0.33      0.32      0.33      1602
weighted avg       1.00      0.97      0.98      1602

[[   0    0    0]
 [   0    0    0]
 [  16   36 1550]]


##### For StopwordRemovedReview

In [43]:
# this time, we vectorize using TF-IDF
train_swr_review_tfidf = df_train['StopwordRemovedReview']
test_swr_review_tfidf = df_test['StopwordRemovedReview']

tf = TfidfVectorizer()
tf.fit(train_swr_review_tfidf)

# encode document
data = tf.transform(train_swr_review_tfidf)

# summarize encoded vector
print(data.shape,"\n")

(12996, 35691) 



In [44]:
train_data_tfidf = tf.fit_transform(train_swr_review_tfidf)
print(train_data_tfidf.shape,"\n") 

test_data_tfidf = tf.transform(test_swr_review_tfidf)
print(test_data_tfidf.shape,"\n") 

idf = tf.idf_

(12996, 35691) 

(1602, 35691) 



In [45]:
# define true labels from train set
x_train = train_data_tfidf
y_train = df_train["overall"]
x_test = test_data_tfidf
y_test = df_test["overall"]


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9681647940074907
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.96816479]
F1 score:  [0.         0.         0.98382493]


In [46]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.97      0.98      1602

    accuracy                           0.97      1602
   macro avg       0.33      0.32      0.33      1602
weighted avg       1.00      0.97      0.98      1602

[[   0    0    0]
 [   0    0    0]
 [  17   34 1551]]


##### For StemmedReview

In [47]:
# this time, we vectorize using TF-IDF
train_stem_review_tfidf = df_train['StemmedReview']
test_stem_review_tfidf = df_test['StemmedReview']

tf = TfidfVectorizer()
tf.fit(train_stem_review_tfidf)

# encode document
data = tf.transform(train_stem_review_tfidf)

# summarize encoded vector
print(data.shape,"\n")

(12996, 23155) 



In [48]:
train_data_tfidf = tf.fit_transform(train_stem_review_tfidf)
print(train_data_tfidf.shape,"\n") 

test_data_tfidf = tf.transform(test_stem_review_tfidf)
print(test_data_tfidf.shape,"\n") 

idf = tf.idf_

(12996, 23155) 

(1602, 23155) 



In [49]:
# define true labels from train set
x_train = train_data_tfidf
y_train = df_train["overall"]
x_test = test_data_tfidf
y_test = df_test["overall"]


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Precision score: ", precision_score(y_test, predictions, average = None))
print ("Recall score: ", recall_score(y_test, predictions, average = None))
print ("F1 score: ", f1_score(y_test, predictions, average = None))

Accuracy score:  0.9625468164794008
Precision score:  [0. 0. 1.]
Recall score:  [0.         0.         0.96254682]
F1 score:  [0.         0.         0.98091603]


In [50]:
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Individual label performance: 
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         0
     neutral       0.00      0.00      0.00         0
    positive       1.00      0.96      0.98      1602

    accuracy                           0.96      1602
   macro avg       0.33      0.32      0.33      1602
weighted avg       1.00      0.96      0.98      1602

[[   0    0    0]
 [   0    0    0]
 [  18   42 1542]]


#### Task 9:  Interpret and discuss classification results
- Discuss the input (original column) and feature combination that yields the best performance with Naïve Bayes algorithm. Why do you think this combination works better than the others? 
- For the best-performing model, examine the performance for each label. How do they compare to one another? What can be inferred from these results? 

*Among the unigrams, unigrams+ bigrams, unigrams + bigrams + trigrams, and tf-idf, we find that **tf-idf** gives the best performance. This can be clearly explained if we dig deeper into how CountVectorizer and TF-IDF works*. 

*CountVectorizers just count the number of times words in the reviews are repeated. It's a way to convert them into frequency statistically. Since this only counts, we cannot stress on which words are more important and which of them can be discarded. Higher the number of times words are repeated in a document, more importance is assigned to them just purely based on frequency*.

*On the contrary, TF-IDF assigns numerical importance to words by calculating frequency and adjusting this entity using log scale of occurence in documents. Hence words that appear in high frequency in a single document but doesn't happen in rest of the documents will reduce its overall numerical importance*.

*For TF_IDF, we can see that we achieve accuracy of 96.7 % for positive label. Since the test data doesn't have any negative and neutral labels, it is hard to compare three labels to one another. But we see that precision for positive label is 1. This is impossible in real life scenario. We can deduce that if test data has only positive labels, chances of review being incorrectly labelled is zero. Yet somehow, we see that 3.3% of reviews were incorrectly labelled.*

#### Task 10: Error Analysis
- For each label, print out 3 examples (from the test set) that were misclassified by the best-performing model. 
- Discuss your thoughts about why these examples might have been misclassified. Based on these examples, what can be done to improve the model performance?

In [51]:
predictions = predictions
labels = df_test["overall"]
inputs = df_test["OriginalReview"]

counter = 0
for idx, prediction, label in zip(enumerate(inputs), predictions, labels):
    if counter < 3:
        if prediction != label:
            print("Sample", idx, ' has been classified as', prediction, 'and should be', label) 
            print ("\n")
            counter += 1

Sample (20, "this was a touching story, wonderful book and hard to put down.  from start to end the story will make you want to read more by this author.  five stars wasn't enough!")  has been classified as neutral and should be positive


Sample (46, 'i loved this book. i chose the five star review because this book was the kind of book that i would want to read over and over. i hope that there a second book in the series. it would be a really good idea to consider. hope that many other readers will be able to enjoy this book as much as i did. five star rating is a very hard rating to get an i feel this book deserve it.')  has been classified as neutral and should be positive


Sample (75, 'unimaginable though could not put it down')  has been classified as negative and should be positive




*In sample number 20, there are words like wonderful, hard, down, more and wasn't, which approximately lead to believe that this review is a neutral review*. 

*In sample 46, there are positive words like loved, good, enjoy and a negative word-hard. But compared to length of the review, these words are not repeated enough throughout and majority of words are neutral. This might have lead to review being incorrectly labelled as neutral*.

*Sample 75 can be clearly seen as negative if not read by human. Prefix 'un' and 'not' can be confusing for computer and hence was labelled as negative instead of positive.*

*One way to improve accuracy can be to add a negation before every word. This will eliminate labels that incorrectly classified as negative. This should also help with neutral reviews but could classify positive reviews incorrectly.*