# NLE Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [None]:
candidateno=11111119 #this MUST be updated to your candidate number so that you get a unique data sample


In [1]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to C:\Users\Zee
[nltk_data]     Tech\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Zee
[nltk_data]     Tech\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to C:\Users\Zee
[nltk_data]     Tech\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [287]:
reviews_df=pd.read_csv("movie_review.csv")

In [288]:
reviews_df

Unnamed: 0,fold_id,cv_tag,html_id,sent_id,text,tag
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos
...,...,...,...,...,...,...
64715,9,cv999,14636,20,that lack of inspiration can be traced back to...,neg
64716,9,cv999,14636,21,like too many of the skits on the current inca...,neg
64717,9,cv999,14636,22,"after watching one of the "" roxbury "" skits on...",neg
64718,9,cv999,14636,23,"bump unsuspecting women , and . . . that's all .",neg


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

2) 
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

4) 
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results. 

[12.5\%]

5) 
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions. 

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


# Method 1 ---- TextBlob Polarity Method

In [269]:
import neattext as nt

In [289]:
reviews_df["text"]=reviews_df["text"].apply(lambda x:nt.remove_stopwords(x))

In [290]:
reviews_df["text"]=reviews_df["text"].apply(lambda x:nt.fix_contractions(x))

In [291]:
reviews_df["text"]=reviews_df["text"].apply(lambda x:nt.remove_numbers(x))

In [292]:
reviews_df["text"]=reviews_df["text"].apply(lambda x:nt.remove_special_characters(x))

In [68]:
def get_words_polarity(text):
    words_polarity={}
    for word in text.split():
        words_polarity[word]=TextBlob(word).sentiment.polarity
    return words_polarity    

In [71]:
reviews_df["Polarity"]=reviews_df["text"][0:10].apply(lambda x:get_words_polarity(x))

In [77]:
reviews_df[["Polarity","tag"]]

Unnamed: 0,Polarity,tag
0,"{'films': 0.0, 'adapted': 0.0, 'comic': 0.25, ...",pos
1,"{'starters': 0.0, 'created': 0.0, 'alan': 0.0,...",pos
2,"{'moore': 0.0, 'campbell': 0.0, 'thoroughly': ...",pos
3,"{'book': 0.0, 'graphic': 0.0, 'novel': 0.0, 'p...",pos
4,"{'words': 0.0, 'dismiss': 0.0, 'film': 0.0, 's...",pos
...,...,...
64715,,neg
64716,,neg
64717,,neg
64718,,neg


# 2nd Method ---- Word Cloud 

In [78]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

### postive reviews

In [123]:
#pos_text=reviews_df[reviews_df.iloc[:,5]=='pos']['text']

In [124]:
#word_cloud=WordCloud(collocations = False, background_color = 'white').generate(str(pos_text))

In [142]:
#plt.imshow(word_cloud, interpolation='bilinear')
#plt.axis("off")
#plt.show()

### Top words from word cloud

In [136]:
#top_words=list(word_cloud.words_.keys())

In [143]:
#top_words[0:10]

In [293]:
reviews_df.isnull().sum()

fold_id    0
cv_tag     0
html_id    0
sent_id    0
text       0
tag        0
dtype: int64

In [300]:
def get_top_words(text):
    word_cloud=WordCloud().generate(text)
    top_words=list(word_cloud.words_.keys())
    return ' '.join(top_words[0:10])

In [301]:
reviews_df["top words"]=reviews_df["text"].apply(lambda x:get_top_words(x+"text"))

KeyboardInterrupt: 

### Spration of positive and negative reviews

In [243]:
pos_reviews_df=reviews_df[reviews_df.loc[:,'tag']=='pos'][['text','tag']]
pos_reviews_df['tag']=1
neg_reviews_df=reviews_df[reviews_df.loc[:,'tag']=='neg'][['text','tag']]
neg_reviews_df['tag']=0

In [None]:
final_df=pd.concat([pos_reviews_df,neg_reviews_df],axis=0)

In [None]:
final_df

In [None]:
X=final_df['top words']
y=final_df['tag']

In [208]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [209]:
tfidf = TfidfVectorizer()
transformed = tfidf.fit_transform(X)

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [210]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [205]:
# Importing Models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm

# Importing Evaluation matrces
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix,classification_report, plot_confusion_matrix

# check the performance on diffrent regressor
models = []
#models.append(('Support Vector Classifier', SVC()))
models.append(('LogisitcRegression', LogisticRegression()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))


# prepare the cross-validation procedure
cv = KFold(n_splits=5, random_state=1, shuffle=True)

# metrices to store performance
acc = []
pre = []
f1 = []
con = []
rec = []


import time
i = 0
for name,model in models:
    i = i+1
    start_time = time.time()
    
    # Fitting model to the Training set
    clf = model
    clf.fit(X_train, y_train)
    
    # predict values
    y_pred = clf.predict(X_test)
    
    # Accuracy
    accuracy = accuracy_score(y_test, y_pred)
    acc.append(accuracy)
    # Precision
    precision = precision_score(y_test, y_pred, average=None)
    pre.append(precision)
    # Recall
    recall = recall_score(y_test, y_pred, average=None)
    rec.append(recall)
    # F1 Score
    f1_sco = f1_score(y_test, y_pred, average=None)
    f1.append(f1_sco)
    # Confusion Matrix
    confusion_mat = confusion_matrix(y_test, y_pred)
    con.append(confusion_mat)
    # Report
    report = classification_report(y_test, y_pred)
    
    # evaluate model
    scores = cross_val_score(clf, X, y, cv=cv, n_jobs=-1)



    print("+","="*100,"+")
    print('\033[1m' + f"\t\t\t{i}-For {name} The Performance result is: " + '\033[0m')
    print("+","="*100,"+")
    print('Accuracy : ', accuracy)   
    print("-"*50)
    print('F1 : ', f1_sco)
    print("-"*50)
    print('Reacll : ', recall)
    print("-"*50)
    print('Precision : ', precision)
    print("-"*50)
    print('cross validation accuracy : ', np.mean(scores))
    print("-"*50)
    print('Confusion Matrix....\n', confusion_mat)
    print("-"*50)
    print('Classification Report....\n', report)
    print("-"*50)
    print('Plotting Confusion Matrix...\n')
    plot_confusion_matrix(clf, X_test, y_test)
    plt.show()


    
    print("\t\t\t\t\t\t\t-----------------------------------------------------------")
    print(f"\t\t\t\t\t\t\t Time for detection ({name}) : {round((time.time() - start_time), 3)} seconds...")
    print("\t\t\t\t\t\t\t-----------------------------------------------------------")
    print()
    
pd.DataFrame({"Model": dict(models).keys(), "Accuracy": acc, "Precision": pre, "Recall": rec, "F1_Score": f1, "Confusion Matrix": con})

ValueError: could not convert string to float: 'filthy  sooty place whores  called  unfortunates   starting little nervous mysterious psychopath carving profession surgical precision '