# **The ninth in-class-exercise (20 points in total, 11/11/2020)**

The purpose of the exercise is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training. 

The dataset can be download from here: https://github.com/unt-iialab/INFO5731_FALL2020/blob/master/In_class_exercise/exercise09_datacollection.zip. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data. 

Algorithms:

(1) MultinominalNB

(2) SVM 

(3) KNN 

(4) Decision tree

(5) Random Forest

(6) XGBoost

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison 

(4) F-1 score

In [1]:
# Write your code here

#Load the libraries
import numpy as np
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score, recall_score, f1_score

import warnings
warnings.filterwarnings('ignore')

# Train Data

In [5]:
# Read the train data
data = []
with open("/content/stsa-train.txt") as f:
    for line in f:
        line = line.strip("\n").split(" ", 1)
        data.append(line)

In [6]:
imdb_data = pd.DataFrame(data, columns = ['sentiment', 'review'])  
print(imdb_data.shape)
imdb_data.head(10)

(6920, 2)


Unnamed: 0,sentiment,review
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...
5,1,campanella gets the tone just right -- funny i...
6,0,a fan film that for the uninitiated plays bett...
7,1,"béart and berling are both superb , while hupp..."
8,0,"a little less extreme than in the past , with ..."
9,0,the film is strictly routine .


In [10]:
# Train Data Pre Processing
#Remove Stopwords and lematize
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
tokenizer = WordPunctTokenizer()
import string
import re
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize_s(s): return re_tok.sub(r' \1 ', s).split()

lmtzr = WordNetLemmatizer()
lmtzr.lemmatize('cars')
imdb_data['review'] = imdb_data['review'].apply(lambda x: ' '.join([lmtzr.lemmatize(word.lower()) for word in tokenize_s(x) if word not in stopwords.words('english') and len(word) > 1]))



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Test Data

In [11]:
# Read the test data
test_data = []
with open("/content/stsa-test.txt") as f:
    for line in f:
        line = line.strip("\n").split(" ", 1)
        test_data.append(line)

In [12]:
# Test Data Pre Processing
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

imdb_testdata = pd.DataFrame(test_data, columns = ['sentiment', 'review'])  
print(imdb_testdata.shape)
imdb_testdata.head(10)

# Pre Processing
#Remove Stopwords and lematize
tokenizer = WordPunctTokenizer()
import string
import re
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize_s(s): return re_tok.sub(r' \1 ', s).split()

lmtzr = WordNetLemmatizer()
lmtzr.lemmatize('cars')
imdb_testdata['review'] = imdb_testdata['review'].apply(lambda x: ' '.join([lmtzr.lemmatize(word.lower()) for word in tokenize_s(x) if word not in stopwords.words('english') and len(word) > 1]))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
(1821, 2)


In [14]:
X_test, y_test = imdb_testdata.review, imdb_testdata.sentiment

# Train Test Split

In [15]:
valid_size = 0.20
X = imdb_data['review']
y = imdb_data['sentiment']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=valid_size)
print(X_train.shape, y_train.shape)
print(X_valid.shape, y_valid.shape)

(5536,) (5536,)
(1384,) (1384,)


# TFIDF Vectorizer

In [16]:
vectorizer = TfidfVectorizer()
tfidf_train = vectorizer.fit_transform(X_train)
tfidf_valid = vectorizer.transform(X_valid)
tfidf_test = vectorizer.transform(X_test)

# Complete data vectors for 10 fold CV
X_tfidf_train = vectorizer.fit_transform(X)

# Machine Learning Models

## Naive Bayes

In [17]:
from sklearn.naive_bayes import MultinomialNB
gnb = MultinomialNB()
gnb_fit = gnb.fit(tfidf_train, y_train)

### 10 Fold CV

In [18]:
# 10 fold cv
scores = cross_val_score(gnb_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.77890173, 0.80491329, 0.78757225, 0.78468208, 0.75578035,
       0.79624277, 0.76734104, 0.80491329, 0.79479769, 0.76156069])

In [19]:
gnb_pred_val = gnb.predict(tfidf_valid)
gnb_pred_test = gnb.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, gnb_pred_val)}")
print(f"Precision: {precision_score(y_valid, gnb_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, gnb_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, gnb_pred_val)}")
print(f"F1: {f1_score(y_valid, gnb_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, gnb_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, gnb_pred_test)}")
print(f"Precision: {precision_score(y_test, gnb_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, gnb_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, gnb_pred_test)}")
print(f"F1: {f1_score(y_test, gnb_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, gnb_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[465 198]
 [106 615]]
Precision: 0.7854091675767896
Recall: 0.777169717775086
Accuracy: 0.7803468208092486
F1: 0.7777359854111769
How Many correct records correctly predicted 1080

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[645 267]
 [109 800]]
Precision: 0.8026016820008999
Recall: 0.7936624254530716
Accuracy: 0.7935200439319056
F1: 0.7920131615399196
How Many correct records correctly predicted 1445


# SVM

In [20]:
from sklearn import svm
clf = svm.SVC(gamma='auto')
svm_fit = clf.fit(tfidf_train, y_train)  

In [21]:
# 10 fold cv
scores = cross_val_score(svm_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.5216763, 0.5216763, 0.5216763, 0.5216763, 0.5216763, 0.5216763,
       0.5216763, 0.5216763, 0.5216763, 0.5216763])

In [23]:
svm_pred_val = svm_fit.predict(tfidf_valid)
svm_pred_test = svm_fit.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, svm_pred_val)}")
print(f"Precision: {precision_score(y_valid, svm_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, svm_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, svm_pred_val)}")
print(f"F1: {f1_score(y_valid, svm_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, svm_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, svm_pred_test)}")
print(f"Precision: {precision_score(y_test, svm_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, svm_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, svm_pred_test)}")
print(f"F1: {f1_score(y_test, svm_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, svm_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[  0 663]
 [  0 721]]
Precision: 0.2604768786127168
Recall: 0.5
Accuracy: 0.5209537572254336
F1: 0.3425178147268409
How Many correct records correctly predicted 721

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[  0 912]
 [  0 909]]
Precision: 0.24958813838550248
Recall: 0.5
Accuracy: 0.49917627677100496
F1: 0.33296703296703295
How Many correct records correctly predicted 909


# KNN

In [24]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
knn_fit = neigh.fit(tfidf_train, y_train)

In [25]:
# 10 fold cv
scores = cross_val_score(knn_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.68063584, 0.5       , 0.49277457, 0.48410405, 0.48265896,
       0.50867052, 0.49277457, 0.49710983, 0.49277457, 0.48265896])

In [26]:
knn_pred_val = knn_fit.predict(tfidf_valid)
knn_pred_test = knn_fit.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, knn_pred_val)}")
print(f"Precision: {precision_score(y_valid, knn_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, knn_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, knn_pred_val)}")
print(f"F1: {f1_score(y_valid, knn_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, knn_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, knn_pred_test)}")
print(f"Precision: {precision_score(y_test, knn_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, knn_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, knn_pred_test)}")
print(f"F1: {f1_score(y_test, knn_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, knn_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[658   5]
 [710  11]]
Precision: 0.5842470760233918
Recall: 0.5038575549712043
Accuracy: 0.4833815028901734
F1: 0.33890370892950317
How Many correct records correctly predicted 669

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[910   2]
 [894  15]]
Precision: 0.693393765488457
Recall: 0.5071543338544381
Accuracy: 0.5079626578802856
F1: 0.35125025049542424
How Many correct records correctly predicted 925


# Decision Tree

In [27]:
from sklearn.tree import DecisionTreeClassifier
dtree_clf = DecisionTreeClassifier()
dtree_fit = dtree_clf.fit(tfidf_train, y_train)

In [28]:
# 10 fold cv
scores = cross_val_score(dtree_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.62427746, 0.64739884, 0.66618497, 0.64884393, 0.63150289,
       0.66618497, 0.60549133, 0.66763006, 0.65606936, 0.66040462])

In [29]:
dtree_pred_val = dtree_fit.predict(tfidf_valid)
dtree_pred_test = dtree_fit.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, dtree_pred_val)}")
print(f"Precision: {precision_score(y_valid, dtree_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, dtree_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, dtree_pred_val)}")
print(f"F1: {f1_score(y_valid, dtree_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, dtree_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, dtree_pred_test)}")
print(f"Precision: {precision_score(y_test, dtree_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, dtree_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, dtree_pred_test)}")
print(f"F1: {f1_score(y_test, dtree_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, dtree_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[434 229]
 [261 460]]
Precision: 0.6460473420972945
Recall: 0.6463015377921146
Accuracy: 0.6459537572254336
F1: 0.6458287636177524
How Many correct records correctly predicted 894

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[603 309]
 [330 579]]
Precision: 0.6491646389154427
Recall: 0.6490739534479764
Accuracy: 0.6490939044481054
F1: 0.6490329410806628
How Many correct records correctly predicted 1182


# Random Forest

In [30]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(max_depth=2, random_state=0)
rf_fit = rf_clf.fit(tfidf_train, y_train)

In [31]:
# 10 fold cv
scores = cross_val_score(rf_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.52601156, 0.5216763 , 0.52312139, 0.52312139, 0.52745665,
       0.5216763 , 0.52312139, 0.52312139, 0.5216763 , 0.52312139])

In [32]:
rf_pred_val = rf_fit.predict(tfidf_valid)
rf_pred_test = rf_fit.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, rf_pred_val)}")
print(f"Precision: {precision_score(y_valid, rf_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, rf_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, rf_pred_val)}")
print(f"F1: {f1_score(y_valid, rf_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, rf_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, rf_pred_test)}")
print(f"Precision: {precision_score(y_test, rf_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, rf_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, rf_pred_test)}")
print(f"F1: {f1_score(y_test, rf_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, rf_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[  1 662]
 [  0 721]]
Precision: 0.7606652205350687
Recall: 0.5007541478129713
Accuracy: 0.5216763005780347
F1: 0.3441866324614046
How Many correct records correctly predicted 722

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[  3 909]
 [  0 909]]
Precision: 0.75
Recall: 0.5016447368421053
Accuracy: 0.500823723228995
F1: 0.3366120218579235
How Many correct records correctly predicted 912


# XG Boost

In [33]:
from sklearn.ensemble import GradientBoostingClassifier
xg_clf = GradientBoostingClassifier(random_state=0)
xg_fit = xg_clf.fit(tfidf_train, y_train)

In [34]:
# 10 fold cv
scores = cross_val_score(xg_fit, X_tfidf_train, y, cv=10, scoring='accuracy')
scores

array([0.64595376, 0.66907514, 0.6632948 , 0.66907514, 0.64884393,
       0.66763006, 0.62861272, 0.63872832, 0.63150289, 0.65606936])

In [35]:
xg_pred_val = xg_fit.predict(tfidf_valid)
xg_pred_test = xg_fit.predict(tfidf_test)
print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_valid, xg_pred_val)}")
print(f"Precision: {precision_score(y_valid, xg_pred_val, average='macro')}")
print(f"Recall: {recall_score(y_valid, xg_pred_val, average='macro')}")
print(f"Accuracy: {accuracy_score(y_valid, xg_pred_val)}")
print(f"F1: {f1_score(y_valid, xg_pred_val, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_valid, xg_pred_val, normalize = False)}\n")

print('Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:')
print(f"Confustion Matrix: \n{confusion_matrix(y_test, xg_pred_test)}")
print(f"Precision: {precision_score(y_test, xg_pred_test, average='macro')}")
print(f"Recall: {recall_score(y_test, xg_pred_test, average='macro')}")
print(f"Accuracy: {accuracy_score(y_test, xg_pred_test)}")
print(f"F1: {f1_score(y_test, xg_pred_test, average='macro')}")
print(f"How Many correct records correctly predicted {accuracy_score(y_test, xg_pred_test, normalize = False)}")

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Validation Data:
Confustion Matrix: 
[[311 352]
 [125 596]]
Precision: 0.6709973677079704
Recall: 0.647854810333394
Accuracy: 0.6553468208092486
F1: 0.6400845913082921
How Many correct records correctly predicted 907

Confusion Matrix, Precison, Recall, Accuracy, F1 Score and how many correct records are predicted for Test Data:
Confustion Matrix: 
[[436 476]
 [164 745]]
Precision: 0.6684111384111384
Recall: 0.648826066817208
Accuracy: 0.6485447556287754
F1: 0.6381250465757509
How Many correct records correctly predicted 1181
