<hr>

# Lab 10: Natural Language Processing – Text Classification
Total Marks: 8 Marks + 2 Marks (individual assessment) = 10 Marks


<hr>

In this assignment you will use a provided `musical dataset` and by using natural language processing, build the `classifiers` and `evaluate` the `performance` of a system that assign `positive (1)` or `negative (0)` score by analyzing text based reviews of musical instruments.<br><br> The dataset is a modified `1000 reviews` of a dataset used in *"Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering by R. He, J. McAuley WWW, 2016 [cseweb.ucsd.edu]"*, which is attached with this assignment.

<b>Accuracy</b> = (TP + TN) / (TP + TN + FP + FN)<br>
<b>Precision</b> = TP / (TP + FP)Recall = TP / (TP + FN)<br>
<b>F1 Score</b> = 2 * Precision * Recall / (Precision + Recall)<br>

## Using Python language, perform the followings NLP tasks to build the classifier for the given dataset:


In [None]:
#ignore  numpy floating point depreciation warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

#text preprocessing
import pandas as pd
import re 
import nltk
import matplotlib.pyplot as plt
import numpy as np
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem.porter import *
from nltk.stem.wordnet import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

dataset = pd.read_csv('musical1.tsv',sep='\t')
x = dataset['Score'].value_counts()
print("Class Distribution:")
print(x)


Class Distribution:
1    533
0    467
Name: Score, dtype: int64


[nltk_data] Downloading package punkt to /Users/jennylong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jennylong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jennylong/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from bs4 import BeautifulSoup
def clean_text_data(data_point, data_size):
    review_soup = BeautifulSoup(data_point)
    review_text = review_soup.get_text()
    #this section removes non-alpha characters
    review_letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    review_lower_case = review_letters_only.lower()   
    """
    Q1. Using NLTK word_tokenize function, tokenize the given dataset reviews
    """
    review_words = word_tokenize(review_letters_only)
    """
    Q2. Using NLTK PorterStemmer, perform the stemming for the tokens of the reviews
    """
    #stemming
    stop_words = stopwords.words("english")
    words=[stemmer.stem(word) for word in review_words if word not in stop_words] 
    """
    Q3. Using NLTK WordNetLemmatizer, perform the lemmatization for the stemmed tokens
    """
    words = [lemmatizer.lemmatize(word.lower()) for word in words]
    return( " ".join(words)) 

In [None]:
training_data_size = dataset["Review"].size

for i in range(training_data_size):
    dataset["Review"][i] = clean_text_data(dataset["Review"][i], training_data_size)
print("Cleaning training completed!")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Cleaning training completed!


## 4. Build the Random Forest technique using sklearn library


In [None]:
#Getting the features ready to be trained
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

X_train, X_cv, Y_train, Y_cv = train_test_split(dataset["Review"], dataset["Score"], test_size = 0.2, random_state=1)

#converting the train,validation and test data to vectors
X_train = vectorizer.fit_transform(X_train)
X_train = X_train.toarray()
# print(X_train.shape)

X_cv = vectorizer.transform(X_cv)
X_cv = X_cv.toarray()
# print(X_cv.shape)

X_test = vectorizer.transform(dataset["Review"])
X_test = X_test.toarray()
# print(X_test.shape)

forest = RandomForestClassifier() 
forest = forest.fit( X_train, Y_train)



## Extra Stuff

In [None]:
vocab = vectorizer.get_feature_names()
print(f"Printing first 100 vocabulary samples:\n{vocab[:100]}")

distribution = np.sum(X_train, axis=0)

print("Printing first 100 vocab-dist pairs:")

for tag, count in zip(vocab[:100], distribution[:100]):
    print(count, tag)

Printing first 100 vocabulary samples:
['abcd', 'abil', 'abl', 'ableto', 'ableton', 'abnorm', 'abov', 'abovec', 'abram', 'absolut', 'absolutley', 'absorb', 'abu', 'ac', 'accept', 'access', 'accid', 'acclim', 'accommod', 'accord', 'accordingli', 'account', 'accoust', 'accur', 'accuraci', 'achiev', 'acknowledg', 'acoust', 'acquir', 'acquisit', 'across', 'act', 'action', 'activ', 'actual', 'ad', 'adapt', 'adaptor', 'adario', 'add', 'addario', 'addit', 'address', 'addrio', 'addtion', 'adequ', 'adh', 'adher', 'adjust', 'admit', 'admittedli', 'adult', 'advanc', 'adver', 'adverti', 'advi', 'advic', 'aesthet', 'affect', 'affili', 'affin', 'afford', 'afraid', 'after', 'agc', 'age', 'aggress', 'aglaesel', 'ago', 'agr', 'ahead', 'aid', 'air', 'airi', 'airlin', 'akai', 'akg', 'akustom', 'album', 'alchemi', 'alesi', 'align', 'alittl', 'aliv', 'alkalin', 'allegedli', 'allow', 'allpart', 'almost', 'alnico', 'alon', 'along', 'alot', 'alreadi', 'alright', 'also', 'altanta', 'altern', 'although', 'altog

## 5. Evaluate the model by finding its accuracy, precision and F1-score

In [None]:
from sklearn.metrics import precision_score,f1_score

predictions = forest.predict(X_cv) 
print("Accuracy: ", accuracy_score(Y_cv, predictions))

print("Precision: %.3f" %precision_score(Y_cv,predictions))

print("f1-score: %.3f" %f1_score(Y_cv,predictions))

Accuracy:  0.695
Precision: 0.724
f1-score: 0.714


In [None]:
result = forest.predict(X_test) 
output = pd.DataFrame( data={"Score":result} )