3 Sentiment Analysis [40 points]
The following training data contains 9484 sentences of movie reviews taken from Stanford
Sentiment Analysis Tree Bank dataset https://nlp.stanford.edu/sentiment/, where
each sentence is rated between 0 to 4 (0 - negative, 1 - somewhat negative, 2 - neutral,
3 - somewhat positive, 4 - positive). The sentiment score is appended at the end of each
sentence, separated by j .
The training data is available at: https://www.dropbox.com/s/4tcyb2iefdr2jaz/train_
data.txt?dl=0
You can use the following tools for implementation:
1. TensorFlow https://www.tensorflow.org/
2. Theano http://deeplearning.net/software/theano/
3. Scikit-learn http://scikit-learn.org/stable/index.html
4. PyTorch http://pytorch.org/
In addition to output and results, please submit your code along with a README.txt
le explaining how to run your code.

In [139]:
import os
import re
import nltk
import gensim
import warnings
import pandas as pd
import numpy as np
from sklearn import metrics
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 
from scipy import sparse
from sklearn.cross_validation import cross_val_predict
from sklearn.neural_network import MLPClassifier
from nltk.stem import WordNetLemmatizer

warnings.filterwarnings('ignore')

In [241]:
## get the dataset
with open(r"C:\Users\mm199\NLP\HW2\q3_train_data.txt") as training_file:
    train_data = training_file.readlines()
    
with open(r"C:\Users\mm199\NLP\HW2\q3_test_data.txt") as test_file:
    test_data = test_file.readlines()

In [262]:
def preprocess_and_create_dataframe_with_unigram_counts(dataset, test = False):
    ## porter stemmer initialization and 
    pstemm = PorterStemmer()
    stop_words = set(stopwords.words('english')) 
    lemmatizer = WordNetLemmatizer()

    ## preprocess the data 
    i = 0
    y_label = []
    reviews_unigram = []
    for data in dataset:
        if test:    
            pattern = "[\w\-]+"
            review_statement = data.replace("n't", "not")
            list_of_words_each_review = [match.group() for match in re.finditer(pattern,review_statement,re.M|re.I)]
            list_of_words_each_review_stemmed = [lemmatizer.lemmatize(i).lower() for i in list_of_words_each_review if i != "--" and i.lower() not in stop_words]
            reviews_unigram.append(" ".join(list_of_words_each_review_stemmed))
        else:
            split_data = data.split("|")
            review_num = split_data[1].strip("\n")
            review_statement = split_data[0]
            y_label.append(int(review_num))
            pattern = "[\w\-]+"
            review_statement = review_statement.replace("n't", "not")
            list_of_words_each_review = [match.group() for match in re.finditer(pattern,review_statement,re.M|re.I)]
            list_of_words_each_review_stemmed = [lemmatizer.lemmatize(i).lower() for i in list_of_words_each_review if i != "--" and i.lower() not in stop_words]
            reviews_unigram.append(" ".join(list_of_words_each_review_stemmed))
        
    ## get all the unigrams from the all the words
    unigram_words = list(set([j for i in reviews_unigram for j in i]))
    
    sentences = reviews_unigram
    cv = CountVectorizer(binary=True)
    cv.fit(sentences)
    train_set = cv.transform(sentences)

    return train_set, y_label, unigram_words, reviews_unigram

In [263]:
df_train, y_label, unigram_words_train, reviews_unigram_train = preprocess_and_create_dataframe_with_unigram_counts(train_data)
df_test, _, unigram_words_test, reviews_unigram_test = preprocess_and_create_dataframe_with_unigram_counts(test_data, test = True)

# 3.1

Train a Multilayer Perceptron to predict sentiment score . Using unigram features as input,
call the training and testing functions for the Multilayer Perceptron from the tool. You do
not need to implement the learning (i.e., back-propagation) algorithm. You should have an
input layer, two hidden layers, and an output layer; the second hidden layer should have
10 nodes. Use 10-fold cross-validation to optimize any parameters (e.g. activation function
or number of nodes in the first hidden layer). Use accuracy as the metric for parameter
selection. Describe your parameter optimization process, and report the parameters for your
best model.


In [265]:
## train and get prediction on multilayer perceptron
def mlpclassifier(df_X, y, hiddenlayernodes = 10, cross_val = None):
    y_label = [np.argmax(each_review) for each_review in y]
    clf = MLPClassifier(hidden_layer_sizes=(hiddenlayernodes, 10), activation='relu',alpha=0.0001, batch_size='auto', 
                    learning_rate='constant', learning_rate_init=0.001)
    clf.fit(df_X, y)
    if cross_val:
        pred = cross_val_predict(clf, df_X, y, cv=10)
    else:
        pred = clf.predict(df_X)
    res_pred = [np.argmax(each_review) for each_review in pred]
    score = np.mean(np.equal(res_pred, y_label))
    print ("Accuracy with number_of_hidden_nodes", (hiddenlayernodes)," : ", score)
    return clf, score

In [266]:
## results prediction on test
def pred_test_res(clf, df_X):
    pred = clf.predict(df_X)
    return pred


In [267]:
## cross validation
def cross_validating(df_X, y_label):
    df_param_num = [2, 5, 10, 20]
    max_score = -1
    for num in df_param_num:
         _, score = mlpclassifier(df_X, y_label, hiddenlayernodes = num, cross_val = 10)
    if score > max_score:
        max_score = score
        max_num = num
    return (max_score, max_num)


In [268]:
## one hot encoder form
y_one_hot = [[0 if i != j else 1 for j in range(0, 5)] for index,i in enumerate(y_label)]

In [281]:
## call the cross validation to get the best parameters
max_score1, max_num = cross_validating(df_train_sparse, y_one_hot)

Accuracy with number_of_hidden_nodes 2  :  0.28954027836355967
Accuracy with number_of_hidden_nodes 5  :  0.29259805989034166
Accuracy with number_of_hidden_nodes 10  :  0.30177140447068745
Accuracy with number_of_hidden_nodes 20  :  0.30366933783213834


# 3.2
Using the parameters for the best performing model from 3.1, re-train it on the whole training
set, and report the accuracy on the training set.

In [269]:
## Run on the whole training set on the best performing model 

##  Considering max_num = 20 as it reduces the chances of overfitting

max_num = 20
clf1, score1 = mlpclassifier(df_train_sparse, y_one_hot, hiddenlayernodes = 20)

Accuracy with number_of_hidden_nodes 20  :  0.9984183888654576


# 3.3
Use pre-trained word embeddings GoogleNews-vectors-negative300.bin.gz from Word2vec
https://code.google.com/archive/p/word2vec/, and compute the review feature vector
by using the average word embeddings. Do the same thing in 3.1: Use 10-fold cross-validation
to optimize any parameters (e.g. activation function or number of nodes in the rst hidden
layer). Use accuracy as the metric for parameter selection. Report the parameters for your
best model. Then re-train the best performing model on the whole training set, and report
the accuracy on the training set.

In [163]:
## word embedding initialization
model = gensim.models.KeyedVectors.load_word2vec_format(r'C:\Users\mm199\NLP\HW2\GoogleNews-vectors-negative300.bin', binary=True)
global model

In [270]:
## get word2vec for each sentences by using average word embeddings
def word2vec_classifier(reviews_unigram):
    dict_word2vec = {}
    global model
    for index, word_list in enumerate(reviews_unigram):
        arr = np.array([0.0 for i in range(0, 300)])
        word_list = word_list.split(" ")
        for word in word_list:
            try:
                arr += model.get_vector(word)
            except KeyError:
                continue 
        dict_word2vec[index] = arr / len(word_list)
    df_word2vec = pd.DataFrame(dict_word2vec).T
    return df_word2vec

In [271]:
## get df with word embeddings
df_word2vec_train = word2vec_classifier(reviews_unigram_train)
df_word2vec_test = word2vec_classifier(reviews_unigram_test)

In [272]:
## call the cross validation to get the best parameters for word2vec dataset
max_score2, max_num = cross_validating(df_word2vec_train, y_one_hot)

Accuracy with number_of_hidden_nodes 2  :  0.17830029523407845
Accuracy with number_of_hidden_nodes 5  :  0.23102066638549135
Accuracy with number_of_hidden_nodes 10  :  0.2650780261493041
Accuracy with number_of_hidden_nodes 20  :  0.2970265710670603


In [279]:
## run on the whole training set with the best performing model parameters with nodes in the first layer as 20
max_num = 20
clf2, score2 = mlpclassifier(df_word2vec_train, y_one_hot, hiddenlayernodes = max_num)

Accuracy with number_of_hidden_nodes 20  :  0.47522142555883595


# 3.4
In addition to the word embeddings, add one type of features by your own design (e.g. POS
tags distribution) to the model in 3.3. Then re-train this model on the whole training set,
and report the accuracy on the training set.


In [274]:
def pos_distribution(reviews_unigram, unigram_words, df_word2vec):  
    ## pos tag for each unigram
    words_review_pos_tag = nltk.pos_tag(unigram_words)

    ## total number of unique tags
    total_tags = (set([i[1] for i in words_review_pos_tag]))

    ## word and tag in a dict format
    words_review_pos_tag_dict = {i[0]:i[1] for i in words_review_pos_tag}

    ## additional feature of POS distribution
    each_review_pos_distribution = {}
    for index, word_list in enumerate(reviews_unigram):
        length_of_tags = 0
        each_review_word_tag = {i : 0 for i in total_tags}
        word_list = word_list.split(" ")
        for words in word_list:
            if words not in words_review_pos_tag_dict:
                continue
            tag = words_review_pos_tag_dict[words]
            each_review_word_tag[tag] += 1
            length_of_tags += 1
        if length_of_tags != 0:
            for tag in each_review_word_tag:
                each_review_word_tag[tag] /= length_of_tags
        each_review_pos_distribution[index] = each_review_word_tag
    df_pos_dist = pd.DataFrame(each_review_pos_distribution).T
    
    ## concatenate two dataframes
    df_word_embedding = pd.concat([df_word2vec, df_pos_dist], axis = 1)
    
    return df_word_embedding

In [275]:
df_word_embedding_train = pos_distribution(reviews_unigram_train, unigram_words_train, df_word2vec_train)
df_word_embedding_test = pos_distribution(reviews_unigram_test, unigram_words_test, df_word2vec_test)

In [278]:
## run on the whole training set with the best performing model parameters
max_num = 20
clf3, score3 = mlpclassifier(df_word_embedding_train, y_one_hot, hiddenlayernodes = max_num)

Accuracy with number_of_hidden_nodes 20  :  0.49578237030788697


# 3.5 
Using the best model from above (based on results from 3.2, 3.3., and 3.4), predict the senti-
ment scores for all sentences in this test set: https://www.dropbox.com/s/jf8mr7kgt3hfv6y/
test_data.txt?dl=0 (contains 2371 sentences, one sentence per line).
Append your predicted sentiment score by the end of each line, separated by j, as shown in
the training data.
Submit this file and name it labels.txt.

In [280]:
## Get the best classifier running with activation = "relu" and solver = "adam".
## Also, the model seems to have overfit on the unigram count training dataset, that's why the accuracy is high
## so, choosing the word embeddings with POS tags distribution as it is closer to the cross validation dataset and with parameter of first 
## layer (number of nodes = 20).

df_X = df_word_embedding_test
max_num = 20
clf = MLPClassifier(hidden_layer_sizes=(max_num, 10), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', 
                    learning_rate='constant', learning_rate_init=0.001, max_iter=500, random_state=3)
clf.fit(df_word_embedding_train, y_one_hot)
predicted_res = clf.predict_proba(df_X)
res = [np.argmax(i) for i in predicted_res]
test_r = [each_review.strip("\n") + " |" + str(res[i]) + "\n" for i,each_review in enumerate(test_data)]
with open("labels.txt", "w+") as file:
    for review in test_r:
        file.write(review)
