# [95-865] Unstructured Data Analysis: Final Exam Q1

Name:  
Andrew ID:

# Q1:  Sentiment Analysis [Total: 25 Points]

Universal Brothers, A fictional movie production company approaches you to find out what people feel about the movies the produce.
Download the dataset from https://www.dropbox.com/s/1ztjhjsznlhtv10/ExamData-1.zip?dl=0 <br>
Unzip into same folder as this notebook. Do not have the files inside any other inner folder.
Find attached an IMDB dataset for movie reviews. You will perform sentiment analysis on this dataset. Low rating movies are labeled 0 and high rated movies are labeled 1. You are provided with a train and test dataset.

### (a) Load [1 Point]
Load the train and test files into a dataframe. Name the columns to help you in the rest of the problem.

In [1]:
import os
import pandas as pd
train = pd.read_csv( os.path.join('train.csv'))
test = pd.read_csv(os.path.join('test.csv'))
train = train[train.columns[[1,2]]]
train.columns = ['Sentiment', 'Review']
test = test[test.columns[[1,2]]]
test.columns = ['Sentiment', 'Review']

### (b) Clean the Dataset [4 points]

- Tokenize the reviews using Spacy
- Convert tokens to lower case
- Only keep alphanumeric characters in the reviews
- Use the stopwords.txt file to remove stop words from the tokens list
- Remove punctuations from the reviews
- Perform the above on both the train and test review datasets
- List the tokens for the first 5 reviews

In [3]:
import spacy
nlp = spacy.load('en')

In [4]:
import string
def tokenize(review):
        review = review.lower()
        review = list(nlp(review.decode("latin-1").strip()))
        review = [str(x) for x in review]
        tokens = [x for x in review if x not in string.punctuation]
        tokens = [x for x in tokens if x.isalnum()]
        with open('stopwords.txt') as f:
            stopwords = f.read().splitlines()
        tokens = [x for x in tokens if x not in stopwords]
        return tokens
        

train['Tokens'] = train['Review'].apply(lambda a: tokenize(a))
test['Tokens'] = test['Review'].apply(lambda a: tokenize(a))
train.head(5)


Unnamed: 0,Sentiment,Review,Tokens
0,1,With all this stuff going down at the moment w...,"[stuff, going, moment, mj, started, listening,..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...","[classic, war, timothy, hines, entertaining, f..."
2,0,The film starts with a manager (Nicholas Bell)...,"[film, starts, manager, nicholas, bell, giving..."
3,0,It must be assumed that those who praised this...,"[assumed, praised, film, greatest, filmed, ope..."
4,1,Superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, wondrously, unpretentious, ..."


### (c) Word2Vec [Loading word Vector - 2 points, Feature matrix - 8 points]

Use the glove.6B.50d.txt to load the 50 dimensional word vector. In the RNN tutorial, you were taught how to load a pre-computed embedding. Load this file the same way. 
The above gives you word embeddings for the words. A review is composed of words. Given a review, the review embedding is the average of the individual token(word) embeddings. Write a function to create a matrix whose rows are the review embeddings created as described in the previous sentence. If a word does not have an embedding in the above glove model, ignore it.


In [5]:
import numpy as np
train_rev = np.array(train.Tokens)
train_label = np.array(train.Sentiment)
test_rev = np.array(test.Tokens)
test_pred = np.array(test.Sentiment)

In [6]:

embeddings_index = {}
with open("glove.6B.50d.txt") as f:
    # Each row represents a word vector
    for line in f:
        values = line.split()
        # The first part is word
        word = values[0]
        # The rest are the embedding vector
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [13]:
embedding_dim = 50
def createFeatureMatrix(review):
    embedding_matrix = np.zeros((len(review), embedding_dim))
    i = 0 
    cnt = 0
    for rev in review:
        vec = np.zeros(embedding_dim)
        c=0
        for token in rev:
            c+=1
            if token in embeddings_index:
                vec+=embeddings_index[token]
            else:
                cnt+=1
        vec = vec/c
        embedding_matrix[i] = vec
        i+=1
    print(cnt)
    return embedding_matrix
        
train_mat = createFeatureMatrix(train_rev)
test_mat = createFeatureMatrix(test_rev)

704
192


### (d) SVM [Cross Validation - 6 points, Correct Prediction - 4 Points]
Pass the above created training matrix through Polynomial Kernel SVM.
Do a grid search in SVM over the best c in the range np.logspace(-4, 2, 3)  and best degree in the range range range(1,4)
Predict the rating on SVM and report the accuracy. <br>
NOTE: You have to implement your own grid search cross validation code.

In [14]:
from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

n_folds = 5
k_fold = KFold(n_folds)
Cs = np.logspace(-4, 2, 3)
Ds = range(1,4)

c_scores = []
bestD = []
for C in Cs:
    d_scores= []
    for d in Ds:
        fold_scores = []
        for k, (train, val) in enumerate(k_fold.split(train_mat, train_label)):
            clf = svm.SVC(C=C,kernel="poly",degree=d)
            clf.fit(train_mat[train], train_label[train])

            ypred = clf.predict(train_mat[val])
            yval = train_label[val]
            accuracy = np.sum(ypred==yval)/float(len(ypred))
            fold_scores.append(accuracy)
        d_score = np.mean(fold_scores)
        d_scores.append(d_score)
    print(d_scores)
    ind = np.argmax(d_scores)
    bestD.append(Ds[ind])
    c_scores.append(max(d_scores))
print(c_scores)
ind = np.argmax(c_scores)
bestC= Cs[ind]
bestD = bestD[ind]





[0.51751256281407032, 0.51751256281407032, 0.51751256281407032]
[0.51751256281407032, 0.51751256281407032, 0.51751256281407032]
[0.73772864321608034, 0.73072864321608033, 0.70870854271356776]
[0.51751256281407032, 0.51751256281407032, 0.73772864321608034]


In [15]:
print(bestC)
print(bestD)

100.0
1


In [16]:

clf = svm.SVC(C=bestC,degree=bestD,kernel="linear")
clf.fit(train_mat, train_label)

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=1, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
y_pred = clf.predict(test_mat)
accuracy_score(test_pred, y_pred)

0.73092369477911645