# Lab 7: Sentiment Analysis on Text Data

(Total points 130)

In this lab, you will gain hands on experiences on processing the raw text data for sentiment analysis.   This lab is divided into two subparts as follows: 

1. Treat sentiment analysis as a process of word matching, which you are given lexicons for both sets of positive and negative words.

2. Think of sentiment analysis as a binary classification problem that a binary classifier need to be learned from a set of training samples and generalize to any other unseen samples.

The purpose of this lab is to guide you steps by steps and after this lab you will at least have a general sense of academic research in the field of text mining, NLP, etc.  

Beside this ipython notebook, you are also given a list of text files as follows:

1.  pros_new.txt: contains 500 “positive” text data for you to analyze.
2.  cons_new.txt: contains 500 “negative” text data for you to analyze.
3.  positive-words.txt: positive word lexicon.
4.  negative-words.txt: negative word lexicon.
5. stopwords.txt:  a list of stop words.

You will use all these files through the lab, and it will be explained more in detail later.

###  What to hand in: 
You will need to pack following things into a file.


   * The completed Notebook file (ipynb) - Remember to answer all the questions in the notebook!
   * All the figures plotted in this lab 

### Please go through the instructions in the notebook thoroughly

## Exercise 1: Sentiment Analysis as Word Matching

First, make sure all the following python packages are successfully installed, you need to use them for this lab.

Use  one of the following commands if needed:

pip install nltk  


pip install nltk --user


In [1]:
import sys;
import numpy as np;
import scipy.sparse;
import nltk;
from nltk.stem.snowball import SnowballStemmer;
import random;
import math;

In this lab, we will process all 1000 text data from pros_new.txt and cons_new.txt (both positive and negative) to match the word in positive or negative lexicons. An example of the type of data you will see is:

“Holding the camera steady for 5 seconds after taking picture, Not XP compatible Too little space to answer”

## Fill in the codes in the blanks below.

You will first work with an example line, and then combine all the code to create your sentiment analyzer.

Step 1:  Token the string and remove punctuation using regular expressions:

In [2]:
line = "Holding the camera steady for 5 seconds after taking picture, Not XP compatible Too little space to answer.";

# define the toker
toker = nltk.tokenize.RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True);
tokens = toker.tokenize(line);

Step 2: Convert each word to a lower case:

In [3]:
tokens_new = []
for word in tokens:     
    if word not in {' ', ',' , '.'}:
    
        #### 1. FILL IN SOME CODE LINE BELOW ####
        # (5 points)
        
        # tokens_new should contain the list of words in lower case.
        tokens_new.append(word.lower())
                
print tokens_new
tokens = tokens_new


['holding', 'the', 'camera', 'steady', 'for', '5', 'seconds', 'after', 'taking', 'picture', 'not', 'xp', 'compatible', 'too', 'little', 'space', 'to', 'answer']


Step 2: Load the stop words list:

In [4]:
f = open('stopwords.txt','r');
stoplist = set([]);
for line in f:
    line = line.strip();
    line = set([line]);
    stoplist |= line;
f.close;

Step 3: Remove all stop words from the tokens list:

In [5]:
#### 2. WRITE YOUR CODE HERE ####
# (10 points)

# remove the words in the 'tokens' list which are in stoplist
# return a list called tokens
new_list = []
for word in tokens:
    if word in stoplist:
        continue
    else:
        new_list.append(word)
        
tokens = new_list

Step 4:  Stem each word:

Stemming is the process of converting the words of a sentence to its non-changing portions. In the example of amusing, amusement, and amused, the stem would be amus(e).

In [6]:
# define the stemmer 
stemmer = SnowballStemmer("english");

# stem all the words in the tokens list. This is a standard way to step words.
tokens = [ str(stemmer.stem(word)) for word in tokens];
print tokens

['hold', 'camera', 'steadi', '5', 'second', 'take', 'pictur', 'xp', 'compat', 'space', 'answer']


Step 5(a):  Load the postive lexicons.

This should be similar to step 2 above. Also stem each word, similar to step 4 above. 

In [7]:
#### 3. COMPLETE THE CODE BELOW ####
# (15 points)
# read the pos lexicon file

file_pos = open("positive-words.txt")
stemmer = SnowballStemmer("english")
poslist = set([]);
for line in file_pos:
    # 1. Use the strip() function to remove the end-of-line tags
    # 2. Stem the line (similar to the previous stemming code. Remember, each line is a word in these text files.)
    # 3. Add the stemmed line to the postive list.
    line = line.strip()
    line = set([line])
    poslist |= line

poslist = [str(stemmer.stem(word)) for word in poslist]
file_pos.close();

print "==> The list of first 15 postive words after stemming: "
print list(poslist)[:15]

==> The list of first 15 postive words after stemming: 
['unencumb', 'pardon', 'saver', 'desir', 'encourag', 'sleek', 'thought', 'cooper', 'fair', 'faster', 'work', 'undisput', 'sturdi', 'envious', 'homag']


Step 5(b): Now do the same for the negative lexicon file. 

In [8]:
#### 4. WRITE YOUR CODE HERE ####
# (5 points)

file_neg = open("negative-words.txt")
neglist = set([]);

for line in file_neg:
    line = line.strip()
    line = set([line])
    neglist |= line
    
neglist = [str(stemmer.stem(word)) for word in neglist]
file_neg.close()
print "==> The list of first 15 negative words after stemming: "
print list(neglist)[:15]


==> The list of first 15 negative words after stemming: 
['limit', 'subtract', 'belliger', 'suicid', 'cuss', 'inadequaci', 'dissolut', 'refut', 'threaten', 'foul', 'obstruct', 'protest', 'slog', 'lurk', 'thirst']


#### Note: This is not the best way to write code. If one is using the same set of steps multiple times, one should define a function that does the required processing.


Step 6: Match your string tokens with both lexicons: If the number of positive words larger than the negative ones, you will report this as a positive sentence, vise versa.  

NOTES:  If the number of postive words is the same as the negative ones, you will report it as neutral/not sure.  

In [9]:
# Create a list of positive words and
# negative words in the list 'tokens'.
# Print the number of positive and negative words 


#### 5. FILL IN BELOW ####
# (5 points, including answer)

pos_word_list = []
neg_word_list = []

for token in tokens:
    if token in poslist:
        pos_word_list.append(token)
    elif token in neglist:
        neg_word_list.append(token)

print "Pos words: ", len(pos_word_list)
print "Neg words: ", len(neg_word_list)


Pos words:  2
Neg words:  0


#### Question:
Based on the number of postive and negative words, the line above is 

(a) Positive (b) Negative (c) Neutral

#### Your answer:
Positive

Step 7: Combine it all!

Now that we have all the tools to process the data for sentiment analysis, we will be analyzing the two files "pros_new.txt" and “cons_new.txt”  line by line (each line contains one sentence). 

To do so, we define the function sentiment_analyzer which takes in filename, postive list and negative list. The function returns the number of positive, negative and neutral sentences, and the number of sentences in the document. 

Report the accuracy you obtained for each file separately as well as how many percentage of the positive sentence you categorized as negative (the miss detection rate) and how many percentage of the negative sentence you categorized as positive (the false alarm rate).  

First use the pros text file.

In [10]:
############### 6. WRITE YOUR CODE HERE ###############
# (25 points)

def sentiment_analyzer(filename, poslist, neglist):
    
    # 1. Load the tokenizer. 
    # 2. Set postive, negative and neutral counts to zero. 
    # 3. Read the file and keep track of the positive and negative words lists. 
    # 4. Then check if the document is more positive, or more negative. If equal, report 'Not Sure'. 
    # 5. Then print the accuracy and missed detection rate.

    f = open('pros_new.txt','r');
    num_lines = 0
    pos_count_total = 0
    neg_count_total = 0
    neutral_count_total = 0
    
    for line in f:
        toker = nltk.tokenize.RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True);
        tokens = toker.tokenize(line);

        tokens_new = []
        for word in tokens:     
            if word not in {' ', ',' , '.'}:
                tokens_new.append(word.lower())

        new_list = []
        for word in tokens:
            if word in stoplist:
                continue
            else:
                new_list.append(word)
        tokens = new_list
        tokens = [str(stemmer.stem(word)) for word in tokens]

        pos_word_list = []
        neg_word_list = []

        for token in tokens:
            if token in poslist:
                pos_word_list.append(token)
            elif token in neglist:
                neg_word_list.append(token)
        
        if len(pos_word_list) > len(neg_word_list):
            pos_count_total += 1
        elif len(neg_word_list) > len(pos_word_list):
            neg_count_total += 1
        else:
            neutral_count_total += 1
        
        num_lines += 1

    f.close()
    
    return pos_count_total, neg_count_total, neutral_count_total, num_lines

Use the above function to process the pros_new.txt document.

In [11]:
############### 7. WRITE YOUR CODE HERE ###############
# (5 points, including answers to questions below)

p_correct, p_wrong, p_notsure, p_num_lines = sentiment_analyzer("pros_new.txt", poslist, neglist)

print "p_correct: ", p_correct
print "p_wrong: ", p_wrong
print "p_notsure : ", p_notsure
print "p_numlines: ", p_num_lines
print "Accuracy: ", float(p_correct)/float(p_num_lines)
print "Miss detection rate: ", float(p_wrong)/float(p_num_lines)


p_correct:  345
p_wrong:  33
p_notsure :  122
p_numlines:  500
Accuracy:  0.69
Miss detection rate:  0.066


#### [WRITE YOUR ANSWERS BELOW. EXPLAIN IF NECESSARY.] ####

Number of sentence you think is postive: 345

Number of sentence you think is negative: 33

Number of sentence you think is Neutral: 122

Accuarcy: 0.69

Miss detection rate: 0.066



Now, repeat for the cons text file

In [12]:
############### 8. WRITE YOUR CODE HERE ###############
# (5 points, including answers to questions below)

n_wrong, n_correct, n_notsure, n_num_lines = sentiment_analyzer("cons_new.txt", poslist, neglist)

print "n_correct: ", n_correct
print "n_wrong: ", n_wrong
print "n_notsure : ", n_notsure
print "n_numlines: ", n_num_lines

n_correct:  33
n_wrong:  345
n_notsure :  122
n_numlines:  500


#### [WRITE YOUR ANSWERS BELOW. EXPLAIN IF NECESSARY.] ####

Number of sentence you think is postive: 345

Number of sentence you think is negative: 33

Number of sentence you think is Neutral: 122

Accuarcy: 0.066

False Alarm rate: 0.69



Step 8: Report the overall accuracy.

In [13]:
#### 9. YOUR CODE HERE TO OUTPUT THE OVERALL ACCURACY ####
# (5 points)

# accuracy = number of total correct predictions / number of total sentences

o_accuracy = (float(p_correct)+float(n_correct))/(float(n_num_lines) * 2)

print "Overall accuracy: ", o_accuracy


Overall accuracy:  0.378


## Exercise 2: Sentiment Analysis as Binary Classifiction

In the second part of this lab, we will be using the Nearest Neighbor (NN) classifier to treat the sentiment analysis as a binary classification problem.  

Each sentence is vectorized for you in some vector space.  And you need to randomly select 80% of positive and negative samples and use it as a training set, and the rest of data as testing set.  For each test sample, calculate the Euclidean distance to all the samples in the training set, and use the label associated with the data with least distance as your predicted label. 

Compare your predicted label with ground truth and report the accuracy.    

We will first construct a dictionary of words, associating a number (corresponding to its entry in the dictionary) with the word.

We will then convert each sentence to a 1,610 dimensional vectorized representation. We will encode this using a presence-absence matrix. Every sentence will have an entry corresponding to every word indicating whether the word is present in the sentence or not. This is one way of converting text to vectors. 

At the end, we will obtain two variables:

1. pos_vec:  a 500 x 1610 numpy matrix, each row indicate the vector representation for the corresponding positive sentence in the pros_new.txt file.  

2. neg_vec:  same as the pos_vec except that each row is generated from the cons_new.txt file. 

In [14]:
# Construct a full word list using dictionary. This code is provided to you.

wordlist = {};
counter = 0;
f = open('pros_new.txt','r');
for line in f: 
    line = line.strip();
    tokens = toker.tokenize(line);
    for word in tokens:
        if not(word in wordlist) and word.isalpha():
            wordlist[word] = counter;
            counter += 1;
f.close;
f = open('cons_new.txt','r');
for line in f: 
    line = line.strip();
    tokens = toker.tokenize(line);
    for word in tokens:
        if not(word in wordlist) and word.isalpha():
            wordlist[word] = counter;
            counter += 1;
f.close;

print "The total number of unique words: %d" %(len(wordlist.keys()));

The total number of unique words: 1610


We will first define a function to obtain the binary vectors for any file:

In [15]:
#### 10. YOUR CODE HERE ####
# (15 points)

def get_binary_vec(filename):
    # construct binary feature vectors; 
    vec = np.zeros((500,len(wordlist.keys())));
    f = open(filename,'r');
    
     
    # 1. Load the tokenizer
    toker = nltk.tokenize.RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
        
    iteration = 0;
    for line in f: 
        
        # 1. Strip line to remove trailing line breaks
        # 2. Tokenize the line
        line = line.strip()
        tokens = toker.tokenize(line)
        
        for word in tokens:
            # 3. Go through every word in the line. If the word contains no numbers, 
            #    update the corresponding entry in the vec array to be 1.
            
            # To assign '1' to the correct indices - the first index corresponds
            # to the sentence number and the second one corresponds to the 
            # column number as defined by the word_list dictionary
            if word in wordlist.keys():
                vec[iteration][wordlist[word]] = 1
            
            
        iteration += 1;
    f.close()
    return vec

Now run for both files:

In [16]:
pos_vec = get_binary_vec('pros_new.txt')
neg_vec = get_binary_vec('cons_new.txt')

This is a fast implementation of Euclidean distance computation, which will be a lot faster than if you are looping through all training and testing samples.  We will use this as is.

In [17]:
def FAST_L2_distance(A,B):
    # L2_distance conputes pairwise squared Euclidean distance matrix. 
    # Inputs: 
    #     A -- (d x m) matrix , m samples, d dimension
    #     B -- (d x n) matrix , n samples, d dimension
    #
    # Outputs: 
    #     D -- (m x n) squared Euclidean distance matrix, (i,j) entry indicates the squared distance of ith sample in A and 
    #          jth sample in B.   
    
    d = A.shape[0];
    # error checking:
    if (B.shape[0] != d):
        raise ValueError ("Dimension mismatched");
    # A_norm[i] = ||A[:,i]||^2
    A_norm = np.sum(A**2,axis=0);
    A_norm = np.reshape(A_norm,(1,A_norm.shape[0]));
    # print A_norm.shape;
    B_norm = np.sum(B**2,axis=0);
    B_norm = np.reshape(B_norm,(1,B_norm.shape[0]));
    # print B_norm.shape;
    cross = -2*np.dot(A.T,B);
    # print cross.shape
    # print A_norm.T.shape
    # print np.tile(A_norm.T,(1,cross.shape[1]))
    # print np.tile(B_norm,(cross.shape[0],1)).shape
    D = cross + np.tile(A_norm.T,(1,cross.shape[1])) + np.tile(B_norm,(cross.shape[0],1));
    # print D.shape
    # print D
    return D;

Set the seed for the random generator.  Please use "2017" as the seed for your final sumbission.    

In [64]:
# set a seed for the random generator 
random.seed("2017");

Random select 80% of positive and negative samples and use them as your training data. To do so:
   * First figure out how many samples will be in the training, and how many in the test.
   * Then randomly sample indices in the range 0-500 to obtain the training and testing indices.
   * Then select the appropriate subset of rows of pos_vec and neg_vec for training and testing. Append these to get            the final x_train and x_test.
   * Finally assign '1' and '0' to the labels, based on whether they are from the positive file, or the negative file, and append appropriately to get y_train and y_test.

In [65]:
############### 11. WRITE YOUR CODE HERE ###############
# (10 points)

num_sample = 1000
pos_idx_train = random.sample(range(0,500), 400)
neg_idx_train = random.sample(range(0,500), 400)
pos_idx_test = []
neg_idx_test = []
for i in range(500):
    if i not in pos_idx_train:
        pos_idx_test.append(i)
    if i not in neg_idx_train:
        neg_idx_test.append(i)

x_train = []
x_test = []
y_train = []
y_test = []

for i in pos_idx_train:
    x_train.append(pos_vec[i])
    y_train.append(1)

for j in neg_idx_train:
    x_train.append(neg_vec[j])
    y_train.append(0)

x_train = np.asarray(x_train)
y_train = np.asarray(y_train)


for i in pos_idx_test:
    x_test.append(pos_vec[i])
    y_test.append(1)
    
for j in neg_idx_test:
    x_test.append(neg_vec[i])
    y_test.append(0)

x_test = np.asarray(x_test)
y_test = np.asarray(y_test)


print "The number of training data: %d" %(x_train.shape[0]);
print "The number of testing data: %d" %(x_test.shape[0]);

The number of training data: 800
The number of testing data: 200


Calculate the distance between each testing sample to each training sample. Finally, 
* Sort to find the element with the least distance, for each line
* Assign the corresponding label
* Compare to ground truth to get accuracy

In [66]:
############### 12. WRITE YOUR CODE HERE ###############
# (10 points, including average accuracy)

D = FAST_L2_distance(x_test.transpose(), x_train.transpose())

# sort and find the index 
# hint: argsort
pred_idx = [np.argsort(line) for line in D]

# assign the label of the nearest neighbor as the prediction for y
pred_y = []

for i in pred_idx:
    pred_y.append(y_train[i][0])

# compare to the groundtruth
NN_correct =  sum(pred_y == y_test);
print NN_correct

# compute and print the accuracy
NN_acc = NN_acc = float(NN_correct)/float(y_test.shape[0])*100.0;
print "The nearest neighbor classifier accuarcy: %f%%" %(NN_acc)


170
The nearest neighbor classifier accuarcy: 85.000000%


Repeat the learning experiment for ten times (By changing the last 3 blocks of code).
Report your average accuarcy.  

Average Accuarcy: .83

### Questions: ( 5+5+5 points )
* Compare your accuarcy from two subparts of this lab, which one is better? 
* Why? 
* Can you think about other ways to improve accuarcy in both ways?  


### Write your answers here:

Overall, NN using euclidean distance was significantly better since whatever word we were looking at did not directly need to be in our positive words list or negative words list to be classified. Therefore, there were a lot more words being correctly classified.

One big way to improve the accuracy in both conditions is to expand our list or positive and negative words. Therefore, the chances of direct hits is larger and there are more classified points to look at in the nearest neighbor prediction.