# Sentiment analysis (Exercise 4)

In [1]:
__author__ = "Paloma Jeretic"
__version__ = "DSGA 1012, NYU, Spring 2018 term"

## Setup

First, let's load the Stanford Sentiment Treebank. If you don't already have it, download it from here: [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. (If you want to put it somewhere else, change `sst_home` below.)

In [1]:
import re
import random
import os
import numpy as np
import collections

In [2]:
sst_home = 'trees'

def load_sst_data(path):
    # Let's do 2-way positive/negative classification instead of 5-way
    EASY_LABEL_MAP = {0:0, 1:0, 2:None, 3:1, 4:1}
    
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = EASY_LABEL_MAP[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    return data
     
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
#test_set = load_sst_data(sst_home + '/test.txt')

We will be using IMDb movie reviews as a test set later on. Download the data <a href="http://ai.stanford.edu/~amaas/data/sentiment/">here</a>, unzip it, and put the resulting folder in the same directory as this notebook.

The following function reformats it in the same form as our SST data.

In [3]:
imdb_home = 'aclImdb/test/'

def load_imdb_data(path):
    
    pos_data, neg_data = [], []
    all_files = []
    _limit = 250
    
    for dirpath, dirnames, files in os.walk(path):
        for name in files:
            all_files.append(os.path.join(dirpath, name))
            
            
    for file_path in all_files:
        if '/neg' in file_path and len(neg_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 0
            neg_data.append(example)
            
        if '/pos' in file_path and len(pos_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 1
            pos_data.append(example)
    data = neg_data + pos_data

    return data

            
imdb_test = load_imdb_data(imdb_home)

In [5]:
imdb_test[0]

{'label': 0,
 'text': "Alan Rickman & Emma Thompson give good performances with southern/New Orleans accents in this detective flick. It's worth seeing for their scenes- and Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no matter what the movie, it seems. The plot for the movie shows potential, but one gets the impression in watching the film that it was not pulled off as well as it could have been. The fact that it is cluttered by a rather uninteresting subplot and mostly uninteresting kidnappers really muddles things. The movie is worth a view- if for nothing more than entertaining performances by Rickman, Thompson, and Holbrook."}

Next, we build a function `feature_function()` that annotates datasets with feature vectors.

In [6]:
def feature_function(datasets):
    '''Annotates datasets with feature vectors.'''
                         
    # Extract vocabulary
    def tokenize(string):
        return string.split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter])

    feature_names = set()
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['features'] = collections.defaultdict(float)
        
            
            #Extract features (by name) for one example:
            word_counter = collections.Counter(tokenize(example['text']))
            for x in word_counter.items():
                if x[0] in vocabulary:
                    example["features"]["word_count_for_" + x[0]] = min(x[1], 1)
                    
#            '''
#                Adding Negation feature
#            '''        
#             if "n't" in example['text'] or "not" in example['text']:
#                 example["features"]["negation"] = 1
#             else:
#                 example["features"]["negation"] = 0
            
            feature_names.update(example['features'].keys())
                            
    # By now, we know what all the features will be, so we can
    # assign indices to them.
    feature_indices = dict(zip(feature_names, range(len(feature_names))))
    indices_to_features = {v: k for k, v in feature_indices.items()}
    dim = len(feature_indices)
                
    # Now we create actual vectors from those indices.
    for dataset in datasets:
        for example in dataset:
            example['vector'] = np.zeros((dim))
            for feature in example['features']:
                example['vector'][feature_indices[feature]] = example['features'][feature]
    return indices_to_features
    
indices_to_features = feature_function([training_set, dev_set, imdb_test])

In [7]:
indices_to_features

{0: 'word_count_for_boredom',
 1: 'word_count_for_peppered',
 2: 'word_count_for_spinoff',
 3: 'word_count_for_denouements',
 4: 'word_count_for_Branagh',
 5: 'word_count_for_teenager',
 6: 'word_count_for_fabulous',
 7: 'word_count_for_Demi',
 8: 'word_count_for_minutes',
 9: 'word_count_for_categorize',
 10: 'word_count_for_pacing',
 11: 'word_count_for_Bullock',
 12: 'word_count_for_exchanges',
 13: 'word_count_for_ghastly',
 14: 'word_count_for_Nightmare',
 15: 'word_count_for_heady',
 16: 'word_count_for_shattering',
 17: 'word_count_for_2000',
 18: 'word_count_for_thumbs',
 19: 'word_count_for_bonds',
 20: 'word_count_for_rarest',
 21: 'word_count_for_intent',
 22: 'word_count_for_stench',
 23: 'word_count_for_Songs',
 24: 'word_count_for_doorstep',
 25: 'word_count_for_together',
 26: 'word_count_for_changes',
 27: 'word_count_for_Sayles',
 28: 'word_count_for_sheer',
 29: 'word_count_for_juiced',
 30: 'word_count_for_makes',
 31: 'word_count_for_confidence',
 32: 'word_count_fo

## A linear classifier: Logistic Regression

We can now build the classifier for this dataset. We’ll be using the LogisticRegression class from Scikit-learn.

Install Scikit Learn from terminal:

In [None]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In order to train our model to our dataset, we use Scikit-learn’s fit method to do so. This is where our ML classifier actually learns the underlying functions that produce our results.

In [None]:
X_train = [x['vector'] for x in training_set]
y_train = [x['label'] for x in training_set]
log_model = log_model.fit(X=X_train, y=y_train)

And finally, we apply `log_model.predict()` to test the learned algorithm on the SST dev set:

In [None]:
X_dev = [x['vector'] for x in dev_set]
y_dev = [x['label'] for x in dev_set]

y_dev_pred = log_model.predict(X_dev)

## Accuracy

We can now look at how accurate our function was:

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_dev_pred, y_dev))

## Exercise 

* Tweak the features in `feature_function()` to improve accuracy. For example, add a new feature to your model that captures the effect of negation (commented out in `feature_function()`, you can play around with this to add more conditions to capture negation from text). You can also try adding more features, for e.g., that detect words indicating positive or negative emotions.

    * Keep testing how your accuracy evolves as you tweak your model.
    * Let's now compare our results in a show of hands. Who did better than 80%?
    
<br> 

* You might have done well after tweaking your function over your development set. But you run into the risk that your model overfit the data you trained it on, i.e. it is too specialized and can only do well on that particular data. You can check whether this is the case on a new dataset, taken from IMDb:

In [None]:
x_test = [x['vector'] for x in imdb_test]
y_test = [x['label'] for x in imdb_test]

y_test_pred = log_model.predict(x_test)

print(accuracy_score(y_test_pred, y_test))

Did your model do as well?