# Fall 2020: DS-GA 1011 NLP with Representation Learning
## Lab 6: 09-Oct-2020, Friday
## Sentiment analysis using Logistic Regression

In this lab, we'll go through the process of processing a dataset, designing features, fitting a model on the feature data (sort of), and evaluate on a held-out test set.

---
### Setup

First, let's load the Stanford Sentiment Treebank. Download it from here: [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. (If you want to put it somewhere else, change `sst_home` below.)

In [None]:
# Import required packages
import re
import os
import numpy as np
import collections

In [None]:
sst_home = 'trees'

def load_sst_data(path):
    # Let's do 2-way positive/negative classification instead of 5-way
    EASY_LABEL_MAP = {0:0, 1:0, 2:None, 3:1, 4:1}
    
    data = []
    with open(path) as f:
        for line in f: 
            example = {}
            example['label'] = EASY_LABEL_MAP[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    return data

train = load_sst_data(sst_home + '/train.txt')
val = load_sst_data(sst_home + '/dev.txt')
test = load_sst_data(sst_home + '/test.txt')

In [None]:
train[0]

{'label': 1,
 'text': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal ."}

In [None]:
print(len(train), len(val), len(test))

6920 872 1821


---
### Extracting features

Now that we have the data, we need to build some sort of feature representation of our data. One of the simplest things we can do is to represent each sentence as a bag of its words. 

In [None]:
def tokenize(string):
    ''' Bare-bones tokenization '''
    return string.split() # simple tokenization

def extract_feats(datasets):
    '''Annotates datasets with feature vectors.'''
                         
    # Extract vocabulary
    word_counter = collections.Counter()
    for example in datasets[0]: # assume first dataset is training set
        word_counter.update(tokenize(example['text']))
    vocabulary = set(word_counter.keys())

    features = set()
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['features'] = collections.defaultdict(float)
            
            #Extract features (by name) for one example:
            word2count = collections.Counter(tokenize(example['text']))
            for word, count in word2count.items():
                if word in vocabulary:
                    example["features"][word] = min(count, 1) # BoW binary features
            
            features.update(example['features'].keys())
                            
    # By now, we know what all the features will be, so we can
    # assign indices to them.
    feat2idx = dict(zip(features, range(len(features))))
    idx2feat = {v: k for k, v in feat2idx.items()}
    dim = len(feat2idx)
                
    # Now we create actual vectors from those indices.
    for dataset in datasets:
        for example in dataset:
            example['input'] = np.zeros((dim))
            for feature in example['features']:
                example['input'][feat2idx[feature]] = example['features'][feature]
    return idx2feat
    
idx2feat = extract_feats([train, val, test]) # adds the features as a key in each example dict

In [None]:
len(idx2feat)

16282

In [None]:
for key in range(25):
    print(key, idx2feat[key])

0 remarkable
1 emptiness
2 unthinkable
3 remarkably
4 omission
5 Unambitious
6 because
7 Wildly
8 Armageddon
9 child
10 dictator-madman
11 Blind
12 giants
13 flatter
14 enigma
15 good-bad
16 sparked
17 duty
18 amoral
19 sucking
20 nursery
21 waif
22 Cusack
23 Oliver
24 upon


In [None]:
train[0]

{'label': 1,
 'text': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 'features': defaultdict(float,
             {'The': 1,
              'Rock': 1,
              'is': 1,
              'destined': 1,
              'to': 1,
              'be': 1,
              'the': 1,
              '21st': 1,
              'Century': 1,
              "'s": 1,
              'new': 1,
              '``': 1,
              'Conan': 1,
              "''": 1,
              'and': 1,
              'that': 1,
              'he': 1,
              'going': 1,
              'make': 1,
              'a': 1,
              'splash': 1,
              'even': 1,
              'greater': 1,
              'than': 1,
              'Arnold': 1,
              'Schwarzenegger': 1,
              ',': 1,
              'Jean-Claud': 1,
              'Van': 1,
              'Damme': 1,
   

In [None]:
train[0]['input'].shape

(16282,)

In [None]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
print(len(X_train), X_train[0].shape, len(y_train), y_train[0])
print(X_train[0].nonzero(), y_train[0])

6920 (16282,) 6920 1
(array([  435,  3055,  3469,  3569,  3792,  4426,  4785,  5169,  5870,
        7017,  7322,  7610,  8171,  8413,  8889,  8948,  9609, 10032,
       10129, 10146, 10890, 11240, 11651, 11686, 11847, 11895, 11954,
       13041, 13299, 13733, 14488, 14539, 15090, 15671]),) 1


#### Pre-packaged methods

cf.
> [scikit-learn](https://scikit-learn.org/stable/) Open-source machine learning library providing simple and efficient tools for  predictive data analysis using Python. Built on NumPy, SciPy, and matplotlib

> [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer) Transformer that converts a collection of text documents to a matrix of token counts

In [None]:
# Install scikit-learn
# !conda install scikit-learn
!conda list scikit-learn

# packages in environment at /opt/anaconda3:
#
# Name                    Version                   Build  Channel
scikit-learn              0.23.1           py38h603561c_0  


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

train_docs = [doc['text'] for doc in train]
vectorizer = CountVectorizer(lowercase=False, binary=True)
vectorizer.fit(train_docs)
vocab = vectorizer.get_feature_names()
print(len(vocab), vocab[:25])

15255 ['000', '10', '100', '101', '103', '105', '10th', '11', '110', '112', '12', '120', '127', '129', '12th', '13', '13th', '14', '140', '146', '15', '15th', '16', '163', '168']


In [None]:
X_train_sk = vectorizer.transform(train_docs)
y_train_sk = np.array([doc['label'] for doc in train])
print(X_train_sk.shape, y_train_sk.shape)
print(y_train_sk[0], type(X_train_sk), '\n', X_train_sk[0], )

(6920, 15255) (6920,)
1 <class 'scipy.sparse.csr.csr_matrix'> 
   (0, 70)	1
  (0, 298)	1
  (0, 711)	1
  (0, 789)	1
  (0, 851)	1
  (0, 949)	1
  (0, 1856)	1
  (0, 2897)	1
  (0, 3014)	1
  (0, 3048)	1
  (0, 3287)	1
  (0, 3426)	1
  (0, 3606)	1
  (0, 4232)	1
  (0, 4658)	1
  (0, 6442)	1
  (0, 7286)	1
  (0, 8213)	1
  (0, 8299)	1
  (0, 8499)	1
  (0, 9284)	1
  (0, 9863)	1
  (0, 10458)	1
  (0, 10693)	1
  (0, 13208)	1
  (0, 13939)	1
  (0, 13943)	1
  (0, 13944)	1
  (0, 14099)	1


---
### Building a Model: Logistic Regression

Let's build a classifier for this dataset. Because we haven't talked about regularization yet, we’ll use the LogisticRegression class from scikit-learn using out-of-the-box solver (non-SGD) and no regularization (penalty).

In [None]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(penalty='none')

cf.
> [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression) An estimator for classification using Logistic Regression

In order to learn the "best" parameters for our model based on the training data, we use scikit-learn’s fit method. Inside this method, the parameters are according to some loss function (see slides).

In [None]:
log_model.fit(X=X_train, y=y_train)

LogisticRegression(penalty='none')

We now have a trained sentiment analysis model! Let's predict using the same.

In [None]:
y_preds = log_model.predict(X_train)
print('Review: ', train[0]['text'], '\n\nLabel: ', train[0]['label'], ' Prediction: ', y_preds[0])

Review:  The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal . 

Label:  1  Prediction:  1


---
### Evaluating a Model and Extensions

How well does our model do? Let's define a function to see our model's accuracy on some data split and see how well we fit the training data. We'll make use of the `model.predict()` interface for generating predictions.

In [None]:
from sklearn.metrics import accuracy_score

def evaluate(inputs, targs, model):
    preds = model.predict(inputs)
    return accuracy_score(preds, targs)

cf.
> [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html?highlight=accuracy%20score#sklearn.metrics.accuracy_score) Classification score

In [None]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
train_acc = evaluate(X_train, y_train, log_model)
print("Train acc: %.3f" % (100 * train_acc))

Train acc: 100.000


Nice, 100% accuracy. How well do we do on held-out data?

In [None]:
X_dev = [x['input'] for x in val]
y_dev = [y['label'] for y in val]
dev_acc = evaluate(X_dev, y_dev, log_model)
print("Dev acc: %.3f" % (100 * dev_acc)) # log_model.score(X_dev, y_dev)

Dev acc: 76.491


We see a big drop, ~25 accuracy, on held-out data, so we overfit the training data. We can go back and revise our approach (e.g. by playing around with the different parameters for the [`LogisticRegression` classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)) and re-fitting on the training data, and then see how well we do on the held-out validation data.

By doing this, however, we'll be fitting to the validation data. At some point, we'll want to evaluate one completely new data. Which is what the test split is for. The test split should be used as sparingly as possible!

In [None]:
X_test = [x['input'] for x in test]
y_test = [y['label'] for y in test]
test_acc = evaluate(X_test, y_test, log_model)
print("Test acc: %.3f" % (100 * test_acc)) # log_model.score(X_test, y_test)

Test acc: 78.309


We've been evaluating on data drawn roughly from the same data distribution. How do our models fare if we move out-of-distribution? We will be using IMDb movie reviews as a test set later on. Download the data <a href="http://ai.stanford.edu/~amaas/data/sentiment/">here</a>, unzip it, and put the resulting folder in the same directory as this notebook.

The following function reformats it in the same form as our SST data.

In [None]:
imdb_home = 'aclImdb/test/'

def load_imdb_data(path):
    
    pos_data, neg_data = [], []
    all_files = []
    _limit = 250
    
    for dirpath, dirnames, files in os.walk(path):
        for name in files:
            all_files.append(os.path.join(dirpath, name))
            
            
    for file_path in all_files:
        if '/neg' in file_path and len(neg_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 0
            neg_data.append(example)
            
        if '/pos' in file_path and len(pos_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 1
            pos_data.append(example)
    data = neg_data + pos_data

    return data

            
imdb_test = load_imdb_data(imdb_home)
idx2feat = extract_feats([train, imdb_test]) # adds the features as a key in each example dict

In [None]:
X_test_imdb = [x['input'] for x in imdb_test]
y_test_imdb = [y['label'] for y in imdb_test]
test_acc = evaluate(X_test_imdb, y_test_imdb, log_model)
print("IMDb Test acc: %.3f" % (100 * test_acc))

IMDb Test acc: 75.299


In [None]:
coefficients = log_model.coef_[0]
print(coefficients.shape)
indices = np.argsort(coefficients) 
# Most negatively weighted
print("\nWords associated with negative sentiment")
for i in indices[:10]:
  print(idx2feat[i], coefficients[i])
print()
# Most positively weighted
print("Words associated with positive sentiment")
for i in indices[-10:]: 
  print(idx2feat[i], coefficients[i])

(16282,)

Words associated with negative sentiment
stupid -139.38785286072397
mess -136.4392950346883
depressing -117.72992560143074
suffers -106.96891312120053
flat -106.71797240294553
worst -105.64415172590596
none -98.06556852228475
failure -96.80920673401636
TV -96.58653776541415
lacking -95.90210240367811

Words associated with positive sentiment
rare 92.33004804385504
half-bad 94.6275338547006
charming 99.73731230769108
refreshing 100.57417260805157
hilarious 102.20572263273924
powerful 113.20283492022443
enjoyable 117.0064118323224
remarkable 122.45276787104888
appealing 123.15363706161516
solid 136.937091529792


#### SGD (Stochastic Gradient Descent) Classifier

cf.
> [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) The ‘log’ loss gives logistic regression model

In [None]:
from sklearn.linear_model import SGDClassifier
log_model_sk = SGDClassifier(loss='log', penalty='none')

log_model_sk.fit(X=X_train_sk, y=y_train_sk)

print("Train acc: %.3f" % (100 * log_model_sk.score(X_train_sk, y_train_sk)))

Train acc: 99.436


In [None]:
val_docs = [doc['text'] for doc in val]
X_dev_sk = vectorizer.transform(val_docs)
y_dev_sk = np.array([doc['label'] for doc in val])
print("Dev acc: %.3f" % (100 * log_model_sk.score(X_dev_sk, y_dev_sk)))

Dev acc: 75.000


In [None]:
test_docs = [doc['text'] for doc in test]
X_test_sk = vectorizer.transform(test_docs)
y_test_sk = np.array([doc['label'] for doc in test])
print("Test acc: %.3f" % (100 * log_model_sk.score(X_test_sk, y_test_sk)))

Test acc: 76.661


In [None]:
imdb_test_docs = [doc['text'] for doc in imdb_test]
X_imdb_test_sk = vectorizer.transform(imdb_test_docs)
y_imdb_test_sk = np.array([doc['label'] for doc in imdb_test])
print("IMDb Test acc: %.3f" % (100 * log_model_sk.score(X_imdb_test_sk, y_imdb_test_sk)))

IMDb Test acc: 75.896


---
### More features

In [None]:
def tokenize(string):
    ''' Bare-bones tokenization '''
    return [token for token in string.split()]

def new_extract_feats(datasets):
    '''Annotates datasets with feature vectors.'''
                         
    # Extract vocabulary
    word_counter = collections.Counter()
    bigram_counter = collections.Counter()
    for example in datasets[0]: # assume first dataset is training set
        tokens = tokenize(example['text'])
        word_counter.update(tokens)
        bigram_counter.update(zip(tokens, tokens[1:]))
    vocabulary = set([k for k, v in word_counter.most_common(10000)])
    bigram_vocab = set([k for k, v in bigram_counter.most_common(5000)])

    features = set()
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['features'] = collections.defaultdict(float)
            tokens = tokenize(example['text'])
            #Extract features (by name) for one example:
            word2count = collections.Counter(tokens)
            bigrams = collections.Counter(zip(tokens, tokens[1:]))
            for word, count in word2count.items():
                if word in vocabulary:
                    example["features"][word] = min(1, count)
                #else:
                #    example["features"]["FEAT_UNK"] = 1
                if word in ["n't", "bad", "awful", "terrible"]:
                    example["features"]["FEAT_negative"] = 1
                if word in ["great", "fantastic", "excellent", "superb", "awesome"]:
                    example["features"]["FEAT_positive"] = 1
                #example["features"]["FEAT_length"] = len(tokenize(example['text'])) / 5
            for bigram in bigrams:
                
                if bigram in bigram_vocab:
                    example["features"]["%s_%s" % (bigram[0], bigram[1])] = 1
                    
            features.update(example['features'].keys())
                            
    # By now, we know what all the features will be, so we can
    # assign indices to them.
    feat2idx = dict(zip(features, range(len(features))))
    idx2feat = {v: k for k, v in feat2idx.items()}
    dim = len(feat2idx)
                
    # Now we create actual vectors from those indices.
    for dataset in datasets:
        for example in dataset:
            example['input'] = np.zeros((dim))
            for feature in example['features']:
                example['input'][feat2idx[feature]] = example['features'][feature]
    return idx2feat, word_counter
    
idx2feat, word_counter = new_extract_feats([train, val, test, imdb_test]) # adds the features as a key in each example dict

In [None]:
tmp = tokenize(train[0]['text'])
for pair in zip(tmp, tmp[1:]):
    print(pair)

('The', 'Rock')
('Rock', 'is')
('is', 'destined')
('destined', 'to')
('to', 'be')
('be', 'the')
('the', '21st')
('21st', 'Century')
('Century', "'s")
("'s", 'new')
('new', '``')
('``', 'Conan')
('Conan', "''")
("''", 'and')
('and', 'that')
('that', 'he')
('he', "'s")
("'s", 'going')
('going', 'to')
('to', 'make')
('make', 'a')
('a', 'splash')
('splash', 'even')
('even', 'greater')
('greater', 'than')
('than', 'Arnold')
('Arnold', 'Schwarzenegger')
('Schwarzenegger', ',')
(',', 'Jean-Claud')
('Jean-Claud', 'Van')
('Van', 'Damme')
('Damme', 'or')
('or', 'Steven')
('Steven', 'Segal')
('Segal', '.')


In [None]:
model = SGDClassifier(loss='log', penalty='none')

In [None]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
model = model.fit(X=X_train, y=y_train)

In [None]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
train_acc = evaluate(X_train, y_train, model)
print("Train acc: %.3f" % (100 * train_acc))

Train acc: 99.595


In [None]:
X_dev = [x['input'] for x in val]
y_dev = [y['label'] for y in val]
dev_acc = evaluate(X_dev, y_dev, model)
print("Dev acc: %.3f" % (100 * dev_acc))

Dev acc: 70.986


In [None]:
X_test = [x['input'] for x in test]
y_test = [y['label'] for y in test]
test_acc = evaluate(X_test, y_test, model)
print("Test acc: %.3f" % (100 * test_acc))

Test acc: 75.783


In [None]:
X_test = [x['input'] for x in imdb_test]
y_test = [y['label'] for y in imdb_test]
test_acc = evaluate(X_test, y_test, model)
print("IMDB test acc: %.3f" % (100 * test_acc))

IMDB test acc: 75.498


---
## References
DS-GA 1012 Natural Language Understanding Spring 2019