# Sentiment Analysis with N-Gram LM

We are going to work with the [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/).

Maas et al, (2011). "Learning Word Vectors for Sentiment Analysis"

This is a collection of user generated movie reviews, each review being labelled as POSITIVE or NEGATIVE.

# Download and Prepare Data

In [None]:
import requests

r = requests.get('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')

assert r.status_code == 200

with open('imdb.tar.gz', 'wb') as out:
    out.write(r.content)



In [None]:
import tarfile
import re

from tqdm.notebook import tqdm

data = []
filename = re.compile(r'aclImdb/(?P<split>train|test)/(?P<label>neg|pos)/(?P<id>[0-9_]+)\.txt$')

with tarfile.open('imdb.tar.gz', 'r:gz') as tgz:
    for f in tqdm(tgz.getmembers()):
        m = filename.match(f.name)
        if f.isfile() and m is not None:
            data.append({
                'id': m['id'],
                'split': m['split'],
                'text': tgz.extractfile(f).read().decode('utf-8'),
                'label': m['label']
            })

HBox(children=(FloatProgress(value=0.0, max=100019.0), HTML(value='')))




In [None]:
import pandas as pd

df = pd.DataFrame(data)
df.sample(10)

Unnamed: 0,id,split,text,label
41984,4603_10,train,"There is one detail, which is not very common ...",pos
12156,12035_1,test,This movie made me so angry!! Here I am thinki...,neg
3165,3106_1,test,"Yeah, that's right. If I were to ask my friend...",neg
26107,1068_1,train,My wife and I just finished this movie and I c...,neg
24632,12059_10,test,I haven't seen this funny of a show on fox in ...,pos
16974,4357_10,test,"With documentary films, the question of realis...",pos
11701,11722_4,test,I saw this film for one reason: the tagline is...,neg
41309,3742_9,train,"I thoroughly enjoyed this movie, but it is not...",pos
37684,199_10,train,I've seen hundreds of silent movies. Some will...,pos
26728,1727_3,train,We have a lake. We have an animated meteor cra...,neg


In [None]:
train = df[df['split'] == 'train']

test = df[df['split'] == 'test']
X_test = test['text']
y_test = test['label']

# N-Gram LM

We will use the Naive classifier as we have seen in class (see the [Notebook](https://colab.research.google.com/drive/1H9kWUGnI-LUVPvx0nfNvGdRWGeFtKdGT?usp=sharing)).

In [None]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk import ngrams
from nltk.tokenize import word_tokenize

Use the `ngrams` function instead of `bigrams` and `trigrams`.

In [None]:
list(ngrams(word_tokenize('I love Tokyo, it is so great to be there.'), n=4, pad_left=True, pad_right=True))

[(None, None, None, 'I'),
 (None, None, 'I', 'love'),
 (None, 'I', 'love', 'Tokyo'),
 ('I', 'love', 'Tokyo', ','),
 ('love', 'Tokyo', ',', 'it'),
 ('Tokyo', ',', 'it', 'is'),
 (',', 'it', 'is', 'so'),
 ('it', 'is', 'so', 'great'),
 ('is', 'so', 'great', 'to'),
 ('so', 'great', 'to', 'be'),
 ('great', 'to', 'be', 'there'),
 ('to', 'be', 'there', '.'),
 ('be', 'there', '.', None),
 ('there', '.', None, None),
 ('.', None, None, None)]

In [None]:
def tokenizer(text):
    tokens = word_tokenize(text)
    lower_tokens = list(map(lambda x: x.lower(), tokens))
    return lower_tokens

In [None]:
import math

from collections import defaultdict
from tqdm.notebook import tqdm

class NGramLM:
    """
    Helper class to create and manipulate a N-Gram LM 
    """
    def __init__(self, label: str, ngram: int):
        self.label = label
        self.ngram = ngram

    def fit(self, corpus):
        """
        Assuming corpus is list of strings (['I love tea', 'NYC rules'])
        """
        self.documents = corpus
        
        model = defaultdict(lambda: defaultdict(lambda: 1e-6))
 
        for document in tqdm(self.documents, desc='Create Dictionary'):
            # Consider only lowercase
            tokens = tokenizer(document)
            for ngram in ngrams(tokens, n=self.ngram, pad_right=True, pad_left=True):
                model[ngram[:-1]][ngram[-1]] += 1

        # Let's transform the counts to probabilities
        for nminus1_gram in tqdm(model, desc='Update Probabilities'):
            total_count = float(sum(model[nminus1_gram].values()))
            for w in model[nminus1_gram]:
                model[nminus1_gram][w] /= total_count

        self.model = model
    
    def corpus_sample(self):
        sample = random.choice(self.documents)

        print(f'{colored(self.label, attrs=["bold"])}')
        print('*' * 80)
        print(f'{colored("Original Data:", color="blue", attrs=["bold"])}\n{sample[:5]} (...)\n(...) {sample[-5:]}')
        print('*' * 80)
        print(f'{colored("Full Text:", color="blue", attrs=["bold"])}')
        print(fill(' '.join(sample), width=80))
        print(flush=True)

    def compute_log_proba(self, txt):
        prob = 0.0
        for p in ngrams(tokenizer(txt), n=self.ngram, pad_right=True, pad_left=True):
            prob += math.log(self.model[p[:-1]][p[-1]])
        return prob

In [None]:
import numpy as np

class NaiveClassifier:
    """
    Based on multiple N-Gram models.
    """
    def __init__(self, models):
        self.models = models

    def predict(self, x):
        predictions = []
        for sample in tqdm(x):
            probs = np.array([m.compute_log_proba(sample) for m in self.models])
            prediction = self.models[np.argmax(probs)].label
            predictions.append(prediction)
        return predictions

## TODO - N-GRAM LM / Classification

* Use the class `NGramLM` to create a **3-Gram LM** based on the **POSITIVE** reviews of the **TRAIN** dataset.
* POSITIVE means that `dataframe['label'] == 'pos'`

In [None]:
pos_lm = NGramLM(
    label='pos', 
    ngram= # TODO)

pos_lm.fit(# TODO)


Show the probability distributions:
* The first word in the sentence
* Next word after `cinema was`
* Next word after `what the`

* Show the top 10 words after `what the`

In [None]:
probs = pos_lm.model['what', 'the'].items()

# probs is a list of tuple [('hello', 0.0012), ('cat', 0.0000001), ...]
# each tuple is (word, probability)
for x in sorted(probs, key=lambda x: x[1], reverse=True)[:10]:
    print(x)

Repeat the operations on the NEGATIVE reviews:
* Use the class `NGramLM` to create a **3-Gram LM** based on the **NEGATIVE** reviews of the **TRAIN** dataset.
* NEGATIVE means that `dataframe['label'] == 'neg'`

In [None]:
neg_lm = NGramLM(
    label='neg', 
    ngram= # TODO)

neg_lm.fit(# TODO)


Show the probability distributions:
* The first word in the sentence
* Next word after `cinema was`
* Next word after `what the`

* Show the top 10 words after `what the`

In [None]:
probs = neg_lm.model['what', 'the'].items()

# probs is a list of tuple [('hello', 0.0012), ('cat', 0.0000001), ...]
# each tuple is (word, probability)

Do the Classification:
* Create the NaiveClassifier
* Predict the classification for the **TEST** dataset

In [None]:
clf = NaiveClassifier(# TODO)

In [None]:
y_pred = clf.predict(# TODO)

Produce the classification report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(# TODO))

## TODO - GridSearch (Additional)

Based on the code above:
* Try the classification with different N (as in 3-Gram LM, 4-Gram LM, ...) (`N in [1, 2, 3, 4]` for example)
* Plot the RECALL / PRECISION for POSITIVE / NEGATIVE reviews (4 plot lines)

(hint: `classification_report(output_dict=True)` will return the data instead of generating a nice display)