# Sentiment Analysis with Word2Vec

We are going to work with the [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/).

Maas et al, (2011). "Learning Word Vectors for Sentiment Analysis"

This is a collection of user generated movie reviews, each review being labelled as POSITIVE or NEGATIVE.

# Download and Prepare Data

In [None]:
import requests

r = requests.get('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')

assert r.status_code == 200

with open('imdb.tar.gz', 'wb') as out:
    out.write(r.content)

In [None]:
import tarfile
import re

from tqdm.notebook import tqdm

data = []
filename = re.compile(r'aclImdb/(?P<split>train|test)/(?P<label>neg|pos)/(?P<id>[0-9_]+)\.txt$')

with tarfile.open('imdb.tar.gz', 'r:gz') as tgz:
    for f in tqdm(tgz.getmembers()):
        m = filename.match(f.name)
        if f.isfile() and m is not None:
            data.append({
                'id': m['id'],
                'split': m['split'],
                'text': tgz.extractfile(f).read().decode('utf-8'),
                'label': m['label']
            })

HBox(children=(FloatProgress(value=0.0, max=100019.0), HTML(value='')))




In [None]:
import pandas as pd

df = pd.DataFrame(data)
df.head()

Unnamed: 0,id,split,text,label
0,127_3,test,I love sci-fi and am willing to put up with a ...,neg
1,126_4,test,"Worth the entertainment value of a rental, esp...",neg
2,125_3,test,its a totally average film with a few semi-alr...,neg
3,124_2,test,STAR RATING: ***** Saturday Night **** Friday ...,neg
4,123_4,test,"First off let me say, If you haven't enjoyed a...",neg


In [None]:
train = df[df['split'] == 'train']
X_train = train['text']
y_train = train['label']

test = df[df['split'] == 'test']
X_test = test['text']
y_test = test['label']

# Word2Vec

We will use the Word2Vec pre-trained vectors provided by Google.

For the document embedding, we will use the TFIDF weighted sum of word embeddings.

$\overrightarrow{doc} = \sum_{t \in doc}\textrm{tfidf}(t, doc)*\overrightarrow{t}$

Here are the steps:
* Fit a TFIDF vectorizer to the TRAIN data
* Transform the TRAIN and TEST data into BoW
* Get the vectors from a pretrained word2vec models
* Create the document embeddings
* Train a classifier

In [None]:
DIMS = 300

In [None]:
import gensim.downloader as api
model = api.load('word2vec-google-news-300')



## TODO - TFIDF Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf = TfidfVectorizer(
    # TODO
)

tfidf.fit(X_train)

## The Word-Word2Vec matrix

This matrix:
* 1 row = 1 word of the vectorizer vocabulary
* 1 row = the word2vec of this word

In [None]:
import numpy as np

vocab = tfidf.get_feature_names()
word_vecs = np.zeros((len(vocab), DIMS))

for i, w in enumerate(vocab):
    try:
        word_vecs[i, :] = model[w]
    except KeyError:
        pass

## TODO - Transform TRAIN and TEST into bow

In [None]:
X_train_bow = # TODO
X_test_bow = # TODO

## Transform into Doc Embeddings

It turns out it is a simple matrix multiplication.

In [None]:
X_train_vecs = X_train_bow.dot(word_vecs)  # Document embeddings
X_test_vecs = X_test_bow.dot(word_vecs)    # Document embeddings

## TODO - Classification

* Create a LogisticRegression model
* Fit it to the Document embeddings

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1e4)
clf.fit(# TODO)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_true=y_test, y_pred=clf.predict(X_test_vecs)))