# Assignment 1: reference model (word2vec)
In this assignment, you will find a reference model for sentiment analysis using spaCy to generate word2vec embeddings to feed to a classifier to determine the sentiment of IMBD movie reviews. It is intended to serve as a baseline and show how straightforward text analysis with spaCy is. In assignment 2 you will compare to the accuracy of this word2vec model. 

Tasks:
- Walk through the code to understand what it does.
- Something seems off with the classifier, try to fix it.

In [1]:
import numpy as np
import pandas as pd
import os
import spacy
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Load data

In [2]:
data_train = pd.read_csv(Path('data/train.csv'))
data_test = pd.read_csv(Path('data/test.csv'))

# select a subset of reviews to train on
data_train=data_train.head(10000)

reviews_train = data_train['review']
y_true_train = data_train['sentiment']

reviews_test = data_test['review']
y_true_test = data_test['sentiment']
display(data_train.head())

Unnamed: 0,review,sentiment
0,"This movie is awful, I can't even be bothered ...",negative
1,Why do movie makers always go against the auth...,negative
2,I can't believe that those praising this movie...,negative
3,This film really used its locations well with ...,positive
4,Strangely enough this movie never made it to t...,positive


# Generate word embeddings

Now we load the spacy language model "en_core_web_lg". This contains a pretrained word2vec model. We take the average word vector vor every review (or "doc") and save it in a list.

**Note:** that we do not perform any preprocessing. This is because spacy creates document vectors by averaging the token vectors for all tokens that have a vector. Tokens that would be taken out be preprocessing will simply not be in the pretrained word2vec model, so no need to account for this.

In [3]:

nlp_lg = spacy.load("en_core_web_lg")

print('Generating word2vec embeddings')
X_train = [doc.vector for doc in tqdm(nlp_lg.pipe(reviews_train), total=len(reviews_train))]
X_test = [doc.vector for doc in tqdm(nlp_lg.pipe(reviews_test), total=len(reviews_test))]


Generating word2vec embeddings


100%|██████████| 10000/10000 [01:46<00:00, 93.56it/s]
100%|██████████| 1000/1000 [00:10<00:00, 92.50it/s]


# Train classifier
Train a classifier to classify the sentiment on both the training and test set.

In [4]:
model = RandomForestClassifier(max_depth=20).fit(X_train , y_true_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Determine accuracy
Next we determine the accuracy of the classifier on the training and test set. 

In [5]:
training_accuracy = accuracy_score(y_pred_train, y_true_train)
test_accuracy = accuracy_score(y_pred_test, y_true_test)
print(f'Training accuracy: {training_accuracy}, test accuracy: {test_accuracy}')

Training accuracy: 0.9993, test accuracy: 0.76
