## Word Vectors for Classification

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import re
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity

In this notebook, we'll see if we can do better in predicting sentiment by using word vectors.

In [2]:
reviews = pd.read_csv('data/amazon_reviews.csv')

reviews.head()

Unnamed: 0,sentiment,title,text
0,1,The Gnostic Gospels (Vintage),This is a misrepesentation of the Gospels. It ...
1,1,Christine Feehan sucks,Ok she always starts off good with the tension...
2,1,bad review,The Dvd that amazon sent me only worked one ti...
3,1,Cheap,"This bracelet was missing pearls, and when I e..."
4,1,piece of crap,The ear piece is completely worthless. It is c...


In [3]:
X = reviews[['text']]
y = reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

First, let's get our baselines using CountVectorizer and TfidfVectorizer.

In [4]:
vect = CountVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [5]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

confusion_matrix(y_test, y_pred)

array([[1003,  247],
       [ 236, 1014]])

In [6]:
vect = TfidfVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [7]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

confusion_matrix(y_test, y_pred)

array([[1030,  220],
       [ 224, 1026]])

Let's try using a matrix-based approach.

In [8]:
from collections import Counter
from scipy import sparse
from scipy.sparse.linalg import svds
import gensim

In [9]:
review_counter = Counter()

for review in X_train['text']:
    review_counter.update(gensim.utils.simple_preprocess(review))

word_index = {word: i for i, word in enumerate(review_counter.keys())}
index_word = {i: word for i, word in enumerate(review_counter.keys())}

window_size = 2

cooccurrence_counter = Counter()

for review in X_train['text']:
    # First, tokenize the sentence
    sentence = gensim.utils.simple_preprocess(review)
    
    # Then, we'll build the window around each word
    for i, word in enumerate(sentence):
        window = sentence[max(0, i-2): i] + sentence[i+1: i+3]

        # Then, we'll up the counter value for that pair
        for other_word in window:
            cooccurrence_counter[(word, other_word)] += 1
            

for word in review_counter.keys():
    cooccurrence_counter[(word, word)] += review_counter[word]

In [10]:
row_idx = []
col_idx = []
counts = []

for (word1, word2) in cooccurrence_counter.keys():
    row_idx.append(word_index[word1])
    col_idx.append(word_index[word2])
    counts.append(cooccurrence_counter[(word1, word2)])

cooccurrence_matrix = sparse.csc_matrix((counts, (row_idx, col_idx)), dtype = 'float')

dimension = 50

U, D, V = svds(cooccurrence_matrix, k = dimension)

word_vectors = U * D

The matrix above gives us vectors for each individual work. But how do we vectorize a full review?

A common approach is to take the average of all of the word vectors in that review. Let's write some code to accomplish this.

We'll grab the first review.

In [11]:
review = X_train['text'][0]
review

'This is a misrepesentation of the Gospels. It should be noted that this is not a translation of the Gnostic Gospel. It is merely The authors viewpoints and excerps from the original text.Which, by the way, has been translated.I was under the impression that this was the tranlated Gnostic Bibles. This book should be clearly titled "My thoughts of the Gnostic text".Don\'t bother ordering this book if you are looking for the Gnostic Bible translated.'

First, let's create an initial vector of zeros. Do this using the numpy `zeros` function. Make sure that the result will have the same shape as the word vectors from the matrix.

We'll also need to count the number of words in the review for which we have an embedding. We'll initialize a counter variable at 0.

In [267]:
embedding = # Fill this in
count = 0

SyntaxError: invalid syntax (240307330.py, line 1)

Now, write a for loop which will go through each word in the preprocessed text, check if that word appears in the matrix (using the `word_index` dictionary), and if so, add its embedding to the embedding and increment the counter by one.

In [271]:
for word in gensim.utils.simple_preprocess(review):
    #### Fill in the body of the for loop
    
if count > 0:
    embedding /= count

IndentationError: expected an indented block (1007975269.py, line 4)

Since we want to apply this to all of the reviews, convert your code above into a function which can take the text of a review and return an embedding.

In [13]:
def make_vector_matrix(review):
    # Fill this in

IndentationError: expected an indented block (794166411.py, line 2)

Finally, we need to use this function to vectorize each review in the training and the test sets.

Create two variables, `X_train_vec` and `X_test_vec` by applying your function to the text of the train and test reviews. Hint: you may need to use the [numpy `vstack` function](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html) to convert the results to a numpy array.

In [None]:
X_train_vec = # Fill this in
X_test_vec = # Fill this in

Now, let's see how these vectors did compared to the baseline.

In [275]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

confusion_matrix(y_test, y_pred)

array([[831, 419],
       [381, 869]])

Now, let's try a different type of word vector - word2vec. This model is implemented in the [gensim library](https://radimrehurek.com/gensim/). 

The [Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html) expects a collection of tokenized sentences. We'll use gensim's simple_preprocess function to tokenize and clean up the text of the reviews.

In [277]:
train_sentences = [gensim.utils.simple_preprocess(sentence) for sentence in X_train['text']]

There are a number of hyperparameters we can set. You can feel free to change these or leave them to the current values.

In [278]:
model = gensim.models.Word2Vec(train_sentences,
                              vector_size = 100,
                              window = 5,
                              min_count = 2,
                              sg = 1,
                              hs = 0,
                              negative = 10,
                              epochs = 50)

Once the model is fit, you can access the word vectors through `.wv`.

In [281]:
model.wv.get_vector('pc')

array([-0.6060952 ,  0.35809305, -0.20759344,  0.18713778, -0.19552566,
       -0.46109164,  0.7681601 ,  0.65415114, -0.12236015, -0.44984972,
        0.51400286, -0.05475419,  0.5449332 ,  0.17944384, -0.20722571,
       -1.1557525 ,  0.55389464, -0.28227523, -0.15193258, -0.23581949,
        0.18039699, -0.2276723 ,  0.5934985 , -0.5953563 ,  0.8057621 ,
       -0.15314746, -0.65783834, -0.5568405 , -0.25253367,  0.26088655,
       -0.16711307,  0.24099213,  0.2011326 ,  0.43268412, -0.8698468 ,
        0.84646153, -0.03731446,  0.05006408,  0.26601744, -0.7598717 ,
       -0.8653652 , -0.62019724,  0.3869248 ,  0.07545926, -0.00375254,
        0.00327626, -0.54103774,  0.19690773, -0.09124593,  0.21769074,
       -0.21526293, -0.9890786 , -0.5045264 ,  0.3277378 ,  0.00262126,
       -0.64682233, -0.42446283,  0.26766402, -0.12486582, -0.01101785,
       -0.8928909 , -0.07323002,  0.19150165, -0.2556738 ,  0.6177432 ,
       -0.33525002, -0.2786363 ,  0.47334754,  0.54774565,  0.32

We can also quickly find the most similar word for a given word.

In [279]:
model.wv.most_similar('pc')

[('compaq', 0.6008726358413696),
 ('demos', 0.5389522314071655),
 ('celeron', 0.5211091041564941),
 ('windows', 0.4956669211387634),
 ('nvidia', 0.49237313866615295),
 ('iphone', 0.49212580919265747),
 ('overkill', 0.48985496163368225),
 ('patched', 0.48859703540802),
 ('system', 0.48829033970832825),
 ('ati', 0.48686859011650085)]

Write a function to find the average embedding for the text of a review. 
Then use this function to vectorize the training and test reviews. How do the word2vec vectors compare to the other approaches on this task?

In [None]:
def make_vector(review):
    # Your code here

In [None]:
X_train_vec = # Fill this in
X_test_vec = # Fill this in

In [284]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

confusion_matrix(y_test, y_pred)

array([[960, 290],
       [261, 989]])

Finally, gensim has a number of pretrained word vector models. Let's try out the Glove Wikipedia Gigaword 100 model, which 

In [22]:
import gensim.downloader as api

In [287]:
wv = api.load('glove-wiki-gigaword-100')

In [289]:
wv.most_similar('pc')

[('desktop', 0.7962974905967712),
 ('pcs', 0.765652060508728),
 ('macintosh', 0.7592843174934387),
 ('computer', 0.7366448640823364),
 ('computers', 0.7268315553665161),
 ('hardware', 0.7222781777381897),
 ('playstation', 0.7164796590805054),
 ('software', 0.7147746682167053),
 ('console', 0.7037823796272278),
 ('xbox', 0.6895698308944702)]

Write a function to make a vector for each review using these embeddings. How well does the model do using these?

In [None]:
def make_vector(review):
# fill this in

X_train_vec = # fill this in
X_test_vec = # fill this in

In [294]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

confusion_matrix(y_test, y_pred)

array([[949, 301],
       [327, 923]])