### Part 4: Modern Feature Engineering - Distributed Representation for Text Modelling

So far we have looked at traditional techniques that include using word frequency and weighting to determine the features from text and using those features to build classification models. On this note, we explore distributed representation as a way of generating features from words. In this notebook, we will look at the implementation of distributed representation techniques. More specifically

#### Word2Vec Implementation with Gensim:
1. Continuous Bag of Words Model
2. Skip-Gram Model
3. Gensim Vocabularity Object
#### Word Vectors to Feature Matrix
1. Averaging Word Vectors
2. Building a vectorizer for new text

### Dataset: Restaurant Reviews
Let's begin with importing the necessary package and read in the dataset

In [1]:
import pandas as pd

review_data = pd.read_csv('restaurant_reviews.tsv', sep='\t')
review_data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


We will need to clean them up so that we have a sequence of words that we can build our word2vector representation. Before we go into the word2vec, let's use Keras to tokenize the sentiments.

In [2]:
from keras.preprocessing.text import text_to_word_sequence
    
corpus_tokens = [text_to_word_sequence(review) for review in review_data.Review.values ]
corpus_tokens[:3]

[['wow', 'loved', 'this', 'place'],
 ['crust', 'is', 'not', 'good'],
 ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty']]

Notice that in the above example, we have created a list of documents in which every document is a set of tokens in the documents. This is the input we will need to use for our word2vec implementation.

### What is Word2Vec?

So far we have generally described distributed representation. The simplified understanding of distributed representation is the idea that words that have the same meaning tend to appear within the same context. More broadly, we distributed representation provides a way assign vectors to words based on the context such that words within a similar context are contained within the same vector space.

For more information, see stanford's lecture: https://www.youtube.com/watch?v=ERibwqs9p38

There are two main ways of computing word vectors:

1. CBOW - Continuous Bag of Words
2. Skip-Gram Model

### 1. CBOW - Continuous Bag of Words
Suppose we have some text: "generalized linear models have link functions that enable flexibility beyond that of ordinary least squares". Also, suppose that we want to predict the word vector for the word "models", the continuous bag of words uses the context words: "generalized linear ___ have link" to predict the target word. Given the corpus may contain other texts with similar context, then we can train a cbow model to provide the most probable word for the text above.
<br>

If you are interested in understanding the mechanics of training word2vec, it may be useful to visit the Stanford link above for more technical details of the training algorithms available for this estimation.

<br>
In Python, I use the gensim package to compute the words vectors for the review corpus using cbow model

<br>
Before we implement cbow word2vec, we need to determine the following: 

<br>

1. size of vector: The size of the vectors for every word in the corpus
2. window size: The size of context words to use in computing the vector


<br>
In this example, let's use a window of 5 and a vector of size 10

In [3]:
from gensim.models import Word2Vec

vector_size = 10
window_size = 5

cbow_model = Word2Vec(  sentences= corpus_tokens,
                        vector_size = vector_size,  # Setting the vector size
                        window =window_size,  # Setting the Window size
                        sg=0,                                 # Initialize CBOW
                        min_count = 2,               # Minimum word count
                        sample=.0000001 )    # Lower Weighting/Downsampling of frequent words

The model has been trained. We can now extract a vector of size 10 using the 5 nearest words

In [4]:
cbow_model.wv['good']

array([ 0.07817571, -0.09510187, -0.00205531,  0.03469197, -0.00938972,
        0.08381772,  0.09010784,  0.06536506, -0.00711621,  0.07710405],
      dtype=float32)

### 1.2. CBOW Similarity

One of the advantages of using a CBOW model is we can then compute word similarity within the corpus. The cosine similarity method is used for this calculation.

<br>
Using the gensim package, we can call the most_similar method for any word as shown in the example below:

In [5]:
cbow_model.wv.most_similar('amazing')

[('mayo', 0.8547266721725464),
 ('pricing', 0.8052123785018921),
 ('twice', 0.802130401134491),
 ('40', 0.7712088227272034),
 ('so', 0.7700384855270386),
 ('party', 0.7527121305465698),
 ('liked', 0.735294759273529),
 ('say', 0.7321372628211975),
 ('terrible', 0.7217848896980286),
 ('part', 0.7145952582359314)]

### 2. Skip-Gram Model

The Skip-Gram model is trained to perform the reverse function of the CBOW model. That is, while the cbow model predicts the target word given context words, the Skip-Gram model predicts context words based on the presence of the target word.

<br>
In terms of the general implementation of skip-gram in gensim, we will only activate the skip-gram argument. Let's see the implementation.

In [6]:
skipgram_model = Word2Vec(  sentences= corpus_tokens,
                            vector_size = 20,    # Setting the vector size
                            window = 5,          # Setting the Window size
                            min_count = 2,       # Minimum word count
                            sg = 1,              # Initialize Skip Gram
                            sample=.0000001 )    # Lower Weighting/Downsampling of frequent words

Just like with CBOW, we can generate a word vector for each word in the corpus

In [7]:
skipgram_model.wv['good']

array([-0.04121573,  0.04649187, -0.00098335, -0.00983143,  0.02302161,
       -0.02047365,  0.01371725,  0.03470667,  0.03032966, -0.03756103,
        0.04690514,  0.0233656 ,  0.01983496, -0.03122055,  0.0423056 ,
       -0.01075502,  0.04413366, -0.02680901, -0.04064848,  0.03411919],
      dtype=float32)

### 2.2. Skip-Gram Word Similarity

We can also retrieve similar words from the skip-gram model.

In [8]:
skipgram_model.wv.similar_by_word('amazing')

[('wait', 0.6336723566055298),
 ('gone', 0.6114690899848938),
 ('bartender', 0.6109290719032288),
 ('money', 0.5625681281089783),
 ('omg', 0.5368884205818176),
 ('promise', 0.5267219543457031),
 ('stayed', 0.5241491198539734),
 ('seems', 0.522510290145874),
 ('who', 0.5187430381774902),
 ('thumbs', 0.5136153697967529)]

###3. Gensim Vocabulary

Suppose we have a new text input that may have new words that our models have not seen yet, it is therefore impossible to return the vectors for those words. It is always important to know how to access the vocabulary in the model so that you can provide an alternative to new words are you are trying to leverage the models for feature extraction.

<br>
The gensim vocabulary is a dictionary of all words with the vector object.

In [9]:
vocab = skipgram_model.wv.index_to_key
vocab[20:30]

['with', 'had', 'great', 'that', 'be', 'so', 'were', 'are', 'but', 'have']

In [10]:
len(vocab)

897

### 4. Convert Word Vectors to Features

We have seen earlier how to implement bag of words and tfidf feature extraction. With those techniques, every word has a single value as a feature. With word vectors, every word has a vector of features so we will need to find a way to summarize the features such that every document has one vector representing all the tokens in the document. One simple approach is to sum the vectors and average them by the count of the words/tokens.

<br>
Let's see this in implementation

In [11]:
import numpy as np

def avg_word_vectors(words, model, vocabulary, feature_size):
    
    feature_vector = np.zeros((feature_size,), dtype='float64')
    word_count = 0.
    
    for word in words:
        if word in vocabulary:
            word_count += 1
            feature_vector = np.add(feature_vector, model.wv[word])
            
        if word_count:
            feature_vector = np.divide(feature_vector, word_count)
            
    return feature_vector


Testing the function on a sample of text

In [12]:
test = ["This", 'is', 'delicious', 'food']
avg_word_vectors(test, skipgram_model, vocab, 20 )

array([ 0.00381702, -0.00316857,  0.00247694, -0.00598015,  0.02107186,
        0.0180915 , -0.00301856,  0.01412969, -0.01600609,  0.01004132,
        0.0075589 , -0.00288089,  0.01646154, -0.00441363,  0.01078511,
       -0.01157882,  0.02155005,  0.01900969, -0.01142925,  0.00739343])

### Averaging across the full dataset

Now let's convert the averaged vectors to a full dataframe with words as the predictors/features

In [13]:
def avg_word_vectorizer(corpus, model, feature_size):
    vocabulary = set(model.wv.index_to_key)
    features = [ avg_word_vectors(text, model, vocabulary, feature_size) for text in corpus]
    
    return np.array(features)


skipgram_feaures = avg_word_vectorizer(corpus_tokens, skipgram_model, 20)

### Converting Vectors into a Feature Dataframe

In [14]:
skipgram_df = pd.DataFrame(skipgram_feaures)
skipgram_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0022,-0.015287,-0.015425,-0.001512,0.001022,-0.004541,-0.00224,-0.010102,0.013118,0.0028,-0.002498,0.004954,0.007783,-0.005989,0.013167,0.012485,0.009511,0.004616,0.010194,-0.006662
1,-0.014202,0.009345,-0.001474,-0.007928,0.008754,-0.004882,0.004632,0.012017,0.003751,-0.007202,0.017074,0.002244,0.002607,-0.011181,0.006813,-0.003037,0.014002,-0.004456,-0.007994,0.012142
2,0.000496,-0.003071,0.000526,0.000306,0.003619,-0.001678,0.001542,-0.001002,0.00284,0.003904,-0.002766,-0.005903,-0.002449,0.004655,-0.00421,-0.004335,0.00384,-0.005615,-0.003023,-0.004575
3,-0.004266,0.001542,-0.002245,-0.002431,0.001924,0.00258,0.003604,-0.002653,0.003792,0.003001,-0.001501,-0.003569,0.002076,0.002506,8.6e-05,-0.002937,-0.003272,-0.000946,0.002147,-0.001985
4,-0.001113,0.000919,-0.000484,0.003836,0.003093,0.000356,0.001461,0.002986,0.000749,0.002564,0.00036,-0.001045,0.00202,-0.00121,-0.001275,0.003626,-0.003685,-0.001368,-0.001514,0.001991


### Conclusion

What we have done is reduce all of our text to a dataframe that has 20 features representing the average of all of the word vectors for the word in the text. We can use this matrix for some of the classification tasks that we did before. In the last section, we will cover how to use deep learning for the classification of the sentiments.