# Word Embeddings Workbook

This notebook was designed to be able to be worked through from start to finish by anyone with prior Python experience. We'll be exploring Word Embeddings and how to use them in code.

## Setup

We'll be using two libraries today: 

- spacy
  - contains the ability to work with vectors, cosine similarity, vocabulary and more
  - generally seen as easy to use
- gensim
  - contains the ability to work with vectors, cosine similarity and more
  - generallys een as harder to use

In [None]:
import spacy

import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import strip_punctuation

import pandas as pd
import numpy as np

from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.models import load_model

Now that we have the libraries loaded, we're going to leverage pre existing word vectors 

In [None]:
# gensim with Glove vectors, 
word2vec = KeyedVectors.load_word2vec_format('./vectors/glove_6B/word2vec.6B.50d.txt')

Loading the word2vec format returns us an object of type `Word2VecKeyedVectors`. Take a moment and scroll through the documentation for this class, located [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

## Similar Words

At this point we're ready to ask ask our glove vectors for similar words. To do this we'll be using the `most_similar` function you hopefully just discovered. 

This function takes in a list of positive words, a list of negative words and how many from the top (most relevant) it should return.

In [None]:
word2vec.most_similar(positive=["king"], topn=10)

Take note at the second value that comes back. This is a rating from $0$ to $1.0$, ranking how relevant it believes that particular word is.

Lets try to to use the negative list now. Perhaps we wanted to get royal positions that were for women only. We can use the same function and pass in negative values now. Let's demonstrate that below:

In [None]:
word2vec.most_similar(positive=["woman", "king"], negative=["man"], topn=10)

The results should be all related royal positions that have no "maleness". This is because we subtracted the vector of the value `"man"`. This is an incredibly powerful and useful tool to use. 

Let's have you try out your own `most_similar` calls now:

In [None]:
word_of_choice = "" # replace the empty string with your word of choice

In [None]:
word2vec.most_similar(positive=[word_of_choice], topn=10)

Were the results relevant? Feel free to experiment with a few different words. 

If you found anything particularly interesting, please share with others.

## Sentiment Analysis
### Preparing Our Neural Network

So at this point we're going to demonstrate sentiment analysis. The next couple of cells are required to be ran without modification.

First we're going to present our documents, which are all reviews for a local pizza place. Along with this we're classifying the reviews, $1$ for good and $0$ for bad. Which we're manually labeling.

In [None]:
documents = ["oh my god! So good. Philly grinder!",
             "This is one of the best pizza's in town. The 7 cheese pizza is Amazing.... stuffed churros, yum. Great upbeat staff. I could eat here every night.",
             "I'm telling you YOUR SERVICE STINKS AND YOU'D RETAIN MORE CUSTOMERS IF YOU'D JUST TEACH THEM TO USE THE PHONE PROPERLY AND TAKE AN ORDER PROPERLY.",
             "My favorite Fort Collins pizza place! I always go for the south of the border. They also have the best spicy ranch.",
             "Nope. Don't do it. Ordered a pizza over 2 hours ago and it still hasn't shown up.",
             "My favorite pizza place in fort collins! I always order South of the border with cream cheese the staff are awesome. One time the pizza deliver guy came super late. He apologized and gave us restaurant credit."]

labels = np.array([1, 1, 0, 1, 0, 1])

Now we're going to process the documents:

- stripping the punctuation
- encoding the documents as vectors from our word embeddings

In [None]:
encoded_documents = []
for line in documents:
    line = strip_punctuation(line)
    encoded = []
    for word in line.split():
        try:
            encoded.append(word2vec.vocab[word].index)
        except:
            encoded.append(0)
    encoded_documents.append(encoded)

For a neural network, all of our input needs to be the same length, so we'll be padding $0$s at the end to make them the same length.

In [None]:
max_length = max([len(i.split()) for i in documents])

In [None]:
padded_documents = pad_sequences(encoded_documents, maxlen=max_length, padding='post')

In [None]:
embedding_input_dim = word2vec[word2vec.vocab].shape[0]
embedding_output_dim = word2vec[word2vec.vocab].shape[1]

Now we'll be building our neural network using Keras, which makes this incredilby easy.

We'll be feeding in our padded doucments for training.

In [None]:
model = Sequential()
model.add(Embedding(input_dim=embedding_input_dim, output_dim=embedding_output_dim, weights=[word2vec[word2vec.vocab]], input_length=max_length))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss="binary_crossentropy", optimizer="adam")
model.fit(padded_documents, labels, epochs=50, verbose=0)
model.evaluate(padded_documents, labels, verbose=3)

### Using Our Neural Network To Do Predictions

At this point we're ready to test our neural network. Don't edit these cells at this moment, you'll be asked to go back and edit the `documents` variable later.

This variable, `documents`, represents reviews our neural network hasn't seen before. We're going to ask it if these sentences are good, represented by a $1$, or bad, represented by a $0$.

Note, the first document is positive and the second is negative.

In [None]:
documents = ["best pizza in fort collins",
             "The pizza is bad"]

Again, we're going to process the documents:

- stripping the punctuation
- encoding the documents as vectors from our word embeddings

In [None]:
encoded_documents = []
for line in documents:
    line = strip_punctuation(line)
    encoded = []
    for word in line.split():
        try:
            encoded.append(word2vec.vocab[word].index)
        except:
            encoded.append(0)
    encoded_documents.append(encoded)

And again, we're going to pad our sentences:

In [None]:
padded_documents = pad_sequences(encoded_documents, maxlen=max_length, padding='post')

Now we can ask the neural network to predict the sentiment of the sentences. We expect the first one to be closer to one, which representes better, and the second one to be closer to zero, which represents worse.

Better and worse meaning positive and negative respectively.

In [None]:
model.predict(padded_documents)

Success!

In [None]:
model.predict_classes(padded_documents)

And we can have it round to the closest $0$, or $1$.

Feel free to go up above and change the `documents` variable and rerun the code cells to see how it performs. Your changes should be based on pizza and won't work to well, as we had a relatively small amount of documents.