## Setup

If you haven't already, please follow the [setup instructions](https://jennselby.github.io/MachineLearningCourseNotes/#setup-and-tools) to get all of the necessary software (Github is optional).

1. Install the Gensim word2vec Python implementation: `python3 -m pip install --upgrade gensim`
1. Get the trained model (1billion_word_vectors.zip) from Canvas and put it in the same folder as this ipynb file.
1. Unzip the trained model file. You should now have three files in the folder (if zip created a new folder, move these files out of that separate folder into the same folder as this ipynb file):
    * 1billion_word_vectors
    * 1billion_word_vectors.syn1neg.npy
    * 1billion_word_vectors.wv.syn0.npy

## Extra Details -- Do Not Do This
This took awhile, which is why I'm giving you the trained file rather than having you do this. But just in case you're curious, here is how to create the trained model file.
1. Download the corpus of sentences from [http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)
1. Unzip and unarchive the file: `tar zxf 1-billion-word-language-modeling-benchmark-r13output.tar.gz` 
1. Run the following Python code:
    ```
    from gensim.models import word2vec
    import os

    corpus_dir = '1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled'
    sentences = word2vec.PathLineSentences(corpus_dir)
    model = word2vec.Word2Vec(sentences) # just use all of the default settings for now
    model.save('1billion_word_vectors')
    ```

## Documentation/Sources
* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general
* _Blog post has been removed_ [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials).
* [https://keras.io/](https://keras.io/) Keras API documentation

## Load the trained word vectors

In [1]:
from gensim.models import word2vec

Load the trained model file into memory

In [2]:
wv_model = word2vec.Word2Vec.load('1billion_word_vectors')

Since we do not need to continue training the model, we can save memory by keeping the parts we need (the word vectors themselves) and getting rid of the rest of the model.

In [3]:
wordvec = wv_model.wv
del wv_model

## Exploration of word vectors
Now we can look at some of the relationships between different words.

Like [the gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html), let's start with a famous example: king + woman - man

In [None]:
wordvec.most_similar(positive=['king', 'woman'], negative=['man'])

This next one does not work as well as I'd hoped, but it gets close. Maybe you can find a better example.

In [None]:
wordvec.most_similar(positive=['panda', 'eucalyptus'], negative=['bamboo'])

Which one of these is not like the others?

Note: It looks like the gensim code needs to be updated to meet the requirements of later versions of numpy. You can ignore the warning.

In [None]:
wordvec.doesnt_match(['red', 'purple', 'laptop', 'turquoise', 'ruby'])

How far apart are different words?

In [None]:
wordvec.distances('laptop', ['computer', 'phone', 'rabbit'])

Let's see what one of these vectors actually looks like.

In [None]:
wordvec['textbook']

What other methods are available to us?

In [None]:
help(wordvec)

# Optional Exercise: Explore Word Vectors
What other interesting relationship can you find, using the methods used in the examples above or anything you find in the help message?

## Using the word vectors in an embedding layer of a Keras model

In [4]:
from keras.models import Sequential
import numpy

Using TensorFlow backend.


You may have noticed in the help text for wordvec that it has a built-in method for converting into a Keras embedding layer.

Since for this experimentation, we'll just be giving the embedding layer one word at a time, we can set the input length to 1.

In [None]:
test_embedding_layer = wordvec.get_keras_embedding()
test_embedding_layer.input_length = 1

In [None]:
embedding_model = Sequential()
embedding_model.add(test_embedding_layer)

But how do we actually use this? If you look at the [Keras Embedding Layer documentation](https://keras.io/layers/embeddings/) you might notice that it takes numerical input, not strings. How do we know which number corresponds to a particular word? In addition to having a vector, each word has an index:

In [None]:
wordvec.vocab['python'].index

Let's see if we get the same vector from the embedding layer as we get from our word vector object.

In [None]:
wordvec['python']

In [None]:
embedding_model.predict(numpy.array([[30438]]))

Looks good, right? But let's not waste our time when the computer could tell us definitively and quickly:

In [None]:
embedding_model.predict(numpy.array([[wordvec.vocab['python'].index]]))[0][0] == wordvec['python']

Now we have a way to turn words into word vectors with Keras layers. Yes! Time to get some data.

## The IMDB Dataset
The [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) consists of movie reviews that have been marked as positive or negative. (There is also a built-in dataset of [Reuters newswires](https://keras.io/datasets/#reuters-newswire-topics-classification) that have been classified by topic.)

In [5]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()

It looks like our labels consist of 0 or 1, which makes sense for positive and negative.

In [6]:
print(y_train[0:9])
print(max(y_train))
print(min(y_train))

[1 0 0 1 0 0 1 0 1]
1
0


But x is a bit more trouble. The words have already been converted to numbers -- numbers that have nothing to do with the word embeddings we spent time learning!

In [7]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

Looking at the help page for imdb, it appears there is a way to get the word back. Phew.

In [8]:
help(imdb)

Help on module keras.datasets.imdb in keras.datasets:

NAME
    keras.datasets.imdb - IMDB sentiment classification dataset.

FUNCTIONS
    get_word_index(path='imdb_word_index.json')
        Retrieves the dictionary mapping words to word indices.
        
        # Arguments
            path: where to cache the data (relative to `~/.keras/dataset`).
        
        # Returns
            The word index dictionary.
    
    load_data(path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113, start_char=1, oov_char=2, index_from=3, **kwargs)
        Loads the IMDB dataset.
        
        # Arguments
            path: where to cache the data (relative to `~/.keras/dataset`).
            num_words: max number of words to include. Words are ranked
                by how often they occur (in the training set) and only
                the most frequent words are kept
            skip_top: skip the top N most frequently occurring words
                (which may not be informative)

In [16]:
imdb_offset = 3
imdb_map = dict((index + imdb_offset, word) for (word, index) in imdb.get_word_index().items())
imdb_map[0] = 'PADDING'
imdb_map[1] = 'START'
imdb_map[2] = 'UNKNOWN'

The knowledge about the initial indices and offset came from [this stack overflow post](https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset) after I got gibberish when I tried to translate the first review, below. It looks coherent now!

In [17]:
' '.join([imdb_map[word_index] for word_index in x_train[0]])

"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and shou

## Train our IMDB word vectors
The word vectors from the 1 billion words dataset might work for us when trying to classify the IMDB data. Word vectors trained on the IMDB data itself might work better, though.

In [18]:
train_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_train]
test_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_test]

In [19]:
# min count says to put any word that appears at least once into the vocabulary
# size sets the dimension of the output vectors
imdb_wv_model = word2vec.Word2Vec(train_sentences + test_sentences + ['UNKNOWN'], min_count=1, size=100)

In [20]:
imdb_wordvec = imdb_wv_model.wv
del imdb_wv_model

## Process the dataset
For this exercise, we're going to keep all inputs the same length (we'll see how to do variable-length later). This means we need to choose a maximum length for the review, cutting off longer ones and adding padding to shorter ones. What should we make the length? Let's understand our data.

In [21]:
lengths = [len(review) for review in x_train + x_test]
print('Longest review: {} Shortest review: {}'.format(max(lengths), min(lengths)))


Longest review: 2697 Shortest review: 70


2697 words! Wow. Well, let's see how many reviews would get cut off at a particular cutoff.

In [22]:
cutoff = 500
print('{} reviews out of {} are over {}.'.format(
    sum([1 for length in lengths if length > cutoff]), 
    len(lengths), 
    cutoff))

8485 reviews out of 25000 are over 500.


In [23]:
from keras.preprocessing import sequence
x_train_padded = sequence.pad_sequences(x_train, maxlen=cutoff)
x_test_padded = sequence.pad_sequences(x_test, maxlen=cutoff)

## Classification without using the pre-trained word vectors

In [24]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, Dense, Flatten, Dropout
import numpy as np

Model definition. The embedding layer here learns the 100-dimensional vector embedding within the overall classification problem training. That is usually what we want, unless we have a bunch of un-tagged data that could be used to train word vectors but not a classification model.

In [34]:
not_pretrained_model = Sequential()
not_pretrained_model.add(Embedding(input_dim=len(imdb_map), output_dim=100, input_length=cutoff))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Flatten())
not_pretrained_model.add(Dense(units=128, activation='relu'))
not_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
not_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

Train the model. __This takes awhile. You might not want to re-run it.__

In [None]:
not_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)

Assess the model. __This takes awhile. You might not want to re-run it.__

In [None]:
not_pretrained_scores = not_pretrained_model.evaluate(x_test_padded, y_test)
print('loss: {} accuracy: {}'.format(*not_pretrained_scores))

## For any model that you try in these exercises, take notes about the performance you see and anything you notice about the differences between the models.

## Exercise Option 1: Use pretrained word vectors in a model (learn more about how to use word vector models and how to translate between data representations)
Using the details above about how the imdb dataset and the keras embedding layer represent words, define a model that uses the pre-trained word vectors from the imdb dataset rather than an embedding that keras learns as it goes along. You'll need to replace the embedding layer and feed in different training data.

## Exercise Option 2: Use pretrained word vectors in a model (learn more about how to use word vector models and how to translate between data representations)
Same as option 1, but try using the 1billion vector word embeddings instead of the imdb vectors. If you also did option 1, comment on how the performance changes.

Think about what you would expect to happen if you used the pretrained word vectors on the same model and trained it for the same amount of time. Would you expect the accuracy to be much better, slightly better, the same, slightly worse, or much worse? 
For the pretrained word vectors, since they are pretrained, when the model trains for the same amount of time, I expect the accuracy to be slightly better, since the model is able to get a head-start with the pretrained vectors. The model will only do slightly better since the non-pretrained model already achieves high accuracy. 

What actually happens, and why?
For both the imdb word vector, and the billion word vector, the pre-trained word vector model did worse. The billion word vector had lower accuracy than the imdb word vector. This suggests that pre-training is actually a detriment to the model when it is trying to learn something new, i.e. figure out whether the review is positive or not. The billion word vector probably did worse because it had even more training done on it that the model needed to revert, and the billion word vector had less words relevant to movie reviews. 

In [39]:
len(imdb_wordvec.vocab)

88591

In [40]:
test_embedding_layer = imdb_wordvec.get_keras_embedding()
test_embedding_layer.input_length = cutoff
imdb_embedding_model = Sequential()
imdb_embedding_model.add(test_embedding_layer)
imdb_embedding_model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
imdb_embedding_model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
imdb_embedding_model.add(Dropout(0.35))
imdb_embedding_model.add(Flatten())
imdb_embedding_model.add(Dense(units=128, activation='relu'))
imdb_embedding_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
imdb_embedding_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

In [41]:
imdb_embedding_model.fit(x_train_padded, y_train, epochs=5, batch_size=64, validation_data = [x_test_padded, y_test])

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x132f30b50>

In [43]:
trained_scores = imdb_embedding_model.evaluate(x_test_padded, y_test)
print('loss: {} accuracy: {}'.format(*trained_scores))

loss: 0.5217968887519836 accuracy: 0.7560799717903137


In [42]:
#Exercise 2
embedding_layer = wordvec.get_keras_embedding()
embedding_layer.input_length = cutoff
billion_embedding_model = Sequential()
#Embedding(input_dim=word_vector_length, output_dim=100, weights= [billion_wv_matrix], input_length=cutoff)
billion_embedding_model.add(embedding_layer)
billion_embedding_model.add(Conv1D(filters=16, kernel_size=5, activation='relu'))
billion_embedding_model.add(Conv1D(filters=16, kernel_size=5, activation='relu'))
billion_embedding_model.add(Dropout(0.35))
billion_embedding_model.add(Flatten())
billion_embedding_model.add(Dense(units=128, activation='relu'))
billion_embedding_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
billion_embedding_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

In [44]:
billion_embedding_model.fit(x_train_padded, y_train, epochs=5, batch_size=64, validation_data = [x_test_padded, y_test])


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x133b003d0>

In [45]:
trained_scores = billion_embedding_model.evaluate(x_test_padded, y_test)
print('loss: {} accuracy: {}'.format(*trained_scores))

loss: 0.5777768155670167 accuracy: 0.6985200047492981
