<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_02_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing and Speech Recognition**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* Part 11.1: Getting Started with Spacy in Python [[Video]](https://www.youtube.com/watch?v=A5BtU9vXzu8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_11_01_spacy.ipynb)
* **Part 11.2: Word2Vec and Text Classification** [[Video]](https://www.youtube.com/watch?v=nWxtRlpObIs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_11_02_word2vec.ipynb)
* Part 11.3: What are Embedding Layers in Keras [[Video]](https://www.youtube.com/watch?v=OuNH5kT-aD0&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_11_03_embedding.ipynb)
* Part 11.4: Natural Language Processing with Spacy and Keras [[Video]](https://www.youtube.com/watch?v=BKgwjhao5DU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_11_04_text_nlp.ipynb)
* Part 11.5: Learning English from Scratch with Keras and TensorFlow [[Video]](https://www.youtube.com/watch?v=Y1khuuSjZzc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN&index=58) [[Notebook]](t81_558_class_11_05_english_scratch.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [3]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab


# Part 11.2: Word2Vec and Text Classification

Word2vec is a group of related models that data scientists use to produce word embeddings, which are numeric representations for words.  For example, a word embedding lexicon may provide a 100-number vector for each word in the English dictionary. Word2vec is one such embedding.

Word2vec is implemented by shallow, two-layer neural networks that trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus assigned a corresponding vector in high dimension space. Similar words will have similar vectors in this high dimension space.[[Cite:mikolov2013efficient]](https://arxiv.org/abs/1301.3781)

### Suggested Software for Word2Vec

The following URLs provide useful software and data for working with Word2vec.

* [GoogleNews Vectors](https://code.google.com/archive/p/word2vec/), [GitHub Mirror](https://github.com/mmihaltz/word2vec-GoogleNews-vectors)
* [Python Gensim](https://radimrehurek.com/gensim/)

The Python package Gensim is used in this chapter to work with word2vec vectors.  It is also necessary to load the embedding lookup table.  The following code can download this table.

In [5]:
from tensorflow.keras.utils import get_file

try:
    path = get_file('GoogleNews-vectors-negative300.bin.gz', 
        origin='https://s3.amazonaws.com/dl4j-distribution/' +\
                    'GoogleNews-vectors-negative300.bin.gz')
except:
    print('Error downloading')
    raise
    
print(path)    


Downloading data from https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
/root/.keras/datasets/GoogleNews-vectors-negative300.bin.gz


The following code loads the vector lookup tables and prepares Gensim for use.

In [6]:
import gensim

# Not that the path below refers to a location on my hard drive.
# You should download GoogleNews Vectors (see suggested software above)
model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)

Word2vec makes each word a vector.  We are using the 300-number vector, which can be seen for the word "hello".

In [None]:
w = model['hello']

In [None]:
print(len(w))

In [None]:
print(w)

The code below shows the distance between two words.

In [None]:
import numpy as np

w1 = model['king']
w2 = model['queen']

dist = np.linalg.norm(w1-w2)

print(dist)

This shows the classic word2vec equation of **queen = (king - man) + female**

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'])


The following code shows which item does not belong with the others.

In [None]:
model.doesnt_match("house garage store dog".split())


The following code shows the similarity between two words.

In [None]:
model.similarity('iphone', 'android')

The following code shows which words are most similar to the given one.

In [None]:
model.most_similar('dog')