In [19]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import gensim
from nltk.corpus import gutenberg, stopwords
from gensim.models import word2vec



In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [12]:
# Parse the data. This can take some time.
nlp = spacy.load('en_core_web_sm')
austen_doc = nlp(austen_clean[:999999])
# Separating the corpus like this allows for combining them later and working around one million limit
austen_doc2 = nlp(austen_clean[1000000:1999999])

In [17]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
for sentence in austen_doc2.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)

In [20]:
model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

In [21]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('loud', 'aloud'))
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

[('people', 0.651031494140625), ('settle', 0.5177574157714844), ('daughter', 0.4962022006511688), ('england', 0.4935835301876068), ('thousand', 0.4912043809890747), ('officer', 0.49020665884017944), ('introduction', 0.4836514890193939), ('mr', 0.48149317502975464), ('way', 0.47175687551498413), ('advantage', 0.4534520208835602)]
0.8136536695330048
0.21349597060459458


  if sys.path[0] == '':


marriage


## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

In [24]:
model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=15,      # Number of words around target word to consider.
    sg=1,          # Use CBOW because our corpus is small.
    sample=1e-4 ,  # Penalize frequent words.
    size=100,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

In [25]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('loud', 'aloud'))
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

[('eld', 0.9272082448005676), ('young', 0.9168800115585327), ('john', 0.9056689739227295), ('law', 0.9054642915725708), ('place', 0.9035821557044983), ('valuable', 0.8996951580047607), ('property', 0.8985298871994019), ('musical', 0.8956568837165833), ('donwell', 0.8946554064750671), ('consequence', 0.8913477063179016)]
0.9836123881218409
0.8498593800834731


  if sys.path[0] == '':


marriage


### Hyperparameter results
The results significantly changed when altering the hyperparameters. We changed:

- Word vector reduced to 100
- 'sg' changed to 1
- Window increased to 15

# Example word2vec applications

You can use the vectors from word2vec as features in other models, or try to gain insight from the vector compositions themselves.

Here are some neat things people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's Pride and Prejudice](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a _truly honest_ account of this data scientist's process.

 * [Tracking changes in Dutch Newspapers' associations with words like 'propaganda' and 'alien' from 1950 to 1990](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces).

 * [Helping customers find clothing items similar to a given item but differing on one or more characteristics](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/).

## Drill 1: Word2Vec on 100B+ words

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

In [26]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format ('../../../backup/GoogleNews-vectors-negative300.bin', binary=True)

In [40]:
print(model.wv.similarity('mr', 'mrs'))
print(model.similarity('loud', 'aloud'))
print(model.wv.most_similar_to_given('wonderful', ['lacking', 'nice', 'good', 'perpetual']))
print(model.wv.distance('what', 'govern'))
print(model.words_closer_than('wonderful', 'ecstatic'))
print(model.similarity('wonderful', 'great'))
print(len(model.vocab))

  """Entry point for launching an IPython kernel.


0.660988288657353
0.38398625062987224


  This is separate from the ipykernel package so we can avoid doing imports until


nice


  after removing the cwd from sys.path.


0.7885902371434435
['good', 'really', 'great', 'love', 'happy', 'fun', 'unique', 'nice', 'perfect', 'enjoy', 'excited', 'proud', 'interesting', 'excellent', 'exciting', 'truly', 'beautiful', 'amazing', 'loved', 'tremendous', 'fantastic', 'liked', 'appreciate', 'sad', 'delighted', 'glad', 'incredible', 'extraordinary', 'remarkable', 'terrible', 'loves', 'brilliant', 'thrilled', 'exceptional', 'grateful', 'fortunate', 'joy', 'awesome', 'loving', 'memorable', 'terrific', 'superb', 'unbelievable', 'horrible', 'weird', 'finest', 'lovely', 'blessed', 'thankful', 'fascinating', 'rewarding', 'inspiring', 'phenomenal', 'fabulous', 'delicious', 'pleasant', 'enjoyable', 'magical', 'neat', 'magnificent', 'Wow', 'worthwhile', 'gorgeous', 'charming', 'refreshing', 'glorious', 'pleasing', 'delightful', 'breathtaking', 'appreciative', 'marvelous', 'unforgettable', 'wonderfully', 'bittersweet', 'gracious', 'gratifying', 'priceless', 'humbling', 'splendid', 'cherish', 'happiest', 'nicer', 'exhilarating'

### More powerful model = More fun
The pre-trained Google set obviously has more computational investment in its training. Is this valuable? I say yes.

It seems to have captured the element of similarity in 'mr' and 'mrs', in 'loud' and 'aloud'. The pairs are not altogether dissimilar, that's correct, but they do not sit nearby one another as computed in the first models.

A great deal of material is returned more similar than 'wonderful' and 'great', indicating this is a populated model. It looks to be trained on three million words (or word-grams). If this is somewhat outdated, I would love to see something more cutting-edge!