In [61]:
import nltk
import re # for preprocessing 
import pandas as pd
import numpy as np
from time import time # To time the operations
from collections import defaultdict # for word frequency
import string
import spacy
from nltk.corpus import stopwords
import logging # setting up the loggings to moniter gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt='%H:%M:%S', level=logging.INFO)

In [66]:
spacy.cli.download("en")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/luoyifeng/opt/anaconda3/envs/tf2/lib/python3.7/site-packages/en_core_web_sm
-->
/Users/luoyifeng/opt/anaconda3/envs/tf2/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [46]:
df = pd.read_csv('./data/simpsons_dataset.csv')
print(df.columns)
df.head()

Index(['raw_character_text', 'spoken_words'], dtype='object')


Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [47]:
df.shape

(158314, 2)

In [48]:
df.isnull().sum()

raw_character_text    17814
spoken_words          26459
dtype: int64

In [69]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

raw_character_text    0
spoken_words          0
dtype: int64

In [67]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

In [68]:
def cleaning(doc):
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

In [70]:
brief_cleaning = (re.sub("[^A-Za-z]+", ' ', str(row)).lower() for row in df['spoken_words'])

In [71]:
t = time()
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t)/ 60, 2)))

Time to clean up everything: 0.83 mins


In [72]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

(92173, 1)

We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

The main reason we do this is to catch words like "mr_burns" or "bart_simpson" !

In [74]:
import sys
!{sys.executable} -m pip install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-macosx_10_9_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 27.3 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-4.1.2-py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 14.0 MB/s eta 0:00:01
[?25hInstalling collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-4.1.2


In [75]:
from gensim.models.phrases import Phrases, Phraser

In [76]:
sent = [row.split() for row in df_clean['clean']]

In [77]:
phrases = Phrases(sent, min_count=30, progress_per=10000)

INFO - 00:13:36: collecting all words and their counts
INFO - 00:13:36: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 00:13:36: PROGRESS: at sentence #10000, processed 67820 words and 51066 word types
INFO - 00:13:36: PROGRESS: at sentence #20000, processed 141206 words and 96869 word types
INFO - 00:13:36: PROGRESS: at sentence #30000, processed 208928 words and 133765 word types
INFO - 00:13:36: PROGRESS: at sentence #40000, processed 271322 words and 166656 word types
INFO - 00:13:36: PROGRESS: at sentence #50000, processed 335262 words and 199086 word types
INFO - 00:13:36: PROGRESS: at sentence #60000, processed 402266 words and 232305 word types
INFO - 00:13:36: PROGRESS: at sentence #70000, processed 469471 words and 265025 word types
INFO - 00:13:36: PROGRESS: at sentence #80000, processed 536207 words and 296899 word types
INFO - 00:13:37: PROGRESS: at sentence #90000, processed 604295 words and 327277 word types
INFO - 00:13:37: collected 333517 word typ

In [78]:
bigram = Phraser(phrases)

INFO - 00:13:58: source_vocab length 333517
INFO - 00:14:01: Phraser built with 136 phrasegrams


In [79]:
sentences = bigram[sent]

In [80]:
# Most Frequent Words
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

30248

In [81]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['s', 'm', 'oh', 'don_t', 'will', 'like', 'know', 'hey', 'think', 'right']

### Training the model

In [82]:
import multiprocessing

In [83]:
from gensim.models import Word2Vec

Why I seperate the training of the model in 3 steps:
I prefer to separate the training in 3 distinctive steps for clarity and monitoring.

1. Word2Vec():
In this first step, I set up the parameters of the model one-by-one.
I do not supply the parameter sentences, and therefore leave the model uninitialized, purposefully.

2. build_vocab():
Here it builds the vocabulary from a sequence of sentences and thus initialized the model.
With the loggings, I can follow the progress and even more important, the effect of min_count and sample on the word corpus. I noticed that these two parameters, and in particular sample, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.

3. train():
Finally, trains the model.
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [85]:
cores = multiprocessing.cpu_count()
cores

8

**The parameters:**
- `min_count` = int - Ignores all words with total absolute frequency lower than this - (2, 100)
- `window` = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
- `size` = int - Dimensionality of the feature vectors. - (50, 300)
- `sample` = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
- `alpha` = float - The initial learning rate - (0.01, 0.05)
- `min_alpha` = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
- `negative` = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
- `workers` = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [86]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

**Building the Vocabulary Table:**


Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them):

In [87]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 00:21:56: collecting all words and their counts
INFO - 00:21:56: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 00:21:56: PROGRESS: at sentence #10000, processed 64401 words, keeping 9172 word types
INFO - 00:21:56: PROGRESS: at sentence #20000, processed 134240 words, keeping 14076 word types
INFO - 00:21:56: PROGRESS: at sentence #30000, processed 198733 words, keeping 17077 word types
INFO - 00:21:57: PROGRESS: at sentence #40000, processed 258328 words, keeping 19781 word types
INFO - 00:21:57: PROGRESS: at sentence #50000, processed 319396 words, keeping 22103 word types
INFO - 00:21:57: PROGRESS: at sentence #60000, processed 383343 words, keeping 24333 word types
INFO - 00:21:57: PROGRESS: at sentence #70000, processed 447667 words, keeping 26389 word types
INFO - 00:21:57: PROGRESS: at sentence #80000, processed 511491 words, keeping 28323 word types
INFO - 00:21:57: PROGRESS: at sentence #90000, processed 576428 words, keeping 29944 word types


Time to build vocab: 0.04 mins


**Training of the model:**


Parameters of the training:

- total_examples = int - Count of sentences;
- epochs = int - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [88]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 00:22:50: training model with 7 workers on 3381 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
INFO - 00:22:52: EPOCH 1 - PROGRESS: at 49.22% examples, 103730 words/s, in_qsize 0, out_qsize 0
INFO - 00:22:52: worker thread finished; awaiting finish of 6 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 5 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 4 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 3 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 2 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 1 more threads
INFO - 00:22:52: worker thread finished; awaiting finish of 0 more threads
INFO - 00:22:52: EPOCH - 1 : training on 591008 raw words (218217 effective words) took 1.9s, 113769 effective words/s
INFO - 00:22:53: EPOCH 2 - PROGRESS: at 59.38% examples, 122328 words/s, in_qsize 0, out_qsize 0
INFO - 00:22:54: worker thre

INFO - 00:23:13: worker thread finished; awaiting finish of 3 more threads
INFO - 00:23:13: worker thread finished; awaiting finish of 2 more threads
INFO - 00:23:13: worker thread finished; awaiting finish of 1 more threads
INFO - 00:23:13: worker thread finished; awaiting finish of 0 more threads
INFO - 00:23:13: EPOCH - 11 : training on 591008 raw words (218536 effective words) took 2.4s, 90200 effective words/s
INFO - 00:23:14: EPOCH 12 - PROGRESS: at 52.62% examples, 113007 words/s, in_qsize 0, out_qsize 0
INFO - 00:23:15: worker thread finished; awaiting finish of 6 more threads
INFO - 00:23:15: worker thread finished; awaiting finish of 5 more threads
INFO - 00:23:15: worker thread finished; awaiting finish of 4 more threads
INFO - 00:23:15: worker thread finished; awaiting finish of 3 more threads
INFO - 00:23:15: worker thread finished; awaiting finish of 2 more threads
INFO - 00:23:15: EPOCH 12 - PROGRESS: at 98.38% examples, 106477 words/s, in_qsize 1, out_qsize 1
INFO - 00:

INFO - 00:23:35: worker thread finished; awaiting finish of 6 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 5 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 4 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 3 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 2 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 1 more threads
INFO - 00:23:35: worker thread finished; awaiting finish of 0 more threads
INFO - 00:23:35: EPOCH - 22 : training on 591008 raw words (218246 effective words) took 1.8s, 119960 effective words/s
INFO - 00:23:36: EPOCH 23 - PROGRESS: at 57.71% examples, 123636 words/s, in_qsize 0, out_qsize 0
INFO - 00:23:37: worker thread finished; awaiting finish of 6 more threads
INFO - 00:23:37: worker thread finished; awaiting finish of 5 more threads
INFO - 00:23:37: worker thread finished; awaiting finish of 4 more threads
INFO - 00:23:37: worker thread f

Time to train the model: 0.99 mins


In [89]:
w2v_model.init_sims(replace=True)

INFO - 00:24:25: precomputing L2-norms of word weight vectors


**Exploring the model**

Most similar to:
    
Here, we will ask our model to find the word most similar to some of the most iconic characters of the Simpsons!

In [90]:
w2v_model.wv.most_similar(positive=["homer"])

[('depressed', 0.7351570129394531),
 ('marge', 0.7273886203765869),
 ('bongo', 0.7201482057571411),
 ('sweetheart', 0.7082821130752563),
 ('rude', 0.7049339413642883),
 ('embarrassing', 0.6905354261398315),
 ('unno', 0.6816667318344116),
 ('abe', 0.6753657460212708),
 ('eliza', 0.6622533798217773),
 ('snuggle', 0.6594163775444031)]

In [92]:
w2v_model.wv.similarity('maggie', 'baby')

0.67944306

In [93]:
w2v_model.wv.similarity('bart', 'nelson')

0.6119868

In [94]:
w2v_model.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'milhouse'

**The codes in this notebook take insipiration from various sources.**