# Embedding Ender's Game

In this notebook I will upload some books from the Ender's game universe, process them using *nltk*, and embed the words in the corpus using *word2vec*.

This is inspired by [this video](https://youtu.be/pY9EwZ02sXU) and implemented using [this source code](https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE).

**Thank you Siraj Raval!**

In [None]:
#future is the missing compatibility layer between Python 2 and Python 3. 
#It allows you to use a single, clean Python 3.x-compatible codebase to 
#support both Python 2 and Python 3 with minimal overhead.
from __future__ import absolute_import, division, print_function

## Some dependencies

In [3]:
#encoding. word encodig
import codecs
#finds all pathnames matching a pattern, like regex
import glob
#log events for libraries
import logging
#concurrency
import multiprocessing
#dealing with operating system , like reading file
import os
#pretty print, human readable
import pprint
#regular expressions
import re
import tensorflow as tf

In [4]:
#natural language toolkit
import nltk
#dimensionality reduction
import sklearn.manifold
from bhtsne import tsne
#math
import numpy as np
#plotting
import matplotlib.pyplot as plt
#parse dataset
import pandas as pd
#visualization
import seaborn as sns
#word 2 vec
from gensim.models import word2vec as w2v

In [5]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [6]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Preprocess using *nltk*

In [7]:
#stopwords like the at a an, unnecesasry
#tokenization into sentences, punkt 
#http://www.nltk.org/
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/baralya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/baralya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
#initialize rawunicode , all text goes here
corpus_raw = u""
with codecs.open('Ender.txt', "r", "utf-8") as book_file:
        corpus_raw += book_file.read()

In [15]:
corpus_raw[0:10]

" ENDER'S G"

In [16]:
#tokenizastion! saved the trained model here
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [17]:
#tokenize into sentences
raw_sentences = tokenizer.tokenize(corpus_raw)

In [18]:
raw_sentences[0:10]

[' ENDER\'S GAME\n by Orson Scott Card\n Chapter 1 -- Third\n\n "I\'ve watched through his eyes, I\'ve listened through his ears, and tell you he\'s the one.',
 'Or at least as close as we\'re going to get."',
 '"That\'s what you said about the brother."',
 '"The brother tested out impossible.',
 'For other reasons.',
 'Nothing to do with his ability."',
 '"Same with the sister.',
 'And there are doubts about him.',
 "He's too malleable.",
 'Too willing\nto submerge himself in someone else\'s will."']

In [19]:
#convert into list of words
#remove unecessary characters, split into words, no hyhens and shit
#split into words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [20]:
#for each sentece, sentences where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [21]:
sentences[0:10]

[['ENDER',
  'S',
  'GAME',
  'by',
  'Orson',
  'Scott',
  'Card',
  'Chapter',
  'Third',
  'I',
  've',
  'watched',
  'through',
  'his',
  'eyes',
  'I',
  've',
  'listened',
  'through',
  'his',
  'ears',
  'and',
  'tell',
  'you',
  'he',
  's',
  'the',
  'one'],
 ['Or', 'at', 'least', 'as', 'close', 'as', 'we', 're', 'going', 'to', 'get'],
 ['That', 's', 'what', 'you', 'said', 'about', 'the', 'brother'],
 ['The', 'brother', 'tested', 'out', 'impossible'],
 ['For', 'other', 'reasons'],
 ['Nothing', 'to', 'do', 'with', 'his', 'ability'],
 ['Same', 'with', 'the', 'sister'],
 ['And', 'there', 'are', 'doubts', 'about', 'him'],
 ['He', 's', 'too', 'malleable'],
 ['Too',
  'willing',
  'to',
  'submerge',
  'himself',
  'in',
  'someone',
  'else',
  's',
  'will']]

In [22]:
#print an example
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Nothing to do with his ability."
['Nothing', 'to', 'do', 'with', 'his', 'ability']


In [23]:
#count tokens, each one being a sentence
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 503,902 tokens


## Build our model

In [24]:
#step 2 build our model, another one is Glove
#define hyperparameters

# Dimensionality of the resulting word vectors.
#more dimensions mean more traiig them, but more generalized
num_features = 300

#
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#rate 0 and 1e-5 
#how often to use
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
seed = 1

In [25]:
ender2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [26]:
ender2vec.build_vocab(sentences)

2017-06-05 20:22:21,796 : INFO : collecting all words and their counts


2017-06-05 20:22:21,797 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


2017-06-05 20:22:21,828 : INFO : PROGRESS: at sentence #10000, processed 111908 words, keeping 8059 word types


2017-06-05 20:22:21,861 : INFO : PROGRESS: at sentence #20000, processed 228097 words, keeping 11485 word types


2017-06-05 20:22:21,891 : INFO : PROGRESS: at sentence #30000, processed 353550 words, keeping 14782 word types


2017-06-05 20:22:21,925 : INFO : PROGRESS: at sentence #40000, processed 478152 words, keeping 17718 word types


2017-06-05 20:22:21,931 : INFO : collected 18299 word types from a corpus of 503902 raw words and 42110 sentences


2017-06-05 20:22:21,932 : INFO : Loading a fresh vocabulary


2017-06-05 20:22:21,962 : INFO : min_count=3 retains 7725 unique words (42% of original 18299, drops 10574)


2017-06-05 20:22:21,962 : INFO : min_count=3 leaves 490596 word corpus (97% of original 503902, drops 13306)


2017-06-05 20:22:21,989 : INFO : deleting the raw counts dictionary of 18299 items


2017-06-05 20:22:21,991 : INFO : sample=0.001 downsamples 66 most-common words


2017-06-05 20:22:21,992 : INFO : downsampling leaves estimated 369461 word corpus (75.3% of prior 490596)


2017-06-05 20:22:21,992 : INFO : estimated required memory for 7725 words and 300 dimensions: 22402500 bytes


2017-06-05 20:22:22,040 : INFO : resetting layer weights


In [27]:
print("Word2Vec vocabulary length:", len(ender2vec.wv.vocab))

Word2Vec vocabulary length: 7725


In [28]:
#train model on sentneces
ender2vec.train(sentences, 
        total_examples=ender2vec.corpus_count, epochs=5)

2017-06-05 20:22:30,404 : INFO : training model with 4 workers on 7725 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7


2017-06-05 20:22:31,456 : INFO : PROGRESS: at 17.04% examples, 301680 words/s, in_qsize 7, out_qsize 0


2017-06-05 20:22:32,470 : INFO : PROGRESS: at 35.81% examples, 320435 words/s, in_qsize 7, out_qsize 0


2017-06-05 20:22:33,475 : INFO : PROGRESS: at 54.50% examples, 327694 words/s, in_qsize 7, out_qsize 0


2017-06-05 20:22:34,480 : INFO : PROGRESS: at 72.84% examples, 329649 words/s, in_qsize 7, out_qsize 0


2017-06-05 20:22:35,496 : INFO : PROGRESS: at 91.43% examples, 331517 words/s, in_qsize 7, out_qsize 0


2017-06-05 20:22:35,940 : INFO : worker thread finished; awaiting finish of 3 more threads


2017-06-05 20:22:35,952 : INFO : worker thread finished; awaiting finish of 2 more threads


2017-06-05 20:22:35,988 : INFO : worker thread finished; awaiting finish of 1 more threads


2017-06-05 20:22:35,990 : INFO : worker thread finished; awaiting finish of 0 more threads


2017-06-05 20:22:35,990 : INFO : training on 2519510 raw words (1848269 effective words) took 5.6s, 331264 effective words/s


1848269

## That's it! we embedded our word vectors!

### Now Let's play...

In [29]:
#save model
if not os.path.exists("trained"):
    os.makedirs("trained")

In [30]:
ender2vec.save(os.path.join("trained", "ender2vec.w2v"))

2017-06-05 20:22:54,820 : INFO : saving Word2Vec object under trained/ender2vec.w2v, separately None


2017-06-05 20:22:54,821 : INFO : not storing attribute syn0norm


2017-06-05 20:22:54,821 : INFO : not storing attribute cum_table


2017-06-05 20:22:54,970 : INFO : saved trained/ender2vec.w2v


In [31]:
#load model
#ender2vec = w2v.Word2Vec.load(os.path.join("trained", "ender2vec.w2v"))

### Perform dimentionality reduction using *tsne* 

In [32]:
#put it all into a giant matrix
all_word_vectors_matrix = ender2vec.wv.syn0
#train t sne
all_word_vectors_matrix_2d = tsne(all_word_vectors_matrix.astype(
        'float64'))



#squash dimensionality to 2
#https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
#tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [33]:
#plot point in 2d space
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[ender2vec.wv.vocab[word].index])
            for word in ender2vec.wv.vocab
        ]
    ],
    columns=["word", "x", "y"]
)

In [34]:
points.head(10)

Unnamed: 0,word,x,y
0,ribs,15.631265,-21.844936
1,jail,6.789067,2.182995
2,port,2.274317,-13.35666
3,main,12.277898,-30.154868
4,thanked,5.798963,3.4842
5,stiff,14.617314,-24.195186
6,thrill,16.773382,-18.393301
7,slot,-29.835675,7.878562
8,rotation,-8.903737,11.813079
9,Ask,-34.672347,5.02566


In [35]:
#plot
sns.set_context("poster")

In [36]:
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

<matplotlib.axes._subplots.AxesSubplot at 0x7ff6dccdf668>

In [37]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(8, 6))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

### Let's have a closer look at some interesting regions

In [None]:
plot_region(x_bounds=(22.0, 25.0), y_bounds=(12.0, 16.0))

In [None]:
plot_region(x_bounds=(4.0,6.0), y_bounds=(31.0, 35.0))

### Let's try some synonyms

In [None]:
ender2vec.most_similar("Bean")

In [None]:
ender2vec.most_similar("battleroom")

In [None]:
ender2vec.most_similar("Buggers")

### Let's look at some word associations

In [None]:
#distance, similarity, and ranking
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = ender2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [None]:
nearest_similarity_cosmul("Ender", "Valentine", "Bean")
# interesting: http://enderverse.wikia.com/wiki/Suriyawong

In [None]:
nearest_similarity_cosmul("Peter", "Valentine", "Locke")
# amazing!!

In [None]:
nearest_similarity_cosmul("run", "slow", "fight")

In [None]:
nearest_similarity_cosmul("love", "hate", "formics")