[View in Colaboratory](https://colab.research.google.com/github/gmihaila/deep_learning_toolbox/blob/master/word_embeddings_visualize.ipynb)

## Homework 2: Word Embeddings 

### George Mihaila

<p> This homework is heavily based on <a href=https://github.com/oxford-cs-deepnlp-2017/practical-1>practical1</a> from 
the <a href=https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/>Oxford CS - Deep NLP 2017</a> by Yannis Assael, Brendan Shillingford and Chris Dyer."

### Steps to run the homework:

<ol>
<li> Download and install Anaconda from <a href=\"https://conda.io/docs/user-guide/install/index.html#regular-installation\">here</a>. In Linux, you have to download a file (choose Python 2.7 or 3.6, whichever you have in your machine) and run a simple command.
    
<li> The file <code>homework2.env</code> lists all the packages required to run the homework (and some that you don't really need). Create an conda environment with the right packages and versions: <code> conda create --name hw2 --file homework2.env </code>

<li> Activate the environment: <code> source activate hw2 </code>

<li> Run the homework (it will open in your browser): <code> jupyter notebook word_embeddings.ipynb </code>
    
<li> You can now run each cell using <code> Shift + Enter </code>
</ol>

Conda is extremely useful if you need to share your current installation (packages + versions) with others, keep different versions in different environments (for example, if you have old and new code), etc. You can find the documentation <a href="https://conda.io/docs/user-guide/tasks/manage-environments.html">here</a>.

In [0]:
import numpy as np
import os
from random import shuffle
import re

In [0]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

### Part 0: Download the TED dataset

In [0]:
import urllib.request
import zipfile
import lxml.etree

In [0]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [0]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

### Part 1: Preprocessing

In this part, we attempt to clean up the raw subtitles a bit, so that we get only sentences. The following substring shows examples of what we're trying to get rid of. Since it's hard to define precisely what we want to get rid of, we'll just use some simple heuristics.

In [0]:
i = input_text.find("Hyowon Gweon: See this?")
input_text[i-20:i+150]

' baby does.\n(Video) Hyowon Gweon: See this? (Ball squeaks) Did you see that? (Ball squeaks) Cool. See this one? (Ball squeaks) Wow.\nLaura Schulz: Told you. (Laughs)\n(Vide'

Let's start by removing all parenthesized strings using a regex:

In [0]:
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

We can verify the same location in the text is now clean as follows. We won't worry about the irregular spaces since we'll later split the text into sentences and tokenize it anyway.

In [0]:
i = input_text_noparens.find("Hyowon Gweon: See this?")
input_text_noparens[i-20:i+150]

"hat the baby does.\n Hyowon Gweon: See this?  Did you see that?  Cool. See this one?  Wow.\nLaura Schulz: Told you. \n HG: See this one?  Hey Clara, this one's for you. You "

Now, let's attempt to remove speakers' names that occur at the beginning of a line, by deleting pieces of the form "`<up to 20 characters>:`", as shown in this example. Of course, this is an imperfect heuristic. 

In [0]:
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# Uncomment if you need to save some RAM: these strings are about 50MB.
# del input_text, input_text_noparens

# Let's view the first few:
sentences_strings_ted[:5]

["Here are two reasons companies fail: they only do more of the same, or they only do what's new",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation',
 ' Both are necessary, but it can be too much of a good thing',
 'Consider Facit',
 " I'm actually old enough to remember them"]

Now that we have sentences, we're ready to tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is "can't", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer.

In [0]:
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Two sample processed sentences:

In [0]:
print(sentences_ted[0])
print(sentences_ted[1])

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']


### Part 2: Word Frequencies

If you store (1) the counts of the top 1000 words in a list called `counts_ted_top1000` and (2) the actual words in `words_top_ted`, the code below will plot the histogram requested in the writeup.

In [0]:
import collections
ted_counts = collections.defaultdict(int)
counts_ted_top1000 = []
words_top_ted = []

for sentence in sentences_ted:
    for word in sentence:
        if ted_counts.keys().__contains__(word):
            ted_counts[word] += 1
        else:
            ted_counts[word] = 1
#sort words in decenting order
sorted_words_ted = sorted(ted_counts, key=ted_counts.__getitem__, reverse=True)
#get first 1000 counts from sorted words
{counts_ted_top1000.append(ted_counts[word]) for word in sorted_words_ted[:1000]}
#get first 1000 words from sorted words
{words_top_ted.append(word) for word in sorted_words_ted[:1000]}
print ("Succesfully sorted Ted Dataset!")


Succesfully sorted Ted Dataset!


Plot distribution of top-1000 words

In [0]:
hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

### Part 3: Train Word2Vec

In [0]:
# Full documentation: https://radimrehurek.com/gensim/models/word2vec.html
from gensim.models import Word2Vec

In [0]:
model_ted = Word2Vec(sentences_ted, min_count=1)

### Part 4: Ted Learnt Representations

Finding similar words: (see gensim docs for more functionality of `most_similar`)

In [0]:
model_ted.most_similar("man")

[('woman', 0.8682868480682373),
 ('guy', 0.8158002495765686),
 ('lady', 0.7774063348770142),
 ('girl', 0.74993896484375),
 ('boy', 0.7380883693695068),
 ('soldier', 0.7181751728057861),
 ('gentleman', 0.7085753083229065),
 ('kid', 0.7070672512054443),
 ('philosopher', 0.6738376021385193),
 ('doctor', 0.6634228825569153)]

In [0]:
model_ted.most_similar("computer")

[('software', 0.7724950909614563),
 ('machine', 0.7566120028495789),
 ('robot', 0.6976919174194336),
 ('3d', 0.6860084533691406),
 ('device', 0.6844151616096497),
 ('game', 0.6713730096817017),
 ('camera', 0.6707618236541748),
 ('video', 0.6641008257865906),
 ('program', 0.6588221788406372),
 ('interface', 0.6544047594070435)]

In [0]:
model_ted.most_similar("language")

[('culture', 0.7175629734992981),
 ('mathematics', 0.7056708931922913),
 ('emotion', 0.6953589916229248),
 ('narrative', 0.6929461359977722),
 ('logic', 0.682327389717102),
 ('nature', 0.6549948453903198),
 ('behavior', 0.6513528227806091),
 ('english', 0.6415160894393921),
 ('meaning', 0.6356196999549866),
 ('beauty', 0.6317970752716064)]

In [0]:
model_ted.wv.most_similar(positive=['nice', 'man'])

[('guy', 0.7498212456703186),
 ('woman', 0.6990991234779358),
 ('boy', 0.6976661682128906),
 ('gentleman', 0.695278525352478),
 ('kid', 0.6902039051055908),
 ('lady', 0.6837806701660156),
 ('girl', 0.6695646047592163),
 ('dog', 0.6527926921844482),
 ('physicist', 0.6447546482086182),
 ('hoodie', 0.6394599676132202)]

In [0]:
model_ted.wv.most_similar(positive=['nice', 'woman'])

[('girl', 0.71077561378479),
 ('boy', 0.7074770927429199),
 ('kid', 0.7037371397018433),
 ('gentleman', 0.6774724721908569),
 ('lady', 0.6747616529464722),
 ('poster', 0.67110276222229),
 ('guy', 0.665742039680481),
 ('baby', 0.6600582599639893),
 ('hoodie', 0.6593862175941467),
 ('man', 0.6558953523635864)]

#### t-SNE visualization
To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable `words_top_ted`. The following code gets the corresponding vectors from the model, assuming it's called `model_ted`:

In [0]:
# This assumes words_top_ted is a list of strings, the top 1000 words
words_top_vec_ted = model_ted[words_top_ted]

In [0]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)
print ("Model fit finished!")

Model fit finished!


In [0]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Part 5: Wiki Learnt Representations

Download dataset

In [0]:
if not os.path.isfile('wikitext-103-raw-v1.zip'):
    urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", filename="wikitext-103-raw-v1.zip")

In [0]:
with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:
    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian

Preprocess sentences (note that it's important to remove small sentences for performance)

In [0]:
sentences_wiki = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_wiki.extend(s)

In [0]:
# sample 1/5 of the data
shuffle(sentences_wiki)
sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]

In [0]:
#clean sentences from wiki
for s_i in range(len(sentences_wiki)):
    sentences_wiki[s_i] = re.sub("[^a-z]", " ", sentences_wiki[s_i].lower())
    sentences_wiki[s_i] = re.sub(r'\([^)]*\)', '', sentences_wiki[s_i]).split()

Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code.

In [0]:
model_wiki = Word2Vec(sentences_wiki, min_count=1)

In [0]:
model_wiki.most_similar('man')

[('woman', 0.7413420677185059),
 ('dog', 0.6260066032409668),
 ('boy', 0.6195032596588135),
 ('girl', 0.6146693229675293),
 ('person', 0.6144819259643555),
 ('men', 0.5713915228843689),
 ('soldier', 0.5568812489509583),
 ('joke', 0.5567601919174194),
 ('creature', 0.5542395710945129),
 ('mask', 0.5220149755477905)]

In [0]:
# Calculate the most frequent words, store the results in words_top_wiki
import collections
wiki_counts = collections.defaultdict(int)
words_top_wiki1000 = []
words_top_wiki = []

for sentence in sentences_wiki:
    for word in sentence:
        if wiki_counts.keys().__contains__(word):
            wiki_counts[word] += 1
        else:
            wiki_counts[word] = 1
#sort words in decenting order
sorted_words_wiki = sorted(wiki_counts, key=wiki_counts.__getitem__, reverse=True)
#get first 1000 counts from sorted words
{words_top_wiki1000.append(wiki_counts[word]) for word in sorted_words_wiki[:1000]}
#get first 1000 words from sorted words
{words_top_wiki.append(word) for word in sorted_words_wiki[:1000]}
print ("Succesfully sorted!")

Succesfully sorted!


#### t-SNE visualization

In [0]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

In [0]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Homework:

#### Question 1 [100pt] 
#### 1.	Regarding the TED and Wiki datasets: 

    (a)	What are the most frequent words in each dataset? List the top 20 most frequent words and draw the histograms.


In [0]:
print("Top 20 words Ted: %s"%words_top_ted[:20])

Top 20 words Ted: ['the', 'and', 'to', 'of', 'a', 'that', 'i', 'in', 'it', 'you', 'we', 'is', 's', 'this', 'so', 'they', 'was', 'for', 'are', 'have']


In [0]:
#histogram
hist, edges = np.histogram(counts_ted_top1000[:20], density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

In [0]:
print("Top 20 words Wiki: %s"%words_top_wiki[:20])

Top 20 words Wiki: ['the', 'of', 'and', 'in', 'to', 'a', 'was', 's', 'on', 'as', 'for', 'that', 'with', 'by', 'is', 'his', 'at', 'he', 'from', 'it']


In [0]:
#histogram
hist, edges = np.histogram(words_top_wiki1000[:20], density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

### Homework:

#### Question 1 [100pt] 
#### 1.	Regarding the TED and Wiki datasets: 

    (b)	Get the most similar words to man using the embeddings you got using each corpus. 


In [0]:
print("Most similar to man for TED: ")
model_ted.most_similar("man")


Most similar to man for TED: 


[('woman', 0.8682868480682373),
 ('guy', 0.8158002495765686),
 ('lady', 0.7774063348770142),
 ('girl', 0.74993896484375),
 ('boy', 0.7380883693695068),
 ('soldier', 0.7181751728057861),
 ('gentleman', 0.7085753083229065),
 ('kid', 0.7070672512054443),
 ('philosopher', 0.6738376021385193),
 ('doctor', 0.6634228825569153)]

In [0]:
print("Most similar to man for Wiki: ")
model_wiki.most_similar('man')

Most similar to man for Wiki: 


[('woman', 0.7413420677185059),
 ('dog', 0.6260066032409668),
 ('boy', 0.6195032596588135),
 ('girl', 0.6146693229675293),
 ('person', 0.6144819259643555),
 ('men', 0.5713915228843689),
 ('soldier', 0.5568812489509583),
 ('joke', 0.5567601919174194),
 ('creature', 0.5542395710945129),
 ('mask', 0.5220149755477905)]

i. Why are they different?
    Because each word embedding is trained on different dataset.
    
ii. Can you identity any relationship(s) between man and its most similar words beyond “they are similar”?
    The appear in same context
    
(c) Using model ted.wv.most similar(positive=[’nice’, ’man’]) you can get the most similar words to “nice plus man”. Does the result make sense to you?

In [0]:
model_ted.wv.most_similar(positive=['nice', 'man'])

[('guy', 0.7498212456703186),
 ('woman', 0.6990991234779358),
 ('boy', 0.6976661682128906),
 ('gentleman', 0.695278525352478),
 ('kid', 0.6902039051055908),
 ('lady', 0.6837806701660156),
 ('girl', 0.6695646047592163),
 ('dog', 0.6527926921844482),
 ('physicist', 0.6447546482086182),
 ('hoodie', 0.6394599676132202)]

Yes they make sense.

In [0]:
print("January: ", model_ted.most_similar("january"))


January:  [('october', 0.9308812618255615), ('june', 0.9222588539123535), ('august', 0.910740852355957), ('december', 0.9084295034408569), ('september', 0.9003639221191406), ('april', 0.899808943271637), ('2011', 0.8972553014755249), ('2008', 0.8921166658401489), ('february', 0.8907141089439392), ('july', 0.8866482377052307)]


In [0]:
print("february: ",model_ted.most_similar("february"))

february:  [('2012', 0.9077060222625732), ('2005', 0.896077036857605), ('2002', 0.8909220099449158), ('january', 0.8907140493392944), ('2013', 0.8901843428611755), ('2001', 0.889124870300293), ('2007', 0.8888581395149231), ('2009', 0.8882791996002197), ('november', 0.8857223987579346), ('december', 0.8837077617645264)]


In [0]:
print("minutes: ",model_ted.most_similar("minutes"))

minutes:  [('seconds', 0.9405599236488342), ('hours', 0.8603693842887878), ('months', 0.8312658667564392), ('weeks', 0.8265615701675415), ('days', 0.7821897864341736), ('bucks', 0.7633097171783447), ('cents', 0.7531062960624695), ('minute', 0.7507410049438477), ('years', 0.7418460249900818), ('month', 0.7325657606124878)]


In [0]:
print("days: ",model_ted.most_similar("days"))

days:  [('months', 0.9083890914916992), ('weeks', 0.8966906070709229), ('hours', 0.789345383644104), ('minutes', 0.7821897864341736), ('decades', 0.780881404876709), ('seconds', 0.755976140499115), ('centuries', 0.736128032207489), ('years', 0.7320805191993713), ('month', 0.7242785692214966), ('nights', 0.7189063429832458)]


In [0]:
print("hours: ",model_ted.most_similar("hours"))

hours:  [('seconds', 0.8678438663482666), ('minutes', 0.8603694438934326), ('weeks', 0.8548505306243896), ('months', 0.85135817527771), ('days', 0.7893454432487488), ('hour', 0.7662705183029175), ('cents', 0.7621139287948608), ('milliseconds', 0.7607388496398926), ('bucks', 0.759858250617981), ('month', 0.7552515864372253)]


In [0]:
print("minutes: ",model_ted.most_similar("minutes"))

minutes:  [('seconds', 0.9405599236488342), ('hours', 0.8603693842887878), ('months', 0.8312658667564392), ('weeks', 0.8265615701675415), ('days', 0.7821897864341736), ('bucks', 0.7633097171783447), ('cents', 0.7531062960624695), ('minute', 0.7507410049438477), ('years', 0.7418460249900818), ('month', 0.7325657606124878)]


In [0]:
print("january: ",model_wiki.most_similar('january'))

january:  [('december', 0.981354296207428), ('june', 0.9783721566200256), ('february', 0.9782575368881226), ('july', 0.9752212166786194), ('november', 0.9731165170669556), ('april', 0.9698600769042969), ('october', 0.9695407748222351), ('march', 0.9683542251586914), ('august', 0.9592667818069458), ('september', 0.9530515670776367)]


In [0]:
print("february: ",model_wiki.most_similar('february'))

february:  [('december', 0.9809674024581909), ('july', 0.9795922040939331), ('june', 0.9786773920059204), ('january', 0.9782577157020569), ('november', 0.9776405096054077), ('october', 0.974767804145813), ('march', 0.9721773862838745), ('april', 0.9711993932723999), ('august', 0.9666255712509155), ('september', 0.9569708108901978)]


In [0]:
print("minutes: ",model_wiki.most_similar('minutes'))

minutes:  [('seconds', 0.8670499324798584), ('hours', 0.8279467821121216), ('minute', 0.7537791728973389), ('days', 0.7382129430770874), ('weeks', 0.7073256373405457), ('laps', 0.6940144300460815), ('hour', 0.6737622022628784), ('rounds', 0.6610020399093628), ('months', 0.6564478874206543), ('overs', 0.6534638404846191)]


In [0]:
print("hours: ",model_wiki.most_similar('hours'))

hours:  [('days', 0.884272575378418), ('minutes', 0.8279469013214111), ('weeks', 0.819317102432251), ('months', 0.8149807453155518), ('hour', 0.6774991750717163), ('years', 0.6743379831314087), ('seconds', 0.6711820363998413), ('month', 0.6431746482849121), ('nights', 0.6345077753067017), ('laps', 0.5993154048919678)]


In [0]:
print("days: ",model_wiki.most_similar('days'))

days:  [('months', 0.9211745262145996), ('weeks', 0.8980214595794678), ('hours', 0.884272575378418), ('years', 0.8103913068771362), ('minutes', 0.7382129430770874), ('decades', 0.7201712131500244), ('month', 0.7128969430923462), ('nights', 0.6599725484848022), ('week', 0.607032060623169), ('laps', 0.6009765863418579)]


### 2. The leases.zip 

In [0]:
input_text = ""
with zipfile.ZipFile('leases.zip', 'r') as z:
    for filename in z.namelist():
        input_text += str(z.open("%s"%filename, 'r').read(), encoding='utf-8')+ " " # Thanks Robert Bastian
print("Length text: ",len(input_text))

Length text:  295773821


In [0]:
sentences_leases = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_leases.extend(s)

In [0]:
# sample 1/5 of the data
shuffle(sentences_leases)
sentences_leases = sentences_leases[:int(len(sentences_wiki)/5)]

In [0]:
#clean sentences from wiki
for s_i in range(len(sentences_leases)):
    sentences_leases[s_i] = re.sub("[^a-z]", " ", sentences_leases[s_i].lower())
    sentences_leases[s_i] = re.sub(r'\([^)]*\)', '', sentences_leases[s_i]).split()

In [0]:
model_leases = Word2Vec(sentences_leases, min_count=1)

In [0]:
# Calculate the most frequent words, store the results in words_top_leases
import collections
leases_counts = collections.defaultdict(int)
words_top_leases1000 = []
words_top_leases = []

for sentence in sentences_leases:
    for word in sentence:
        if leases_counts.keys().__contains__(word):
            leases_counts[word] += 1
        else:
            leases_counts[word] = 1
#sort words in decenting order
sorted_words_leases = sorted(leases_counts, key=leases_counts.__getitem__, reverse=True)
#get first 1000 counts from sorted words
{words_top_leases1000.append(leases_counts[word]) for word in sorted_words_leases[:1000]}
#get first 1000 words from sorted words
{words_top_leases.append(word) for word in sorted_words_leases[:1000]}
print ("Succesfully sorted!")

Succesfully sorted!


In [0]:
# This assumes words_top_leases is a list of strings, the top 1000 words
words_top_vec_leases = model_leases[words_top_leases]

tsne = TSNE(n_components=2, random_state=0)
words_top_leases_tsne = tsne.fit_transform(words_top_vec_leases)

In [0]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_leases_tsne[:,0],
                                    x2=words_top_leases_tsne[:,1],
                                    names=words_top_leases))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

In [0]:
print("January: ", model_leases.most_similar("january"))

January:  [('november', 0.9086244106292725), ('june', 0.898212730884552), ('december', 0.8965006470680237), ('february', 0.8936348557472229), ('august', 0.8898540735244751), ('october', 0.8842123746871948), ('april', 0.8836274147033691), ('march', 0.8824495673179626), ('july', 0.8805142641067505), ('september', 0.8302119374275208)]


In [0]:
print("february: ", model_leases.most_similar("february"))

february:  [('june', 0.9295380711555481), ('august', 0.8984723091125488), ('january', 0.8936350345611572), ('march', 0.8927676677703857), ('july', 0.8921082615852356), ('december', 0.8764469027519226), ('october', 0.8713834285736084), ('april', 0.871055006980896), ('november', 0.8623570799827576), ('september', 0.8598427772521973)]


In [0]:
print("hours: ", model_leases.most_similar("hours"))

hours:  [('appeals', 0.4939519464969635), ('times', 0.49231672286987305), ('offices', 0.4722555875778198), ('pdnazy', 0.46289709210395813), ('normal', 0.4611148238182068), ('jltt', 0.4558117687702179), ('conforming', 0.4435980021953583), ('soon', 0.43607157468795776), ('view', 0.4327104091644287), ('horizorital', 0.4219496250152588)]


In [0]:
print("minutes: ", model_leases.most_similar("minutes"))

minutes:  [('seconds', 0.9467544555664062), ('degrees', 0.9440059661865234), ('west', 0.9167031049728394), ('thence', 0.910275936126709), ('east', 0.8943793773651123), ('south', 0.8640549778938293), ('north', 0.8484262824058533), ('weatherford', 0.7775397896766663), ('davis', 0.7715314030647278), ('inch', 0.7621535062789917)]
