# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

Notice that we're upgrading the NumPy and Gensim packages so that they can communicate with each other without errors. Gensim is undergoing rapid development and the package has had several major transformations (hence, version 4). If you experience problems with Gensim in your work, they may be easily fixed by keeping the package up to date.

In [1]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
# ! pip -q install -U python-Levenshtein==0.12.2 gensim==4.1.0 > log
import pandas as pd, numpy as np, nltk, seaborn as sns, matplotlib.pyplot as plt, gensim
from gensim.models import KeyedVectors

print(f'Versions. gensim:{gensim.__version__}, np:{np.__version__}') 

Versions. gensim:4.2.0, np:1.19.5


<hr style="border-top: 2px solid #606366; background: transparent;">

# Review

In this notebook, you will examine a Word2Vec model to practice working with its output and to develop a better understanding of what it can tell you about a particular word. Additionally, you'll develop an appreciation for the limitations of this type of model. 

## Word2Vec

You'll work with the `Gensim`-trained Word2Vec model `'glove-wiki-gigaword-50.gz'`. This is the smallest package in the library and includes about 400,000 words. 

First, look at each part of the model name, because they give you critical information about the model: 

1. [*GloVe*](https://en.wikipedia.org/wiki/GloVe_(machine_learning)) or Global Vectors is the model used to create Word2Vec vectors (aka word-embedding vectors or word embeddings). 
1. *wiki* and [*gigaword*](https://catalog.ldc.upenn.edu/LDC2003T05) are large corpora that were used to train the model. Wikipedia 2014 corpus and English Gigaword 5 corpus together had 6 billion of uncased tokens.
1. *50* is the size of each vector.
1. *.gz* indicates that this is a text file compressed to [*gzip*](https://en.wikipedia.org/wiki/Gzip) format. 

`'glove-wiki-gigaword-50.gz'` contains a matrix of weights, where each line is a word vector with the word itself starting the line. Use the code below to load this model as `wv`. Note that this may take a minute or two to load.

Note: The original Word2Vec model and Global Vectors (GloVe) model both produce word vectors, but their algorithms differ in technicalities which we will not discuss here. While you will use the more accessible/popular GloVe embeddings in this course, you will learn about the original Word2Vec algorithm.

In [2]:
# Dictionary-like object. key=word (string), value=trained embedding coefficients (array of numbers)
sFile = "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz"
%time wv = KeyedVectors.load_word2vec_format(sFile)
wv            # prints the type of the object and its memory location

CPU times: user 46.1 s, sys: 772 ms, total: 46.9 s
Wall time: 47.2 s


<gensim.models.keyedvectors.KeyedVectors at 0x724f16b13080>

Now that you've loaded the model, retrieve the vector for the word `'cornell'`. 

In [3]:
wv['cornell']  # retrieve a word vector. Formerly: wv.word_vec('cornell')

array([-0.92918 ,  1.2927  , -0.60859 , -0.64373 , -0.42095 ,  0.22215 ,
       -1.3797  , -0.27724 , -0.11376 , -0.3705  ,  0.19052 ,  0.39284 ,
        0.068273,  0.045676, -0.35982 , -0.025498, -0.33448 ,  0.70516 ,
       -0.3363  ,  1.0111  ,  0.2258  ,  0.63049 ,  0.403   , -0.66357 ,
        0.12265 , -0.78821 ,  0.14584 , -0.43932 , -1.1897  , -0.75912 ,
        0.66164 , -0.31295 ,  0.14875 , -1.0532  ,  0.034797, -0.074928,
       -0.024014,  0.75029 ,  1.73    ,  0.16136 , -0.26749 , -0.68806 ,
       -0.29646 ,  0.12024 ,  0.51634 ,  0.32831 ,  0.66773 ,  0.47298 ,
       -1.1365  ,  0.72653 ], dtype=float32)

Examine the vector, noting each of the following important characteristics: 

1. It contains 50 values somewhere between -2 and +2.
1. All values are floats with 32-bit precision.
1. There are no zeros, so it can be considered a **dense vector** (a vector comprised of mostly non-zero values).
1. Each value represents a dimension but is not necessarily interpretable by humans.
1. Large (in magnitude) values, such as `1.73` or `-1.3797`, may relate to education, university, academia, technology, or some other broad category in which Cornell has great presence.
    1. Many broader categories are also likely to be represented by a combination of these dimensions.
1. A few values are relatively close to zero: `0.068273`, `0.045676`, `0.034797`, `-0.074928`.
    1. This implies that these dimensions do not carry "significant information" about `'cornell'`.

Next, check the vector length with the model's `vector_size` attribute by taking the length of any vector you obtain from the model. Note, however, that words are only present if they were present in the training corpus. The word `'the'` is highly likely to be present in all models, so it is a good choice if you want to check the number of dimensions.

In [4]:
wv.vector_size
len(wv['the'])

50

50

If a word in which you're interested is not in a model, some preprocessing may help. In the previous example, you used `'cornell'` instead of `'Cornell'` since the uppercase version `'Cornell'` is not in this model's vocabulary. 

To check if a word is included the vocabulary, you can use the `key_to_index` attribute. 

In [5]:
'Cornell' in wv.key_to_index, 'cornell' in wv.key_to_index

(False, True)

You can check the size of a particular model with the `len` method. 400K words may seem like a lot, but this is actually quite small for most real-world applications. In practice, you can use models with larger vector sizes (say, 300) and much larger vocabulary.

In [6]:
len(wv.key_to_index)  # retrieve vocabulary and measure its size

400000

Let's print a few words from this model's vocabulary. Since `wv.key_to_index` is a dictionary, you need to wrap it as a list before slicing.

In [7]:
LsTop20 = list(wv.key_to_index)[:20]
print(LsTop20)

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']


You can go a step further and print vectors associated with each of these words. Use the `background_gradient` method to wrap these vectors into a DataFrame as rows with the corresponding words as row indices. Let's spice it up with colors. WOW! 

In [8]:
pd.DataFrame({w:wv[w] for w in LsTop20}).T.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
the,0.42,0.25,-0.41,0.12,0.35,-0.04,-0.5,-0.18,-0.0,-0.66,0.28,-0.15,-0.56,0.15,-0.01,0.01,0.1,-0.13,-0.84,-0.12,-0.02,-0.33,-0.16,-0.23,-0.19,-1.88,-0.77,0.1,-0.42,-0.2,4.01,-0.19,-0.52,-0.32,0.0,0.01,0.18,-0.16,0.01,-0.05,-0.3,-0.16,-0.35,-0.05,-0.44,0.19,0.0,-0.18,-0.12,-0.79
",",0.01,0.24,-0.17,0.41,0.64,0.48,-0.43,-0.56,-0.36,-0.24,0.13,-0.06,-0.4,-0.48,0.23,0.09,-0.13,0.08,-0.42,-0.15,0.1,0.49,0.31,-0.13,-0.04,-1.52,0.13,-0.02,-0.04,-0.28,3.54,-0.12,-0.01,-0.15,0.22,-0.33,-0.14,0.32,0.7,0.45,-0.08,0.63,0.32,-0.47,0.23,0.36,-0.38,-0.57,0.04,0.3
.,0.15,0.3,-0.17,0.18,0.32,0.34,-0.43,-0.31,-0.45,-0.29,0.17,0.12,-0.41,-0.42,0.6,0.29,-0.12,-0.04,-0.68,-0.25,0.18,0.09,0.47,0.02,0.04,-1.47,-0.3,-0.02,0.31,-0.22,3.75,0.0,-0.18,-0.46,0.1,-0.12,0.24,0.12,0.42,0.06,-0.0,0.07,0.09,-0.1,-0.14,0.22,-0.08,-0.36,0.02,0.1
of,0.71,0.57,-0.47,0.18,0.54,0.73,0.18,-0.52,0.1,-0.18,0.08,-0.36,-0.12,-0.83,0.12,-0.17,0.06,-0.01,-0.57,0.01,0.23,-0.14,-0.07,-0.38,-0.24,-1.7,-0.87,-0.27,-0.26,0.18,3.87,-0.16,-0.13,-0.69,0.18,0.01,-0.34,-0.08,0.24,0.37,-0.35,0.28,0.08,-0.06,-0.39,0.23,-0.22,-0.23,-0.09,-0.8
to,0.68,-0.04,0.3,-0.18,0.43,0.03,-0.41,0.13,-0.3,-0.09,0.17,0.22,-0.1,-0.44,0.33,0.68,0.06,-0.34,-0.43,-0.43,0.56,0.1,0.19,-0.27,0.04,-2.09,0.22,-0.4,0.21,-0.56,3.88,0.47,-0.96,-0.38,0.21,-0.33,0.13,0.09,0.16,-0.22,-0.09,0.02,0.21,-0.03,-0.2,0.08,-0.09,-0.07,-0.06,-0.26
and,0.27,0.14,-0.28,0.02,0.11,0.7,-0.51,-0.47,-0.33,-0.14,0.27,0.31,-0.45,-0.41,-0.1,0.04,0.03,0.1,-0.25,-0.52,0.35,0.45,0.49,-0.08,-0.1,-1.38,-0.11,-0.23,0.01,-0.47,3.85,0.31,0.14,-0.52,0.33,0.34,-0.36,0.32,0.12,0.35,-0.07,0.37,0.25,-0.25,0.25,0.14,-0.31,-0.63,-0.25,-0.38
in,0.33,0.25,-0.61,0.11,0.04,0.15,-0.55,-0.07,-0.09,-0.33,0.1,-0.82,-0.37,-0.67,0.43,0.02,-0.24,0.13,-1.1,0.43,0.57,-0.1,0.2,0.08,-0.43,-1.8,-0.28,0.12,-0.13,0.03,3.86,-0.18,-0.08,-0.63,0.26,-0.06,-0.07,0.46,0.31,0.12,-0.49,-0.01,0.03,-0.37,-0.43,0.42,-0.12,-0.51,-0.03,-0.53
a,0.22,0.47,-0.47,0.1,1.01,0.75,-0.53,-0.26,0.17,0.13,-0.25,-0.44,-0.22,0.51,0.13,-0.43,-0.03,0.21,-0.78,-0.2,-0.1,0.16,-0.62,-0.19,-0.12,-2.25,-0.22,0.5,0.32,0.15,3.96,-0.71,-0.67,0.28,0.22,0.14,0.26,0.23,0.43,-0.44,0.14,0.37,-0.64,0.02,-0.04,-0.26,0.12,-0.04,0.41,0.18
"""",0.26,0.46,-0.77,-0.38,0.59,-0.06,0.21,-0.57,-0.29,-0.14,0.33,1.47,-0.74,-0.12,0.71,-0.46,0.65,0.49,-0.52,0.04,-0.34,-0.01,0.86,0.35,0.8,-1.5,-1.82,0.41,0.24,-0.43,3.66,-0.8,-0.55,0.17,-0.82,-0.35,0.69,-1.23,-0.18,-0.06,0.03,-0.4,-0.39,-1.0,0.09,-0.31,-0.35,-0.31,0.75,0.97
's,0.24,0.4,-0.21,0.59,0.66,0.33,-0.82,-0.23,0.27,0.24,0.05,0.16,-1.26,-0.09,0.45,0.1,-0.17,0.06,-0.39,0.09,0.0,0.55,-0.78,-0.62,0.09,-2.57,-0.68,0.1,-0.49,-0.06,3.19,-0.02,-0.16,0.06,-0.26,-0.34,-0.2,0.26,0.1,-0.56,-0.12,0.66,-0.52,-0.83,-0.08,0.28,-0.42,-0.27,-0.01,-0.03


Note that the 50 columns you see here are dimensions you examined earlier. We do not know what each dimension signifies, although this is an active area of research. However, if you examine each dimension, you'll notice more red in some columns and more blue in others. The words we loaded above are very common and  might be stopwords we would normally remove. However, if we load more specific groups of words, we might make good guesses about what some of the dimensions mean.

Now that you've explored the model programmatically, consider downloading this model, unzipping it, then opening it in the text editor and searching for the example words we reviewed above. Explore the vectors and confirm that they are the same as what you retrieve programmatically.

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now, equipped with these concepts and tools, you will practice a few related tasks.


As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1

First, evaluate the similarity between two word vectors the word vectors for `'university'` and `'cornell'` simply by counting the matched signs of their coefficients (report as a fraction of matched signs). Try comparing other words to these two to find some that are more or less similar.

<b>Hint:</b> You can create a mask from comparing each vector as > 0. Then compare two masks and count the matches. Then divide by 50.


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
    <pre>
# Solution 1
matched_signs = sum((wv['university'] > 0) == (wv['cornell'] > 0)) 
matched_signs / 50
# Solution 2
np.mean((wv['university'] * wv['cornell']) > 0)
    </pre>
    </details> 
</font>
<hr>

## Task 2

Now, for the top 500 words in `wv`'s vocabulary, find the top three that are most similar to `'cornell'` using the metric we created in the previous task. Do the results make sense?

<b>Hint:</b> Try wrapping vocabulary dictionary as a list then retrieving 500 words and for each one compute the metric above. If you wrap it as a list of tuples with the first element as the measure and the second as the word, then you can easily apply <code>sorted()</code> to order these tuples by similarity score. You can retrieve the bottom three tuples.


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
    <pre>
sorted((sum((wv[w] > 0) == (wv['cornell'] > 0)) / 50, w) for w in list(wv.key_to_index)[:500])[-3:]
    </pre>
    </details> 
</font>
<hr>

## Task 3

Similarly, for the top 500 words in `wv`'s vocabulary, find the top three that are least similar to `'cornell'` using the metric you created in the previous task. Do the results make sense?

<b>Hint:</b> This is similar to the code above but is retrieving elements from the other end of the list.


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
    <pre>
sorted((sum((wv[w] > 0) == (wv['cornell'] > 0)) / 50, w) for w in list(wv.key_to_index)[:500])[:3]
    </pre>
    </details> 
</font>
<hr>

## Task 4

Retrieve all words from `wv` that contain the word `'university'`. What separators do you observe? Any punctuation or uppercasing?

<b>Hint:</b> Try it with list comprehension.


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
    <pre>
print([w for w in wv.key_to_index if 'university' in w])
    </pre>
    </details> 
</font>
<hr>

## Task 5

Retrieve all words from `wv` that contain the words `'new'` and `'york'`. What separators do you observe?

<b>Hint:</b> This is similar to the code above, but you need two comparisons in your condition statement.


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
    <pre>
[w for w in wv.key_to_index if 'new' in w and 'york' in w]
    </pre>
    </details> 
</font>
<hr>