Want a big corpus to do NLP research?  How about all of Wikipedia?<br>
https://corpus.byu.edu/wiki/ <br>
This corpus contains approximately 2 billion words, and a vocabulary of approximately 400,000 words, which corresponds to 400,000 dimensions in the bag of words model.  This is a big space to deal with!  How might we reduce our space to a smaller dimension?

A single word such as *computer* might be represented in bag of words as a vector like $(1,0,0,\dots,0)$ and another word like *software* might be represented as a vector like $(0,1,0,0,\dots,0)$.  The two vectors are orthogonal to each other, despite the fact that the words *computer* and *software* are related to each other.

The technique of ***word embedding*** attempts to embed words into a continuous vector space of smaller dimension (say 50) in such a way their semantic relationships are preserved.<br>
$$\mathbb{Z_{\geq 0}^{400,000} \longrightarrow \mathbb{R}^{50}}$$

For example, we might hope that the vector corresponding to *computer* might be relatively similar to vector corresponding to *software*. 

Given an embedding scheme, we can either train it on our own corpus of data or we can use a pre-trained model.

There are several available word embedding schemes that are publicly available.  One of the first to be created was the **Word2vec** scheme created (and patented) by Tomáš Mikolov of Google.  They have an archive consisting of code and pre-trained vectors. The latter was trained on Google News data and consists of 3,000,000 words and phrases embedded in 300 dimensions.<br>
https://code.google.com/archive/p/word2vec/

Another commonly used scheme created NLP researchers Manning, Pennington and Socher at Stanford Univesity is the **Global Vectors for Word Representation (GloVe)**, which also provides code vectors pre-trained on various corpora, such as the Wikipedia corpus and a Twitter corpus.<br>
https://nlp.stanford.edu/projects/glove/<br>


Stanford also provides a number of good NLP lectures that are available on YouTube.  For example,<br>
https://www.youtube.com/watch?v=OQQ-W_63UgQ<br>
is Christopher Manning's introductory lecture for Stanford's NLP class, and at the 55:30 mark he discusses word vectors.  And Manning's entire second lecture is on word vectors:<br>
https://www.youtube.com/watch?v=ERibwqs9p38

If 80 minute video lectures aren't your thing, how about a pair of 1 minute videos from Udacity?<br>
https://www.youtube.com/watch?v=186HUTBQnpY<br>
https://www.youtube.com/watch?v=xMwx2A_o5r4

The GloVe file ```glove6B.zip``` provides four different pre-trained word embeddings for 400,000 words based on the Wikipedia corpus. The embeddings are in dimensions 50, 100, 200 and 300.  For simplicity, let's focus on the 50-dimensional embeddings, in the file ```glove.6B.50d.txt```.

If we inspect ```glove.6B.50d.txt```, we see that it is a text file consisting of rows separted by line returns and commas, and each line is a word and 50 signed floating point numbers separated by spaces.

```the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
,```

Let's  open up the GloVe file for reading. 

In [5]:
fin = open ("glove.6B/glove.6B.50d.txt","r")

Now let's read in each of the lines into a list of lines.

In [6]:
lines = fin.readlines()
fin.close()  # Closes the file and frees any system resources used by the open file.
len(lines)

400000

Now let's parse each line and create a dictionary of word vectors (which we'll store as ```numpy``` arrays).

In [44]:
import numpy as np
word2vec = {}
for line in lines:
    splitLine = line.split(" ")
    word = splitLine[0]
    vector = np.array([float(x) for x in splitLine[1:51]])
    word2vec[word] = vector

In [45]:
print(list(word2vec['the']))

[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862, -0.00066023, -0.6566, 0.27843, -0.14767, -0.55677, 0.14658, -0.0095095, 0.011658, 0.10204, -0.12792, -0.8443, -0.12181, -0.016801, -0.33279, -0.1552, -0.23131, -0.19181, -1.8823, -0.76746, 0.099051, -0.42125, -0.19526, 4.0071, -0.18594, -0.52287, -0.31681, 0.00059213, 0.0074449, 0.17778, -0.15897, 0.012041, -0.054223, -0.29871, -0.15749, -0.34758, -0.045637, -0.44251, 0.18785, 0.0027849, -0.18411, -0.11514, -0.78581]


What is the Euclidean length of this vector?

In [46]:
np.linalg.norm(word2vec['the'])

4.9678269047049

Let's create a functions that calculates the cosine similarity between two vectors and between two words.

In [47]:
def cosDist(v1,v2):
    cosine = np.dot(v1,v2) / ( np.linalg.norm(v1) * np.linalg.norm(v2) )
    return(cosine)

def wordDist(w1,w2):
    v1 = word2vec[w1]
    v2 = word2vec[w2]    
    cosine = cosDist(v1,v2)
    return(cosine)

In [48]:
wordDist("computer","the")

0.519508383525476

In [49]:
wordDist("computer","software")

0.8814993634710456

In [50]:
print(wordDist("beyonce","gaga"))  # Beyonce and Lady Gaga are similar.
print(wordDist("beyonce","jay"))
print(wordDist("gaga","jay")) # and Jay (Z) is closer to Beyonce than to Lady Gaga.
print(wordDist("beyonce","naiman")) # but Naiman and Beyonce are orthongal to each other.

0.6757718666324486
0.440908893106959
0.1505738416811099
-0.07746374002723987


In [62]:
cosineDict = {w:wordDist("king",w) for w in word2vec}  # dictionary comprension
print(cosineDict["prince"])

0.82361796933357


In [61]:
sorted([(value,key) for (key,value) in cosineDict.items()], reverse=True)[:10]

[(1.0000000000000002, 'king'),
 (0.82361796933357, 'prince'),
 (0.7839043010964116, 'queen'),
 (0.7746230030635107, 'ii'),
 (0.7736247624872923, 'emperor'),
 (0.7667193954606585, 'son'),
 (0.7627150944065074, 'uncle'),
 (0.7542161124008465, 'kingdom'),
 (0.7539914268281643, 'throne'),
 (0.7492411846124971, 'brother')]

Word embeddings sometime preserve semantic relationships between words.  For example, consider the analogy:<br>
"`Man` is to `king` as `woman` is to $______________$".<br>
What term should go in the blank?

In [70]:
v = word2vec['king'] - word2vec['man'] + word2vec['woman']

Now let's look for words that are similar to the above vector of `king` - `man` + `woman`.

In [71]:
cosineDict = {w:cosDist(v,word2vec[w]) for w in word2vec}

In [72]:
sorted([(value,key) for (key,value) in cosineDict.items()], reverse=True)[:10]

[(0.8859834623625931, 'king'),
 (0.8609581258578942, 'queen'),
 (0.768451180089547, 'daughter'),
 (0.764069959135472, 'prince'),
 (0.7634970756412144, 'throne'),
 (0.7512728447426918, 'princess'),
 (0.750648883691072, 'elizabeth'),
 (0.7314496957870618, 'father'),
 (0.7296158352285126, 'kingdom'),
 (0.7280010324674799, 'mother')]

In [94]:
v = word2vec['teacher'] - word2vec['school'] + word2vec['church']
cosineDict = {w:cosDist(v,word2vec[w]) for w in word2vec}
sorted([(value,key) for (key,value) in cosineDict.items()], reverse=True)[:10]

[(0.8092078966752569, 'priest'),
 (0.804460582009797, 'church'),
 (0.7725253836901848, 'orthodox'),
 (0.7700323232560121, 'catholic'),
 (0.7411272859589869, 'congregation'),
 (0.7331976241792784, 'pastor'),
 (0.7090660804573766, 'christian'),
 (0.6940143018853606, 'clergyman'),
 (0.6935899138284772, 'roman'),
 (0.6829545665503904, 'faith')]