## Word Embeddings

- Word embeddings transform a binary/count based or tf*idf vectors into a much smaller dimension vector of real numbers. The one-hot encoded vector or binary vector is also known as a sparse vector, whilst the real valued vector is known as a dense vector. 

- An word embedding maps discrete, categorical values to a continous space. Major advances in NLP applications have come from these continuous representations of words.

- The key concept in these word embeddings is that words that appear in similar contexts appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. 

- By context here, we mean the surrounding words. For example in the sentences **"it is the time of stupidity"** and **"it is the age of foolishness**" the words **'time'** and **'age'** and **'stupidity'** and **'foolishness'** appear in the same context and thus should be close together in vector space.

- You did learn about word2vec which calculates word vectors from a corpus. In this lab session we use GloVe vectors, GloVe being another algorithm to calculate word vectors. If you want to find out more about GloVe, check the website [here](https://nlp.stanford.edu/projects/glove/). For more information about word embeddings, go [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).

## Loading the GloVe vectors

First, we'll load the GloVe vectors. The name field specifies what the vectors have been trained on, here the 6B means a corpus of 6 billion words. The dim argument specifies the dimensionality of the word vectors. **GloVe vectors are available in 50, 100, 200 and 300 dimensions.** There is also a 42B and 840B glove vectors, **however they are only available at 300 dimensions**.

- For more information about GloVe vectors loading using `torchtext` visit the [link](https://torchtext.readthedocs.io/en/latest/vocab.html#glove).

- [GLoVe](https://github.com/stanfordnlp/GloVe) comes with different domain differences:-

    - **Common Crawl** (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
    - **Common Crawl** (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
    - **Wikipedia 2014 + Gigaword 5**(6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download)
    - **Twitter** (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors, 1.42 GB download)

In [None]:
import torchtext.vocab

glove = torchtext.vocab.GloVe(name = '6B', dim = 100)

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 400000 words in the vocabulary


As shown above, **there are 400,000 unique words** in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors is the actual tensor containing the values of the embeddings.`

In [None]:
glove.vectors.shape

torch.Size([400000, 100])

We can see what word is associated with each row by checking the **itos (int to string)** list. We can also use the **stoi (string to int)** dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [None]:
glove.itos[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [None]:
glove.stoi['the']

0

In [None]:
print(glove.vectors[glove.stoi['the']])
print(glove.vectors[glove.stoi['the']].shape)

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

In [None]:
def get_vector(embeddings, word):
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    return embeddings.vectors[embeddings.stoi[word]]

In [None]:
print(get_vector(glove, 'dhaka'))
print(get_vector(glove, 'dhaka').shape)

tensor([-0.4313, -0.4264,  0.0080,  0.0843, -0.2632,  0.5497, -0.6609,  1.4093,
        -0.3155,  0.9577, -0.2203, -0.5726,  0.3466,  0.2939, -0.0545, -0.6057,
         0.1025, -0.2336,  0.3325, -0.5356,  0.9242, -0.4160,  0.9887,  0.0442,
         0.1104, -0.2943,  0.3938, -0.2971,  0.0072,  0.6839, -0.5073, -0.0298,
         0.0992,  0.3575, -1.0666, -0.3014,  0.0949,  0.1943, -0.7708,  0.1985,
        -0.3778,  0.8102,  0.0784, -0.8903,  1.0367,  0.1295, -0.1955,  0.3953,
        -0.1357,  0.5108, -0.1412, -0.2397,  0.8553,  0.2163, -0.4538,  0.0355,
        -1.1429, -0.3612,  0.8375, -0.0534, -1.3352, -0.1135, -0.7246,  0.1347,
        -0.7338,  0.6919, -0.1318, -0.1666,  0.4299,  0.3777,  0.0694, -0.5150,
        -0.0721, -0.7482, -0.1416,  0.3478,  0.4706,  0.2370, -0.8630, -0.3583,
         0.4200,  0.4044, -0.7176, -0.3392, -0.3349,  0.3887, -0.7387, -0.2911,
        -0.1261,  0.5037,  0.9511,  0.1648, -0.5015, -0.1889, -0.4417,  1.2995,
         0.9472,  0.0645,  0.4155,  0.83

## Similar Contexts

Now to start looking at the context of different words.

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary calculating the distance between the vector of each word and our input word vector. We then sort these from closest to furthest away.

The function below returns the closest 10 words to an input word vector:

In [None]:
import torch

def closest_words(embeddings, vector, n = 10):
    
    distances = [(word, torch.dist(vector, get_vector(embeddings, word)).item())
                 for word in embeddings.itos]
    
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'dhaka'. The closest word is the word 'dhaka' itself (not very interesting), however all of the words are related in some way. 

Interestingly, we also get 'lahore' and 'karachi', implies that Bangladesh, and Pakistan are frequently talked about together in similar contexts. 

Moreover, other vectors are geographically situated near each other.

In [None]:
word_vector = get_vector(glove, 'dhaka')

closest_words(glove, word_vector)

[('dhaka', 0.0),
 ('lahore', 3.73711895942688),
 ('karachi', 3.852436065673828),
 ('calcutta', 3.947979211807251),
 ('kathmandu', 3.9504103660583496),
 ('bangkok', 3.986726760864258),
 ('chittagong', 4.223588466644287),
 ('multan', 4.2937541007995605),
 ('harare', 4.355865001678467),
 ('delhi', 4.358461856842041)]

Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India (geographically), but Thailand and Malaysia are closer. So why is Australia closer to India in vector space? This is most probably due to India and Australia appearing in the context of cricket matches together.

In [None]:
word_vector = get_vector(glove, 'india')

closest_words(glove, word_vector)

[('india', 0.0),
 ('pakistan', 3.6954822540283203),
 ('indian', 4.114313125610352),
 ('delhi', 4.155975818634033),
 ('bangladesh', 4.261017799377441),
 ('lanka', 4.435846328735352),
 ('sri', 4.515717029571533),
 ('australia', 4.806082725524902),
 ('thailand', 4.994781494140625),
 ('malaysia', 5.009334087371826)]

In [None]:
word_vector = get_vector(glove, 'google')

closest_words(glove, word_vector)

[('google', 0.0),
 ('yahoo', 3.0772178173065186),
 ('microsoft', 3.8836112022399902),
 ('web', 4.10483980178833),
 ('aol', 4.108161449432373),
 ('facebook', 4.116486072540283),
 ('ebay', 4.39174222946167),
 ('msn', 4.412169933319092),
 ('internet', 4.4540276527404785),
 ('netscape', 4.465073108673096)]

## Analogies

Another property of word embeddings is that they can be operated on just as any standard vector and give interesting results.

In [None]:
def analogy(embeddings, word1, word2, word3, n=4):
    
    #get vectors for each word
    word1_vector = get_vector(embeddings, word1)
    word2_vector = get_vector(embeddings, word2)
    word3_vector = get_vector(embeddings, word3)
    
    #calculate analogy vector
    analogy_vector = word2_vector - word1_vector + word3_vector
    
    #find closest words to analogy vector
    candidate_words = closest_words(embeddings, analogy_vector, n+3)
    
    #filter out words already in analogy
    candidate_words = [(word, dist) for (word, dist) in candidate_words 
                       if word not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

<div align="center">
<img src="https://drive.google.com/uc?id=12Kku3uSvqqaTya7trjkfy5EKU7pC9u2U" width="500">
</div>


In [None]:
print(analogy(glove, 'man', 'king', 'woman'))

man is to king as woman is to...
[('queen', 4.08107852935791), ('monarch', 4.642907619476318), ('throne', 4.905500888824463), ('elizabeth', 4.921558380126953)]


If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

In [None]:
print(analogy(glove, 'man', 'actor', 'woman'))

man is to actor as woman is to...
[('actress', 2.8133397102355957), ('comedian', 5.003941059112549), ('actresses', 5.139926433563232), ('starred', 5.277286052703857)]


In [None]:
print(analogy(glove, 'india', 'delhi', 'bangladesh'))

india is to delhi as bangladesh is to...
[('dhaka', 3.076033353805542), ('kathmandu', 4.292879104614258), ('lahore', 4.358791351318359), ('bangladeshi', 4.5970611572265625)]


In [None]:
print(analogy(glove, 'good', 'heaven', 'bad'))

good is to heaven as bad is to...
[('hell', 4.395862102508545), ('ghosts', 5.286444187164307), ('hades', 5.289849281311035), ('madness', 5.341413497924805)]


In [None]:
print(analogy(glove, 'jordan', 'basketball', 'ronaldo'))

jordan is to basketball as ronaldo is to...
[('soccer', 6.455689907073975), ('romario', 6.579685688018799), ('beckham', 6.839649677276611), ('ronaldinho', 6.978160381317139)]


In [None]:
print(analogy(glove, 'paper', 'newspaper', 'screen'))

paper is to newspaper as screen is to...
[('tv', 4.780970096588135), ('television', 5.104853630065918), ('cinema', 5.381847858428955), ('feature', 5.552447319030762)]


## Similarity operations on embeddings

In [None]:
from scipy import spatial

def cosineSim(word1, word2):
    vector1, vector2 = get_vector(glove, word1), get_vector(glove, word2)
    return 1 - spatial.distance.cosine(vector1, vector2)

In [None]:
word_pairs = [
    ('dog', 'cat'),
    ('tree', 'cat'),
    ('tree', 'leaf'),
    ('king', 'queen'),
]

for word1, word2 in word_pairs:
    print(f'Similarity between "{word1}" and "{word2}":\t{cosineSim(word1, word2):.2f}')

Similarity between "dog" and "cat":	0.88
Similarity between "tree" and "cat":	0.45
Similarity between "tree" and "leaf":	0.64
Similarity between "king" and "queen":	0.75


### Need to learn embedding for your own corpus? 

#### Simplest Ans: Use [Gensim Library](https://radimrehurek.com/gensim/auto_examples/index.html#documentation)

-  [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html)
-  [fastText](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html)
-  [Doc2Vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)
-  [GloVe](https://nlp.stanford.edu/projects/glove/)
-  [How is GloVe different from word2vec?](https://www.quora.com/How-is-GloVe-different-from-word2vec)


### Job Related Topics - Part I [Optional]

- Create a professional email address 
    - First name + last name = firstlast@domain.com
    - First name . last name = first.last@domain.com
    - First name - last name = first-last@domain.com
    - First name . middle name . last name = first.middle.last@domain.com
    - First name - middle name - last name = first-middle-last@domain.com
    - First initial + last name = flast@domain.com
    - First initial + middle name + last name = fmiddlelast@domain.com
    - First initial + middle initial + last name = fmlast@domain.com
- The shorter your email the better
- Complete your Linkedin profile
- Prepare a CV in Latex 
- Seperate your contact number [personal vs professional]
- Create GitHub profile [Username may only contain alphanumeric characters or single hyphens, and cannot begin or end with a hyphen.]
- You can also use [desktop version of GitHub](https://desktop.github.com/). It's very easy to use without any commands!
- Build your website using [GitHub pages](https://pages.github.com/)
- [Great Templates! ](https://wowchemy.com/templates/) to use. 