# Word Embeddings :

* Word embeddings are numerical representation of words or phrases that capture the meaning of the words in a vector space. 
* In word embeddings, each word possessed a **universal numerical representation** associated with its **own definition**.
* In this, it is possible to represent each word by a **unique vector independent of the text under consideration**.
* They are useful for the natural language processing tasks because they capture the **semantic similarity and semantic distance between words**.

* Benefits of word embeddings :
    1. They **retain semantic similarity**
    2. Word embeddings have **dense vectors**
    3. They have a **constant vector size**
    4. Their vector representations are **absolute**
    5. They have **Multiple Embedding Models**
    

In this worksheet, we will use **gensim** library to implement *word embeddings*
 

In [1]:
# Install the gensim
# !pip install --upgrade gensim

# Check the version is 4+
!pip show gensim

Name: gensim
Version: 4.3.2
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: D:\anaconda\Lib\site-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [2]:
# Import the libraries and
# Load the gensim model : 'glove-wiki-gigaword-50'
# Since it takes less time to download and load
# Results are roughly similar

import numpy as np
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")





In [3]:
# Usually the 'word2vec-google-news-300' model takes long to download and load.
# If you want to work on this, then :
# model = api.load("word2vec-google-news-300")

Functions in `gensim` : 

    * model['word'] : Returns the actual word vector
    * most_similar('word') : For a list of words that are most similar to a given word.
    * similarity("word1", "word2") : To compute the cosine similarity between two words.

In [4]:
# To get the acutal word vector
print(model['book'])

[-0.0076543  0.93456   -0.73189   -0.55162    0.76977    0.35925
 -1.1365    -1.1632     0.34214    0.29145   -0.8711     0.9197
 -0.47069   -0.22834    1.4777    -0.81714   -0.17466   -0.51093
 -0.28354    0.23292    0.71832    0.23414    0.49443    0.35483
  0.76889   -1.4374    -1.7457    -0.28994   -0.10156   -0.36959
  2.5502    -1.0581    -0.049416  -0.25524   -0.63303    0.02671
 -0.18733    0.20206   -0.26288   -0.41418    0.83473   -0.14227
 -0.28125    0.098155  -0.17096    0.52408    0.31851   -0.089847
 -0.27223   -0.0088736]


In [5]:
# To get the similar words for a given words
print(model.most_similar("apple"))

[('blackberry', 0.7543067932128906), ('chips', 0.7438643574714661), ('iphone', 0.7429665327072144), ('microsoft', 0.7334205508232117), ('ipad', 0.7331036329269409), ('pc', 0.7217226624488831), ('ipod', 0.7199784517288208), ('intel', 0.7192243337631226), ('ibm', 0.7146540880203247), ('software', 0.7093585729598999)]


In [8]:
print(model.most_similar("apples"))

[('peaches', 0.862353503704071), ('oranges', 0.859447717666626), ('cherries', 0.8461860418319702), ('mangoes', 0.8264982104301453), ('apricots', 0.8242633938789368), ('strawberries', 0.8229067325592041), ('potatoes', 0.8179376125335693), ('melons', 0.7980057597160339), ('berries', 0.7946050763130188), ('vegetables', 0.792052149772644)]


In [12]:
# To get the similarity score between two words
print("similarity score between apple and banana:", model.similarity("apples", "banana")) 
print("similarity score between apple and dog:   ", model.similarity("apple", "dog")) 
print("similarity score between cat   and dog:   ", model.similarity("cat", "dog"))

similarity score between apple and banana: 0.5633737
similarity score between apple and dog:    0.41387236
similarity score between cat   and dog:    0.92180055


### Vocabulary

Let's take a look at the **word2Vec** vocabualry

Returns a dictionary where **key** are **words** and **index** are **values** 

In [15]:
vocab = model.index_to_key

In [16]:
# For getting random 5 tokens from the vocabulary

for _ in range(10) :
    print(np.random.choice(vocab, 5))

['reaud' 'ex-gay' 'powerless' 'stinchcomb' 'horsehead']
['furgal' 'umaria' 'daedalus' 'joulwan' 'mulleavy']
['trotti' 'absolves' 'engadin' 'dunai' 'witherell']
['cowls' 'futari' 'odgen' 'zapeta' 'neatness']
['polypeptides' 'thorndyke' 'ibs' 'karlović' 'nacreous']
['unaccompanied' 'impressionist' 'levines' 'astrue' 'ovamboland']
['universalistic' 'oast' '77' 'doorjambs' 'equivocating']
['kostić' 'smoking' 'answers.com' 'zhlobin' 'sikdar']
['canisteo' 'khaos' 'beistline' 'bicci' 'brookins']
['niazi' 'lohia' 'arvin' 'chosin' "l'union"]


**Cosine Similarity** and **Levenshtein Distance** :

In [19]:
# Cosine similarity can be calculated using

from scipy import spatial
vector1 =[1, 1, 2, 2, 3]
vector2 = [1, 3, 1, 2, 6]

cosine_similarity = 1 - spatial.distance.cosine(vector1, vector2)
print(cosine_similarity)

0.8994895926845297


Let's calculate the cosine similarity for two words : *grass* and *trees*

In [20]:
# Using the gensim library
print("Similarity score between grass and trees : ", model.similarity('grass', 'trees'))

Similarity score between grass and trees :  0.70772266


In [21]:
# Using scipy
cosine_similarity = 1 - spatial.distance.cosine(model['grass'], model['trees'])
print("Cosine Similarity between grass and trees : ", cosine_similarity)

# In the above code, model['grass'] and model['trees'] will give you the vector for those words

Cosine Similarity between grass and trees :  0.707722544670105


* **LevenshTein Distance or Edit Distance:**
    * Similarity between two words can be measured using this method.
    * It is the **minimum number of single character edits** (insertion, deletion or substitution) required to **change one word into another**
    * `levenshtein.distance('word1', 'word2')` is used to measure the edits

In [24]:
# To install
# !pip install levenshtein

In [23]:
from Levenshtein import distance

In [25]:
print(f"distance('test', 'test') : {distance('test', 'test')} because no character substitution is needed")
print(f"distance('test', 'team') : {distance('test', 'team')} because two character substitution is needed; s -> a ; t -> m")

distance('test', 'test') : 0 because no character substitution is needed
distance('test', 'team') : 2 because two character substitution is needed; s -> a ; t -> m


**Short Comoigs of Word Embeddings :**
* Since the models are dependent on the data they were trained on, they carry 2 significant side effects
    1. Cultural Bias
    2. Out-Of-Vocabulary ( OOV ) issues
  
**1. Cultural Bias:**
* Model was trained on a massive US Google News Corpus. It learned the relationship between words on the news as seen by Google in US. There's nothing inherently biased about the news in the US versus some other corpus from another part of world, but training on such a dataset that the **model inherits a certain dose of cultural bias**.
* For more critical issues, you should be aware that these models are **not universal or neutral but directly influenced by the corpus they were trained on**
    
**2. Out-Of-Vocabulary :**
* GloVe and word2Vec models are finite nature in the model's vocabulary.
* For words, where model fails to identify and give their vector representation, suck words are known as **OOV words**.
* One way of handling the OOV words is by returning a **vector filled with zeros**.
* Another way is updating the model by training the model with your own dataset. By this, the OOV words will end up with their own vector representation. This process is called **Fine-Tuning** or **Transfer Learning**.

In [33]:
# 1. cultural bias :
# In the US, Alexis is a feminin name, while in the rest of the world it's a masculin name.
# Word2vec was trained on US centric data. 
# This shows up when looking at the names the model condsiders 
# most similar to 'Alexis': Nicole, Erica, Marissa, Alicia ... all women names.

In [31]:
# 2. OOV
# Some words are not in Word2vec vocab's. for instance Covid and ... word2vec.

vocab = model.index_to_key

# no covid words (only 'covidien' which is a company)
start_with = 'covid'
vocab_subset = [tk.lower() for tk in  vocab if tk.lower()[:len(start_with)] == start_with]
vocab_subset.sort()
print(vocab_subset)

# no word2vec words
start_with = 'word2vec'
vocab_subset = [tk.lower() for tk in  vocab if tk.lower()[:len(start_with)] == start_with]
vocab_subset.sort()
print(vocab_subset)

['covidien']
[]
