# GENSIM for word embeddings 

- THE GENSIM LIBRARY

    - Gensim is an open-source python library for natural language processing.

    - It was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. 

    - In the previous tutorial, we have seen how you can use this package to do topic modeling.
    
    - Here we use `gensim` for word embedding. 
    

- Word embeddings can be used for:
    - automated text tagging
    - recommendation engines
    - synonyms and search query expansion
    - machine translation
    - plain feature engineering

One example

<img src="1.png" alt="drawing" width="600"/>

Another example

<img src="2.png" alt="drawing" width="400"/>

## Software for word embeddings 

- Software for training and using word embeddings includes 
    - Tomas Mikolov's Word2vec, 
    - Stanford University's GloVe, GN-GloVe 
    - AllenNLP's ELMo,
    - BERT
    - fastText 
    - Gensim
    - Indra and Deeplearning4j
  
    - Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

In [None]:
import pandas as pd
import numpy as np

import re  # For preprocessing

# from collections import defaultdict
# from time import time  # To time our operations
# import warnings
# warnings.filterwarnings('ignore')
# import logging  # Setting up the loggings to monitor gensim
# logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

## 1) CREATE A WORD2VEC MODEL

- Training the model: Gensim Word2Vec Implementation:
    - We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile

common_texts

## train a simpliest word embedding by yourself

In [None]:
model_1 = Word2Vec(common_texts, vector_size=100, window=5, min_count=1, workers=4)

The hyperparameters of `Word2Vec`:

- `vector_size`: # of dimensions of the embeddings and the default is 100.
- `window`: The maximum distance between a target word and words around the target word. The default window is 5.
- `min_count`: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
- `workers`: # of worker threads used to train the model; depends on your computer.
- `sg`: The training algorithm, either CBOW(0) or skip-gram (1). The default training algorithm is CBOW.

In [None]:
vector = model_1.wv['computer']  # 'wv': word to vector
vector

In [None]:
len(vector)

In [None]:
# new words? cannot handle
model_1.wv['hi']

In [None]:
model_1.wv.most_similar('graph')

In [None]:
## save model
model_1.save("word2vec.model")

## 2) WORD EMBEDDING Using a Real Dataset

In this example, I use [a dataset from Kaggle](https://www.kaggle.com/CooperUnion/cardataset). This cars dataset includes features such as make, model, year, engine, and other properties of the car. We will use these features to generate the word embeddings for each make model and then compare the similarities between different make model. The following dataframe shows the detail information of this dataset.

Note what we are doing: **the structure of word embedding can not only be used on text but also on observation-feature dataframes**! Fundamentally it's about dimension reduction. 

In fact this is [also true for topic modeling (LDA)](https://www.journals.uchicago.edu/doi/10.1086/705331)

In [None]:
df = pd.read_csv('data.csv')
df.head()

### PRE-PROCESS WORDS

- Cleaning 
    - Removing the missing values;
    - Lemmatizing;
    - Removing the stopwords;
    - Removes non-alphabetic characters: regular expression;
    - Bigrams: We can use Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

         ```python
         from gensim.models.phrases import Phrases, Phraser
         ```
         - As Phrases() takes a list of list of words as input:
        ```python
        sent = [row.split() for row in df_clean['clean']]
        ```


Since the purpose of this tutorial is to learn how to generate word embeddings using genism library, I will not do the EDA and feature selection for the word2vec model for the sake of simplicity. 
<br> 
Genism word2Vec requires that a format of list of list for training where every document is contained in a list and every list contains list of tokens of that document. At first, we need to generate a format of list of list for training the make model word embedding. To be more specific, each make model is contained in a list and every list contains list of features of that make model.

To achieve these, we need to do the following data preprocessing steps:

1. Create a new column for Make Model 
2. Generate a format of list of list for each Make Model with the following features: Engine Fuel Type, Transmission Type, Driven_Wheels, Market Category, Vehicle Size and Vehicle Style. 


1. Create a new column for Make Model

In [None]:
df['Maker_Model']= df['Make']+ " " + df['Model']

2. Generate a format of list of list for each Make Model 

In [None]:
df.head()

In [None]:
# Select features from original dataset to form a new dataframe 
df1 = df[['Engine Fuel Type','Transmission Type','Driven_Wheels','Market Category',
          'Vehicle Size', 'Vehicle Style', 'Maker_Model']]
df1

In [None]:
# For each row, combine all the columns into one column
df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1) 
df2

In [None]:
# Store them in the pandas dataframe
df_clean = pd.DataFrame({'clean': df2}) 

df_clean

In [None]:
df_clean['clean'][0]

In [None]:
# Create the list of list format of the custom corpus for gensim modeling 
sent = [row.split(',') for row in df_clean['clean']]
# show the example of list of list format of the custom corpus for gensim modeling 
sent[:2] 

In [None]:
len(sent)

### Genism word2vec Model Training 

In [None]:
## Train the genisim word2vec model with our own custom corpus
model_2 = Word2Vec(sent, min_count=1, vector_size= 50, workers=3, window =3, sg = 1)

In [None]:
## We can obtain the word embedding directly from the training model
model_2.wv['BMW 1 Series']

### Compare Similarities 

Now we could even use Word2vec to compute similarity between two make model in the vocabulary by invoking the model.similarity() and passing in the relvevant words. For instance,  model.similarity('Porsche 718 Cayman', 'Nissan Van') This will give us the Euclidian similarity between Porsche 718 Cayman and Nissan Van. 

In [None]:
model_2.wv.similarity('Porsche 718 Cayman', 'Nissan Van')

In [None]:
model_2.wv.similarity('Porsche 718 Cayman', 'Mercedes-Benz SLK-Class')

From the above example, we can tell that Porsche 718 Cayman is more similar with Mercedes-Benz SLK-Class than Nissan Van. We also can use the build in function model.most_similar() to get a set of the most similar make models for a given make model.

In [None]:
## Show the most similar vehicles for Mercedes-Benz SLK-Class : Default by eculidean distance 
model_2.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

In [None]:
## Show the most similar vehicles for Toyota Camry : Default by eculidean distance 
model_2.wv.most_similar('Toyota Camry')[:5]

However, Euclidian similarity cannot work well for the high-dimensional word vectors, This is because Euclidian similarity will increase as the number of dimensions increases even if the word embedding stands for different meanings. Alternatively, we can use cosine similarity to measure the similarity between two vectors.  

For Vector $A$ and $B$, the dot product is given by $ A \cdot B = \|A\| \|B\| \cos(\theta)$

The cosine similarity is given by $ \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $

Cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. Therefore, the cosine similarity captures the angle of the word vectors and not the magnitude. Under cosine similarity, no similarity is expressed as a 90-degree angle while the total similarity of 1 is at 0 degree 
angle. The following function shows how can we generate the most similar make model based on cosine similarity.

In [None]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance (model, word,target_list , num) :
    cosine_dict ={}
    word_list = []
    a = model.wv[word]
    
    for item in target_list :
        if item != word :
            b = model.wv[item]
            cos_sim = dot(a, b)/(norm(a)*norm(b))
            cosine_dict[item] = cos_sim
    dist_sort=sorted(cosine_dict.items(), key=lambda dist: dist[1],reverse = True) ## in Descedning order 
    
    for item in dist_sort:
        word_list.append((item[0], item[1]))
    
    return word_list[0:num]

In [None]:
Maker_Model = list(df.Maker_Model.unique()) ## only get the unique Maker_Model_Year

## Show the most similar Mercedes-Benz SLK-Class by cosine distance 
cosine_distance(model_2,'Mercedes-Benz SLK-Class',Maker_Model,5) 

In [None]:
model_2.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

# 3) Read pre-trained models

As we said, it's usually far better to use some pre-trained embeddings instead of starting from scratches

Read more: https://radimrehurek.com/gensim/models/keyedvectors.html


I will read GloVe's pre-trained vectors here. Gensim offers download of some other pre-trained vectors. See
https://github.com/RaRe-Technologies/gensim-data


And a more complete pre-trained vector dataset can be found here
http://vectors.nlpl.eu/repository/

You may need to manually download them to your disk and let Gensim read in.


Another source for Pre-trainned word and phrase vectors from Google: https://code.google.com/archive/p/word2vec/

In [None]:
import gensim.downloader as api

word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data


Then we can find similar words

In [None]:
result = word_vectors.most_similar(positive=['woman'])
result

As you may observe, girl and man are really different things. So we can use the king/queen and man/women analogy to find what's the similar word to "woman", if we hope to find pairs such as (king, queen)

$ man = woman + king - queen $

In [None]:
result = word_vectors.most_similar(positive=['king', 'woman'], negative=['queen'])
result

$ queen = king + woman - man $

In [None]:
result = word_vectors.most_similar(positive=[ 'king', 'woman'], negative=['man'])
result

### Using GoogleNews-vectors-negative300.bin.gz  as an example

The GoogleNews-vectors-negative300.bin.gz is pretty large and I won't upload it to GitHub. Please download it from [the official source](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) to your local computer if you want to try out the following code. 

In [None]:
file = '/Users/percychan/Tech/GoogleNews-vectors-negative300.bin'

In [None]:
# Load pretrained model (since intermediate data is not included, the model cannot be refined with additional data)

model_google = gensim.models.KeyedVectors.load_word2vec_format(file, binary=True,limit= 100000) 

In [None]:
dog = model_google['dog']
print(dog.shape)
print(dog)

In [None]:
# Deal with an out of dictionary word: Михаил (Michail)
if 'Михаил' in model_google:
    print(model_google['Михаил'].shape)
else:
    print('{0} is an out of dictionary word'.format('Михаил'))

In [None]:
# Some predefined functions that show content related information for given words
model_google.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
vec = model_google['king'] - model_google['man'] + model_google['woman']
model_google.most_similar([vec])

In [None]:
vec = model_google['Berlin'] - model_google['Germany'] + model_google['China']
model_google.most_similar([vec])

In [None]:
vec = model_google['Germany'] - model_google['Berlin'] + model_google['Beijing']
model_google.most_similar([vec])

In [None]:
vec = model_google['Messi'] - model_google['soccer'] + model_google['tennis']
model_google.most_similar([vec])

In [None]:
model_google.doesnt_match("breakfast economics dinner lunch".split())

In [None]:
model_google.similarity('woman', 'man')

In [None]:
model_google.similarity('Harvard', 'Stanford')

In [None]:
model_google.similarity('Cambridge', 'Oxford')

In [None]:
model_google.most_similar('Harvard')

In [None]:
model_google.similarity('HKUST', 'HKU')

In [None]:
model_google.similarity('Economics', 'Sociology')

In [None]:
model_google.similarity('Statistics', 'Economics')

In [None]:
model_google.similarity('Statistics', 'Sociology')

# Software 
- GloVe: https://nlp.stanford.edu/projects/glove/
- Word2Vec: https://code.google.com/archive/p/word2vec/
- Tensorflow Word2Vec tutorial: https://www.tensorflow.org/tutorials/text/word_embeddings