# GENSIM for word embeddings 

- THE GENSIM LIBRARY

    - Gensim is an open-source python library for natural language processing.

    - It was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. 

    - In the previous tutorial, we have seen how you can use this package to do topic modeling.
    
    - Here we use `gensim` for word embedding. 
    

- Word embeddings can be used for:
    - automated text tagging
    - recommendation engines
    - synonyms and search query expansion
    - machine translation
    - plain feature engineering

One example

<img src="1.png" alt="drawing" width="600"/>

Another example

<img src="2.png" alt="drawing" width="400"/>

## Software for word embeddings 

- Software for training and using word embeddings includes 
    - Tomas Mikolov's Word2vec, 
    - Stanford University's GloVe, GN-GloVe 
    - AllenNLP's ELMo,
    - BERT
    - fastText 
    - Gensim
    - Indra and Deeplearning4j
  
    - Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

In [1]:
import pandas as pd
import numpy as np

import re  # For preprocessing

# from collections import defaultdict
# from time import time  # To time our operations
# import warnings
# warnings.filterwarnings('ignore')
# import logging  # Setting up the loggings to monitor gensim
# logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

## 1) CREATE A WORD2VEC MODEL

- Training the model: Gensim Word2Vec Implementation:
    - We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [2]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile

common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

## train a simpliest word embedding by yourself

In [3]:
model_1 = Word2Vec(common_texts, vector_size=100, window=5, min_count=1, workers=4)

The hyperparameters of `Word2Vec`:

- `vector_size`: # of dimensions of the embeddings and the default is 100.
- `window`: The maximum distance between a target word and words around the target word. The default window is 5.
- `min_count`: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
- `workers`: # of worker threads used to train the model; depends on your computer.
- `sg`: The training algorithm, either CBOW(0) or skip-gram (1). The default training algorithm is CBOW.

In [5]:
vector = model_1.wv['computer']  # 'wv': word to vector
vector

array([-0.00515774, -0.00667028, -0.0077791 ,  0.00831315, -0.00198292,
       -0.00685696, -0.0041556 ,  0.00514562, -0.00286997, -0.00375075,
        0.0016219 , -0.0027771 , -0.00158482,  0.0010748 , -0.00297881,
        0.00852176,  0.00391207, -0.00996176,  0.00626142, -0.00675622,
        0.00076966,  0.00440552, -0.00510486, -0.00211128,  0.00809783,
       -0.00424503, -0.00763848,  0.00926061, -0.00215612, -0.00472081,
        0.00857329,  0.00428459,  0.0043261 ,  0.00928722, -0.00845554,
        0.00525685,  0.00203994,  0.0041895 ,  0.00169839,  0.00446543,
        0.0044876 ,  0.0061063 , -0.00320303, -0.00457706, -0.00042664,
        0.00253447, -0.00326412,  0.00605948,  0.00415534,  0.00776685,
        0.00257002,  0.00811905, -0.00138761,  0.00808028,  0.0037181 ,
       -0.00804967, -0.00393476, -0.0024726 ,  0.00489447, -0.00087241,
       -0.00283173,  0.00783599,  0.00932561, -0.0016154 , -0.00516075,
       -0.00470313, -0.00484746, -0.00960562,  0.00137242, -0.00

In [6]:
len(vector)

100

In [7]:
# new words? cannot handle
model_1.wv['hi']

KeyError: "Key 'hi' not present"

In [8]:
model_1.wv.most_similar('graph')

[('user', 0.06793873757123947),
 ('survey', 0.033640600740909576),
 ('eps', 0.0093911811709404),
 ('human', 0.008315952494740486),
 ('minors', 0.004503021948039532),
 ('system', -0.010839170776307583),
 ('trees', -0.023671651259064674),
 ('computer', -0.09575343132019043),
 ('time', -0.11410721391439438),
 ('response', -0.11557212471961975)]

In [9]:
## save model
model_1.save("word2vec.model")

## 2) WORD EMBEDDING Using a Real Dataset

In this example, I use [a dataset from Kaggle](https://www.kaggle.com/CooperUnion/cardataset). This cars dataset includes features such as make, model, year, engine, and other properties of the car. We will use these features to generate the word embeddings for each make model and then compare the similarities between different make model. The following dataframe shows the detail information of this dataset.

Note what we are doing: **the structure of word embedding can not only be used on text but also on observation-feature dataframes**! Fundamentally it's about dimension reduction. 

In fact this is [also true for topic modeling (LDA)](https://www.journals.uchicago.edu/doi/10.1086/705331)

In [10]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


### PRE-PROCESS WORDS

- Cleaning 
    - Removing the missing values;
    - Lemmatizing;
    - Removing the stopwords;
    - Removes non-alphabetic characters: regular expression;
    - Bigrams: We can use Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

         ```python
         from gensim.models.phrases import Phrases, Phraser
         ```
         - As Phrases() takes a list of list of words as input:
        ```python
        sent = [row.split() for row in df_clean['clean']]
        ```


Since the purpose of this tutorial is to learn how to generate word embeddings using genism library, I will not do the EDA and feature selection for the word2vec model for the sake of simplicity. 
<br> 
Genism word2Vec requires that a format of list of list for training where every document is contained in a list and every list contains list of tokens of that document. At first, we need to generate a format of list of list for training the make model word embedding. To be more specific, each make model is contained in a list and every list contains list of features of that make model.

To achieve these, we need to do the following data preprocessing steps:

1. Create a new column for Make Model 
2. Generate a format of list of list for each Make Model with the following features: Engine Fuel Type, Transmission Type, Driven_Wheels, Market Category, Vehicle Size and Vehicle Style. 


1. Create a new column for Make Model

In [11]:
df['Maker_Model']= df['Make']+ " " + df['Model']

2. Generate a format of list of list for each Make Model 

In [12]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP,Maker_Model
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135,BMW 1 Series M
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650,BMW 1 Series
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350,BMW 1 Series
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450,BMW 1 Series
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500,BMW 1 Series


In [13]:
# Select features from original dataset to form a new dataframe 
df1 = df[['Engine Fuel Type','Transmission Type','Driven_Wheels','Market Category',
          'Vehicle Size', 'Vehicle Style', 'Maker_Model']]
df1

Unnamed: 0,Engine Fuel Type,Transmission Type,Driven_Wheels,Market Category,Vehicle Size,Vehicle Style,Maker_Model
0,premium unleaded (required),MANUAL,rear wheel drive,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,BMW 1 Series M
1,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,Performance",Compact,Convertible,BMW 1 Series
2,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,High-Performance",Compact,Coupe,BMW 1 Series
3,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,Performance",Compact,Coupe,BMW 1 Series
4,premium unleaded (required),MANUAL,rear wheel drive,Luxury,Compact,Convertible,BMW 1 Series
...,...,...,...,...,...,...,...
11909,premium unleaded (required),AUTOMATIC,all wheel drive,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,Acura ZDX
11910,premium unleaded (required),AUTOMATIC,all wheel drive,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,Acura ZDX
11911,premium unleaded (required),AUTOMATIC,all wheel drive,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,Acura ZDX
11912,premium unleaded (recommended),AUTOMATIC,all wheel drive,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,Acura ZDX


In [14]:
# For each row, combine all the columns into one column
df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1) 
df2

0        premium unleaded (required),MANUAL,rear wheel ...
1        premium unleaded (required),MANUAL,rear wheel ...
2        premium unleaded (required),MANUAL,rear wheel ...
3        premium unleaded (required),MANUAL,rear wheel ...
4        premium unleaded (required),MANUAL,rear wheel ...
                               ...                        
11909    premium unleaded (required),AUTOMATIC,all whee...
11910    premium unleaded (required),AUTOMATIC,all whee...
11911    premium unleaded (required),AUTOMATIC,all whee...
11912    premium unleaded (recommended),AUTOMATIC,all w...
11913    regular unleaded,AUTOMATIC,front wheel drive,L...
Length: 11914, dtype: object

In [15]:
# Store them in the pandas dataframe
df_clean = pd.DataFrame({'clean': df2}) 

df_clean

Unnamed: 0,clean
0,"premium unleaded (required),MANUAL,rear wheel ..."
1,"premium unleaded (required),MANUAL,rear wheel ..."
2,"premium unleaded (required),MANUAL,rear wheel ..."
3,"premium unleaded (required),MANUAL,rear wheel ..."
4,"premium unleaded (required),MANUAL,rear wheel ..."
...,...
11909,"premium unleaded (required),AUTOMATIC,all whee..."
11910,"premium unleaded (required),AUTOMATIC,all whee..."
11911,"premium unleaded (required),AUTOMATIC,all whee..."
11912,"premium unleaded (recommended),AUTOMATIC,all w..."


In [16]:
df_clean['clean'][0]

'premium unleaded (required),MANUAL,rear wheel drive,Factory Tuner,Luxury,High-Performance,Compact,Coupe,BMW 1 Series M'

In [17]:
# Create the list of list format of the custom corpus for gensim modeling 
sent = [row.split(',') for row in df_clean['clean']]
# show the example of list of list format of the custom corpus for gensim modeling 
sent[:2] 

[['premium unleaded (required)',
  'MANUAL',
  'rear wheel drive',
  'Factory Tuner',
  'Luxury',
  'High-Performance',
  'Compact',
  'Coupe',
  'BMW 1 Series M'],
 ['premium unleaded (required)',
  'MANUAL',
  'rear wheel drive',
  'Luxury',
  'Performance',
  'Compact',
  'Convertible',
  'BMW 1 Series']]

In [18]:
len(sent)

11914

### Genism word2vec Model Training 

In [19]:
## Train the genisim word2vec model with our own custom corpus
model_2 = Word2Vec(sent, min_count=1, vector_size= 50, workers=3, window =3, sg = 1)

In [20]:
## We can obtain the word embedding directly from the training model
model_2.wv['BMW 1 Series']

array([-0.06221133, -0.00078157,  0.11212988,  0.02290514, -0.09565159,
       -0.26648268, -0.10793739,  0.34476006,  0.02947687, -0.17716965,
        0.19972663,  0.05157929, -0.08664826, -0.00665096,  0.06953028,
        0.17162979,  0.20540069,  0.11678503,  0.00519125, -0.29180658,
       -0.04234027, -0.0331309 ,  0.22614162, -0.03543823,  0.21149847,
        0.01719995, -0.22853822,  0.33174163,  0.06983203, -0.18861471,
       -0.11522385,  0.15385288, -0.02876615,  0.17589487,  0.01618016,
       -0.04859521,  0.14127263,  0.00490319,  0.14612713,  0.1642975 ,
        0.06573396,  0.07646341, -0.26727164,  0.00967619,  0.23544917,
        0.01714828, -0.09497026, -0.01258893,  0.02748009,  0.14160384],
      dtype=float32)

### Compare Similarities 

Now we could even use Word2vec to compute similarity between two make model in the vocabulary by invoking the model.similarity() and passing in the relvevant words. For instance,  model.similarity('Porsche 718 Cayman', 'Nissan Van') This will give us the Euclidian similarity between Porsche 718 Cayman and Nissan Van. 

In [21]:
model_2.wv.similarity('Porsche 718 Cayman', 'Nissan Van')

0.8296043

In [22]:
model_2.wv.similarity('Porsche 718 Cayman', 'Mercedes-Benz SLK-Class')

0.8998677

From the above example, we can tell that Porsche 718 Cayman is more similar with Mercedes-Benz SLK-Class than Nissan Van. We also can use the build in function model.most_similar() to get a set of the most similar make models for a given make model.

In [23]:
## Show the most similar vehicles for Mercedes-Benz SLK-Class : Default by eculidean distance 
model_2.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

[('Lamborghini Murcielago', 0.9921762347221375),
 ('Lamborghini Huracan', 0.9880927205085754),
 ('BMW Z4', 0.9880548715591431),
 ('Ferrari 458 Italia', 0.9864090085029602),
 ('Lotus Evora', 0.986366331577301)]

In [24]:
## Show the most similar vehicles for Toyota Camry : Default by eculidean distance 
model_2.wv.most_similar('Toyota Camry')[:5]

[('Kia Optima', 0.9848490953445435),
 ('Nissan Sentra', 0.9837976694107056),
 ('Oldsmobile Eighty-Eight Royale', 0.9833534359931946),
 ('Oldsmobile Cutlass Ciera', 0.9820960760116577),
 ('Oldsmobile Alero', 0.9819996953010559)]

However, Euclidian similarity cannot work well for the high-dimensional word vectors, This is because Euclidian similarity will increase as the number of dimensions increases even if the word embedding stands for different meanings. Alternatively, we can use cosine similarity to measure the similarity between two vectors.  

For Vector $A$ and $B$, the dot product is given by $ A \cdot B = \|A\| \|B\| \cos(\theta)$

The cosine similarity is given by $ \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $

Cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. Therefore, the cosine similarity captures the angle of the word vectors and not the magnitude. Under cosine similarity, no similarity is expressed as a 90-degree angle while the total similarity of 1 is at 0 degree 
angle. The following function shows how can we generate the most similar make model based on cosine similarity.

In [25]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance (model, word,target_list , num) :
    cosine_dict ={}
    word_list = []
    a = model.wv[word]
    
    for item in target_list :
        if item != word :
            b = model.wv[item]
            cos_sim = dot(a, b)/(norm(a)*norm(b))
            cosine_dict[item] = cos_sim
    dist_sort=sorted(cosine_dict.items(), key=lambda dist: dist[1],reverse = True) ## in Descedning order 
    
    for item in dist_sort:
        word_list.append((item[0], item[1]))
    
    return word_list[0:num]

In [26]:
Maker_Model = list(df.Maker_Model.unique()) ## only get the unique Maker_Model_Year

## Show the most similar Mercedes-Benz SLK-Class by cosine distance 
cosine_distance(model_2,'Mercedes-Benz SLK-Class',Maker_Model,5) 

[('Lamborghini Murcielago', 0.99217635),
 ('Lamborghini Huracan', 0.9880926),
 ('BMW Z4', 0.98805475),
 ('Ferrari 458 Italia', 0.98640907),
 ('Lotus Evora', 0.9863663)]

In [27]:
model_2.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

[('Lamborghini Murcielago', 0.9921762347221375),
 ('Lamborghini Huracan', 0.9880927205085754),
 ('BMW Z4', 0.9880548715591431),
 ('Ferrari 458 Italia', 0.9864090085029602),
 ('Lotus Evora', 0.986366331577301)]

### T-SNE Plot

It’s hard to visualize the word embedding directly, for they usually has more than 3 dimensions. T-SNE is an useful tool to visualize high-dimensional data by reducing dimensional space while keeping relative pairwise distance between points. It can be said that t-SNE looking for a new data representation where the neighborhood relations are preserved. In this tutorial, I used TSNE package from scikit-learn library. The following code showed how to plot the word embedding with T-SNE plot. 

In [28]:
from sklearn.manifold import TSNE

import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sns

def display_closestwords_tsnescatterplot(model, word, size):
    
    arr = np.empty((0,size), dtype='f')
    word_labels = [word]

    close_words = model.wv.most_similar(word)

    arr = np.append(arr, np.array([model.wv[word]]), axis=0)
    
    for wrd_score in close_words:
        wrd_vector = model.wv[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    tsne = TSNE(n_components=2, perplexity = 10, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+1, x_coords.max()+1)
    plt.ylim(y_coords.min()+1, y_coords.max()+1)
    plt.show()

In [29]:
display_closestwords_tsnescatterplot(model_2, 'Porsche 718 Cayman', 50)

<IPython.core.display.Javascript object>

In [30]:
model_2.wv.most_similar('Porsche 718 Cayman')[:5]

[('Cadillac XLR', 0.9554708003997803),
 ('Bentley Azure', 0.94948410987854),
 ('Oldsmobile Toronado', 0.9449625611305237),
 ('BMW M5', 0.9442688822746277),
 ('Ferrari FF', 0.9441433548927307)]

In [32]:
close_words = model_2.wv.most_similar('Porsche 718 Cayman')
close_words

[('Cadillac XLR', 0.9554708003997803),
 ('Bentley Azure', 0.94948410987854),
 ('Oldsmobile Toronado', 0.9449625611305237),
 ('BMW M5', 0.9442688822746277),
 ('Ferrari FF', 0.9441433548927307),
 ('Hyundai Genesis', 0.9419194459915161),
 ('Infiniti G37 Coupe', 0.940889835357666),
 ('Ford F-150 SVT Lightning', 0.9398514032363892),
 ('Lexus IS F', 0.93939608335495),
 ('Lexus GX 460', 0.9392694234848022)]

In [31]:
arr = np.empty((0,50), dtype='f')
arr = np.append(arr, np.array([model_2.wv['Porsche 718 Cayman']]), axis=0)
print(arr)

[[ 0.01770397  0.00060668  0.00897832  0.01521355 -0.04059852 -0.07494052
  -0.00806342  0.05023349  0.00794503 -0.05610264  0.04477044  0.01331486
   0.00208868  0.01204422 -0.01683522  0.03804658  0.05248648  0.02591788
  -0.02219298 -0.09076661  0.00216296  0.01334868  0.06236157  0.02087773
   0.04295537 -0.00384182 -0.03031852  0.10243306 -0.02289386 -0.03307728
  -0.0345009   0.01864258 -0.00555367  0.00673456  0.0114482  -0.00242077
   0.04646415 -0.00488622  0.01486705  0.00600004  0.02308523  0.0141725
  -0.05412903  0.01280535  0.09246679  0.02271322  0.00373423 -0.00359278
   0.0239618   0.0262041 ]]


In [33]:
wrd_vector = model_2.wv[close_words[0][0]]
arr = np.append(arr, np.array([wrd_vector]), axis=0)
print(arr)

[[ 0.01770397  0.00060668  0.00897832  0.01521355 -0.04059852 -0.07494052
  -0.00806342  0.05023349  0.00794503 -0.05610264  0.04477044  0.01331486
   0.00208868  0.01204422 -0.01683522  0.03804658  0.05248648  0.02591788
  -0.02219298 -0.09076661  0.00216296  0.01334868  0.06236157  0.02087773
   0.04295537 -0.00384182 -0.03031852  0.10243306 -0.02289386 -0.03307728
  -0.0345009   0.01864258 -0.00555367  0.00673456  0.0114482  -0.00242077
   0.04646415 -0.00488622  0.01486705  0.00600004  0.02308523  0.0141725
  -0.05412903  0.01280535  0.09246679  0.02271322  0.00373423 -0.00359278
   0.0239618   0.0262041 ]
 [-0.02382832 -0.00968771  0.03882695  0.0069865  -0.07979552 -0.15068229
  -0.02241191  0.1814336  -0.01695096 -0.10253578  0.08953    -0.00429171
   0.02332604  0.0096097   0.00693087  0.09299589  0.10048032  0.07165135
  -0.03038892 -0.2382006  -0.01252199 -0.00272819  0.1375699   0.02849218
   0.12107908 -0.00590191 -0.09270216  0.2450349  -0.02169758 -0.07091669
  -0.0724800

# 3) Read pre-trained models

As we said, it's usually far better to use some pre-trained embeddings instead of starting from scratches

Read more: https://radimrehurek.com/gensim/models/keyedvectors.html


I will read GloVe's pre-trained vectors here. Gensim offers download of some other pre-trained vectors. See
https://github.com/RaRe-Technologies/gensim-data


And a more complete pre-trained vector dataset can be found here
http://vectors.nlpl.eu/repository/

You may need to manually download them to your disk and let Gensim read in.


Another source for Pre-trainned word and phrase vectors from Google: https://code.google.com/archive/p/word2vec/

In [None]:
import gensim.downloader as api

word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data


Then we can find similar words

In [None]:
result = word_vectors.most_similar(positive=['woman'])
result

As you may observe, girl and man are really different things. So we can use the king/queen and man/women analogy to find what's the similar word to "woman", if we hope to find pairs such as (king, queen)

$ man = woman + king - queen $

In [None]:
result = word_vectors.most_similar(positive=['king', 'woman'], negative=['queen'])
result

$ queen = king + woman - man $

In [None]:
result = word_vectors.most_similar(positive=[ 'king', 'woman'], negative=['man'])
result

### Using GoogleNews-vectors-negative300.bin.gz  as an example

The GoogleNews-vectors-negative300.bin.gz is pretty large and I won't upload it to GitHub. Please download it from [the official source](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) to your local computer if you want to try out the following code. 

In [34]:
file = '/Users/percychan/Tech/GoogleNews-vectors-negative300.bin'

In [36]:
# Load pretrained model (since intermediate data is not included, the model cannot be refined with additional data)
import gensim
model_google = gensim.models.KeyedVectors.load_word2vec_format(file, binary=True,limit= 100000) 

In [37]:
dog = model_google['dog']
print(dog.shape)
print(dog)

(300,)
[ 0.05126953 -0.02233887 -0.17285156  0.16113281 -0.08447266  0.05737305
  0.05859375 -0.08251953 -0.01538086 -0.06347656  0.1796875  -0.42382812
 -0.02258301 -0.16601562 -0.02514648  0.10742188 -0.19921875  0.15917969
 -0.1875     -0.12011719  0.15527344 -0.09912109  0.14257812 -0.1640625
 -0.08935547  0.20019531 -0.14941406  0.3203125   0.328125    0.02441406
 -0.09716797 -0.08203125 -0.03637695 -0.0859375  -0.09863281  0.00778198
 -0.01342773  0.05273438  0.1484375   0.33398438  0.01660156 -0.21289062
 -0.01507568  0.05249023 -0.10742188 -0.08886719  0.24902344 -0.0703125
 -0.01599121  0.07568359 -0.0703125   0.11914062  0.22949219  0.01416016
  0.11523438  0.00750732  0.27539062 -0.24414062  0.296875    0.03491211
  0.2421875   0.13574219  0.14257812  0.01757812  0.02929688 -0.12158203
  0.02282715 -0.04760742 -0.15527344  0.00314331  0.34570312  0.12255859
 -0.1953125   0.08105469 -0.06835938 -0.01470947  0.21484375 -0.12109375
  0.15722656 -0.20703125  0.13671875 -0.129882

In [38]:
# Deal with an out of dictionary word: Михаил (Michail)
if 'Михаил' in model_google:
    print(model_google['Михаил'].shape)
else:
    print('{0} is an out of dictionary word'.format('Михаил'))

Михаил is an out of dictionary word


In [39]:
model_google.most_similar('queen')

[('queens', 0.739944338798523),
 ('princess', 0.7070531249046326),
 ('king', 0.6510956883430481),
 ('monarch', 0.6383602023124695),
 ('Queen', 0.6163408160209656),
 ('princesses', 0.5908075571060181),
 ('royal', 0.5637185573577881),
 ('prince', 0.5534094572067261),
 ('duchess', 0.5475091338157654),
 ('Queen_Elizabeth_II', 0.5321036577224731)]

In [41]:
# Some predefined functions that show content related information for given words
model_google.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377322435379028),
 ('kings', 0.5236844420433044),
 ('queens', 0.5181134939193726),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006),
 ('royal_palace', 0.5087165832519531)]

In [42]:
vec = model_google['king'] - model_google['man'] + model_google['woman']
model_google.most_similar([vec])

[('king', 0.844939112663269),
 ('queen', 0.7300516366958618),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156250834465027),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.577711820602417),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376777052879333),
 ('queens', 0.5289886593818665),
 ('ruler', 0.5247418880462646)]

In [43]:
vec = model_google['Berlin'] - model_google['Germany'] + model_google['China']
model_google.most_similar([vec])

[('Beijing', 0.7503446936607361),
 ('China', 0.690947413444519),
 ('Shanghai', 0.6311317086219788),
 ('Chinese', 0.5892248749732971),
 ('Wen', 0.5494182109832764),
 ('Taipei', 0.5488865971565247),
 ('Hu', 0.5421319007873535),
 ('Berlin', 0.540385901927948),
 ('Jiang', 0.5400534272193909),
 ('Nanjing', 0.5362175107002258)]

In [44]:
vec = model_google['Germany'] - model_google['Berlin'] + model_google['Beijing']
model_google.most_similar([vec])

[('China', 0.78082674741745),
 ('Beijing', 0.7486749887466431),
 ('Chinese', 0.6215003132820129),
 ('Taiwan', 0.6076850891113281),
 ('Guangzhou', 0.5951666235923767),
 ('South_Korea', 0.5853776931762695),
 ('Guangdong', 0.557560384273529),
 ('Tianjin', 0.5540564060211182),
 ('Hong_Kong', 0.5525559186935425),
 ('Hangzhou', 0.5507877469062805)]

In [45]:
vec = model_google['Messi'] - model_google['soccer'] + model_google['tennis']
model_google.most_similar([vec])

[('Messi', 0.8166202306747437),
 ('Nadal', 0.7505947947502136),
 ('Lionel_Messi', 0.7263434529304504),
 ('Federer', 0.7245292663574219),
 ('Del_Potro', 0.7131719589233398),
 ('Djokovic', 0.6933087706565857),
 ('Xavi', 0.6920369863510132),
 ('Wawrinka', 0.6769295334815979),
 ('Safin', 0.6765395402908325),
 ('Verdasco', 0.6752812266349792)]

In [46]:
model_google.doesnt_match("breakfast economics dinner lunch".split())

'economics'

In [47]:
model_google.similarity('woman', 'man')

0.7664013

In [49]:
model_google.similarity('Harvard', 'Stanford')

0.5616391

In [50]:
model_google.similarity('Cambridge', 'Oxford')

0.7489214

In [51]:
model_google.most_similar('Harvard')

[('Yale', 0.7817695736885071),
 ('MIT', 0.6923760771751404),
 ('Tufts', 0.6757500171661377),
 ('Princeton', 0.6723749041557312),
 ('Dartmouth_College', 0.6639551520347595),
 ('Tufts_University', 0.6623229384422302),
 ('Dartmouth', 0.6545069813728333),
 ('Cornell', 0.6406891942024231),
 ('Ivy_League', 0.6399901509284973),
 ('Harvard_Law', 0.639301061630249)]

In [54]:
model_google.similarity('HKUST', 'HKU')

KeyError: "Key 'HKUST' not present"

In [55]:
model_google.similarity('Economics', 'Sociology')

0.5652766

In [56]:
model_google.similarity('Statistics', 'Economics')

0.36622703

In [57]:
model_google.similarity('Statistics', 'Sociology')

0.30566943

# Software 
- GloVe: https://nlp.stanford.edu/projects/glove/
- Word2Vec: https://code.google.com/archive/p/word2vec/
- Tensorflow Word2Vec tutorial: https://www.tensorflow.org/tutorials/text/word_embeddings