# Using pre-trained embeddings and NLP corpora
Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.

I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [25]:
# import 
import numpy as np 
from scipy.linalg import norm

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

In [3]:
from pprint import pprint
pprint(list(api.info()['models'].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


# Pre-trained: Twitter GloVe Embeddings

This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name refers to the dimensionality of the vectors.

In [4]:
# download the model and return as object ready for use
dimension = 25
model_glove_twitter = api.load("glove-twitter-25")
# model_glove_twitter = api.load("glove-twitter-100")

In [5]:
model_glove_twitter.most_similar("twitter",topn=10)

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342359900474548),
 ('instagram', 0.9104822874069214),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885936737060547),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778460621833801),
 ('link', 0.877821147441864),
 ('internet', 0.8753897547721863)]

Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:

In [6]:
model_glove_twitter.most_similar("pelosi",topn=10)

[('clegg', 0.9653650522232056),
 ('miliband', 0.9515050053596497),
 ('bachmann', 0.9484400749206543),
 ('mcconnell', 0.9416398406028748),
 ('carney', 0.9340257048606873),
 ('coulter', 0.9311323165893555),
 ('boehner', 0.9286302328109741),
 ('santorum', 0.9269059896469116),
 ('farage', 0.9193653464317322),
 ('mourdock', 0.9186689853668213)]

In [7]:
model_glove_twitter.most_similar("policies",topn=10)

[('policy', 0.9484812617301941),
 ('reforms', 0.9403934478759766),
 ('laws', 0.9401204586029053),
 ('government', 0.923071026802063),
 ('regulations', 0.9168933629989624),
 ('economy', 0.9110006093978882),
 ('immigration', 0.9105909466743469),
 ('legislation', 0.9089650511741638),
 ('govt', 0.9054747223854065),
 ('regulation', 0.9050779342651367)]

Which of these words don't fit?

In [8]:
#what doesn't fit?
model_glove_twitter.doesnt_match(["trump","bernie","obama","pelosi","orange"])

'orange'

Word vectors for trump and obama

In [9]:
# show weight vector for trump and obama
model_glove_twitter["trump"]

array([-0.56174 ,  0.69419 ,  0.16733 ,  0.055867, -0.26266 , -0.6303  ,
       -0.28311 , -0.88244 ,  0.57317 , -0.82376 ,  0.46728 ,  0.48607 ,
       -2.1942  , -0.41972 ,  0.31795 , -0.70063 ,  0.060693,  0.45279 ,
        0.6564  ,  0.20738 ,  0.84496 , -0.087537, -0.38856 , -0.97028 ,
       -0.40427 ], dtype=float32)

In [10]:
model_glove_twitter['obama']

array([ 0.77126 ,  0.81259 , -0.5901  , -0.015908, -0.082797, -1.2261  ,
        0.098286,  0.087488,  0.012586, -0.35884 ,  0.80733 ,  0.12569 ,
       -4.0522  ,  0.14856 ,  0.6988  , -0.78948 , -0.77125 ,  0.49512 ,
        0.16366 , -0.9713  ,  0.95064 ,  0.19921 , -0.27903 , -1.6844  ,
       -0.79424 ], dtype=float32)

# Count the document similarity

In [46]:
def vector_similarity(s1, s2, dimension = 25):
    
    def sentence_vector(s):
        '''这边dimension取决于你训练模型时，给定的维度
        '''
        # words = jieba.lcut(s)
        words = [ w.lower() for w in s.split()]
        
        # average the words vector, to get the sentence vector
        v = np.zeros(dimension)  
        for word in words:
            v += model_glove_twitter[word]
        v /= len(words)
        return v
    
    v1, v2 = sentence_vector(s1), sentence_vector(s2)
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

new1 = "Amazon holds early lead in historic union election"
new2 = "The woman who took on google and won"

score = vector_similarity(s1=new1, s2=new2)
print(score)

0.9069623250399094


# Code decomposition explain

In [29]:
s = new1
words = [ w.lower() for w in s.split()]
words

['amazon', 'holds', 'early', 'lead', 'in', 'historic', 'union', 'election']

In [30]:
dimension =25
v = np.zeros(dimension)
v

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.])

In [33]:
for word in words:
    v += model_glove_twitter[word]
print(v)

[ -1.25367999   2.7157019   -2.38962002  -4.63931996   0.09306997
  -4.19992995   5.07026005  -4.96010609   2.51256605  -2.76789112
   1.45824202   2.09489102 -28.72370052   7.32691002   1.65678005
  -3.59892997  -1.06501414   1.02622299  -1.78564898  -1.97814004
  -4.28357     -2.51712359   2.47142602  -5.37796997  -2.65966394]


In [35]:
model_glove_twitter["amazon"]

array([ 0.23029 , -0.26417 , -0.19669 , -1.2001  ,  0.84545 , -0.49428 ,
        0.46503 , -0.079233,  0.46324 , -0.70849 ,  0.91901 ,  0.65455 ,
       -2.73    , -0.74847 , -0.85378 , -0.57711 ,  0.1443  ,  0.33378 ,
        0.062339,  0.77928 , -0.77372 , -2.8468  ,  0.22277 , -0.39313 ,
       -0.044044], dtype=float32)