# Using pre-trained embeddings and NLP corpora

Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.

I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.

## Imports

In [1]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

## Pre-trained: Twitter GloVe Embeddings
This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The `25` in the model name refers to the dimensionality of the vectors. 

In [2]:
# download the model and return as object ready for use
model = api.load("glove-twitter-25")

Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:

In [3]:
model.wv.most_similar("pelosi",topn=10)

  """Entry point for launching an IPython kernel.


[('clegg', 0.9653650522232056),
 ('miliband', 0.9515050053596497),
 ('bachmann', 0.9484400153160095),
 ('mcconnell', 0.9416399002075195),
 ('carney', 0.9340256452560425),
 ('coulter', 0.9311323165893555),
 ('boehner', 0.9286301732063293),
 ('santorum', 0.9269059896469116),
 ('farage', 0.9193653464317322),
 ('mourdock', 0.9186689853668213)]

In [4]:
model.wv.most_similar("policies",topn=10)

  """Entry point for launching an IPython kernel.


[('policy', 0.9484812021255493),
 ('reforms', 0.9403934478759766),
 ('laws', 0.9401204586029053),
 ('government', 0.9230710864067078),
 ('regulations', 0.9168934226036072),
 ('economy', 0.9110006093978882),
 ('immigration', 0.9105910062789917),
 ('legislation', 0.9089650511741638),
 ('govt', 0.9054746627807617),
 ('regulation', 0.9050778746604919)]

### Pre-trainend: GloVe Wikipedia + Gigaword 
The example below uses pre-trained GloVe vectors based on Wikipedia 2014 and Gigaword. The original source of these embeddings can be found here: https://nlp.stanford.edu/projects/glove/

In [5]:
#again, download and load the model
model_gigaword = api.load("glove-wiki-gigaword-100")

In [6]:
# find similarity
model_gigaword.wv.most_similar(positive=['dirty','grimy'],topn=10)


  


[('filthy', 0.7690386176109314),
 ('smelly', 0.7392696738243103),
 ('shabby', 0.7025482654571533),
 ('dingy', 0.7022336721420288),
 ('grubby', 0.6754513382911682),
 ('grungy', 0.6414024233818054),
 ('dank', 0.6263698935508728),
 ('sweaty', 0.622745156288147),
 ('dreary', 0.6216242909431458),
 ('gritty', 0.6215749382972717)]

In [7]:
model_gigaword.wv.most_similar(positive=["summer","winter"],topn=10)

  """Entry point for launching an IPython kernel.


[('spring', 0.8519278764724731),
 ('autumn', 0.7865706086158752),
 ('olympics', 0.6915045380592346),
 ('weekend', 0.6908971667289734),
 ('days', 0.6872981786727905),
 ('during', 0.6861999034881592),
 ('season', 0.6849778294563293),
 ('year', 0.6827663779258728),
 ('rainy', 0.6744828820228577),
 ('day', 0.671191930770874)]

## Load a dataset and train a model
Instead of loading pre-trained embeddings, you can also load a corpus and train it on demand. This list of datasets that you can download can be found here: https://github.com/RaRe-Technologies/gensim-data#datasets

In [8]:
from gensim.models.word2vec import Word2Vec

# this loads the text8 dataset
corpus = api.load('text8')

# train a Word2Vec model
model_text8 = Word2Vec(corpus,iter=10,size=150, window=10, min_count=2, workers=10)  # train a model from the corpus


In [9]:
# similarity 
model_text8.wv.most_similar("shocked")

[('surprised', 0.7146514058113098),
 ('outraged', 0.7117233276367188),
 ('disappointed', 0.6712729930877686),
 ('angered', 0.6455301642417908),
 ('offended', 0.6371268630027771),
 ('overwhelmed', 0.6347959637641907),
 ('confronted', 0.6278891563415527),
 ('betrayed', 0.6236147284507751),
 ('disgusted', 0.6220308542251587),
 ('alarmed', 0.6148042678833008)]

In [10]:
# similarity between two different words
model_text8.wv.similarity(w1="dirty",w2="smelly")

0.44690064

In [11]:
# Which one is the odd one out in this list?
model_text8.wv.doesnt_match(["cat","dog","france"])

'france'