In [1]:
# Pre-trained Word Embeddings

# Owner : Jacquelyn Carmichael
# Uses Libraries: gensim
# Runtime: Jupyter Notebook

# Data:

# Reference: https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne



# Pre-trained Word Embeddings

There are many pretrained Word Embeddings. We can use PreTrained Word Embeddings with Keras. We can use the following pretrained Word Embeddings:

* GloVe
* Word2Vec
* FastText



In [10]:
# import the necessar libraries
import pandas as pd 

# The pandas library is used to read in the data and manipulate it

import gensim

# The gensim library is used to create the word embeddings

import gensim.downloader

# The gensim.downloader library is used to download the pre-trained word embeddings

import numpy as np

# The numpy library is used to create the word embeddings

import matplotlib.pyplot as plt

# The matplotlib library is used to create the word embeddings



In [4]:
for model_name in list(gensim.downloader.info()['models'].keys())[:10]:
    print(model_name)

# The gensim.downloader.info() function returns a dictionary of information about the pre-trained word embeddings

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50


In [5]:
google_news_vectors = gensim.downloader.load('word2vec-google-news-300')

# The gensim.downloader.load() function loads the pre-trained word embeddings
# The pre-trained word embeddings are stored in the google_news_vectors variable
# the word2vec-google-news-300 is the name of the pre-trained word embeddings
# it contains 300 dimensional word vectors that were trained on 100 billion words from the Google News datasets
# You can use this to find the most similar words to a given word



# GloVe Word Embeddings

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

word2vec-google-news-300 is a pretrained word embedding model from Google News dataset. It contains 300-dimensional vectors for 3 million words and phrases. The model can be used to generate word embeddings for out-of-vocabulary words.



In [11]:
google_news_vectors.most_similar('cat')

[('cats', 0.8099378347396851),
 ('dog', 0.760945737361908),
 ('kitten', 0.7464984655380249),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]

In [40]:
# lets find the most similar words 

google_news_vectors.most_similar("apple")










[('apples', 0.720359742641449),
 ('pear', 0.6450697183609009),
 ('fruit', 0.6410146355628967),
 ('berry', 0.6302294731140137),
 ('pears', 0.6133960485458374),
 ('strawberry', 0.6058261394500732),
 ('peach', 0.6025872826576233),
 ('potato', 0.5960935354232788),
 ('grape', 0.5935865044593811),
 ('blueberry', 0.5866668224334717)]

# FastText Word Embeddings

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

FastText uses subword information to create vector representations for out-of-vocabulary words. It is also able to provide word vectors for words that are not present in the vocabulary. 



In [41]:
# FastText Embeddings are a set of pre-trained word embeddings that were trained on Wikipedia and Common Crawl

fasttext_vectors = gensim.downloader.load('fasttext-wiki-news-subwords-300')
fasttext_vectors.most_similar('cat')

[('cats', 0.8368596434593201),
 ('housecat', 0.767471194267273),
 ('-cat', 0.7602992057800293),
 ('dog', 0.7502297759056091),
 ('kitten', 0.7480817437171936),
 ('feline', 0.7353992462158203),
 ('super-cat', 0.7305206060409546),
 ('supercat', 0.7163283824920654),
 ('pet', 0.7090283632278442),
 ('moggy', 0.7057286500930786)]

# Word2Vec Word Embeddings

Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. The idea is that words that share contexts in the corpus will be located in close proximity to one another in the space.

# Word Embeddings in Keras

Keras provides an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

* **input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

* **output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

* **input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

We can define an Embedding layer as part of a Keras model as follows:

```python
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=4))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
```




In [44]:
google_news_vectors.most_similar("glass")

[('R._Mazzei_fused', 0.6665399074554443),
 ('Christian_Audigier_nightclub', 0.6632695198059082),
 ('copper_alloy_garnets', 0.634365439414978),
 ('Nelmeus', 0.6274421811103821),
 ('fiber_fusion_splicing', 0.6229820251464844),
 ('Plexiglass', 0.5858588814735413),
 ('slashing_Leonardo_DiCaprio', 0.5850011110305786),
 ('plexiglass', 0.5823023319244385),
 ('Plexiglas', 0.5803930759429932),
 ("#Q'##_unaudited", 0.5798528790473938)]

In [48]:
# Finding Cpital of Britain given the capital of France
print("Finding Capital of Britain given: (Paris - France) + Britain")
capital = google_news_vectors.most_similar(['Paris', 'Britain'], ["France"], topn=1)
print(capital)
print()


Finding Capital of Britain given: (Paris - France) + Britain
[('London', 0.7541898488998413)]

