<a href="https://colab.research.google.com/github/Educat8n/Invited_Talks/blob/master/OxfordAI/create_embedding_with_text8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gensim

A state of the art package for NLP


We will create an embedding using a small text corpus, called text8. The text8 dataset is the first 108 bytes the Large Text Compression Benchmark, which
consists of the first 109 bytes of English Wikipedia
. The text8 dataset is accessible from within the gensim API as an iterable of tokens, essentially a list of tokenized sentences.

[GitHub](https://github.com/PacktPublishing/Deep-Learning-with-TensorFlow-2-and-Keras/blob/master/Chapter%207/create_embedding_with_text8.py)



In [None]:
!mkdir data
import warnings
warnings.filterwarnings("ignore")

mkdir: cannot create directory ‘data’: File exists


In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec

info = api.info("text8")
assert(len(info) > 0)

dataset = api.load("text8")  # download and load text 8  dataset
model = Word2Vec(dataset) # we create an embedding using Word2vec model for this data

model.save("data/text8-word2vec.bin")

Let us now explore the saved model

In [None]:
print(model)

Word2Vec(vocab=71290, size=100, alpha=0.025)


In [None]:
words = list(model.wv.vocab)
print(words)



We can load the saved model and use it. KeyedVectors re representation used to save memory-- we cannot train the model further if we use keyer vectors.
But it saves RAM by shedding the internal data structure needed for training.



In [None]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin")  # Help in saving memory by shedding the internal data structures necessary for training
word_vectors = model.wv   ## Gives the word vectors

# get words in the vocabulary
words = word_vectors.vocab.keys()
print([x for i, x in enumerate(words) if i < 10])  # Printing first 10 words of the vocabulary

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


### Find most similar

In [None]:
word_vectors.most_similar('king')

[('prince', 0.7196099162101746),
 ('queen', 0.7016708850860596),
 ('throne', 0.6925569772720337),
 ('emperor', 0.6899150609970093),
 ('vii', 0.6792635917663574),
 ('kings', 0.6782271265983582),
 ('antiochus', 0.6537981033325195),
 ('constantine', 0.6524246335029602),
 ('pope', 0.6509160995483398),
 ('elector', 0.6457626819610596)]

#### Helper function to print

In [None]:
def print_most_similar(word_conf_pairs, k):
    for i, (word, conf) in enumerate(word_conf_pairs):
        print("{:.3f} {:s}".format(conf, word))
        if i >= k-1:
            break
    if k < len(word_conf_pairs):
        print("...")


## Word Arithmetic

France + Berlin - Paris = ?

In [None]:
print("# vector arithmetic with words (cosine similarity)")
print("# france + berlin - paris = ?")
print_most_similar(word_vectors.most_similar(
    positive=["france", "berlin"], negative=["paris"]), 1)

# vector arithmetic with words (cosine similarity)
# france + berlin - paris = ?
0.793 germany
...


## Find the odd one out

In [None]:
print(word_vectors.doesnt_match(["hindus", "parsis", 
    "singapore", "christians"]))

singapore


## Calculate distance between two words

In [None]:

print("# distance between vectors")
print("distance(singapore, malaysia) = {:.3f}".format(
    word_vectors.distance("singapore", "malaysia")
))

# distance between vectors
distance(singapore, malaysia) = 0.115


## Find similarity between words

In [None]:
print("# similarity between words")
for word in ["woman", "dog", "whale", "tree"]:
    print("similarity({:s}, {:s}) = {:.3f}".format(
        "man", word,
        word_vectors.similarity("man", word)
    ))

# similarity between words
similarity(man, woman) = 0.740
similarity(man, dog) = 0.451
similarity(man, whale) = 0.279
similarity(man, tree) = 0.294
