## Word Embeddings

Word Embeddings transform text into a sequence of vectors.
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

(Source: [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding))

![Word2Vec](img/Word2Vec_web.png)

### Embedding Layers

Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear.

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.

(Source: [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding))

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

```python 
e = Embedding(200, 32, input_length=50)```

![Word2VecAnn](img/word2vec_ann.png)

## Create GenSim Vectors

In [9]:
from gensim.models import word2vec

corpus = [
   "She loves you yeah yeah yeah",
   "see you later alligator",
   "see you later crocodile",
   "i just call to say i love you",
   "and it seems to me you lived your life like a candle in the wind",
   "baby you can drive my car",
   "we all live in the yellow submarine"
]

tokenized = [s.lower().split() for s in corpus]

wv = word2vec.Word2Vec(tokenized, 
                       size=7, 
                       window=5, 
                       min_count=1
                      )
words = list(wv.wv.vocab)
len(words)

print(wv['crocodile'])

[-0.00743786  0.0590687  -0.02039889  0.02666032  0.01912138 -0.04088607
 -0.00558173]




In [10]:
wv.wv.most_similar('crocodile')

[('candle', 0.6661001443862915),
 ('drive', 0.6130875945091248),
 ('loves', 0.403645783662796),
 ('all', 0.39144495129585266),
 ('just', 0.38677164912223816),
 ('seems', 0.37842628359794617),
 ('lived', 0.30731189250946045),
 ('wind', 0.29231205582618713),
 ('car', 0.28432750701904297),
 ('baby', 0.279856413602829)]

## Download Pretrained GenSim Vectors

Code for downloading and unzipping the 'GoogleNews-vectors-negative300' model on your computer. if you experience problems, download the file manually at [s3.amazonaws.com](https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz) and proceed to the next code snippet

```python
import os
from keras.utils import get_file
import gensim
import subprocess


MODEL = 'GoogleNews-vectors-negative300.bin'
path = get_file(MODEL + '.gz', 'https://s3.amazonaws.com/dl4j-distribution/%s.gz' % MODEL)
if not os.path.isdir('generated'):
    os.mkdir('generated')

unzipped = os.path.join('generated', MODEL)
if not os.path.isfile(unzipped):
    with open(unzipped, 'wb') as fout:
        zcat = subprocess.Popen(['zcat'],
                          stdin=open(path),
                          stdout=fout
                         )
        zcat.wait()


model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)
```

### Loading model from local folder

In [1]:
import gensim

from gensim.models import KeyedVectors

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [3]:
model.most_similar(positive=['fernando'])

MemoryError: 

## Vector Addition

In [11]:
def A_is_to_B_as_C_is_to(a, b, c, topn=1):
    a, b, c = map(lambda x:x if type(x) == list else [x],
                  (a, b, c))
    res = model.most_similar(positive=b + c, 
                             negative=a, 
                             topn=topn
                            )
    if len(res):
        if topn == 1:
            return res[0][0]
        return [x[0] for x in res]
    return None


for country in 'Italy', 'France', 'India', 'China':
    print('%s is the capital of %s' % 
          (A_is_to_B_as_C_is_to('Germany', 'Berlin', country), country))


Rome is the capital of Italy
Paris is the capital of France
Delhi is the capital of India
Beijing is the capital of China
