### Word to Vector model
- Using Pre-trained WORD VECTORS



### Word2Vec Model
- Word2vec google's pretrained model
- contains vector representation of 50 billion words
- words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using cosine distance


### Application
- Text Similarity
- Language translation
- finding odd words
- word analogies

### Word embeddings
- Word embedding are numerical representation of words, in the form of vectors
- Word2Vec Model represents each word as 300 dimensional vector
- In this tutorial we are going to use pre-trained word2vec model
- model size is around 1.5 GB
- we will work using gensim, which is a popular NLP package


Genism's word2vec model provides optimum implementation of
- 1) ***CBOW***(Bag of words) model
- 2) ***SkipGram*** model


Paper 1 -->

Paper 2 -->

### Word2Vec model using gensim
-

### Code
#### Lode Word2Vec model
***KeyedVectors*** - This object essentially contains the mapping between words and embeddings.
After training, it can be used directly to query those embeddings in various ways

In [16]:
import numpy as np
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
word_vectors = KeyedVectors.load_word2vec_format(fname='GoogleNews-vectors-negative300.bin',binary=True)

In [18]:
v_apple = word_vectors['apple']
v_mango = word_vectors['mango']

In [19]:
# Capital letter and small letter means different, since model is based on that
cosine_similarity([v_apple],[v_mango])

array([[0.57518554]], dtype=float32)

#### Question - Answering - Find the odd one out

In [20]:
input_1 = ['apple','mango','juice','party','orange']
input_2 = ['music','dance','sleep','dancer','food']
input_3 = ['match','player','football','cricket',"dancer"]
input_4 = ['india','paris','russia','france','germany']

In [21]:
def odd_one_out(words):
    """ Accepts a list of words and returns the odd word"""

    # Generate all word embedding for the given text
    all_word_vectors = [word_vectors[w] for w in words]
    print(len(all_word_vectors))

    avg_vector = np.mean(all_word_vectors,axis=0)  # Shape (300,)
    # iterate over every word and find similarity
    odd_one_out = None
    min_sim = 1.0 # very high value
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_sim:
            min_sim = sim
            odd_one_out = w

        print("Similarity btw %s and avg vector is %0.2f"%(w,sim))
    return odd_one_out

In [22]:
odd_one_out(input_1)

5
Similarity btw apple and avg vector is 0.78
Similarity btw mango and avg vector is 0.76
Similarity btw juice and avg vector is 0.71
Similarity btw party and avg vector is 0.36
Similarity btw orange and avg vector is 0.65


'party'

In [None]:
filename ='/odd/test.csv'
with open(filename,'r') as f:
    text = f.read()

## 2. Word Analogies Task
- In the word analogy task, we compute the sentence "a is to b as c is to __". An example is 'man is to woman as king is
to queen". In detail, we are trying to find a word d, such that the associated word vector ea, eb ec, ed are related in
the following manner: eb-ea-ed-ec.We will measure the similarity betweem eb-ea and ed-ec using cosine similarity.

In [None]:
# word_vectors.vocab

In [24]:
def predict_words(a,b,c,word_vectors):
    """Accepts a triad of words,a,b,a and returns d such that a is to b : c is to d"""
    a, b, c = a.lower(), b.lower(), c.lower()
    # similarity |b-a|=|d-c| should be max
    max_sim = -100
    d = None
    words = word_vectors.vocab.keys()
    wa, wb, wc = word_vectors[a], word_vectors[b], word_vectors[c]

    for w in words:
        if w in [a,b,c]: continue

        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])

        if sim > max_sim:
            max_sim = sim
            d = w


    return d

In [25]:
triad_2 = ("man","woman","prince")
predict_words(*(triad_2),word_vectors)

KeyboardInterrupt: 

## Using most similar method


In [26]:
word_vectors.most_similar(positive=['woman','king'],negative=['man'],topn=1)

KeyboardInterrupt: 