
## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

- Word2Vec Model represents each word as 300 Dimensional Vector

- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

### CODE ##

### Load Word2Vec Model

**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [1]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
word_vectors = KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin', binary=True)


In [3]:
vector_apple = word_vectors['apple']
vector_mango = word_vectors['mango']
print(vector_apple.shape, vector_mango.shape)


(300,) (300,)


In [4]:
cosine_similarity([vector_apple],[vector_mango])

array([[0.57518554]], dtype=float32)

## 1. Find the Odd One Out


In [5]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [6]:
import numpy as np


def odd_one_out(words):
    """Accepts a list of words and returns the odd word
       Lesser the cosine similarity , more chance of being odd one out
    """
    # Generate all word embedding for the given list
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors, axis=0)
    print(avg_vector.shape)

    # Iterate over every word and find similarity
    odd_one_out = None
    min_similarity = 1.0
    for w in words:
       sim = cosine_similarity([word_vectors[w]], [avg_vector])
       if sim < min_similarity:
          min_similarity = sim
          odd_one_out = w
       print("Similairy btw %s and avg vector is %.2f"%(w,sim))
    return odd_one_out


In [7]:
odd_one_out(input_4)

(300,)
Similairy btw india and avg vector is 0.81
Similairy btw paris and avg vector is 0.75
Similairy btw russia and avg vector is 0.79
Similairy btw france and avg vector is 0.81
Similairy btw germany and avg vector is 0.84


'paris'

### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](word2vec.png)

`man -> woman :: 	prince -> princess`  
`italy -> italian :: 	spain -> spanish`  
`india -> delhi :: 	japan -> tokyo`  
`man -> woman :: 	boy -> girl`  
`small -> smaller :: 	large -> larger`  

#### Try it out 


`man -> coder :: woman -> ______?`


In [8]:
type(word_vectors.key_to_index)

dict

In [9]:
def predict_word(a, b, c, word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a, b, c = a.lower(), b.lower(), c.lower()

    # Similarity between |b-a|=|d-c| should be maximum
    max_similarity = -100
    d = None
    words = word_vectors.key_to_index.keys()

    wa, wb, wc = word_vectors[a], word_vectors[b], word_vectors[c]
    # to find d s.t similarity(|b-a|,|d-c|) should be max

    for w in words:
        if w in [a, b, c]:
            continue

        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa], [wv-wc])

        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d


In [10]:
triad_2 = ("man","woman","prince")
predict_word(*triad_2,word_vectors)

'princess'

## Using the Most Similar Method

In [11]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7118193507194519)]

## 3. Training Your Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model
- Skip Gram Model

`Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner.` The algorithm was developed by Tomas Mikolov.

## Data Preparation



- Each sentence must be tokenized, into a list of words.

- The sentences can be text loaded into memory once,
or we can build a data pipeline which iteratively feeds data to the model.


In [12]:
import nltk
from nltk.corpus import stopwords

In [13]:
stop_word=set(stopwords.words('english'))
def readFile(file):
    f= open(file,'r',encoding='utf8')
    text=f.read()
    # Tokenization - sentences and words
    sentences=nltk.sent_tokenize(text)
    print(len(sentences))
    data=[]
    for sent in sentences:
        words=nltk.word_tokenize(sent)
        words=[w.lower() for w in words if len(w)>2 and w not in stop_word ]
        data.append(words)

    return data
text=readFile('bollywood.txt')
print(text)

18
[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['the', 'deepika', 'ranveer', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['from', 'airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'deepika', 'ranveer', 'wedding', 'style', 'file'], ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'the', 'year', 'this', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['from', 'isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['but', 'nothing', 'beats', 'man', 'wedding', 'the', 'year', 'award', 'social', 'media'], ['priyanka', 'also', 'shared', 'video', 'featuring', 'nick', 'jonaswas', 'also', 'celebratin

## Create Model

In [14]:
from gensim.models import Word2Vec
model=Word2Vec(text,vector_size=300,window=10,min_count=1)
print(model)

Word2Vec(vocab=116, vector_size=300, alpha=0.025)


In [15]:
words = list(model.wv.key_to_index.keys())
print(words)

['year', 'priyanka', 'nick', 'deepika', 'ranveer', 'wedding', 'the', 'chopra', 'sharma', 'ginni', 'weddings', 'jonas', 'kapil', 'chatrath', 'anand', '2018', 'isha', 'ambani', 'piramal', 'saw', 'from', 'new', 'also', 'man', 'singh', 'padukone', 'virat', 'many', 'grand', 'but', 'nothing', 'beats', 'award', 'media', 'shared', 'anushka', 'style', 'couple', 'two', 'social', 'big', 'fat', 'celebrations', 'this', 'december', 'married', 'friends', 'lavish', 'extravagant', 'one', 'entire', 'parties', 'everything', 'timeline', 'file', 'not', 'ambanis', 'pink', 'events', 'happened', 'reception', 'bollywood', 'squad', 'hooked', 'phones', 'waiting', 'come', 'biggest', 'gave', 'enough', 'reason', 'believe', 'stylish', 'attire', 'dress', 'looks', 'airport', 'side', 'morning', 'jaggo', 'celebration', 'verbier', 'switzerland', 'three', 'receptions', 'delhi', 'mumbai', 'night', 'proves', 'made', 'even', 'special', 'industry', 'long', 'time', 'there', 'glimpses', 'outstanding', 'pictures', 'london', 'ran

In [22]:
actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]


def predict_actor(a,b,c,word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 
    
    d = None
    words = actors
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    #to find d s.t similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d

### 4. Test your Model

In [23]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

'padukone'

In [24]:

triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,model.wv)

'jonas'

In [25]:
triad = ("priyanka","jonas","nick")
predict_actor(*triad,model.wv)

'deepika'

In [26]:

triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

'padukone'

In [27]:
triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,model.wv)

'jonas'