<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="360" height="160" /></center>


# Word2Vec 

---

## Table of Contents

1. [Word2Vec Architecture](#section1)<br>
  1.2.[Continuous Bag of Words](#section2)<br>
  1.3. [Skip Gram](#section3)
2. [Generating Word Vectors using Word2Vec](#section4)<br>
    - 2.1 [Importing necessary Libraries](#section401)<br>
    - 2.2 [Ignoring Warning messages](#section402)<br>
    - 2.3 [Importing the datafile](#section403)<br>
    - 2.4 [Iterating through each sentence through the file](#section404)<br>
    - 2.5[Creating a CBOW model](#section405)<br>
    - 2.6[Creating a Skip Gram model](#section406)<br>
    - 2.7[Output](#section407)<br>
3. [Applcations](#section408)<br>




### **Word2Vec** is an algorithm for constructing vector representations of words, also known as word embeddings.

<center><img src =" https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/22344.PNG"/></center>

**Word2Vec** takes care of 2 things:

1. It converts high dimensional vector into low dimensional vector.
2. Maintains the word context - meaning.


<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/meaning.JPG"/></center>

**Word2Vec is one of the most widely used models to produce *word embeddings*. The models are shallow, 2 layer neural networks that are trained to reconstruct linguistic context of the word**.


**Layers --> Input layer + Hidden layer = Output layer**

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/hidden.PNG"/></center>

# Word2Vec has 2 architectures :

## 1. CBOW (Continuous Bag of Words) :

It learns to predict the word by context.

Input  --> the context (neighboring words)

Output --> target word

The limit on the number of words in each context is determined by a parameter called **“window size”**.

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/cbow.png"/></center>

**In the above example-**

**INPUT Layer  : White box content**

**TARGET Layer : Blue box content**

**Window Size  : 5**

____

## 2. SKIP Gram

Skip Gram is learning to predict the context by the word.

Input  --> Word

Output --> Target Context (neighboring words)

The limit on the number of words in each context is determined by a parameter called **“window size”.**

![](https://cdn-images-1.medium.com/max/1200/1*5ugorDZ6nOgSqQq1dirY8Q.png)

**In the above example-**

**INPUT Layer  : Blue box content**

**TARGET Layer : White box content**

**Window Size  : 5**

___

# A simple example in Python to generate word vectors using Word2Vec

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/python.PNG"/></center>

Run these 2 commands in the terminal :

```python
pip install nltk
pip install gensim
```

## 2.1 Importing necessary libraries

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize 
import warnings 
import gensim 
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 2.2 Ignoring warning messages

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/errors.JPG" height = "300"/></center>

In [0]:
warnings.filterwarnings(action = 'ignore')

## 2.3 Importing the data file (Alice.txt)


In [0]:
import urllib
sample = urllib.request.urlopen('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/Alice.txt')
s = sample.read().decode('utf8')
f = s.replace("\n", " ")

## 2.4 Iterating through each sentence in the file

In [0]:
data = []

for i in sent_tokenize(f): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    data.append(temp) 

## 2.5 Creating a CBOW model

In [0]:
model1 = gensim.models.Word2Vec(data, min_count = 1,  
                              size = 100, window = 5) 

# Print results 
print("Cosine similarity between 'alice' " + 
               "and 'wonderland' - CBOW : ", 
    model1.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ", 
      model1.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9995325
Cosine similarity between 'alice' and 'machines' - CBOW :  0.95275563


## 2.6 Creating a Skip Gram model

In [0]:
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100, 
                                             window = 5, sg = 1) 
  
# Print results 
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ", 
    model2.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ", 
      model2.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.89742833
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.86560345


## 2.7 Output 

**Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.99913293**

**Cosine similarity between 'alice' and 'machines' - CBOW :  0.98022455**

**Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.89401996**

**Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.85758847**

Output indicates the cosine similarities between word vectors **‘alice’, ‘wonderland’** and **‘machines’** for different models.

## 3. Applications of Word Embeddings:

1. Sentiment Analysis

2. Speech Recognition

3. Information Retrieval

4. Question Answering