<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<center><h1>Introduction to Word2Vec</center> 

---
# **Table of Contents**
---

**1.** [**Word2Vec Architecture**](#section1)<br>
 - **1.1.** [**Continuous Bag of Words**](#section11)<br>
 - **1.2.** [**Skip Gram**](#section12)

**2.** [**Generating Word Vectors using Word2Vec**](#section2)<br>
  - **2.1** [**Importing necessary Libraries**](#section21)
  - **2.2** [**Importing the data**](#section22)
  - **2.3** [**Iterating through each sentence through the file**](#section23)
  - **2.4** [**Creating a CBOW model**](#section24)
  - **2.5** [**Creating a Skip Gram model**](#section25)
  - **2.6** [**Output**](#section26)

**3.** [**Applcations**](#section3)<br>

---
<a name = Section1></a>
# **1. Word2Vec Architectures**
---

- **Word2Vec** is an algorithm for constructing **vector** representations of words, also known as **word embeddings**.


- **Word2Vec** takes care of 2 things:

  - It converts **high dimensional** vector into **low** dimensional vector.
  
  - Maintains the **word context**.



- **Word2Vec** is one of the most widely used **models** to produce **word embeddings**.

- The models are **shallow**, 2 layer **neural** networks that are trained to reconstruct **linguistic** **context** of the word.


- **Layers --> Input layer + Hidden layer = Output layer**


<br>  
<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/hidden.PNG"width='450' height='280'/></center>

<a id=section1.1></a>
### **1.1 CBOW (Continuous Bag of Words)**

- It learns to **predict** the word by context.

- Input  --> the context (neighboring words)

- Output --> target word

The **limit** on the number of words in each context is **determined** by a parameter called **`window size`**.

<br>  

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/cbow.png" width='600' height='380'/></center>
<br>  

<a id=section201></a>
### **1.2 SKIP Gram**

- Skip Gram is **learning** to predict the **context** by the word.

- Input  --> Word

- Output --> Target Context (neighboring words)

- The limit on the number of words in **each** context is **determined** by a **parameter** called **`window size`**.

<center><img src = "https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/SKIP_gram.png" width='650' height='380'/></center>
<br>  

- In the above example:

  - **INPUT Layer**  : Blue box content

  - **TARGET Layer** : White box content

  - **Window Size**  : 5


#### A simple example in Python to generate word vectors using Word2Vec

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/python.PNG" width='580' height='300'/></center>

- Run these 2 commands in the terminal :

```python
pip install nltk
pip install gensim
```

---
<a name = Section2></a>
# **2. Generating Word Vectors using Word2Vec**
---

<a id=section201></a>
### **2.1 Importing necessary libraries**

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize 
import warnings 
import gensim 
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
warnings.filterwarnings(action = 'ignore') # This is to ignore unnecessary warnings

<a id=section202></a>
### **2.2 Importing the data** 


In [None]:
import urllib
sample = urllib.request.urlopen('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/Alice.txt')
s = sample.read().decode('utf8')
f = s.replace("\n", " ")

<a id=section203></a>
### **2.3 Iterating through each sentence in the file**

In [None]:
data = []

for i in sent_tokenize(f): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    data.append(temp) 


<a id=section204></a>
### **2.4 Creating a CBOW model**

In [None]:
model1 = gensim.models.Word2Vec(data, min_count = 1,  
                              size = 100, window = 5) 

# Print results 
print("Cosine similarity between 'alice' " + 
               "and 'wonderland' - CBOW : ", 
    model1.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ", 
      model1.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9994259
Cosine similarity between 'alice' and 'machines' - CBOW :  0.9537079


<a id=section205></a>
### **2.5 Creating a Skip Gram model**

In [None]:
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100, 
                                             window = 5, sg = 1) 
  
# Print results 
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ", 
    model2.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ", 
      model2.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.8895068
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.84547246


<a id=section201></a>
### **2.6 Output**

- Cosine similarity between 'alice' and 'wonderland' - CBOW :  **0.99913293**

- Cosine similarity between 'alice' and 'machines' - CBOW :  **0.98022455**

- Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  **0.89401996**

- Cosine similarity between 'alice' and 'machines' - Skip Gram :  **0.85758847**

- Output indicates the **cosine similarities** between word vectors **‘alice’, ‘wonderland’** and **‘machines’** for **different** models.

---
<a name = Section3></a>
# **3. Applications of Word Embeddings**
---


- **Sentiment Analysis**

- **Speech Recognition**

- **Information Retrieval**

- **Question Answering**