<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<center><h1>Using Pre-trained Word Embedding</center>

<center><table style="border: 2px solid black; border-collapse: collapse">

<tr>
<td style="border-right: 2px solid black; border-bottom: 2px solid black"><img src="https://i.imgur.com/R8VLFs2.png" height="200"width="1050px"/></td>
  

<td style="border-right: 2px solid black; border-bottom: 2px solid black"><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/GloVe.png"/height="200" width="1050px"></td>

<td style="border-right: 2px solid black; border-bottom: 2px solid black"><img src="https://fasttext.cc/img/ogimage.png"/ height="200"width="1050px"></td>
</tr>

</table></center>

---
# **Table of Contents**
---

**1.** [**What is Pre-trained Word Embedding?**](#section1)<br>
**2.** [**Popular used word embeddings**](#section2)<br>
   - **2.1** [**GloVe**](#section201)
   - **2.2** [**fastText**](#section202)
   - **2.3** [**Why use pre-trained word embbedding ?**](#section203)

**3.** [**Instantiating an Embedding layer**](#section3)<br>
  - **3.1** [**Loading the IMDB data for use with an Embedding layer**](#section301)
  - **3.2** [**Note**](#section302)
  - **3.2.1** [**Mount your gdrive to your colab notebook**](#section30201)
  - **3.3** [**Preprocessing the Embedding file**](#section303)
    - **3.3.1** [**Importing Libraries**](#section30301)
    - **3.3.2** [**Define the path of embedding file**](#section30302)
    - **3.3.3** [**Read the embedding file**](#section30303)
      - **3.3.3.1** [**Embedding examples**](#section3030301)
     
**4.** [**Processing the raw IMDB data**](#section4)<br>
  - **4.1** [**Downloading the IMDB dataset**](#section401)
  - **4.2** [**Note**](#section402)
  - **4.3** [**Define the path for imdb data**](#section403)
  - **4.4** [**Read the training data only**](#section404)
  - **4.5** [**Tokenization**](#section406)
  - **4.6** [**Splits the data into a training set and a validation set**](#section407)
  - **4.7** [**Preparing the GloVe word-embeddings matrix**](#section408)
     
**5.** [**Preparing the model**](#section5)<br>
  - **5.1** [**Defining a model**](#section501)
  - **5.2** [**What about the embedding matrix which we create**](#section502)
    - **5.2.1** [**Loading pretrained word embeddings into the Embedding layer**](#section50201)
  - **5.3** [**Training and Evaluating the model**](#section503)
  - **5.4** [**Plot the model’s performance over time**](#section504)
  - **5.5** [**Important Note**](section505)

**6.** [**Conclusion**](#section6)<br>

---
<a name = Section1></a>
# **1. What is Pre-trained Word Embedding?**
---

- Practitioners of deep learning for NLP typically **initialize** their models using **pre-trained word embeddings**, bringing in outside information, and **reducing** the number of **parameters** that a neural **network** needs to learn from scratch. 


---
<a name = Section2></a>
# **2. Popular used word embeddings**
---

<a id="section201"></a>
### **2.1 GloVe**

- [**GLOVE**](https://nlp.stanford.edu/projects/glove/) works similarly as **Word2Vec**.

- While you can see above that Word2Vec is a **predictive** model that predicts **context** given word, GLOVE learns by **constructing** a **co-occurrence matrix** (words X context) that basically **count how frequently** a word appears in a context.

- Since it's going to be a **gigantic** matrix, we **factorize** this matrix to achieve a **lower-dimension** representation. 

- There's a lot of details that goes in **GLOVE** but that's the **rough** idea.


<a id="section202"></a>
### **2.2 FastText**

- [FastText](fasttext.cc/docs/en/english-vectors.html) is quite **different** from the above **2 embeddings**. 

- While **Word2Vec** and **GLOVE** treats each word as the smallest unit to train on, FastText uses **n-gram characters** as the **smallest** unit.

- For example, the word **vector** ,**`apple`** could be broken down into **separate** word vectors **units** as **`ap,app,ple`**. 

- The biggest **benefit** of using FastText is that it generate **better word embeddings** for **rare words**, or even words not seen **during** training because the **n-gram** character **vectors** are shared with other words.

- This is something that **Word2Vec** and **GLOVE** cannot achieve.

<a id="section203"></a>
### **2.3 Why use pre-trained word embbedding ?**

- Sometimes, you have so little **training** data available that you can't use your **data** along to learn an appropriate **task-specific** embedidng of your **vocabulary**. 

**What do you do them?**

- Instead of **learning** word embeddings jointly with the **problem** you want to solve, you can load **embedding** vectors from **pre-computed** embedding space that you know is **highly structure** and **exhibits** useful properties 

- That capture **generic** aspect of **language** structure.

---
<a name = Section3></a>
# **3. Training NLP model with pre-trained word embedding**
---

<a id="section301"></a>
### **3.1 Download pretrained word embeddings**

-  [Download](https://nlp.stanford.edu/projects/glove/) the **precomputed** Glove **embeddings** trained on 2014 **English Wikipedia**.

    - **Wikipedia** 2014 + **Gigaword** 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip

-  [Download](https://fasttext.cc/docs/en/english-vectors.html) the precomputed fastText **embeddings** trained on 2017 **English Wikipedia**.

- In this **notebook** we gonna use the **Glove embedding**.


<a id="section302"></a>
### **3.2 Note**

- Since the size of embedding is big, not possible to use and access the word embeddings to your local machine, so for that first download the __word embedding__ to your system and then __upload__ it to your __gdrive__.

<a id="section30201"></a>
#### **3.2.1 Mount your gdrive to your colab notebook**

- While mounting the drive it require access form your gmail account.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Observation:**

- Once drive is mounted find the __location__ where you stored your __Word Embedding__ file.Right click on it and __copy the path__ of the file.

<a id="section303"></a>

### **3.3 Preprocessing the Embedding file**

- Since the __GloVe embedding__ available with different __dimensional vectors__ i.e **50-D,100-D,200-D,300-D**.


- Here we are using **50D** because less **dimension** required less model **training** time. 

<a id="section30301"></a>
#### **3.3.1 Importing Libraries**

In [None]:
# Import tensorflow 2.x
# This code block will only work in Google Colab.
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

In [None]:
import numpy as np         # For performing mathematical operations 
import pandas as pd        # For data analysis 
import os
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

<a id="section30302"></a>
#### **3.3.2 Define the path of embedding file**


- **Download** the **`glove.6B.50d.txt`** from [here](https://www.kaggle.com/watts2/glove6b50dtxt?select=glove.6B.50d.txt)

**Important Note:**

Your path could be **different** so use your's one.


In [None]:
# Define the path of embeddings
glove_dir = "/content/drive/My Drive/Word embeddings/glove.6B.50d.txt"

<a id="section30303"></a>

#### **3.3.3 Read the embedding file**

In [None]:
embeddings_index = {}    # Dictionary to store embedding vectors with it's value.
f=open(glove_dir)
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:], dtype='float32')
  embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

<a id="section303030301"></a>
#### **3.3.3.1 Embedding examples**

- Let's see few embedding examples for different words.

- In **embedding** all words are stored in **lower case** so make sure you word need to be in lower case before **testing**.


In [None]:
print("The 50 dimensional embedding vector for word hello is {}".format(embeddings_index['hello']))
print('-------------------'*10)

print("The 50 dimensional embedding vector for word data is {}".format(embeddings_index['data']))

print('-------------------'*10)

print("The 50 dimensional embedding vector for word science is {}".format(embeddings_index['science']))

---
<a name = Section4></a>
# **4. Processing the raw IMDB data**
---

### **4.1 Downloading the IMDB dataset**

<br>

- [Download](http://mng.bz/0tIo) the imdb dataset. 

<a id="section402"></a>
### **4.2 Note**

- Like previously, we will first upload the **`movie_review`** data to gdrive and then read it. If the upoloaded file is zipped one then first upzip it and then read the file using it's path in the driver.

-  Here we uploaded the file as zipped one to the drive, it saves a lot drive space and time.

In [None]:
# unzip the file

# It's time consuming process so grab some coffee and wait and watch.

! unzip "/content/drive/My Drive/Dataset/Movie_review.zip"

**Observation:**

- After **unziping** got two new file, we need only file with name **aclImDB**.

<a id="section403"></a>
### **4.3 Define the path for imdb data**

In [None]:
imdb_dir="/content/aclImdb"

<a id="section404"></a>
### **4.4 Reading the training file.**

**Movie_review** data have train and test data here we **read** only train data and according to that create **seperate** labels for each **review**. i.e Whether it is __-ve__ or __+ve__.


In [None]:
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
  dir_name = os.path.join(train_dir, label_type)
  for fname in os.listdir(dir_name):
    if fname[-4:] == '.txt':
      f = open(os.path.join(dir_name, fname))
      texts.append(f.read())
      f.close()
      # assigning a label to the review
      if label_type == 'neg':
        labels.append(0)
      else:
        labels.append(1)

<a id="section406"></a>
### **4.5 Tokenization** 

- From keras import **tokenizer** and **pad_sequences** so we can tokenize the review and perform padding on each one to have **same length**.


In [None]:
maxlen = 50                  # Cuts off reviews after 50 words
training_samples = 200       # Trains on 200 samples

validation_samples = 10000   # Validates on 10,000 samples

max_words = 10000            # Considers only the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)


word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

<a id="section407"></a>
### **4.6 Splits the data into a training set and a validation set**


- Before **split** the data first **shuffle** the data, because we started with data in which samples are **ordered** (**all negative** first, then **all positive**) 

In [None]:
indices = np.arange(data.shape[0])

np.random.shuffle(indices)   # Create a random shuffle

data = data[indices]         # Based on the shuffle create a new data 
labels = labels[indices]

## Train and validation 
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

<a id="section407"></a>
###  **4.7 Preparing the GloVe word-embeddings matrix**

In [None]:
embedding_dim = 50       
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
  if i < max_words:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector   # Words not found in the embedding index will be all zeros. 

---
<a name = Section5></a>
# **5. Preparing the model**
---

<a id="section502"></a>
### **5.1 Defining a model**

- We are using **`Sequential`** function from **`tensorflow.keras.models`** to build our model.

- We are using a **embedding layer** using our **pre-trained** word embeddings.

- We are using **flatten** and **dense layers** to build our model.

In [None]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

<a id="section502"></a>
### **5.2 What about the embedding matrix which we create**

<center>
<td><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/Question.png"/ width="150px"></td>

</center>

<a id="section50201"></a>
#### **5.2.1 Loading pretrained word embeddings into the Embedding layer**


In [None]:
# set the weights in accordance with the pre-trained model.
model.layers[0].set_weights([embedding_matrix])

# As we don't need to train for weights so freeze the Embedding layer (set its trainable attribute to False)
model.layers[0].trainable = False

<a id="section503"></a>
### **5.3 Training and Evaluating the model**

In [None]:
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(x_train, y_train,epochs=20,batch_size=32,validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

**Observation:**

- Don't need to **bother** about what **output** we got after **training** because here we train for only **200 review**.

<a id="section504"></a>
### **5.4 Plot the model’s performance over time**

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

 <a id="section505"></a>
### **5.5 Important Note**

- The model quickly starts **overfitting**, which is **unsurprising** given the small number of **training** samples.

- Because you have so few training samples, **performance** is heavily **dependent** on exactly which **200** samples you choose and you’re choosing them at **random**. 

- If this works poorly for you, try choosing a different **random** set of **200 samples**, for the sake of the **exercise** or try it by **increase** the number of **samples** like try to train with **5000** samples.

---
<a name = Section6></a>
# **6. Conclusion**
---

- After this sheet you are **aware** about how to use it but when to use it. 

- So for the use case of **word-embeddings** when you have **limited data** set that **time** use word embeddings.

__When lots of data is available__:

- Train the model with **loading** the pre-trained word *embeddings* and **without freezing** the embedding layer. 

- In that case, you’ll learn a tasks pecific embedding of the **input tokens**, which is generally more **powerful** than **pre-trained** word embeddings. 

