# 🚀 END-to-END-Natural Language Processing-3


#### In this notebook, I am demonstrating how to build different 'ANN Language Models' with Semantic Embeddings on Keras-TensorFlow 2.0 framework.

First, I am building the ConvNet Language Model without pre-trained embedding and using it as benchmark.

Secondly, I'd design the embedding matrix with pre-trained word embeddings (**GloVe,GoogleNews,Fasttext**) to feed to the embedding layer of the neural networks.

The following Neural Networks are build with the appropriate word embeddings:

* ConvNets
* Recurrent Neural Networks/ Long Short Term Memory Cells
* Bidirectional LSTMs / Gated Recurrecnt Units (GRU)

> ### So, let's get started!!!

**As always, I hope you find this kernel useful and your [UPVOTES](https://www.kaggle.com/rizdelhi/quora-insincere-questions-part-3) would be highly appreciated.**

**Previous Kernels**

> [⚡END-to-END-Natural Language Processing-1⚡- Exploratory Data Analysis & Pre-trained Word Embedding Models](https://www.kaggle.com/rizdelhi/end-to-end-natural-language-processing-1) 

> [⚡END-to-END-Natural Language Processing-2⚡-Statistical Models and Ensemble Technique to Improvise Performance](https://www.kaggle.com/rizdelhi/end-to-end-natural-language-processing-2) 

### Data loaded 

In [None]:
import os
import re  
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords   
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from keras.models import Sequential, Model
from tensorflow.keras import regularizers
from keras import layers
from tensorflow.keras.layers import Embedding, Bidirectional, GlobalMaxPool1D
from gensim.models import KeyedVectors
from keras.models import Sequential
from tensorflow.keras import regularizers
from keras import layers
from tensorflow.keras.layers import Embedding
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model

# Basic Building Blocks of Artificial Neural Networks (ANN) Model with Keras-Tensorflow 2.0

We'll be following the below pipeline to create neural networks for text classification:

1. Tokenize the input features- This implies converting the input data into tokens (by using one hot encoding/tokenizing) 
2. Tokenize the targets- This can be done with the help of Label Encoder(sklearn) or using "values()" from python
3. Padd the tokenized features- To ensure that the length of the tokenized feature is same across all the entries(post padding)
4. Create a simple model: Build a Sequential Network with Keras 'Embedding layer' as the starting point - 'Keras Sequential API' 
5. Add Layers to the model: Add either Conv/LSTM/Bi-LSTM/RNN/GRU layers with different activations ('relu' is recommended)
6. Add the Dense layer to the model- At the end , we have to add the Dense layer by flattening the output of the previous layer
7. Add necessary activation functions- Sigmoid for Binary Classification , Softmax for Multi-class classification
8. Print the model architecture using 'plot_model'
9. Plot loss/accuracy of the model with matplotlib

OR

Launch Tensorboard to visualize the training parameters - loss,accuracy, etc.


This forms the fundamental steps to build a basic but fundamental pipeline for any language modelling task. 

Sophistications include adding custom embeddings before the keras Embedding layer and then adding certain other layers(transformer architectures) before the LSTM.

## Basic CNN Model for NLP with Keras Tokenizer

I am using Keras Tokenizer Class to get the embedding layer to build a simple Convolutional Neural Network model.

[Read: Understanding CNN for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)

![CNN for NLP](http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png)

### Train data

In [None]:
# Import libraries
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import LSTM, Dense,Flatten,Conv2D,Conv1D,GlobalMaxPooling1D
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
from keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model,Sequential
from keras.utils import to_categorical
# Load the input features
pd.set_option('display.max_colwidth',None)
train_df=pd.read_csv('../input/clean-quora-train-data/clean_lem_stemmed_train_data.csv') #cleaned data imported from previous kernel
train_df=train_df.dropna()
train_df.head()

In [None]:
X = train_df['question_text'] # input
y = train_df['target'].values # target /label

sentences_train,sentences_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=11)

tokenizer = Tokenizer(num_words=30000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_val = tokenizer.texts_to_sequences(sentences_val)

# Adding 1 because of  reserved 0 index
vocab_size = len(tokenizer.word_index) + 1 # (in case of pre-trained embeddings it's +2)                         
maxlen = 131 # sentence length

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_val = pad_sequences(X_val, padding='post', maxlen=maxlen)


print("Padded and Tokenized Training Sequence".format(),X_train.shape)
print("Target Training Values Shape".format(),y_train.shape)
print("_____________________________________________")
print("Padded and Tokenized Validation Sequence".format(),X_val.shape)
print("Target Validatation Values Shape".format(),y_val.shape)

In [None]:
num_tokens=len(tokenizer.word_index)+2
print("Number of Features/Tokens:",num_tokens)

### Delete unused memory

In [None]:
del train_df
import gc
gc.collect()

### Generic function to plot the train/validation loss and accuracy

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
seed = 1000

# generic function to plot the train Vs validation loss/accuracy:
def plot_history(history):
    loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s]
    val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s]
    acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s]
    val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s]
    if len(loss_list) == 0:
        print('Loss is missing in history')
        return 
    ## As loss always exists
    epochs = range(1,len(history.history[loss_list[0]]) + 1)
    plt.figure(figsize=(25,15))
    ## Accuracy
    plt.subplot(2,2,1)
    for l in acc_list:
        plt.plot(epochs, history.history[l], 'b', label='Training accuracy (' + str(format(history.history[l][-1],'.4f'))+')')
    for l in val_acc_list:    
        plt.plot(epochs, history.history[l], 'g', label='Validation accuracy (' + str(format(history.history[l][-1],'.4f'))+')')

    plt.title('Training Accuracy Vs Validation Accuracy\n')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    ## Loss
    plt.subplot(2,2,2)
    for l in loss_list:
        plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.4f'))+')'))
    for l in val_loss_list:
        plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.4f'))+')'))
    
    plt.title('Training Loss Vs Validation Loss\n')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

## Function to plot confusion matrix

In [None]:
from sklearn import metrics

def conf_matrix(actual, prediction, model_name):
    cm_array=metrics.confusion_matrix(actual,prediction,labels=[0,1])
    sns.set_context("notebook", font_scale=1.1)
    plt.figure(figsize=(5,5))
    sns.heatmap(cm_array,annot=True, fmt='.0f',xticklabels=['Sincere','Insincere'],yticklabels=['Sincere','Insincere'])
    plt.ylabel('True\n')
    plt.xlabel('Predicted\n')
    plt.title(model_name)
    plt.show()

## Deep ConvNet model without pre-trained embeddings

I am using the 'Keras Embedding' layer and visualize the results before using the embedding models.

In [None]:
# ConvNet model

embedding_dim = 100
# number_of_tokens=len(tokenizer.word_index)+1

embedding_layer = Embedding(vocab_size,embedding_dim,input_length=maxlen,trainable=True)

# model = tf.keras.Sequential()
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(256, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D()(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D()(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(10,activation='relu')(x)
preds = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(int_sequences_input, preds)

model.summary()

model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

### Fit the model

In [None]:
history=model.fit(X_train, y_train,epochs=5,validation_data=(X_val, y_val),batch_size=1024)

### Plotting the accuracy & loss 

In [None]:
plot_history(history)

### Architecture of the ConvNet model 

In [None]:
# save the model
# model.save('cnn_nlp_model.h5')
# plotting the architecture
dot_img_file = '/tmp/model_cnn.png'
tf.keras.utils.plot_model(model, show_shapes=True,to_file=dot_img_file, rankdir="TB")

In [None]:
del model,history,
import gc
gc.collect()

## About RNNs
### Recurrent Neural Network (RNN)

Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling **sequence data** such as time series or natural language.

Schematically, a RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far.

The Keras RNN API is designed with a focus on:

- Ease of use: the built-in keras.layers.RNN, keras.layers.LSTM, keras.layers.GRU layers enable you to quickly build recurrent models without having to make difficult configuration choices.

- Ease of customization: You can also define your own RNN cell layer (the inner part of the for loop) with custom behavior, and use it with the generic keras.layers.RNN layer (the for loop itself). This allows you to quickly prototype different research ideas in a flexible way with minimal code.

Some resources for understanding the derivatives and optimization inside the RNNs:

[Maths PDF](https://www.cs.toronto.edu/~tingwuwang/rnn_tutorial.pdf)

[Colah's Article on RNNs](https://colah.github.io/posts/2015-09-NN-Types-FP/)

[Recurrent Neural Networks (RNN) with Keras](https://www.tensorflow.org/guide/keras/rnn)

### Long Short Term Memory (LSTM)

Drawbacks of RNNS: One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends. Sometimes, we only need to look at recent information to perform the present task. 

For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. But there are also cases where we need more context. 

Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult. Thankfully, LSTMs don’t have this problem!


[LSTM Video](https://www.youtube.com/watch?v=WCUNPb-5EYI)

[About LSTM - blog](https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/)


There are several Variants of LSTMs some of the most famous being Depth GRU /Gated Recurrent Units.

### Gated Recurrent Unit (GRU)

GRU introduced by Cho, et al. (2014).It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

[Paper: Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf)

## Basic RNN Neural Networks Model without pre-trained Embeddings

In this context, I am building a preliminary deep neural model with different variants of RNNs. I am also building a simple LSTM model for validating the influence of deep models with respect to the statistical ones. 

I am not be using any pretrained static/dynamic embeddings but will be using a simple Neural Network model of LSTM to create the network.

There are three built-in RNN layers in Keras:

- keras.layers.SimpleRNN, a fully-connected RNN where the output from previous timestep is to be fed to next timestep.

- keras.layers.GRU, first proposed in Cho et al., 2014.

- keras.layers.LSTM, first proposed in Hochreiter & Schmidhuber, 1997.

[Read Jason's Blog - Best Practices for Text Classification](https://machinelearningmastery.com/best-practices-document-classification-deep-learning/)

In [None]:
# Basic RNN(LSTM) model without pretrained embeddings

model_RNN = Sequential([layers.Embedding(vocab_size, embedding_dim, input_length=maxlen),
                        LSTM(64),
                        Dense(16,activation='relu'),
                        Dense(1,activation='sigmoid')])

# compile
model_RNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# summary of model
model_RNN.summary()
# save the model
# model_RNN.save('rnn_nlp_model.h5')

### Plot the architecure

In [None]:
# plot the architecure
dot_img_file = '/tmp/model_RNN.png'
tf.keras.utils.plot_model(model_RNN, to_file=dot_img_file, rankdir="LR",show_shapes=True)

In [None]:
history = model_RNN.fit(X_train,y_train,epochs=2,validation_data=(X_val, y_val),batch_size=2056)

### Plot the loss and accuracy of the model

In [None]:
plot_history(history)

In [None]:
del model_RNN,dot_img_file,history
import gc
gc.collect()

## Bi-directional RNN model without pre-trained Embeddings

In [None]:
# 

embedding_layer = Embedding(vocab_size,embedding_dim,input_length=maxlen,trainable=True)

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
z=Bidirectional(LSTM(32,return_sequences='True'))(embedded_sequences)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(1,activation='sigmoid')(z)
model_biRNN= Model(inputs=int_sequences_input,outputs=z)

model_biRNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# summary of model
model_biRNN.summary()
# save the model
#model_biRNN.save('biRNN_nlp_model.h5')

### Architecture

In [None]:
dot_img_file = '/tmp/model_birnn.png'
tf.keras.utils.plot_model(model_biRNN, to_file=dot_img_file, rankdir="LR",show_shapes=True)

In [None]:
history = model_biRNN.fit(X_train,y_train,epochs=1,validation_data=(X_val, y_val),batch_size=2056)

In [None]:
del model_biRNN,dot_img_file,history
import gc
gc.collect()

<img src="https://media.giphy.com/media/10LKovKon8DENq/giphy.gif" width="300" height="100" align="left">

# Neural Networks with Static Semantic Embeddings Baseline
 
In this context, I'd explore certain embeddings which may increase the performance of the model. Pre-trained embeddings provide a better representation of word vectors.

> ## GloVe, GoogleNews and FastText embeddigns are my starting point!!

> ## GloVe Embeddings

<img src="https://media.giphy.com/media/xT1R9M8505GD2mz2da/giphy.gif" width="300" height="100" align="left">

### Designing the Embedding matrix with Glove Embeddings

In [None]:
from gensim.models import KeyedVectors

path_to_glove_file = os.path.join('../input/pretrained/', "glove.6B.100d.txt")

## make a dict mapping words (strings) to their NumPy vector representation:

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, dtype=float, sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

### Embedding Layer with GloVe Embeddings

In [None]:
## prepare a corresponding embedding matrix that we can use in a Keras Embedding layer. 
## It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer's vocabulary.
word_index=tokenizer.word_index
num_tokens = len(tokenizer.word_index)+ 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
        
print("Converted %d words (%d misses)" % (hits, misses))


#load the pre-trained word embeddings matrix into an Embedding layer.
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False
)

In [None]:
print("embedding matrix shape:",embedding_matrix.shape)

### Build the ConvNet Model - GloVe embeddings

In [None]:
## Build the ConvNet Model - GloVe embeddings

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)

x = layers.Conv1D(256, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D()(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D()(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(10, kernel_regularizer=regularizers.l1(l1=1e-4),activation='relu')(x)
preds = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(int_sequences_input, preds)
model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

### Fit the model

In [None]:
history = model.fit(X_train,y_train,epochs=2,validation_data=(X_val, y_val),batch_size=2056)

### Plot the accuracy and loss

In [None]:
plot_history(history)

In [None]:
del model,history
import gc
gc.collect()

### RNN model on Keras Sequential API with GloVe embeddings

In [None]:
# load the GloVe word embeddings matrix into an Embedding layer

model_RNN = Sequential([layers.Embedding(num_tokens,embedding_dim,embeddings_initializer=keras.initializers.Constant(embedding_matrix),trainable=False),
                        LSTM(60),
                        Dense(20,activation='relu'),
                        Dense(1,activation='sigmoid')])

# compile
model_RNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# summary of model
model_RNN.summary()

### Fit the model

In [None]:
history = model_RNN.fit(X_train,y_train,epochs=1,validation_data=(X_val, y_val),batch_size=2056)

### BiDirectional RNN model with GloVe Embeddings 

In [None]:
#load the pre-trained word embeddings matrix into an Embedding layer.
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False
)

embedding_layer = Embedding(vocab_size,embedding_dim,input_length=maxlen,trainable=True)
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)

z=Bidirectional(LSTM(32,return_sequences='True'))(embedded_sequences)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(1,activation='sigmoid')(z)
model_biRNN= Model(inputs=int_sequences_input,outputs=z)

model_biRNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# summary of model
model_biRNN.summary()

### Fit the model

In [None]:
%%time

history = model_biRNN.fit(X_train,y_train,epochs=1,validation_data=(X_val, y_val),batch_size=2056)

In [None]:
del embedding_matrix, model_RNN, model_biRNN,history
import gc
gc.collect()

> ##  GoogleNews Embeddings

<img src="https://media.giphy.com/media/l1UkRZuk6FPFYn0ewa/giphy.gif" width="300" height="100" align="left">

### Designing GoogleNews Embedding Matrix

In [None]:
google_news_embed="https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
embeddings_index = KeyedVectors.load_word2vec_format(google_news_embed, binary=True)

In [None]:
print("Found %s word vectors." % len(embeddings_index.vocab))

In [None]:
word_index = tokenizer.word_index
num_tokens = len(tokenizer.word_index)+2
nb_words   = min(num_tokens, len(word_index))
embed_size = 300

embedding_matrix = np.zeros((nb_words+2, embed_size))

In [None]:
## prepare a corresponding embedding matrix that used in a Keras Embedding layer. 
## It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer's vocabulary.

hits = 0
misses = 0

# embedding matrix
for word, i in word_index.items():
    try:
        embedding_vector = embeddings_index.get_vector(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            hits += 1
        else:
            misses += 1
    except:
        misses += 1
    
        
print("Converted %d words (%d misses)" % (hits, misses))

In [None]:
print("Embedding Matrix Shape:",embedding_matrix.shape)
print("Number of Tokens      :",num_tokens)

### Build the ConvNet Model - GoogleNews Embeddings

In [None]:
## Build the ConvNet Model - GoogleNews embeddings

embedding_dim=300

model = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                    layers.Conv1D(256, 5, activation="relu"),
                    layers.MaxPooling1D(),
                    layers.Conv1D(128, 5, activation="relu"),
                    layers.MaxPooling1D(),
                    layers.Conv1D(128, 5, activation="relu"),
                    layers.GlobalMaxPooling1D(),
                    layers.Dense(128, activation="relu"),
                    layers.Dropout(0.5),
                    layers.Dense(10, kernel_regularizer=regularizers.l1(l1=1e-4),activation='relu'),
                    layers.Dense(1, activation='sigmoid')])

model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

### Fit the ConvNet Model with GoogleNews Embeddings

In [None]:
history = model.fit(X_train,y_train,batch_size=2056,epochs=1,validation_data=(X_val,y_val))

In [None]:
del model,embeddings_index,history
import gc
gc.collect()

### RNN Model with GoogleNews Embeddings

In [None]:
model_RNN = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                        LSTM(60),
                        Dense(20,activation='relu'),
                        Dense(1,activation='sigmoid')])

# compile
model_RNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

### Fit the model

In [None]:
BATCH_size = 2056
history = model_RNN.fit(X_train,y_train,batch_size=BATCH_size,epochs=1,validation_data=(X_val,y_val))

In [None]:
del model_RNN,history
import gc
gc.collect()

### BiDirectional RNN model with GoogleNews Embeddings

In [None]:
model_biRNN = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                          layers.Bidirectional(LSTM(32,return_sequences='True')),
                          layers.GlobalMaxPool1D(),
                          layers.Dense(16,activation='relu'),
                          layers.Dense(1,activation='sigmoid')])

model_biRNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
# plotting the architecture
dot_img_file = '/tmp/model_birnn_google.png'
tf.keras.utils.plot_model(model_biRNN, show_shapes=True,to_file=dot_img_file, rankdir="LR")

### Fit the model

In [None]:
BATCH_size = 2056
history = model_biRNN.fit(X_train,y_train,batch_size=BATCH_size,epochs=1,validation_data=(X_val,y_val))

In [None]:
del model_biRNN,history,embedding_matrix
import gc
gc.collect()

> ## Fasttext Embeddings

<img src="https://media.giphy.com/media/3og0IMVPaqrnGfBnZm/giphy.gif" align ='left'>

### Word2Vec Model with Fasttext Embeddings

In [None]:
%%time

# Using the fasttext word embeddding from crawl

fasttext_file= "../input/pretrained/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(fasttext_file, binary=False)

word_index = tokenizer.word_index
num_tokens = len(tokenizer.word_index)+2

In [None]:
print("Found %s word vectors." % len(fasttext_model.vocab))

### Embedding Matrix with Fasttext Embeddings

In [None]:
## prepare a corresponding embedding matrix that used in a Keras Embedding layer. 
## It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer's vocabulary.

embed_size = 300
embedding_matrix = np.zeros((num_tokens, embed_size))

hits = 0
misses = 0

# embedding matrix
for word, i in word_index.items():
    try:
        embedding_vector = fasttext_model.get_vector(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            hits += 1
        else:
            misses += 1
    except:
        misses += 1
    
        
print("Converted %d words (%d misses)" % (hits, misses))

In [None]:
print("Embedding Matrix Shape:",embedding_matrix.shape)
print("Number of Tokens      :",num_tokens)

### ConvNet Model with Fasttext Embeddings

In [None]:
## Build the ConvNet Model - asttext embeddings

embedding_dim=300

model = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                    layers.Conv1D(256, 5, activation="relu"),
                    layers.MaxPooling1D(),
                    layers.Conv1D(128, 5, activation="relu"),
                    layers.MaxPooling1D(),
                    layers.Conv1D(128, 5, activation="relu"),
                    layers.GlobalMaxPooling1D(),
                    layers.Dense(128, activation="relu"),
                    layers.Dropout(0.5),
                    layers.Dense(10, kernel_regularizer=regularizers.l1(l1=1e-4),activation='relu'),
                    layers.Dense(1, activation='sigmoid')])

model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

### Fit the model

In [None]:
BATCH_size = 1024
history = model.fit(X_train,y_train,batch_size=BATCH_size,epochs=2,validation_data=(X_val,y_val))

### Plot the accuracy and loss 

In [None]:
plot_history(history)

### Print the model architecture

In [None]:
# plotting the architecture
dot_img_file = '/tmp/model_cnn_fast.png'
tf.keras.utils.plot_model(model, show_shapes=True,to_file=dot_img_file, rankdir="LR")

In [None]:
del history,model,dot_img_file
import gc
gc.collect()

### RNN Model with Fasttext Embeddings

In [None]:
model_RNN = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                        LSTM(64),
                        Dense(32,activation='relu'),
                        Dense(1,activation='sigmoid')])

# compile
model_RNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# fit the model
BATCH_size = 1024
history = model_RNN.fit(X_train,y_train,batch_size=BATCH_size,epochs=1,validation_data=(X_val,y_val))

In [None]:
# plotting the architecture
dot_img_file = '/tmp/model_rnn_fast.png'
tf.keras.utils.plot_model(model_RNN, show_shapes=True,to_file=dot_img_file, rankdir="LR")

In [None]:
del history,model_RNN,dot_img_file
import gc
gc.collect()

### BiDirectional RNN Model with Fasttext Embeddings

In [None]:
embedding_dim=300

model_biRNN = Sequential([layers.Embedding(num_tokens,embedding_dim,weights=[embedding_matrix],trainable=False),
                          layers.Bidirectional(LSTM(64,return_sequences='True')),
                          layers.GlobalMaxPool1D(),
                          layers.Dense(128,activation='relu'),
                          layers.Dense(64,activation='relu'),
                          layers.Dense(16,activation='relu'),
                          layers.Dense(1,activation='sigmoid')])

### Print the Model Architecture

In [None]:
dot_img_file = '/tmp/model_birnn_fast.png'
tf.keras.utils.plot_model(model_biRNN, show_shapes=True,to_file=dot_img_file, rankdir="LR")

### Validation F1 Score and Accuracy on the BiRNN model

In [None]:
%%time

from sklearn import metrics

model_biRNN.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])


## Log Directory-TensorBoard

# root log directory - with logs and sub-directory of current data and time
root_logdir = os.path.join("os.curdir","my_logs")

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d_%H_%M_%S")
    return os.path.join(root_logdir,run_id)

run_logdir = get_run_logdir()
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir,histogram_freq=1)

epochs=2
BATCH_size = 1024

for e in range(epochs):
    model_biRNN.fit(X_train,y_train,batch_size=BATCH_size,epochs=3,validation_data=(X_val,y_val),callbacks=[tensorboard_cb])
    pred_fast_val_y = model_biRNN.predict([X_val], batch_size=1024, verbose=1)
    best_thresh = 0.6
    best_score = 0.0
    for thresh in np.arange(0.1, 0.601, 0.01):
        thresh = np.round(thresh, 2)
        score = metrics.f1_score(y_val, (pred_fast_val_y>thresh).astype(int))
        if score > best_score:
            best_thresh = thresh
            best_score = score
    print("Val F1 Score: {:.4f}".format(best_score))

### Confusion Matrix on Validation data

In [None]:
from sklearn import metrics
import seaborn as sns

pred_y_val = (pred_fast_val_y>best_thresh).astype(int)

def conf_matrix(actual, prediction, model_name):
    cm_array=metrics.confusion_matrix(actual,prediction,labels=[0,1])
    sns.set_context("notebook", font_scale=1.1)
    plt.figure(figsize=(5,5))
    sns.heatmap(cm_array,annot=True, fmt='.0f',xticklabels=['Sincere','Insincere'],yticklabels=['Sincere','Insincere'])
    plt.ylabel('True\n')
    plt.xlabel('Predicted\n')
    plt.title(model_name)
    plt.show()
    

conf_matrix(y_val,pred_y_val,'Bidirectional RNN Model with fasttext embeddings\n')

In [None]:
test_df = pd.read_csv("../input/testdataquora/test.csv")
test_sentences = test_df['question_text']

tokenizer = Tokenizer(num_words=30000)
tokenizer.fit_on_texts(test_sentences)
X_test = tokenizer.texts_to_sequences(test_sentences)
X_test = pad_sequences(X_test, padding='post', maxlen=126)

pred_test_y = model_biRNN.predict([X_test], batch_size=1024, verbose=1)

#submission file
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission_quora_birnn.csv", index=False)

In [None]:
print("Metrics\n")
print(metrics.classification_report(y_val,pred_y_val))

## Dynamic Embeddings - ELMo (Embeddings from Language Models)

<img src='https://images.squarespace-cdn.com/content/v1/5208f2f8e4b0f3bf53b73293/1486510537795-YWYY4NDK68CT5VBPZZR2/ke17ZwdGBToddI8pDm48kDrMjE7hBq4fQV3wYHraitJZw-zPPgdn4jUwVcJE1ZvWQUxwkmyExglNqGp0IvTJZUJFbgE-7XRK3dMEBRBhUpzj2bmKhA1a89vhGCTEuFcMrGIAhTIwGn2DOXg1A8iNSPxvh_zK_LmuDa3ZMbEzfBk/Elmo_Emoji_animating_Jazz_Hands_JS_Y_v06.gif?format=2500w'>


Deep contextual embeddings and sentence/word vectors falls under dynamic embeddings. These embeddings are current SOTA implying that there is a need for robust Neural Network models.


[📖 READ **Attention Is All You Need**- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhine](https://arxiv.org/abs/1706.03762)

[📖 READ **Deep contextualized word representations** - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer](https://arxiv.org/abs/1802.05365)

Both these papers are essentially important for their contributions to contextual deep embeddings.

I  highly recommend to read these papers!!! These are really cool explanation of how ELMo was designed.

ELMo deep contextualized word embeddings (developed by AllenNLP) are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks. 

![Lena Voita's Blog](https://lena-voita.github.io/resources/lectures/transfer/elmo/training-min.png)

### Under the hood:

The architecture above uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
These raw word vectors act as inputs to the first layer of biLM
The forward pass contains information about a certain word and the context (other words) before that word
The backward pass contains information about the word and the context after it
This pair of information, from the forward and backward pass, forms the intermediate word vectors
These intermediate word vectors are fed into the next layer of biLM
The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors

As the input to the biLM is computed from characters rather than words, it captures the inner structure of the word. For example, the biLM will be able to figure out that terms like beauty and beautiful are related at some level without even looking at the context they often appear in. Sounds incredible!


[📖 READ Analytical Vidya](https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/)

<img src ="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/03/output_YyJc8E.gif">

> #### Visit **[END-to-END-Natural Language Processing-4](https://www.kaggle.com/rizdelhi/end-to-end-natural-language-processing-4)** to the explore implemenation of the following:

- Sequence2Sequence models without/with Attention Heads
- Transformers
- ELMo, DistilBERT embeddings
- BERT
- roBERTo
- ALBERT


<img src="https://media.giphy.com/media/10b7yI48cD31K0/giphy.gif" width="300" height="100" align="right">