# Bidirectional LSTM Module

Natural Language Processing(NLP) is one of the main usages for the Neural networks-Deep learning model wither it was speech recognition, ChatBots or even predict the next words in a sentence, this all will not be achieved throughout using simple NN there is model's developed in order to overcome these obstacles one of these models is RNN.
- RNN - (Recurrent Neural Network) is a generalization of a feedforward neural network that has internal memory. It's recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation.

<img src="../assets/RNN.png" alt="Drawing" style="width: 400px;"/>


One of the RNN limitations is Gradient vanishing and exploding problems, That's why the model  LSTM was founded.
- LSTM -(Long Short Term Memory)It's a special kind of RNN, capable of learning long-term dependencies.This was created with one basic thing in mind- the gradients shouldn’t vanish even if the sequence is very large.

<img src="../assets/LSTM.png" alt="Drawing" style="width: 400px;"/>


To Improve the model performance on sequence classification problems Bidirectional LSTMs have used.
- BiLSTM - (Bidirectional LSTMs) it's an extension of traditional LSTMs. It trains two instead of one LSTMs on the input sequence, The first on the input sequence as-is, and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning on the problem.

<img src="../assets/BI-LSTM.png" alt="Drawing" style="width: 400px;"/>


In our case study for the detection of Toxic comments, it's very important to consider the context and sequence of a word that's why we used the BiLSTM model.



## 1 . Modelling

This notebook created for the Explanation & submission purposes, for  use the link provided to run the code on Kaggle Notebook using GPU: 
https://www.kaggle.com/norahsh/2-lstm-model-for-toxic-classification-comments

### 1.1 Text data preprocessing

#### 1.1.1 Tokenization

In order to feed the model we first need to vectorize the comments, This can be done by using keras ```Tokenizer``` it will break down the sentence --> put the words in a dictionary-like structure and give an index for each --> represent the sequence of words in the comments in the form of index.


In [None]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features,lower=True,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

#Fitting tokenizer
tokenizer.fit_on_texts(list(list_sentences_train) + list(list_sentences_validation) + list(list_sentences_test))
word_index = tokenizer.word_index

# Building training set
list_tokenized_train = tokenizer.texts_to_sequences(list(list_sentences_train))
y_train = train['toxic'].values

# Building validation set
list_tokenized_validation = tokenizer.texts_to_sequences(list(list_sentences_validation))
y_valid = valid['toxic'].values

# Building test set
list_tokenized_test = tokenizer.texts_to_sequences(list(list_sentences_test))

del tokenizer # To save RAM space

#### 1.1.2 Padding

we use `pad_sequences` in order to fix the issue of inconsistent length we use the max length feature to fill the short sentences with 0.

In [2]:
maxlen = 200 # length of padding

# Padding sequences for all 
print('Padding sequences...')
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_valid = pad_sequences(list_tokenized_validation, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

### 1.2 Word Embeddings

In thes step We use Pre-trained word embeddings wich are usually trained on a large amount of corpus it helps avoid training word embedding from scratch in here i comined 2 Pre-trained word embeddings Glove & crawl.

<b>NOTE:</b> The training dataset has only English comments to handle the language problem. We combined two-word vectors embedding (FastText(Crawl) & Glove) it's Pre-Trined word vectors depend on the word representation in the vector space and each of them has her own way.
The second thing we did is to fit the validation data set after the train, So all this helps the model to deal with different languages and give accuracy equal to 78%.
There is a lot of technique that can be implemented in the future like injection the embedding with different words and using cross-lingual word embedding models.

#### 1.2.1 load pre trined word embeddings 

In [None]:
# using Fasttext word vector
with open('../input/pickled-crawl300d2m-for-kernel-competitions/crawl-300d-2M.pkl', 'rb') as  infile:
        crawl_embeddings = pickle.load(infile)

In [None]:
# using GloVe word vector
with open('../input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl', 'rb') as  infile:
        glove_embeddings = pickle.load(infile)

#### 1.2.2 Build embedding matrix

In [None]:
def build_matrix(word_index, embeddings_index):
    
    ''''
    Input: word indexing from the tocnizer appove and the pre-trined word vector model
    
    output: embedding matrix
    
    ''''
    
    embedding_matrix = np.zeros((len(word_index) + 1,300 ))
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embeddings_index[word]
        except:
            embedding_matrix[i] = embeddings_index["unknown"]
    return embedding_matrix

In [None]:
#Building matrices
embedding_matrix_1 = build_matrix(word_index, crawl_embeddings)
embedding_matrix_2 = build_matrix(word_index, glove_embeddings)

# Concatenating embedding matrices 
embedding_matrix = np.concatenate([embedding_matrix_1, embedding_matrix_2], axis=1)

# deleting to save spase in RAM 
del embedding_matrix_1, embedding_matrix_2
del crawl_embeddings ,glove_embeddings 

### 1.3 Building Annotation based Bi-LSTM model

<img src="../assets/Model_simple_expl.png" alt="Drawing" style="width: 400px;"/>


In this model, our input will be a sequence of word representation it will be processed in the embedded layer the output will be transferred to our bi-directional LSTM which it represents 2 LSTM layers one proceed forward and the second performs backward after this combine the result plus implement the attention method with average weights in order to get an accurate result next we do a dropout with 0.5 in order to avoid overfitting following this move to dense layer with a sigmoid activation function.

You will see below the code representation of what has been discussed :

#### 1.3.1 Defie the input shape

In [None]:
#define shape of the input 
inp = Input(shape=(maxlen,)) #define shape of the input 

#### 1.3.2 first layer (embedding_layer)

The Embedding layer is used to create word vectors for incoming words,The output of the Embedding layer is the input to the LSTM layer.

In [5]:
# create embedding layer 
embedding_layer = Embedding(*embedding_matrix.shape,
                                weights=[embedding_matrix],
                                trainable=False) # create embedding layer 

In [None]:
# pass input into the embded layer 
x = embedding_layer(inp)

#### 1.3.2 Bi-LSTM layer

In [None]:
# feed into bidirectional wech it will out but 
x = Bidirectional(LSTM(256, return_sequences=True))(x) 

In [None]:
# feed into bidirectional wech it will out but
x = Bidirectional(LSTM(128, return_sequences=True))(x) 

#### 1.3.3 Concatnationg Avreging and Attention layer

In [None]:
class Attention(Layer):
    """
    Custom Keras attention layer
    Reference: https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
    """
    def __init__(self, step_dim, W_regularizer=None, b_regularizer=None, 
                 W_constraint=None, b_constraint=None, bias=True, **kwargs):

        self.supports_masking = True

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = None
        super(Attention, self).__init__(**kwargs)

        self.param_W = {
            'initializer': initializers.get('glorot_uniform'),
            'name': '{}_W'.format(self.name),
            'regularizer': regularizers.get(W_regularizer),
            'constraint': constraints.get(W_constraint)
        }
        self.W = None

        self.param_b = {
            'initializer': 'zero',
            'name': '{}_b'.format(self.name),
            'regularizer': regularizers.get(b_regularizer),
            'constraint': constraints.get(b_constraint)
        }
        self.b = None

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.features_dim = input_shape[-1]
        self.W = self.add_weight(shape=(input_shape[-1],), 
                                 **self.param_W)

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],), 
                                     **self.param_b)

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        step_dim = self.step_dim
        features_dim = self.features_dim

        eij = K.reshape(
            K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))),
            (-1, step_dim))

        if self.bias:
            eij += self.b
        eij = K.tanh(eij)
        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim

In [None]:
# call the GlobalAveragePooling1D 
avrege = GlobalAveragePooling1D()(x)

In [None]:
# call the Attention 
attention = Attention(maxlen)(x)

In [None]:
# concate these techniqes to form layer that perform on the output from the Bi-LSTM 
hidden = concatenate([attention,avrege])

In [None]:
# using dense with 512 output with relu acttivation function
x = Dense(512, activation='relu')(hidden)

#### 1.3.4 Dropu out to avoid over fitting

In [None]:
# perform a dropout with 0.5 to avoid ofer fitting 
x =  Dropout(0.5)(x)

#### 1.3.5 Dense

ReLU activation: max(x, 0), the element-wise maximum of 0 and the input tensor.

In [8]:
# using dense with 128 output with relu acttivation function 
x = Dense(128, activation="relu")(x)

Sigmoid activation function, sigmoid(x) = 1 / (1 + exp(-x)).

In [None]:
# using dense output with sigmoid acttivation function 
o = Dense(1, activation='sigmoid')(x)

#### 1.3.6 Buldin the model 

In [None]:
# call the model 
model = Model(inputs=inp, outputs=o)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=[tf.keras.metrics.AUC()])

##### 1.3.7.1  Fitting the train model

In [None]:
# Model fitting on train data set
model.fit(X_t,y_train,batch_size=32,epochs=2,validation_split=0.1)

##### 1.3.7.2  Fitting the Validation model

In [None]:
# Model fitting on Validation data set
model.fit(X_valid,y_valid,batch_size=32,epochs=2,validation_split=0.1)

#### 1.3.7 Prediction & Save the result

In [None]:
# Predect the toxicity of the test
val = model.predict(X_te, verbose=1)

In [None]:
# save the predections into the submetion file 
sub['toxic'] = val 
sub.to_csv('submission.csv', index=False)

### 1.4  Model AUC = 74.83
 

### Resources 

- https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
- https://www.researchgate.net/publication/321976542_Bidirectional_LSTM_Recurrent_Neural_Network_for_Keyphrase_Extraction
- https://medium.com/@shivajbd/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
- https://www.researchgate.net/publication/323130660_Text_Classification_Research_with_Attention-based_Recurrent_Neural_Networks
- https://nlp.stanford.edu/projects/glove/
- https://www.sciencedirect.com/science/article/abs/pii/S0925231219301067
- https://keras.io
- https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6834909.pdf
- https://towardsdatascience.com/natural-language-processing-from-basics-to-using-rnn-and-lstm-ef6779e4ae66
- https://colah.github.io/posts/2015-08-Understanding-LSTMs
- https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
- https://www.sciencedirect.com/science/article/abs/pii/S0925231219301067
- https://www.kaggle.com/authman/pickled-crawl300d2m-for-kernel-competitions
- https://www.kaggle.com/authman/pickled-glove840b300d-for-10sec-loading