# RNN (Recurrent Neural Network)
* RNN's are state-of-art algorithms that remember it's input due to an internal memory.
* For a typical Neural Network, the order of features/inputs doesn't matter
* In RNN, we have a concept of adding feedback/loopback to a hidden layer. This enables that the current value is dependent on past values like a AR time series model.
* unlike Feed Forward Neural Networks, RNN's implement Back Propogation Through Time.
* So the current value is dependent all previous values. In calculating the error during back propogation, a lot of gradients are multipled (basically derivatives) and due to this we fall into either of the two problems<br>
1) Vanishing Gradient Descent.<br>
2) Exploding Gradient Descent.
* Another problem is that say the information of xt-1 is overwritten by xt as it reaches xt+1.<br>
But for text analytics the first word information is also important. So LSTM came into picture.


### SimpleRNN model is not implemented in this part.
When it comes to implememting, the only change we do is instead of **keras.model.layers.LSTM** use **keras.model.layers.SimpleRNN**.

**Conceptually both differ**

# LSTM

In LSTM we make use of gates. Each LSTM will have 3 gates. These gates are nothing but neural networks.
1. Forget Gate
2. Add Gate
3. Output Gate<br>

We can think that a single LSTM unit will have 3 inputs
1. Memory (This memory will be flowing from one layer to the other, <b><i>Basically in short it will be previous memory and for current LSTM unit becomes current memory where we add this gates information which will act as previous memory for next LSTM</i></b>)
2. Previous Output
3. current Input

## Forget Gate:
It considers previous input and current input, then decides what to forget/remove and adds that to previous memory. We do not want all previous outputs affecting our current value, so this gate tells what to forget. Say you have 4 previous memory values(Cₜ₋₁) [0.1, 0.4, 0.6, 0.2] and you multiply it with current values(fₜ) ( [1,0,1,0] (here 0's are what this gate is telling to forget) resulting [0.1, 0, 0.6, 0] and adding it to previous memory flow.

## Add Gate
What information from the current input and previous output should be added to the memory flow.
Here we will have a regular neural network having some values(gₜ) [0.7, 0.5, 0.2, 0.1] and (iₜ) [1,0,0,1] telling what needs to be added resulting [0.7, 0, 0, 0.1] say jₜ.

## Output Gate
What goes out as output to next layer is specified by this gate oₜ. Basically a part of information will be sent out as output.<br>
<br>
Please note that (fₜ, gₜ, iₜ, oₜ) will be associated with weights.And because of these weights the computation time will be higher<br>
Finally the weights for each of these gates will be trained accordingly.

## Text Generation

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, GRU, Embedding, Dropout

In [None]:
import glob
file_list = glob.glob('/content/drive/MyDrive/Colab Notebooks/NLP/datasets/Tagore/data/*.txt')

text_data = []
for file in file_list:
  with open(file, 'rb') as file:
    file_content = file.read().decode('utf-8')
    text_data.append(file_content)
    
len(text_data)

20

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

## Understand how Tokenizer is working (VIMP)

In [49]:
test_doc = ["This is a single document with some words"]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(test_doc)
print("After fitting a word index is internally created and you did this on training data: ",tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(test_doc)
print("Generate sequences for each document and for one document it looks like this: ", sequences)
test_doc_2 = ["This is a second document with some words"]
sequences = tokenizer.texts_to_sequences(test_doc_2)
print("Say you have a test set with document like this, this would be the result, you see not all numbers are present: ",sequences)

After fitting a word index is internally created and you did this on training data:  {'this': 1, 'is': 2, 'a': 3, 'single': 4, 'document': 5, 'with': 6, 'some': 7, 'words': 8}
Generate sequences for each document and for one document it looks like this:  [[1, 2, 3, 4, 5, 6, 7, 8]]
Say you have a test set with document like this, this would be the result, you see not all numbers are present:  [[1, 2, 3, 5, 6, 7, 8]]


In [43]:
test_doc = ["This is a single document with some words"]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(test_doc)
print("After fitting a word index is internally created and you did this on training data: ",tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(test_doc)
print("Generate sequences for each document and for one document it looks like this: ", sequences)
test_doc_2 = ["This is a second document with some words"]
sequences = tokenizer.texts_to_sequences(test_doc_2)
print("Say you have a test set with document like this, this would be the result, you see not all numbers are present: ",sequences)

After fitting a word index is internally created and you did this on training data:  {'this': 1, 'is': 2, 'a': 3, 'single': 4, 'document': 5, 'with': 6, 'some': 7, 'words': 8}
Generate sequences for each document and for one document it looks like this:  [[1, 2, 3, 4, 5, 6, 7, 8]]
Say you have a test set with document like this, this would be the result, you see not all numbers are present:  [[1, 2, 3, 5, 6, 7, 8]]


In [44]:
test_doc = ["This is a single document with some words", "This is second document"]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(test_doc)
print("After fitting a word index is internally created and you did this on training data: ",tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(test_doc)
print("Generate sequences for each document: ", sequences)
test_doc_2 = ["This is a second document with some words", "This is test data document"]
sequences = tokenizer.texts_to_sequences(test_doc_2)
print("Say you have a test set with documents like this, this would be the result, you see not all numbers are present: ",sequences)

After fitting a word index is internally created and you did this on training data:  {'this': 1, 'is': 2, 'document': 3, 'a': 4, 'single': 5, 'with': 6, 'some': 7, 'words': 8, 'second': 9}
Generate sequences for each document:  [[1, 2, 4, 5, 3, 6, 7, 8], [1, 2, 9, 3]]
Say you have a test set with documents like this, this would be the result, you see not all numbers are present:  [[1, 2, 4, 9, 3, 6, 7, 8], [1, 2, 3]]


In [45]:
test_doc_lists = [["This is one document"], 
                  ["This is another document"]]
 
tokenizer = Tokenizer()
tokenizer.fit_on_texts(test_doc_lists)
print("For list of lists, each list will be treated as single token: ",tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(test_doc_lists)
print("Now sequences will be generated for exact documents: ", sequences)
test_doc_lists_2 = [["some document from test data"],
                    ["This is one document"],
                    ["This is on document"],
                    ["This is ONE document"]]
sequences = tokenizer.texts_to_sequences(test_doc_lists_2)
print("Test data sequences for different use-cases: ", sequences)

For list of lists, each list will be treated as single token:  {'this is one document': 1, 'this is another document': 2}
Now sequences will be generated for exact documents:  [[1], [2]]
Test data sequences for different use-cases:  [[], [1], [], [1]]


## Tokenizer for Tagore Data

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
print("Length of word index : ", len(word_idx))

Length of word index :  29566


In [None]:
sequences = tokenizer.texts_to_sequences(text_data)
print(list(sequences[0][:15]))
print(len(sequences))

[2, 57, 43, 256, 3, 2068, 37, 544, 729, 1, 17, 256, 9, 16, 2]
20


### As we are doing text generation meaning for every particular set of words, we have to predict what is the next word.<br>
So we will maintain features and labels where features will be some particualr words (ideally word_ids) and label will be the next appearing word (also an id)<br>

**Eg** <br>
sequence = [1,2,3,4,5,6,7,8,9,10]<br>
features_length = 3<br>


---


features = [1,2,3], labels = [4]<br>
features = [2,3,4], labels = [5]<br>
features = [3,4,5], labels = [6]<br>
..<br>
..<br>
..<br>

In [None]:
features_length = 20
features = []
labels = []
for seq in sequences:
  for i in range(len(seq)-features_length):
    feat = seq[i:i+features_length]
    features.append(feat)
    lbl = seq[features_length+i]
    labels.append(lbl)

In [None]:
print(features[0]), print(labels[0])
print(features[1]), print(labels[1])
print(features[2]), print(labels[2])

[2, 57, 43, 256, 3, 2068, 37, 544, 729, 1, 17, 256, 9, 16, 2, 169, 3, 752, 1218, 32]
44
[57, 43, 256, 3, 2068, 37, 544, 729, 1, 17, 256, 9, 16, 2, 169, 3, 752, 1218, 32, 44]
650
[43, 256, 3, 2068, 37, 544, 729, 1, 17, 256, 9, 16, 2, 169, 3, 752, 1218, 32, 44, 650]
5


(None, None)

In [None]:
from sklearn.utils import shuffle
final_features, labels = shuffle(features, labels, random_state=1)
final_features = np.array(final_features[:10000])
labels = np.array(labels[:10000])

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(final_features,
                                                                            labels,
                                                                            test_size=0.2,
                                                                            random_state=1)

In [None]:
word_ids_length = len(word_idx)
train_labels_encoded = np.zeros((len(train_labels), word_ids_length))
test_labels_encoded = np.zeros((len(test_labels), word_ids_length))

for i, val in enumerate(train_labels):
  train_labels_encoded[i, val] = 1

for i, val in enumerate(test_labels):
  test_labels_encoded[i, val] = 1

### Build Model

In [None]:
model = Sequential()
model.add(Embedding(input_dim=word_ids_length, output_dim=50, input_length=features_length, trainable=True))
model.add(LSTM(64, dropout=0.1, recurrent_dropout=0.1, activation='tanh'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(word_ids_length, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])



In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 50)            1478300   
                                                                 
 lstm (LSTM)                 (None, 64)                29440     
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 29566)             1921790   
                                                                 
Total params: 3,433,690
Trainable params: 3,433,690
Non-trainable params: 0
_________________________________________________________________


In [None]:
h = model.fit(train_features, train_labels_encoded, 
              validation_data=(test_features, test_labels_encoded),
              epochs=200, batch_size=64, verbose=1)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [None]:
model.evaluate(test_features, test_labels_encoded, batch_size=64)



[47.52728271484375, 0.042500000447034836]

In [None]:
# model.predict(test_features[0])
test_features[0].shape

(20,)

In [None]:
test_features[0].shape[0]

20

## How to predict?
We have considered one input below. For that input we will get the index of predicted word then append this to the input feature and predict the next word and also get the words related to those indexes. Below is the code implemented

In [None]:
tf1 = test_features[0].reshape(1, test_features[0].shape[0])

pred_words = []
for i in range(30):
  pred_val = model.predict(tf1)
  pred_index = np.argmax(pred_val)
  pred_word = tokenizer.index_word[pred_index]
  pred_words.append(pred_word)
  tf1 = list(tf1.flatten())
  tf1.append(pred_index)
  tf1 = np.array(tf1[1:]).reshape(1,20)

In [None]:
print(pred_words)

['comes', 'with', 'fancy', 'to', 'you', 'you', 'you', 'coming\r', "let's", 'to', 'access', 'to', 'or\r', 'before', 'keeping', 'before', 'is', 'is', 'is', 'will', 'is', 'is', 'not', 'terms', 'to', 'great', 'movement\r', 'to', 'a', 'project']


# Text Classification using LSTM

In [3]:
from gensim.parsing.preprocessing import remove_stopwords

In [51]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/NLP/datasets/quora/train.csv').sample(10000)
data.head()

Unnamed: 0,qid,question_text,target
3864,00c1056292bb09c267e7,Why are African Americans either victims or th...,1
158235,1ef420af71334ba636ab,If you unfollow a person then follow them agai...,0
195187,262775db63ece1e13b4c,What were the responsibilities of the Holy Rom...,0
243999,2fb96cc3a73cba1dce2d,Is Chinese stainless steel inferior?,0
317658,3e41ef02bf629d17e4b1,What kind of religious faith did the Mauryans ...,0


In [52]:
data.shape

(10000, 3)

In [53]:
docs = data['question_text'].str.lower().str.replace('[^a-z\s]', '')
docs = docs.apply(remove_stopwords)

In [54]:
from sklearn.model_selection import train_test_split
trainx, testx, trainy, testy = train_test_split(docs, data['target'], test_size=0.2, random_state=3)

In [55]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(trainx)
trainx_seq = tokenizer.texts_to_sequences(trainx)
testx_seq = tokenizer.texts_to_sequences(testx)

In [72]:
vocab_length = len(tokenizer.word_index) + 1

In [57]:
from keras.preprocessing.sequence import pad_sequences

In [58]:
max_word_length = 11
trainx_pad = pad_sequences(trainx_seq, maxlen=max_word_length, padding='post')
testx_pad = pad_sequences(testx_seq, maxlen=max_word_length, padding='post')

In [59]:
from keras.models import Sequential
from keras import layers

In [86]:
model = Sequential()
model.add(layers.Embedding(input_dim=vocab_length, output_dim=50, input_length=max_word_length, trainable=True))
model.add(layers.LSTM(64, activation='tanh', dropout=0.1, recurrent_dropout=0.1))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')

In [87]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 11, 50)            646750    
                                                                 
 lstm_7 (LSTM)               (None, 64)                29440     
                                                                 
 flatten_4 (Flatten)         (None, 64)                0         
                                                                 
 dense_14 (Dense)            (None, 10)                650       
                                                                 
 dense_15 (Dense)            (None, 1)                 11        
                                                                 
Total params: 676,851
Trainable params: 676,851
Non-trainable params: 0
_________________________________________________________________


In [88]:
model.fit(trainx_pad, trainy,
          validation_data=(testx_pad, testy),
          epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fb139a1cc50>

In [89]:
model.evaluate(testx_pad, testy, batch_size=64)



[0.573844313621521, 0.9150000214576721]

In [90]:
model.predict(testx_pad)

array([[5.9874833e-01],
       [6.2185615e-02],
       [8.0461293e-02],
       ...,
       [1.0581951e-05],
       [1.8902509e-05],
       [1.4308875e-05]], dtype=float32)