# Tensorflow Keras Tutorial - Text Generation using RNN (Part 7)

**What is Keras?** Keras is a wrapper that allows you to implement Deep Neural Network without getting into intrinsic details of the Network. It can use Tensorflow or Theano as backend. This tutorial series will cover Keras from beginner to intermediate level.

YOU CAN CHECK OUT REST OF THE TUTORIALS OF THIS SERIES.

[PART 1](https://www.kaggle.com/akashkr/tf-keras-tutorial-neural-network-part-1)<br>
[PART 2](https://www.kaggle.com/akashkr/tf-keras-tutorial-cnn-part-2)<br>
[PART 3](https://www.kaggle.com/akashkr/tf-keras-tutorial-binary-classification-part-3)<br>
[PART 4](https://www.kaggle.com/akashkr/tf-keras-tutorial-pretrained-models-part-4)<br>

<font color=red>IF YOU HAVEN'T GONE THROUGH THE PREVIOUS PART OF THIS TUTORIAL, IT'S RECOMMENDED FOR YOU TO GO THROUGH THAT FIRST.</font><br>
[PART 5](https://www.kaggle.com/akashkr/tf-keras-tutorial-basics-of-nlp-part-5)<br>
[PART 6](https://www.kaggle.com/akashkr/tf-keras-tutorial-bi-lstm-conv1d-gru-part-6)

In the previous notebooks we worked on text classification. Let's see how to generate text data.

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np

In [None]:
data_path = '../input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv'

# Overview of Dataset

Since the generation of text data takes a lot of time, we will use smaller sample to test our model.

In [None]:
df = pd.read_csv(data_path).head(500)

In [None]:
df.head()

Here we will be using only **Review Text** column to generate the text. Rest of the columns are resundant as of now.

In [None]:
print(f'Shape of data: {df.shape}')
# Find the number of missing values
print(df.info())

# Preprocessing
The preprocssing in this type of model will be quite different. Let's see how we are going to make the data.
Let's take a sentence-
    `Wow, this dress is quite wonderful.`
    
1. Tokenize this sentence

    `[23, 5 ,6, 34, 11, 21]` - suppose this is the tokenized sentence. Lets assume that this was the longest sentence in our corpus
2. slash into n-gram and append - We convert one sentence like this

    `[23, 5]
    [23, 5, 6]
    [23, 5 ,6, 34]
    [23, 5 ,6, 34, 11]
    [23, 5 ,6, 34, 11, 21]`
    
3. Pad sequence - we pre pad all the sequence with zeros making the length of each sequence equal to the longest senquence in the corpus (Which we have assumed to be this sentence)

    `[0, 0, 0, 0, 23, 5]
    [0, 0, 0, 23, 5, 6]
    [0, 0, 23, 5 ,6, 34]
    [0, 23, 5 ,6, 34, 11]
    [23, 5 ,6, 34, 11, 21]`
    
4. Split into X and Y - We split the data to make the last column as target variable and rest as feature.
    
    X<br>
    `[0, 0, 0, 0, 23]
    [0, 0, 0, 23, 5]
    [0, 0, 23, 5 ,6]
    [0, 23, 5 ,6, 34]
    [23, 5 ,6, 34, 11]`
    
    Y<br>
    `[5]
    [6]
    [34]
    [11]
    [21]`
    
5. Convert Y to one hot encoding

Y<br>
    `[0, 0, 0, 0, 1, 0 ...]
    [0, 0, 0, 1, 0, 0 ...]
    [1, 0, 0, 0, 0, 0 ...]
    [0, 1, 0, 0, 0, 0 ...]
    [0, 0, 0, 0, 0, 1 ...]`

In [None]:
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['Review Text'].astype(str).str.lower())

total_words = len(tokenizer.word_index)+1
tokenized_sentences = tokenizer.texts_to_sequences(df['Review Text'].astype(str))

In [None]:
# Slash sequences into n gram sequence
input_sequences = list()
for i in tokenized_sentences:
    for t in range(1, len(i)):
        n_gram_sequence = i[:t+1]
        input_sequences.append(n_gram_sequence)
        
# Pre pad with max sequence length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In [None]:
# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

# Modelling
Enable GPU for faster computation

In [None]:
model = Sequential()
model.add(Embedding(total_words, 40, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(250)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
earlystop = EarlyStopping(monitor='loss', min_delta=0, patience=4, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=50, verbose=1, callbacks=[earlystop])

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()
    
plot_graphs(history, 'accuracy')

# Prediction
Now we will take a seed text, encode it and predict the next word. Then take the next word into the seed and predict the next ... for some number of words.. say 40.
Note that gradually the accuracy will start decreasing since we already take those words into seed which is predicted.

In [None]:
def complete_this_paragraph(seed_text, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

In [None]:
complete_this_paragraph("this is a good", 40)

In [None]:
complete_this_paragraph("i loved that dress", 40)

In [None]:
complete_this_paragraph("This shirt is so", 40)

Voila! Here it is. It does make some sense here. You can increase the training data size to see better results.
### Like and upvote👍