<a href="https://colab.research.google.com/github/SharmaSanskar/ml-story-generation/blob/main/story_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STORY GENERATION USING GRUs
It is difficult to learn long term dependencies using vanilla RNNs due to the problem of vanishing and exploding gradients. LSTMs and GRUs solve this problem by maintaining a cell state and controlling the cell state using a set of gates.

### Importing Libraries

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-2.7.5.2-py3-none-any.whl (871 kB)
[?25l[K     |▍                               | 10 kB 15.6 MB/s eta 0:00:01[K     |▊                               | 20 kB 20.6 MB/s eta 0:00:01[K     |█▏                              | 30 kB 24.2 MB/s eta 0:00:01[K     |█▌                              | 40 kB 19.8 MB/s eta 0:00:01[K     |█▉                              | 51 kB 16.1 MB/s eta 0:00:01[K     |██▎                             | 61 kB 13.9 MB/s eta 0:00:01[K     |██▋                             | 71 kB 11.3 MB/s eta 0:00:01[K     |███                             | 81 kB 12.4 MB/s eta 0:00:01[K     |███▍                            | 92 kB 13.6 MB/s eta 0:00:01[K     |███▊                            | 102 kB 10.6 MB/s eta 0:00:01[K     |████▏                           | 112 kB 10.6 MB/s eta 0:00:01[K     |████▌                           | 122 kB 10.6 MB/s eta 0:00:01[K     |████▉                           | 133 kB 10.6 MB/s eta 0

In [None]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, GRU, Dense, Dropout
from keras.models import Sequential
import tensorflow.keras.utils as ku

In [None]:
# Input file
filepath = './harry_potter.txt'
text = open(filepath, 'rb').read().decode(encoding='utf-8').lower()

In [None]:
text



### Cleaning the Text
Splitting each line and removing the punctuations.

In [None]:
import string
def clean_text(line):
  line = "".join(v for v in line if v not in string.punctuation)
  return line

In [None]:
corpus = [clean_text(line) for line in text.split("\n") if line.strip()]

In [None]:
corpus[:10]

['harry potter and the sorcerers stone',
 'chapter one',
 'the boy who lived',
 'mr and mrs dursley of number four privet drive were proud to say',
 'that they were perfectly normal thank you very much they were the last',
 'people youd expect to be involved in anything strange or mysterious',
 'because they just didnt hold with such nonsense',
 'mr dursley was the director of a firm called grunnings which made',
 'drills he was a big beefy man with hardly any neck although he did',
 'have a very large mustache mrs dursley was thin and blonde and had']

# Tokenization
Extracting the tokens and their indices from each sentence. Since ML models only understand numbers. 

In [None]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    # Tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    # Convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

In [None]:
inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[7, 123],
 [7, 123, 2],
 [7, 123, 2, 1],
 [7, 123, 2, 1, 631],
 [7, 123, 2, 1, 631, 155],
 [607, 38],
 [1, 140],
 [1, 140, 73],
 [1, 140, 73, 1036],
 [144, 2]]

In [None]:
total_words

6036

### Padding
Each sequence of tokens is padded to make their length equal.

In [None]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

### Creating GRU Model
GRUs only have 2 gates (Update and Reset) compared to the 3 gates in LSTMs. This makes GRUs faster in terms of training speed. Also GRUs outperform LSTMs in scenario of long text and small dataset.

In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()   
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))  
    # Add Hidden Layer 1 - GRU Layer
    model.add(GRU(256, return_sequences=True))
    model.add(Dropout(0.2))
    # Add Hidden Layer 2 - GRU Layer
    model.add(GRU(256))
    model.add(Dropout(0.2))
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')  
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 17, 10)            60360     
                                                                 
 gru (GRU)                   (None, 17, 256)           205824    
                                                                 
 dropout (Dropout)           (None, 17, 256)           0         
                                                                 
 gru_1 (GRU)                 (None, 256)               394752    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense (Dense)               (None, 6036)              1551252   
                                                                 
Total params: 2,212,188
Trainable params: 2,212,188
Non-

In [None]:
%%time
model.fit(predictors, label, epochs=200, verbose=1)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x7f2f80045050>

In [None]:
model.save("/content/drive/MyDrive/story_generator_200")



INFO:tensorflow:Assets written to: /content/drive/MyDrive/story_generator_200/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/story_generator_200/assets


### Generate text
The prompt is tokenized, padded and passed to the model. The word with maximum probability is selected and appended to the output.

In [None]:
def generate_text(seed_text, next_words):
    for _ in range(int(next_words)):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict(token_list)
        predicted_class = np.argmax(predicted, axis=1)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted_class:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

In [None]:
import gradio as gr
gr.Interface(generate_text, title="GRU Text Generator", inputs=["text", "number"], outputs="textbox").launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://55391.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x7f2eebf7d510>,
 'http://127.0.0.1:7860/',
 'https://55391.gradio.app')

----

# RISE OF TRANSFORMERS
In 2017, Google AI released Transformer architecture. It was able to process sentences as a whole and focus on certain important words. Many state-of-the-art NLP models today use Transformer architecture.

### Importing Libraries

In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.4-py3-none-any.whl (248 kB)
[?25l[K     |█▎                              | 10 kB 21.2 MB/s eta 0:00:01[K     |██▋                             | 20 kB 26.4 MB/s eta 0:00:01[K     |████                            | 30 kB 14.0 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 10.4 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 10.3 MB/s eta 0:00:01[K     |████████                        | 61 kB 9.6 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 9.3 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 10.4 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 10.9 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 10.8 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 10.8 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 10.8 MB/s eta 0:00:01[K     |█████████████████▏              | 1

In [None]:
import logging
from simpletransformers.language_modeling import LanguageModelingModel, LanguageModelingArgs
from simpletransformers.language_generation import LanguageGenerationModel, LanguageGenerationArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

### Fine-tuning GPT-2
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. <br/>
Fine-tuning GPT-2 model on input text will allow us to take advantage of what the model has already learned without having to develop it from scratch. 

In [None]:
model_args = LanguageModelingArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.num_train_epochs = 3
model_args.dataset_type = "simple"
model_args.mlm = False

# Input file
train_file = "harry_potter.txt"

model = LanguageModelingModel(
    "gpt2", "gpt2", args=model_args, train_files=train_file
)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [None]:
%%time
model.train_model(train_file)

  0%|          | 0/7646 [00:00<?, ?it/s]

  0%|          | 0/841 [00:00<?, ?it/s]



Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/106 [00:00<?, ?it/s]



Running Epoch 1 of 3:   0%|          | 0/106 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/106 [00:00<?, ?it/s]

CPU times: user 4min 7s, sys: 1min 56s, total: 6min 3s
Wall time: 6min 28s


(318, 3.5872417253518254)

### Generate Text

In [None]:
def gpt_generate(seed_text, next_words):
  language_args = LanguageGenerationArgs()
  language_args.max_length = int(next_words)
  model = LanguageGenerationModel("gpt2", "outputs/checkpoint-318-epoch-3", args=language_args)
  output = model.generate(seed_text)
  return output[0]

In [None]:
gr.Interface(gpt_generate, title="GPT-2 Text Generator", inputs=["text", "number"], outputs="textbox").launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://25182.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x7f2eebf7d510>,
 'http://127.0.0.1:7861/',
 'https://25182.gradio.app')