<h3 style="text-align: center;">Big Data Analytics</h3>
<h3 style="text-align: center;">LSTM Storyteller</h3>

Once upon a time, there was a storyteller who wanted to create new fairy tales. The storyteller knew that to do this, they needed a powerful tool that could learn from existing stories and generate new ones. So, they decided to build an LSTM-based story generator.

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout, Input
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd
import io
import re
from IPython.display import clear_output
from keras.utils import np_utils
from keras.models import Model, load_model
from keras.optimizers import Adam, RMSprop
from keras.callbacks import LambdaCallback

In [2]:
# Load and preprocess the text corpus
with open('LSTM/cleaned_merged_fairy_tales_without_eos.txt', 'r', encoding='utf-8') as file:
    corpus_text = file.read().lower()  # Read and convert to lowercase



The storyteller began by loading and preprocessing a corpus of fairy tales. They then tokenized the text, removed special characters and numbers, and removed stop words. Next, the storyteller used the Tokenizer from Keras to convert the text into sequences and generate a categorical variable. They also defined a function to generate sequences of tokens to train the LSTM model.

In [3]:
corpus= corpus_text.split('.')
part = int(len(corpus)*.05)
part_corpus=corpus[:part]
print(len(part_corpus))
corpus = ". ".join(part_corpus)
print(type(corpus))
print(len(corpus))

8993
<class 'str'>
1248189


In [4]:
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Convert text to lowercase
text = corpus.lower()
# Remove special characters and numbers
text = re.sub('[^A-Za-z]+', ' ', text)

# Tokenize the text
tokens = word_tokenize(text)

# Remove stop words
#stop_words = set(stopwords.words('english'))
#tokens = [word for word in tokens if not word in stop_words]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\manug\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manug\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Tokenizing the text
token_type = 'word'
if token_type == 'word':
    tokenizer = Tokenizer(char_level = False, filters = '')
else:
    tokenizer = Tokenizer(char_level = True, filters = '', lower = False)

tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
token_list = tokenizer.texts_to_sequences([text])[0]

# printing interesting quntities:
print(f"Number of tokenized words: {total_words}")


Number of tokenized words: 9649


In [6]:
def generate_sequences(token_list, step):
    
    X = []
    y = []

    for i in range(0, len(token_list) - seq_length, step):
        X.append(token_list[i: i + seq_length])
        y.append(token_list[i + seq_length])
    
    # one-hot encoding, creating a categorical variable:
    y = np_utils.to_categorical(y, num_classes = total_words)
    
    num_seq = len(X)
    print('Number of sequences:', num_seq, "\n")
    
    return X, y, num_seq

step = 1
seq_length = 20

X, y, num_seq = generate_sequences(token_list, step)

X = np.array(X)
y = np.array(y)

# printing output:
print(f"Inout shape: {X.shape}")
print(f"Output shape: {y.shape}")

Number of sequences: 237293 

Inout shape: (237293, 20)
Output shape: (237293, 9649)


After defining the model architecture with an input layer, an embedding layer, an LSTM layer, and a dense layer, the storyteller compiled the model with the RMSprop optimizer and trained it using the fit method.

In [7]:
n_units = 256
embedding_size = 100

teInput_datat_in = Input(shape = (None,))
embedding = Embedding(total_words, embedding_size)
Input_data = embedding(text_in)
Input_data = LSTM(n_units)(Input_data)
teInput_datat_out = Dense(total_words, activation = 'softmaInput_data')(Input_data)

model = Model(text_in, text_out)
learning_rate = 0.001
opti = RMSprop(learning_rate = learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=opti)

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         964900    
                                                                 
 lstm (LSTM)                 (None, 256)               365568    
                                                                 
 dense (Dense)               (None, 9649)              2479793   
                                                                 
Total params: 3,810,261
Trainable params: 3,810,261
Non-trainable params: 0
_________________________________________________________________


Once the model was trained, the storyteller saved it to a file using the pickle library. They also defined a function called `story_text` that takes a seed text, the number of words to generate, the trained model, the maximum sequence length, and a temperature value as input. The function then generates new stories using the trained LSTM model and the seed text provided.

In [53]:
epochs = 10
batch_size = 32
# this will take a while ...
model.fit(X, y, epochs=epochs, batch_size=batch_size, shuffle = True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x21784dc61a0>

In [54]:
import pickle
pickle.dump(model, open('storteller_lstm_3.pkl','wb'))

The storyteller also created a chatbot function called `story_chat` that takes user input and generates a response using the `story_text` function. They integrated the chatbot function with a speech-to-text library and a text-to-speech library to allow users to speak their input and hear the bot's response.

In [110]:
def story_text(seed_text, next_words, model, max_sequence_len, temp):
    output_text = seed_text
    #seed_text = start_story + seed_text
    
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = token_list[-max_sequence_len:]
        token_list = np.reshape(token_list, (1, max_sequence_len))
        
        probs = model.predict(token_list, verbose=0)[0]
        pred = pred_temp(probs, temperature = temp)
        
        if pred == 0:
            output_word = ''
        else:
            output_word = tokenizer.index_word[pred]
            
        if token_type == 'word':
            output_text += output_word + " "
            seed_text += output_word + " "
        else:
            output_text += output_word + " " 
            seed_text += output_word + " "
            
    return output_text

def pred_temp(preds, temperature=0.9):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [130]:
seed_text = "Once upon a time there was a little girl in a village waiting for a prince to start her exciting life in this wonderful world "
gen_words = 100
seq_length= 20
print('Temp 0.1')
print (story_text(seed_text, gen_words, model, seq_length, temp = 0.1))
print('Temp 0.5')
print (story_text(seed_text, gen_words, model, seq_length, temp = 0.5))
print('Temp 0.9')
print (story_text(seed_text, gen_words, model, seq_length, temp = 0.9))

Temp 0.1
Once upon a time there was a little girl in a village waiting for a prince to start her exciting life in this wonderful world i will do not forget to you herself in another now she took hold of the fountain as beautiful do you shall the charming charming princess asked beauty if she could not help because he was going to ask for the terrible that he could fetch you and see air in the air because he would not ask his good the boy to let him go on the road open but he there stood his he who asked for the fast in he had a good house for he was very good and beautiful his long golden apple are beautiful 
Temp 0.5
Once upon a time there was a little girl in a village waiting for a prince to start her exciting life in this wonderful world i must do that she from cried out to go her bring another terrible most eat me to morrow do off i do you take the way said asked the mice do not stay for answered pinocchio then the king s wife streets on the account of the ship would take our two 

In [121]:
# Start the chatbot
def story_chat(user_input,gen_words):
    try:

        while True:
            #user_input = input("You: ")
            if user_input.lower() == 'bye':
                print("Chatbot: Goodbye!")
                break
            print('hello')
            response = story_text(user_input, gen_words, model, seq_length, temp = .9)
            print(response)
            return response
    except:
        return "I can't understand, please say something else"
def ask_bot_story():
    from speechToText import speech_to_text
    from text_to_speech import speak
    speak('Say minimum 10 words to begin a story')    
    user_input = speech_to_text()
    #speak('How many words in the story?')
    #numbers = speech_to_text()
    gen_words = 250
    #print(gen_words)
    response = story_chat(user_input,gen_words)
    speak(response)

In summary, the LSTM-based story generator built by the storyteller allows for the generation of new fairy tales by learning from existing stories. The chatbot function also allows users to interact with the model and generate stories using voice commands.

In [129]:
ask_bot_story()

Say something...
You said: a girl found a haunted place in a secluded area inside a forest and then she found her family
hello


  preds = np.log(preds) / temperature


a girl found a haunted place in a secluded area inside a forest and then she found her familythat against against by against begged these his begged himself against against by who who your which gave them against against against we head six each which show himself your against begged against your aladdin against begged himself against against by who your whom father which them against few each that your begged himself by against against against by who against against these that ah majesty majesty our against against where against begged himself your begged against by against against by or each go against each begged himself against against against by s against this his which gave t against against against against against begged himself which him against begged your against begged himself by over against against s who your begged these over against against against against s each himself against ahmed each his against begged himself by against against by by who your whom father which the