<a href="https://colab.research.google.com/github/MariyaSha/StoryTeller/blob/master/StoryTeller.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Story Teller Neural Network**

by: Mariya Sha
<br>
email: mariyasha888@gmail.com


In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import string
import re
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import random

from google.colab import drive
drive.mount('/content/drive')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


**Building database from 2 different books:**
<br>
*   Fairytales by The Brothers Grimm
*   Alice's Advantures in Wonderland by Lewis Carroll



In [0]:
# loading and filtering functions for each book 

def load_fairytales(text_file, punct, not_a_word):
    '''
    param text_file: url to Project Gutenberg's text file for Fairytales by The Brothers Grimm
    param punct: a string of punctuation characters we'd like to filter
    param not_a_word: a list of words we'd like to filter
    '''
    book = open(text_file, encoding='cp1252')
    book = book.read()
    book = book[2376:519859]
    book_edit = re.sub('[(+*)]', '', book)
    words = word_tokenize(book_edit.lower())

    # filtering punctuation inside tokens (example: didn't or wow!)
    for word in words:
        for char in word:
            if char in punct:
                word = word.replace(char, "")

    # filtering punctuation as alone standing tokens(example: \ or ,)
    words = [word for word in words if word not in punct and word not in not_a_word]

    print('Fairytales database includes {} tokens, and {} unique tokens after editing'.format(len(words), len(set(words))))            
    return words

def load_alice(text_file, punct, not_a_word):
    '''
    param text_file: url to Project Gutenberg's text file for Alice's Advantures in Wonderland by Lewis Carroll
    param punct: a string of punctuation characters we'd like to filter
    param not_a_word: a list of words we'd like to filter
    '''
    book = open(text_file, 'r')
    book = book.read()
    book = book[715:145060]
    book_edit = re.sub('[+]', '', book)
    book_edit = re.sub(r'(CHAPTER \w+.\s)', '', book)
    words = word_tokenize(book_edit.lower())
    
    word_list = []
    
    # filtering punctuation and non-words
    for word in words:
        for char in word:
            if char in punct:
                word = word.replace(char, "")
        if word not in punct and word not in not_a_word:
            word_list.append(word)

    print('Alice database includes {} tokens, and {} unique tokens after editing'.format(len(word_list), len(set(word_list)))) 
    return word_list

**PLEASE NOTE!**
<br>
Replace the **fairytales_url** and **alice_url** in the following cell
<br>
with the location of the .txt files in **YOUR OWN GOOGLE DRIVE!**
<p>
.txt files can be downloaded from my Github:
<br>
https://github.com/MariyaSha/StoryTeller
<br>
Please save them on your Google Drive and specify their location below

In [3]:
# PARAMETERS FOR LOADING FUNCTIONS

# defining the punctuation characters we'd like to filter
punct = string.punctuation
punct = punct.replace('-', "") + '’' + '‘'

# word ending and special charecters that won't be caught by the filters inside the functions
not_a_word = ['s', '--', 'nt', 've', 'll', 'd']

# url of the Google Drive location of the .txt book files
fairytales_url = '/content/drive/My Drive/Colab Notebooks/StoryTeller/FairytalesByTheBrothersGrimm.txt'
alice_url = '/content/drive/My Drive/Colab Notebooks/StoryTeller/AlicesAdvanturesInWonderland.txt'

brothers_grimm = load_fairytales(fairytales_url, punct, not_a_word)
print('Checking Fairytales tokens:\n', brothers_grimm[:100])

print()

alice = load_alice(alice_url, punct, not_a_word)    
print('Checking Alice tokens:\n', alice[:100])

Fairytales database includes 100811 tokens, and 5329 unique tokens after editing
Checking Fairytales tokens:
 ['the', 'golden', 'bird', 'a', 'certain', 'king', 'had', 'a', 'beautiful', 'garden', 'and', 'in', 'the', 'garden', 'stood', 'a', 'tree', 'which', 'bore', 'golden', 'apples', 'these', 'apples', 'were', 'always', 'counted', 'and', 'about', 'the', 'time', 'when', 'they', 'began', 'to', 'grow', 'ripe', 'it', 'was', 'found', 'that', 'every', 'night', 'one', 'of', 'them', 'was', 'gone', 'the', 'king', 'became', 'very', 'angry', 'at', 'this', 'and', 'ordered', 'the', 'gardener', 'to', 'keep', 'watch', 'all', 'night', 'under', 'the', 'tree', 'the', 'gardener', 'set', 'his', 'eldest', 'son', 'to', 'watch', 'but', 'about', 'twelve', 'o', 'clock', 'he', 'fell', 'asleep', 'and', 'in', 'the', 'morning', 'another', 'of', 'the', 'apples', 'was', 'missing', 'then', 'the', 'second', 'son', 'was', 'ordered', 'to', 'watch']

Alice database includes 26612 tokens, and 2596 unique tokens after editi

In [4]:
# combining the tokens from both books into one list
tokens = brothers_grimm + alice
print('The combined database includes {} tokens, and {} unique tokens'.format(len(tokens), len(set(tokens))))

The combined database includes 127423 tokens, and 6332 unique tokens


**Organizing the database into feature tokens and target tokens**

In [5]:
n_features = 5 # if changes are made to n_features, train_data must be adjusted accordingly
embedding_dim = 10

# Itterating over the data and organizing it into 5 features and 1 target
train_data = [([tokens[i],
                tokens[i + 1],
                tokens[i + 2],
                tokens[i + 3],
                tokens[i + 4]],
                tokens[i + 5])
            for i in range(len(tokens) - n_features)]

print('Checking train_data content:\n', train_data[:3])

Checking train_data content:
 [(['the', 'golden', 'bird', 'a', 'certain'], 'king'), (['golden', 'bird', 'a', 'certain', 'king'], 'had'), (['bird', 'a', 'certain', 'king', 'had'], 'a')]


**Defining the StoryTeller Neural Network structure**

In [0]:
#Structuring the Neural Network
class StoryTeller(nn.Module):

    def __init__(self, vocab_size, embedding_dim, n_features):
        super(StoryTeller, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(n_features * embedding_dim, 128)
        self.linear2 = nn.Linear(128, 768)   
        self.linear3 = nn.Linear(768, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = F.relu(self.linear2(out))
        out = self.linear3(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

**Defining the training function with an optional GPU connection**

In [0]:
#Training Function
def train(model, train_data, epochs, word_to_idx): 
    #checking for available GPU 
    if torch.cuda.is_available():
        device = torch.device("cuda:0")
        print('Training on GPU...')
    else:
        device = torch.device("cpu")
        print('No GPU available: Training on CPU...')
    model.to(device)  

    #training begins
    for e in range(epochs):
        model.train()
        steps = 0
        print_every = 100
        running_loss = 0
        for feature, target in train_data:
            feature_idx = torch.tensor([word_to_idx[word] for word in feature], dtype=torch.long)
            feature_idx = feature_idx.to(device)
            steps += 1
            model.zero_grad()
            log_probs = model(feature_idx)
            target_tensor = torch.tensor([word_to_idx[target]], dtype=torch.long)
            target_tensor = target_tensor.to(device)
            loss = criterion(log_probs, target_tensor)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

            if steps % print_every == 0:
              model.eval()  
              average_loss.append(running_loss/print_every)
              #Printing Training information    
              print("Epoch: {}/{}.. ".format(e+1, epochs),
                    "Running Loss: {:.3f}.. ".format(running_loss/print_every))        
              running_loss = 0
              model.train()
    return model

**Defining training parameters and proceeding with training and evaluation**

In [8]:
# Vocabulary
vocab = set(tokens)
vocab_size = len(vocab)

# Word Mappings
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}

# Training Parameters
model = StoryTeller(vocab_size, embedding_dim, n_features)
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

epochs = 16
average_loss = []
device = 0

start_time = time.time()
model = train(model, train_data, epochs, word_to_idx)  

# Training Summary
print('-----------------------------------------------------\n TRAINING END \n-----------------------------------------------------')
print('Training Took {} Minutes'.format(round((time.time() - start_time)/60), 2))
print('Highest Loss Value: {} / Lowest Loss Value: {}'.format(max(average_loss), min(average_loss)))

Training on GPU...
Epoch: 1/16..  Running Loss: 8.621.. 
Epoch: 1/16..  Running Loss: 8.285.. 
Epoch: 1/16..  Running Loss: 7.742.. 
Epoch: 1/16..  Running Loss: 7.661.. 
Epoch: 1/16..  Running Loss: 6.971.. 
Epoch: 1/16..  Running Loss: 6.776.. 
Epoch: 1/16..  Running Loss: 6.564.. 
Epoch: 1/16..  Running Loss: 6.450.. 
Epoch: 1/16..  Running Loss: 6.640.. 
Epoch: 1/16..  Running Loss: 6.510.. 
Epoch: 1/16..  Running Loss: 6.153.. 
Epoch: 1/16..  Running Loss: 6.679.. 
Epoch: 1/16..  Running Loss: 6.128.. 
Epoch: 1/16..  Running Loss: 5.928.. 
Epoch: 1/16..  Running Loss: 6.125.. 
Epoch: 1/16..  Running Loss: 6.253.. 
Epoch: 1/16..  Running Loss: 6.136.. 
Epoch: 1/16..  Running Loss: 5.791.. 
Epoch: 1/16..  Running Loss: 5.706.. 
Epoch: 1/16..  Running Loss: 6.202.. 
Epoch: 1/16..  Running Loss: 6.266.. 
Epoch: 1/16..  Running Loss: 6.122.. 
Epoch: 1/16..  Running Loss: 5.971.. 
Epoch: 1/16..  Running Loss: 6.112.. 
Epoch: 1/16..  Running Loss: 5.607.. 
Epoch: 1/16..  Running Loss: 6.

**Saving checkpoint to drive**
<br>
Replace **checkpoint_url** with a location in **YOUR OWN GOOGLE DRIVE**

In [9]:
checkpoint_url = '/content/drive/My Drive/Colab Notebooks/StoryTeller/checkpoint.pth'

checkpoint = {'model': model,
              'state_dict': model.state_dict(),
              'word_to_idx': word_to_idx,
              'idx_to_word': idx_to_word,
              'epochs': epochs,
              'average_loss': average_loss,
              'device': device,
              'optimizer_state': optimizer.state_dict()}

torch.save(checkpoint, checkpoint_url)

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


**Loading model from the freshly-saved checkpoint**

In [11]:
def load_checkpoint(filepath):
    checkpoint = torch.load(filepath)
    model = checkpoint['model']
    model.optimizer_state = checkpoint['optimizer_state']
    model.load_state_dict(checkpoint['state_dict'])
    device = checkpoint['device']
    #word_to_idx = checkpoint['word_to_idx']
    #idx_to_word = checkpoint['idx_to_word']
    #epochs = checkpoint['epochs']
    #average_loss= checkpoint['average_loss']
    return model

model = load_checkpoint(checkpoint_url)
model

StoryTeller(
  (embeddings): Embedding(6332, 10)
  (linear1): Linear(in_features=50, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=768, bias=True)
  (linear3): Linear(in_features=768, out_features=6332, bias=True)
)

Defining the predict function

In [0]:
def predict(model, first_words ,story_len ,top_k):
    '''
    param model: trained model
    param first_words: a string of 5 (n_feature) words to begin the story
    param story_len: an integer symbolizing the number of words you'd like the story to have
    param top_k: the number of top probabilities per word that the network will randomly select from
    '''
    feature = (first_words.lower()).split(" ")
    for i in feature:
        story.append(i)
    for i in range(story_len):
        feature_idx = torch.tensor([word_to_idx[word] for word in feature], dtype=torch.long)
        feature_idx = feature_idx.to(device)
        with torch.no_grad():
            output = model.double().forward(feature_idx)
        ps = torch.exp(output)
        topk_combined = ps.topk(top_k, sorted=True)
        #top kk probabilities
        topk_ps = topk_combined[0][0]
        #top kk classes
        topk_class = topk_combined[1][0]
        topk_class = [idx_to_word[int(i)] for i in topk_class]
        next_word = random.choice(topk_class)
        feature = feature[1:]
        feature.append(next_word)
        story.append(next_word)
    return story

**Collecting user input for the first 5 words of the story**
<br>
Please ensure the number of words you type is no different from your **n_features** during training

In [13]:
first_words = input('Type the first {} words to start the story:\nexample: A lovely day at the\n'.format(n_features))

Type the first 5 words to start the story:
example: A lovely day at the
a lovely day at the


**Predicting the rest of the story based-off the used input**

In [14]:
top_k = 3
story_len = 50
story = []
device = 'cuda:0'

#Predicting and Handling User-Input Errors
try:      
    prediction = predict(model, first_words, story_len, top_k)
except KeyError as error:
    print('Oops, looks like you\'ve selected a word that the network does not understand yet: ', error)
    if story[0] != "":
        story = story[len(first_words):]
    first_words = input('please select a different word:\nexample: A lovely day at the\n')
    prediction = predict(model, first_words, story_len, top_k)
except KeyError and RuntimeError:
    if story[0] != "":
        story = story[len(first_words):]
    first_words = input('Oops, looks like you\'ve typed {} words instead of {}!\n\nType the first 5 words to start the story:\nexample: A lovely day at the\n'.format(len(first_words.split(" ")), n_features))
    prediction = predict(model, first_words, story_len, top_k)

print('-----------------------------------------------------\n The STORY \n-----------------------------------------------------')
print(" ".join(story))

-----------------------------------------------------
 The STORY 
-----------------------------------------------------
a lovely day at the beginning she ran away in her arms folded and speak the whole of soldiers said the unfortunate leave it voice in an honest of great dismay said the rest and the pool rippling and you no more dreadfully frightened live to himself she soon who were all and when he
