# Intuition for Text Generation

This notebook shows an end to end process for the objective of generating text on a character level. It first reads a text file as defined in `filename` and performs the following:

* Text Cleanup by lowercasing everything
* Text Cleanup by removing unwanted symbols
* Creating a mapping for character indices and indices to characters
* Defining x and y vectors for classification
* Pass it to a multi-layer perceptron
* Generate some text

## Required Libraries

* `nltk`
* `scikit-learn`
* `torch`

### Import the Necessary Libraries

In [7]:
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Read a Text File and Get Raw Text

You may change `filename` to the location of your own file.

In [9]:
filename = "story.txt"
f = open(filename, 'r')

raw_text = f.read()

### Examine First 1000 Characters

In [10]:
raw_text[0:1000]

'Once upon a time, there was a boy named Mike\nHe was raised in the hood, life was never quite\nGrowing up, he faced many struggles and strife\nBut through it all, he had one love in his life\n\nRap music was his escape from reality\nThe beats and lyrics helped him to see\nA world beyond his troubled neighborhood\nHe knew that with hard work, he could change his mood\n\nHe started writing rhymes and practicing his flow\nIn the mirror, he would rap and watch himself grow\nHe knew that if he could just make it to the top\nHe could change his life, and make a better stop\n\nOne day, he had the chance to perform on stage\nIn front of a crowd, he killed it, he killed it with rage\nHe had finally made it, he had achieved his dream\nAnd now he is a successful rapper, or so it seems\n\nHe never forgot where he came from, and he never will\nHe always remember the struggles that he had to fulfill\nHe is now a role model to kids in the hood\nShowing them that with hard work, they too could\n\nRis

### Cleanup Text

* Lowercase all characters
* Remove special symbols

Note that this retains things like spaces, commas, periods and other punctuations.

In [11]:
processed_text = raw_text.lower()
processed_text = re.sub(r'[^\x00-\x7f]', r'', processed_text)

processed_text

'once upon a time, there was a boy named mike\nhe was raised in the hood, life was never quite\ngrowing up, he faced many struggles and strife\nbut through it all, he had one love in his life\n\nrap music was his escape from reality\nthe beats and lyrics helped him to see\na world beyond his troubled neighborhood\nhe knew that with hard work, he could change his mood\n\nhe started writing rhymes and practicing his flow\nin the mirror, he would rap and watch himself grow\nhe knew that if he could just make it to the top\nhe could change his life, and make a better stop\n\none day, he had the chance to perform on stage\nin front of a crowd, he killed it, he killed it with rage\nhe had finally made it, he had achieved his dream\nand now he is a successful rapper, or so it seems\n\nhe never forgot where he came from, and he never will\nhe always remember the struggles that he had to fulfill\nhe is now a role model to kids in the hood\nshowing them that with hard work, they too could\n\nris

In [12]:
word_tokens = word_tokenize(processed_text)

word_tokens

['once',
 'upon',
 'a',
 'time',
 ',',
 'there',
 'was',
 'a',
 'boy',
 'named',
 'mike',
 'he',
 'was',
 'raised',
 'in',
 'the',
 'hood',
 ',',
 'life',
 'was',
 'never',
 'quite',
 'growing',
 'up',
 ',',
 'he',
 'faced',
 'many',
 'struggles',
 'and',
 'strife',
 'but',
 'through',
 'it',
 'all',
 ',',
 'he',
 'had',
 'one',
 'love',
 'in',
 'his',
 'life',
 'rap',
 'music',
 'was',
 'his',
 'escape',
 'from',
 'reality',
 'the',
 'beats',
 'and',
 'lyrics',
 'helped',
 'him',
 'to',
 'see',
 'a',
 'world',
 'beyond',
 'his',
 'troubled',
 'neighborhood',
 'he',
 'knew',
 'that',
 'with',
 'hard',
 'work',
 ',',
 'he',
 'could',
 'change',
 'his',
 'mood',
 'he',
 'started',
 'writing',
 'rhymes',
 'and',
 'practicing',
 'his',
 'flow',
 'in',
 'the',
 'mirror',
 ',',
 'he',
 'would',
 'rap',
 'and',
 'watch',
 'himself',
 'grow',
 'he',
 'knew',
 'that',
 'if',
 'he',
 'could',
 'just',
 'make',
 'it',
 'to',
 'the',
 'top',
 'he',
 'could',
 'change',
 'his',
 'life',
 ',',
 'and

In [13]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in word_tokens if token.lower() not in stop_words]

filtered_tokens

['upon',
 'time',
 ',',
 'boy',
 'named',
 'mike',
 'raised',
 'hood',
 ',',
 'life',
 'never',
 'quite',
 'growing',
 ',',
 'faced',
 'many',
 'struggles',
 'strife',
 ',',
 'one',
 'love',
 'life',
 'rap',
 'music',
 'escape',
 'reality',
 'beats',
 'lyrics',
 'helped',
 'see',
 'world',
 'beyond',
 'troubled',
 'neighborhood',
 'knew',
 'hard',
 'work',
 ',',
 'could',
 'change',
 'mood',
 'started',
 'writing',
 'rhymes',
 'practicing',
 'flow',
 'mirror',
 ',',
 'would',
 'rap',
 'watch',
 'grow',
 'knew',
 'could',
 'make',
 'top',
 'could',
 'change',
 'life',
 ',',
 'make',
 'better',
 'stop',
 'one',
 'day',
 ',',
 'chance',
 'perform',
 'stage',
 'front',
 'crowd',
 ',',
 'killed',
 ',',
 'killed',
 'rage',
 'finally',
 'made',
 ',',
 'achieved',
 'dream',
 'successful',
 'rapper',
 ',',
 'seems',
 'never',
 'forgot',
 'came',
 ',',
 'never',
 'always',
 'remember',
 'struggles',
 'fulfill',
 'role',
 'model',
 'kids',
 'hood',
 'showing',
 'hard',
 'work',
 ',',
 'could',
 '

In [14]:
word_vocabulary = list(set(sorted(filtered_tokens)))

word_vocabulary

['watch',
 'kids',
 'lyrics',
 ',',
 'role',
 'seems',
 'dreams',
 'flow',
 'never',
 'successful',
 'life',
 'always',
 'hard',
 'hood',
 'started',
 'boy',
 'could',
 'writing',
 'mirror',
 'mike',
 'neighborhood',
 'reality',
 'crowd',
 'knew',
 'made',
 'struggles',
 'would',
 'beyond',
 'came',
 'raised',
 'make',
 'faced',
 'circumstances',
 'many',
 'strife',
 'day',
 'fulfill',
 'beats',
 'achieved',
 'growing',
 'stop',
 'quite',
 'rhymes',
 'rap',
 'front',
 'end',
 'practicing',
 'forgot',
 'rage',
 'showing',
 'work',
 'see',
 'troubled',
 'upon',
 'chance',
 'stage',
 'escape',
 'named',
 'top',
 'killed',
 'chase',
 'love',
 'model',
 'rise',
 'better',
 'music',
 'mood',
 'grow',
 'dream',
 'one',
 'time',
 'helped',
 'remember',
 'perform',
 'rapper',
 'finally',
 'world',
 'change']

In [15]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in word_vocabulary]

lemmatized_tokens

['watch',
 'kid',
 'lyric',
 ',',
 'role',
 'seems',
 'dream',
 'flow',
 'never',
 'successful',
 'life',
 'always',
 'hard',
 'hood',
 'started',
 'boy',
 'could',
 'writing',
 'mirror',
 'mike',
 'neighborhood',
 'reality',
 'crowd',
 'knew',
 'made',
 'struggle',
 'would',
 'beyond',
 'came',
 'raised',
 'make',
 'faced',
 'circumstance',
 'many',
 'strife',
 'day',
 'fulfill',
 'beat',
 'achieved',
 'growing',
 'stop',
 'quite',
 'rhyme',
 'rap',
 'front',
 'end',
 'practicing',
 'forgot',
 'rage',
 'showing',
 'work',
 'see',
 'troubled',
 'upon',
 'chance',
 'stage',
 'escape',
 'named',
 'top',
 'killed',
 'chase',
 'love',
 'model',
 'rise',
 'better',
 'music',
 'mood',
 'grow',
 'dream',
 'one',
 'time',
 'helped',
 'remember',
 'perform',
 'rapper',
 'finally',
 'world',
 'change']

### Create Index Mappings

* `word_indices`: Return the index of a given word. Example: `word_indices['a']`
* `indices_word`: Returns the word given an index. Example: `indices_word[0]`

In this case, `word` serves as your **vocabulary**.

In [18]:
print("Corpus Length: {}".format(len(lemmatized_tokens)))

words = sorted(list(set(lemmatized_tokens)))
print("Total Words: {}".format(len(words)))

word_indices = dict((c, i) for i,c in enumerate(words))
indices_words = dict((i, c) for i, c in enumerate(words))

Corpus Length: 78
Total Words: 77


In [19]:
word_indices

{',': 0,
 'achieved': 1,
 'always': 2,
 'beat': 3,
 'better': 4,
 'beyond': 5,
 'boy': 6,
 'came': 7,
 'chance': 8,
 'change': 9,
 'chase': 10,
 'circumstance': 11,
 'could': 12,
 'crowd': 13,
 'day': 14,
 'dream': 15,
 'end': 16,
 'escape': 17,
 'faced': 18,
 'finally': 19,
 'flow': 20,
 'forgot': 21,
 'front': 22,
 'fulfill': 23,
 'grow': 24,
 'growing': 25,
 'hard': 26,
 'helped': 27,
 'hood': 28,
 'kid': 29,
 'killed': 30,
 'knew': 31,
 'life': 32,
 'love': 33,
 'lyric': 34,
 'made': 35,
 'make': 36,
 'many': 37,
 'mike': 38,
 'mirror': 39,
 'model': 40,
 'mood': 41,
 'music': 42,
 'named': 43,
 'neighborhood': 44,
 'never': 45,
 'one': 46,
 'perform': 47,
 'practicing': 48,
 'quite': 49,
 'rage': 50,
 'raised': 51,
 'rap': 52,
 'rapper': 53,
 'reality': 54,
 'remember': 55,
 'rhyme': 56,
 'rise': 57,
 'role': 58,
 'see': 59,
 'seems': 60,
 'showing': 61,
 'stage': 62,
 'started': 63,
 'stop': 64,
 'strife': 65,
 'struggle': 66,
 'successful': 67,
 'time': 68,
 'top': 69,
 'trouble

In [20]:
indices_words

{0: ',',
 1: 'achieved',
 2: 'always',
 3: 'beat',
 4: 'better',
 5: 'beyond',
 6: 'boy',
 7: 'came',
 8: 'chance',
 9: 'change',
 10: 'chase',
 11: 'circumstance',
 12: 'could',
 13: 'crowd',
 14: 'day',
 15: 'dream',
 16: 'end',
 17: 'escape',
 18: 'faced',
 19: 'finally',
 20: 'flow',
 21: 'forgot',
 22: 'front',
 23: 'fulfill',
 24: 'grow',
 25: 'growing',
 26: 'hard',
 27: 'helped',
 28: 'hood',
 29: 'kid',
 30: 'killed',
 31: 'knew',
 32: 'life',
 33: 'love',
 34: 'lyric',
 35: 'made',
 36: 'make',
 37: 'many',
 38: 'mike',
 39: 'mirror',
 40: 'model',
 41: 'mood',
 42: 'music',
 43: 'named',
 44: 'neighborhood',
 45: 'never',
 46: 'one',
 47: 'perform',
 48: 'practicing',
 49: 'quite',
 50: 'rage',
 51: 'raised',
 52: 'rap',
 53: 'rapper',
 54: 'reality',
 55: 'remember',
 56: 'rhyme',
 57: 'rise',
 58: 'role',
 59: 'see',
 60: 'seems',
 61: 'showing',
 62: 'stage',
 63: 'started',
 64: 'stop',
 65: 'strife',
 66: 'struggle',
 67: 'successful',
 68: 'time',
 69: 'top',
 70: 'tro

### Convert to a Set of Symbols of Fixed Length

* `maxlen`: Dimensionality of each data point
* `step`: Granularity of skips. The lower the number, the noisier. The higher the number, the more erratic.

Take note we also capture the predicted character for the given sentence. This will allow us to setup the data in such a way that a given set of sequences predicts the next character. In machine learning, we will denote this as our `y` value or ground truth. Each `y` however is represented as a one hot encoding where a position will receive a value of `1` depending on the character position in `character_indices`

In [22]:
maxlen = 40
step = 5

sentences = []
next_words = []

for i in range(0, len(lemmatized_tokens) - maxlen, step):
    sentences.append(lemmatized_tokens[i: i+maxlen])
    next_words.append(lemmatized_tokens[i + maxlen])

In [23]:
sentences

[['watch',
  'kid',
  'lyric',
  ',',
  'role',
  'seems',
  'dream',
  'flow',
  'never',
  'successful',
  'life',
  'always',
  'hard',
  'hood',
  'started',
  'boy',
  'could',
  'writing',
  'mirror',
  'mike',
  'neighborhood',
  'reality',
  'crowd',
  'knew',
  'made',
  'struggle',
  'would',
  'beyond',
  'came',
  'raised',
  'make',
  'faced',
  'circumstance',
  'many',
  'strife',
  'day',
  'fulfill',
  'beat',
  'achieved',
  'growing'],
 ['seems',
  'dream',
  'flow',
  'never',
  'successful',
  'life',
  'always',
  'hard',
  'hood',
  'started',
  'boy',
  'could',
  'writing',
  'mirror',
  'mike',
  'neighborhood',
  'reality',
  'crowd',
  'knew',
  'made',
  'struggle',
  'would',
  'beyond',
  'came',
  'raised',
  'make',
  'faced',
  'circumstance',
  'many',
  'strife',
  'day',
  'fulfill',
  'beat',
  'achieved',
  'growing',
  'stop',
  'quite',
  'rhyme',
  'rap',
  'front'],
 ['life',
  'always',
  'hard',
  'hood',
  'started',
  'boy',
  'could',
  '

In [24]:
next_words

['stop', 'end', 'work', 'stage', 'chase', 'music', 'time', 'finally']

### Vectorization

This step simply converts the `sentences` and `next_words` to its `x` and `y` components respectively. Since we're using pytorch, we convert it to a tensor. Succeeding cells check the shape of `x` and `y`.

In [28]:
print("Vectorization")
device = 'cuda'
x = np.zeros((len(sentences), maxlen, len(words)), dtype=np.float64)
y = np.zeros((len(sentences), len(words)), dtype=np.float64)

for i, sentence in enumerate(sentences):
    for t, word, in enumerate(sentence):
        x[i, t, word_indices[words]] = 1
    y[i, word_indices[next_words[i]]] = 1
    
x = torch.tensor(x).float().to(device)
y = torch.tensor(y).float().to(device)

Vectorization


IndexError: index 53 is out of bounds for axis 2 with size 6

In [29]:
x.shape

(8, 40, 6)

In [30]:
x = torch.flatten(x, start_dim=1)

x.shape

TypeError: flatten(): argument 'input' (position 1) must be Tensor, not numpy.ndarray

In [31]:
y.shape

(8, 6)

### Utility Function for Generating Samples

In [32]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

### Callback Function

This generates sample text from a given seed and ran for every epoch of the model.

In [33]:
def callback(model):
    start = 0
    stop = len(processed_text) - maxlen - 1

    #print("Start: {}".format(start))
    #print("Stop: {}".format(stop))

    start_index = random.randint(start, stop)

    #print("Start Index: {}".format(start_index))

    sentence = lemmatized_tokens[start_index: start_index + maxlen]

    #print("Sentence: {}".format(sentence))
    #print("Sentence Length: {}".format(len(sentence)))

    generated = ''

    for i in range(400):
        x_predictions = np.zeros((1, maxlen, len(words)))

        for t, word in enumerate(sentence):
            x_predictions[0, t, character_indices[word]] = 1

            # print(x_predictions)
        x_predictions = torch.tensor(x_predictions).float().to(device)
        x = torch.flatten(x_predictions, start_dim=1)

        preds = model.forward(x)[0].detach().cpu().numpy()

        next_index = sample(preds)
        #print("next_index: {}".format(next_index))
        next_word = indices_words[next_index]
        #print("next_char: {}".format(next_char))

        generated += next_word
        sentence = sentence[1:] + next_word

    return sentence

### MultiLayerPerceptron Model

This will be our current language model. Although originally used for classification, we can also treat is a regression model since our ground truth represents the next predicted character in a sequence. Take note that this is a rather simplistic model without any properties to remember previous input. This forces the model to treat each input as an independent observation without considering sequential behavior. TLDR, it won't generate good results.

In [34]:
class MultiLayerPerceptron(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.hidden = nn.Linear(input_dim, 500)
        self.output = nn.Linear(500, output_dim)
        
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # f(x) = a(f(x))
        x = self.relu(self.hidden(x))
        y = self.sigmoid(self.output(x))

        return y

In [35]:
model = MultiLayerPerceptron(x.shape[1], y.shape[1]).to(device)

model

AssertionError: Torch not compiled with CUDA enabled

### Training Function

In [36]:
optimizer = optim.Adam(model.parameters(), lr=0.00001)
criterion = nn.CrossEntropyLoss()

def train_fn(model, optimizer, loss_fn, device):
    ave_loss = 0
    count = 0
    
    for i, data in enumerate(x):
        data = x[i]
        targets = y[i]
        
        # Forward
        predictions = model.forward(data)
        
        predictions = F.softmax(predictions, dim=-1)
        
        loss = loss_fn(predictions, targets)
        
        # Backward
        optimizer.zero_grad()
        
        loss.backward()
        
        optimizer.step()

        count += 1
        ave_loss += loss.item()
    
    ave_loss = ave_loss / count

    return ave_loss

epochs = 1000

average_losses = []

for epoch in range(epochs):
    print("Epoch: {}".format(epoch))
    ave_loss = train_fn(model, optimizer, criterion, device)
    
    average_losses.append(ave_loss)
        
    print("Ave Loss: {}".format(ave_loss))
    
    generated_sentence = callback(model)
    
    print("Generated sentence:")
    print(generated_sentence)
    print("Length: {}".format(len(generated_sentence)))

NameError: name 'model' is not defined