<img src="../Pierian-Data-Logo.PNG">
<br>
<strong><center>Copyright 2019. Created by Jose Marcial Portilla.</center></strong>

# RNN for Text Generation

## Generating Text (encoded variables)

We saw how to generate continuous values, now let's see how to generalize this to generate categorical sequences (such as words or letters).

## Imports

In [1]:
import torch
from torch import nn
import torch.nn.functional as F

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Get Text Data

In [2]:
with open('../Data/Ed_Sheeran.txt','r',encoding='utf8') as f:
    text = f.read()

In [3]:
text[:1000]

"Perfect\nI found a love for me\nOh darling, just dive right in and follow my lead\nWell, I found a girl, beautiful and sweet\nOh, I never knew you were the someone waiting for me\n'Cause we were just kids when we fell in love\nNot knowing what it was\nI will not give you up this time\nBut darling, just kiss me slow, your heart is all I own\nAnd in your eyes, you're holding mine\nBaby, I'm dancing in the dark with you between my arms\nBarefoot on the grass, listening to our favourite song\nWhen you said you looked a mess, I whispered underneath my breath\nBut you heard it, darling, you look perfect tonight\nWell I found a woman, stronger than anyone I know\nShe shares my dreams, I hope that someday I'll share her home\nI found a love, to carry more than just my secrets\nTo carry love, to carry children of our own\nWe are still kids, but we're so in love\nFighting against all odds\nI know we'll be alright this time\nDarling, just hold my hand\nBe my girl, I'll be your man\nI see my futu

In [4]:
print(text[:1000])

Perfect
I found a love for me
Oh darling, just dive right in and follow my lead
Well, I found a girl, beautiful and sweet
Oh, I never knew you were the someone waiting for me
'Cause we were just kids when we fell in love
Not knowing what it was
I will not give you up this time
But darling, just kiss me slow, your heart is all I own
And in your eyes, you're holding mine
Baby, I'm dancing in the dark with you between my arms
Barefoot on the grass, listening to our favourite song
When you said you looked a mess, I whispered underneath my breath
But you heard it, darling, you look perfect tonight
Well I found a woman, stronger than anyone I know
She shares my dreams, I hope that someday I'll share her home
I found a love, to carry more than just my secrets
To carry love, to carry children of our own
We are still kids, but we're so in love
Fighting against all odds
I know we'll be alright this time
Darling, just hold my hand
Be my girl, I'll be your man
I see my future in your eyes
Baby, I'

In [5]:
len(text)

67862

## Encode Entire Text

In [6]:
all_characters = set(text)

In [7]:
len(all_characters)

76

In [8]:
decoder = dict(enumerate(all_characters))

In [9]:
# decoder
# decoder.items()

In [9]:
encoder = {char: ind for ind,char in decoder.items()}

In [11]:
# encoder

In [10]:
encoded_text = np.array([encoder[char] for char in text])

In [11]:
encoded_text[:500]

array([73, 75,  3, 49, 75,  9, 41, 60,  4, 38, 49, 37, 30, 25, 69, 38, 10,
       38, 56, 37, 70, 75, 38, 49, 37,  3, 38, 74, 75, 60, 11, 51, 38, 69,
       10,  3, 56, 72, 25, 12, 39, 38, 48, 30, 45, 41, 38, 69, 72, 70, 75,
       38,  3, 72, 12, 51, 41, 38, 72, 25, 38, 10, 25, 69, 38, 49, 37, 56,
       56, 37, 36, 38, 74,  6, 38, 56, 75, 10, 69, 60, 53, 75, 56, 56, 39,
       38,  4, 38, 49, 37, 30, 25, 69, 38, 10, 38, 12, 72,  3, 56, 39, 38,
       13, 75, 10, 30, 41, 72, 49, 30, 56, 38, 10, 25, 69, 38, 45, 36, 75,
       75, 41, 60, 11, 51, 39, 38,  4, 38, 25, 75, 70, 75,  3, 38, 20, 25,
       75, 36, 38,  6, 37, 30, 38, 36, 75,  3, 75, 38, 41, 51, 75, 38, 45,
       37, 74, 75, 37, 25, 75, 38, 36, 10, 72, 41, 72, 25, 12, 38, 49, 37,
        3, 38, 74, 75, 60, 29,  0, 10, 30, 45, 75, 38, 36, 75, 38, 36, 75,
        3, 75, 38, 48, 30, 45, 41, 38, 20, 72, 69, 45, 38, 36, 51, 75, 25,
       38, 36, 75, 38, 49, 75, 56, 56, 38, 72, 25, 38, 56, 37, 70, 75, 60,
       40, 37, 41, 38, 20

## One Hot Encoding

As previously discussed, we need to one-hot encode our data inorder for it to work with the network structure. Make sure to review numpy if any of these operations confuse you!

In [12]:
def one_hot_encoder(encoded_text, num_uni_chars):
    '''
    encoded_text : batch of encoded text
    
    num_uni_chars = number of unique characters (len(set(text)))
    '''
    
    # METHOD FROM:
    # https://stackoverflow.com/questions/29831489/convert-encoded_textay-of-indices-to-1-hot-encoded-numpy-encoded_textay
      
    # Create a placeholder for zeros.
    one_hot = np.zeros((encoded_text.size, num_uni_chars))
    
    # Convert data type for later use with pytorch (errors if we dont!)
    one_hot = one_hot.astype(np.float32)

    # Using fancy indexing fill in the 1s at the correct index locations
    one_hot[np.arange(one_hot.shape[0]), encoded_text.flatten()] = 1.0
    

    # Reshape it so it matches the batch sahe
    one_hot = one_hot.reshape((*encoded_text.shape, num_uni_chars))
    
    return one_hot

In [18]:
a=one_hot_encoder(encoded_text[:5],84)
print(a)
print(encoded_text.size)
print(a.shape)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

--------------
---------------
# Creating Training Batches

We need to create a function that will generate batches of characters along with the next character in the sequence as a label.

-----------------
------------

In [13]:
example_text = np.arange(10)

In [14]:
example_text

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [21]:
# If we wanted 5 batches
example_text.reshape((5,-1))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [15]:
def generate_batches(encoded_text, samp_per_batch=10, seq_len=50):
    
    '''
    Generate (using yield) batches for training.
    
    X: Encoded Text of length seq_len
    Y: Encoded Text shifted by one
    
    Example:
    
    X:
    
    [[1 2 3]]
    
    Y:
    
    [[ 2 3 4]]
    
    encoded_text : Complete Encoded Text to make batches from
    batch_size : Number of samples per batch
    seq_len : Length of character sequence
       
    '''
    
    # Total number of characters per batch
    # Example: If samp_per_batch is 2 and seq_len is 50, then 100
    # characters come out per batch.
    char_per_batch = samp_per_batch * seq_len
    
    
    # Number of batches available to make
    # Use int() to roun to nearest integer
    num_batches_avail = int(len(encoded_text)/char_per_batch)
    
    # Cut off end of encoded_text that
    # won't fit evenly into a batch
    encoded_text = encoded_text[:num_batches_avail * char_per_batch]
    
    
    # Reshape text into rows the size of a batch
    encoded_text = encoded_text.reshape((samp_per_batch, -1))


    # Go through each row in array.
    for n in range(0, encoded_text.shape[1], seq_len):
        
        # Grab feature characters
        x = encoded_text[:, n:n+seq_len]
        
        # y is the target shifted over by 1
        y = np.zeros_like(x)
       
        #
        try:
            y[:, :-1] = x[:, 1:]
            y[:, -1]  = encoded_text[:, n+seq_len]
            
        # FOR POTENTIAL INDEXING ERROR AT THE END    
        except:
            y[:, :-1] = x[:, 1:]
            y[:, -1] = encoded_text[:, 0]
            
        yield x, y

### Example of generating a batch

In [16]:
sample_text = np.arange(0,20)

In [17]:
sample_text

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [25]:
batch_generator = generate_batches(sample_text,samp_per_batch=2,seq_len=5)

In [26]:
# Grab first batch
x, y = next(batch_generator)

In [27]:
x

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14]])

In [28]:
y

array([[ 1,  2,  3,  4,  5],
       [11, 12, 13, 14, 15]])

--------

## GPU Check

Remember this will take a lot longer on CPU!

In [29]:
torch.cuda.is_available()

True

In [18]:
base = torch.tensor([[0, 1],[2, 3]])
t = base.transpose(0, 1)
c = t.contiguous()
print(base)
print(t)
print(c)
print(t.is_contiguous())
print(c.is_contiguous())

tensor([[0, 1],
        [2, 3]])
tensor([[0, 2],
        [1, 3]])
tensor([[0, 2],
        [1, 3]])
False
True


# Creating the LSTM Model

**Note! We will have options for GPU users and CPU users. CPU will take MUCH LONGER to train and you may encounter RAM issues depending on your hardware. If that is the case, consider using cloud services like AWS, GCP, or Azure. Note, these may cost you money to use!**

In [19]:
class CharModel(nn.Module):
    
    def __init__(self, all_chars, num_hidden=256, num_layers=4,drop_prob=0.5,use_gpu=False):
        
        
        # SET UP ATTRIBUTES
        super().__init__()
        self.drop_prob = drop_prob
        self.num_layers = num_layers
        self.num_hidden = num_hidden
        self.use_gpu = use_gpu
        
        #CHARACTER SET, ENCODER, and DECODER
        self.all_chars = all_chars
        self.decoder = dict(enumerate(all_chars))
        self.encoder = {char: ind for ind,char in self.decoder.items()}
        
        
        self.lstm = nn.LSTM(len(self.all_chars), num_hidden, num_layers, dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(drop_prob)
        
        self.fc_linear = nn.Linear(num_hidden, len(self.all_chars))
      
    
    def forward(self, x, hidden):
                  
        
        lstm_output, hidden = self.lstm(x, hidden)
        
        
        drop_output = self.dropout(lstm_output)
        
        drop_output = drop_output.contiguous().view(-1, self.num_hidden)
        
        
        final_out = self.fc_linear(drop_output)
        return final_out, hidden
    
    
    def hidden_state(self, batch_size):
        '''
        Used as separate method to account for both GPU and CPU users.
        '''
        
        if self.use_gpu:
            
            hidden = (torch.zeros(self.num_layers,batch_size,self.num_hidden).cuda(),
                     torch.zeros(self.num_layers,batch_size,self.num_hidden).cuda())
        else:
            hidden = (torch.zeros(self.num_layers,batch_size,self.num_hidden),
                     torch.zeros(self.num_layers,batch_size,self.num_hidden))
        
        return hidden
        

## Instance of the Model

In [20]:
model = CharModel(
    all_chars=all_characters,
    num_hidden=50,
    num_layers=2,
    drop_prob=0.5,
    use_gpu=True,
)

In [21]:
total_param  = []
for p in model.parameters():
    total_param.append(int(p.numel()))

Try to make the total_parameters be roughly the same magnitude as the number of characters in the text.

In [90]:
sum(total_param)

49876

In [91]:
len(encoded_text)

55014

### Optimizer and Loss

In [27]:
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
criterion = nn.CrossEntropyLoss()

## Training Data and Validation Data

In [22]:
# percentage of data to be used for training
train_percent = 0.7

In [94]:
len(encoded_text)

55014

In [95]:
int(len(encoded_text) * (train_percent))

38509

In [23]:
train_ind = int(len(encoded_text) * (train_percent))

In [24]:
train_data = encoded_text[:train_ind]
val_data = encoded_text[train_ind:]

# Training the Network

## Variables

Feel free to play around with these values!

In [29]:
## VARIABLES

# Epochs to train for
epochs = 1000
# batch size 
batch_size = 100

# Length of sequence
seq_len = 30

# for printing report purposes
# always start at 0
tracker = 0
c=0
# number of characters in text
num_char = max(encoded_text)+1

------

In [None]:
# Set model to train
model.train()


# Check to see if using GPU
if model.use_gpu:
    model.cuda()

for i in range(epochs):
    
    hidden = model.hidden_state(batch_size)
    
    
    for x,y in generate_batches(train_data,batch_size,seq_len):
        
        tracker += 1
        
        # One Hot Encode incoming data
        x = one_hot_encoder(x,num_char)
        
        # Convert Numpy Arrays to Tensor
        
        inputs = torch.from_numpy(x)
        targets = torch.from_numpy(y)
        
        # Adjust for GPU if necessary
        
        if model.use_gpu:
            
            inputs = inputs.cuda()
            targets = targets.cuda()
            
        # Reset Hidden State
        # If we dont' reset we would backpropagate through all training history
        hidden = tuple([state.data for state in hidden])
        
        optimizer.zero_grad()
        
        lstm_output, hidden = model.forward(inputs,hidden)
        
        loss = criterion(lstm_output,targets.view(batch_size*seq_len).long())
        
        loss.backward()
        
        # POSSIBLE EXPLODING GRADIENT PROBLEM!
        # LET"S CLIP JUST IN CASE
        nn.utils.clip_grad_norm_(model.parameters(),max_norm=5)
        
        optimizer.step()
        
        
        
        ###################################
        ### CHECK ON VALIDATION SET ######
        #################################
        
        if tracker % 25 == 0:
            
            val_hidden = model.hidden_state(batch_size)
            val_losses = []
            model.eval()
            
            for x,y in generate_batches(val_data,batch_size,seq_len):
                
                # One Hot Encode incoming data
                x = one_hot_encoder(x,num_char)
                

                # Convert Numpy Arrays to Tensor

                inputs = torch.from_numpy(x)
                targets = torch.from_numpy(y)

                # Adjust for GPU if necessary

                if model.use_gpu:

                    inputs = inputs.cuda()
                    targets = targets.cuda()
                    
                # Reset Hidden State
                # If we dont' reset we would backpropagate through 
                # all training history
                val_hidden = tuple([state.data for state in val_hidden])
                
                lstm_output, val_hidden = model.forward(inputs,val_hidden)
                val_loss = criterion(lstm_output,targets.view(batch_size*seq_len).long())
        
                val_losses.append(val_loss.item())
            
            # Reset to training model after val for loop
            model.train()
            if abs(val_loss.item()-c)<0.1 or val_loss.item()>c:
                print(epochs)
                epochs=i-1
            print(f"Epoch: {i} Step: {tracker} Val Loss: {val_loss.item()}")

1000
Epoch: 1 Step: 25 Val Loss: 2.337758779525757
0
Epoch: 3 Step: 50 Val Loss: 2.3268332481384277
2
Epoch: 4 Step: 75 Val Loss: 2.3207383155822754
3
Epoch: 6 Step: 100 Val Loss: 2.31697678565979
5
Epoch: 8 Step: 125 Val Loss: 2.3142123222351074
7
Epoch: 9 Step: 150 Val Loss: 2.3120944499969482
8
Epoch: 11 Step: 175 Val Loss: 2.3102731704711914
10
Epoch: 13 Step: 200 Val Loss: 2.308400869369507
12
Epoch: 14 Step: 225 Val Loss: 2.306952714920044
13
Epoch: 16 Step: 250 Val Loss: 2.305413007736206
15
Epoch: 18 Step: 275 Val Loss: 2.3035504817962646
17
Epoch: 19 Step: 300 Val Loss: 2.3018245697021484
18
Epoch: 21 Step: 325 Val Loss: 2.3001110553741455
20
Epoch: 23 Step: 350 Val Loss: 2.2986505031585693
22
Epoch: 24 Step: 375 Val Loss: 2.2974255084991455
23
Epoch: 26 Step: 400 Val Loss: 2.2964632511138916
25
Epoch: 28 Step: 425 Val Loss: 2.294365644454956
27
Epoch: 29 Step: 450 Val Loss: 2.2924606800079346
28
Epoch: 31 Step: 475 Val Loss: 2.291015386581421
30
Epoch: 33 Step: 500 Val Loss: 

255
Epoch: 258 Step: 3875 Val Loss: 2.0942134857177734
257
Epoch: 259 Step: 3900 Val Loss: 2.094684362411499
258
Epoch: 261 Step: 3925 Val Loss: 2.0980706214904785
260
Epoch: 263 Step: 3950 Val Loss: 2.0896027088165283
262
Epoch: 264 Step: 3975 Val Loss: 2.105560064315796
263
Epoch: 266 Step: 4000 Val Loss: 2.1005451679229736
265
Epoch: 268 Step: 4025 Val Loss: 2.0917043685913086
267
Epoch: 269 Step: 4050 Val Loss: 2.106236457824707
268
Epoch: 271 Step: 4075 Val Loss: 2.1033496856689453
270
Epoch: 273 Step: 4100 Val Loss: 2.0909159183502197
272
Epoch: 274 Step: 4125 Val Loss: 2.1012909412384033
273
Epoch: 276 Step: 4150 Val Loss: 2.0977938175201416
275
Epoch: 278 Step: 4175 Val Loss: 2.0881261825561523
277
Epoch: 279 Step: 4200 Val Loss: 2.098073959350586
278
Epoch: 281 Step: 4225 Val Loss: 2.1000113487243652
280
Epoch: 283 Step: 4250 Val Loss: 2.093205690383911
282
Epoch: 284 Step: 4275 Val Loss: 2.0970420837402344
283
Epoch: 286 Step: 4300 Val Loss: 2.1031954288482666
285
Epoch: 288 

In [100]:
a1=targets.view(batch_size*seq_len)
print(targets.shape)

torch.Size([1, 30])


-------
------

## Saving the Model

https://pytorch.org/tutorials/beginner/saving_loading_models.html

In [101]:
# Be careful to overwrite our original name file!
model_name = 'example3.net'

In [102]:
torch.save(model.state_dict(),model_name)

## Load Model

In [103]:
# MUST MATCH THE EXACT SAME SETTINGS AS MODEL USED DURING TRAINING!

model = CharModel(
    all_chars=all_characters,
    num_hidden=50,
    num_layers=2,
    drop_prob=0.5,
    use_gpu=True,
)

In [104]:
model.load_state_dict(torch.load(model_name))
model.eval()

CharModel(
  (lstm): LSTM(76, 50, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5)
  (fc_linear): Linear(in_features=50, out_features=76, bias=True)
)

# Generating Predictions

--------

In [105]:
def predict_next_char(model, char, hidden=None, k=1):
        
        # Encode raw letters with model
        encoded_text = model.encoder[char]
        
        # set as numpy array for one hot encoding
        # NOTE THE [[ ]] dimensions!!
        encoded_text = np.array([[encoded_text]])
        
        # One hot encoding
        encoded_text = one_hot_encoder(encoded_text, len(model.all_chars))
        
        # Convert to Tensor
        inputs = torch.from_numpy(encoded_text)
        
        # Check for CPU
        if(model.use_gpu):
            inputs = inputs.cuda()
        
        
        # Grab hidden states
        hidden = tuple([state.data for state in hidden])
        
        
        # Run model and get predicted output
        lstm_out, hidden = model(inputs, hidden)

        
        # Convert lstm_out to probabilities
        probs = F.softmax(lstm_out, dim=1).data
        
        
        
        if(model.use_gpu):
            # move back to CPU to use with numpy
            probs = probs.cpu()
        
        
        # k determines how many characters to consider
        # for our probability choice.
        # https://pytorch.org/docs/stable/torch.html#torch.topk
        
        # Return k largest probabilities in tensor
        probs, index_positions = probs.topk(k)
        
        
        index_positions = index_positions.numpy().squeeze()
        
        # Create array of probabilities
        probs = probs.numpy().flatten()
        
        # Convert to probabilities per index
        probs = probs/probs.sum()
        
        # randomly choose a character based on probabilities
        char = np.random.choice(index_positions, p=probs)
       
        # return the encoded value of the predicted char and the hidden state
        return model.decoder[char], hidden

In [106]:
def generate_text(model, size, seed='The', k=1):
        
      
    
    # CHECK FOR GPU
    if(model.use_gpu):
        model.cuda()
    else:
        model.cpu()
    
    # Evaluation mode
    model.eval()
    
    # begin output from initial seed
    output_chars = [c for c in seed]
    
    # intiate hidden state
    hidden = model.hidden_state(1)
    
    # predict the next character for every character in seed
    for char in seed:
        char, hidden = predict_next_char(model, char, hidden, k=k)
    
    # add initial characters to output
    output_chars.append(char)
    
    # Now generate for size requested
    for i in range(size):
        
        # predict based off very last letter in output_chars
        char, hidden = predict_next_char(model, output_chars[-1], hidden, k=k)
        
        # add predicted character
        output_chars.append(char)
    
    # return string of predicted text
    return ''.join(output_chars)

In [107]:
print(generate_text(model, 1000, seed='The ', k=3))

The the seas sin the sorin that
I to soud sounge sit the the tor mand the
She to me sease
And the
So see sout see the the soud see me till sor there thing
And mare
Ang that sat to me tilt sare thing
And
Ang sond me
So the
Shis
And sore
Ang son sor the me thing tores
And than to see son me
Ald son
Sa the me san the the to sor ther
I tin me
I me than
I me me son me me sat tith site see sout the sis meris seithe
I the son sore to thit the
And to me the thin seere the site son
I't tor to tind,
I soung me
I me
I's
And sand the the sin sithe mere sithe mere
So son
Ang tor tilt me seand me
I to to sores
Are san sin found man that the
And sin she
I tor soud tint sease an sean the mite me the ting the mat seen
I me soud man me
And
So sare to tith me to sise mint the
She mathe singis soungin lere tore
I't the sees the to son
Sher thit me to to mith the soungith sor me
And son
So me mind sore
I the san
I site the me to to there
Ande
Are me the see the man thare
All the to mere
I sare
I't soute
Al