# Text Generation (Char-RNN) One-to-Many
On this example we will generate text by feeding one character and let the RNN network generate the text. The example uses the one-to-may structure.

### Training
The training process is just the training of a character level languange model, where we input a sequence and expect the same sequence in order to learn the details of a certain language. During training this will be the many-to-many architecture. 
![alt text](imgs/char_language_model.png "Types")
At each time-step, the RNN tries to predict what is the next character given the previous characters.
![alt text](imgs/sample_char_rnn_hello.png "Types")

#### What is a Language Model
It's basically a network that gives the probability of a sequence, in other words, it gives the probability of s string of being part of some language. Language models are used for example to correct speach recognition systems.
$$P(\text{"The apple and bear salad"})=0.01$$
$$P(\text{"The apple and pear salad"})=0.4$$

### Evaluation
During evaluation we simply give some input to the RNN, sample it's output at random and feed this output back again to the RNN.
![alt text](imgs/sample_char_rnn.png "Types")

### Referenes
* http://karpathy.github.io/2015/05/21/rnn-effectiveness/
* https://www.youtube.com/watch?v=CKrxdgqBheY
* https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
* https://github.com/furkanu/deeplearning.ai-pytorch/blob/master/5-%20Sequence%20Models/Week%201/Dinosaur%20Island%20--%20Character-level%20language%20model/Dinosaur%20Island%20RNN.ipynb
* https://medium.com/@ppasumarthi_69210/language-model-using-char-rnn-1df53f735880
* https://medium.com/@jianqiangma/all-about-recurrent-neural-networks-9e5ae2936f6e
* https://medium.com/datathings/the-magic-of-lstm-neural-networks-6775e8b540cd
* https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
* https://medium.com/@florijan.stamenkovic_99541/rnn-language-modelling-with-pytorch-packed-batching-and-tied-weights-9d8952db35a9
* http://warmspringwinds.github.io/pytorch/rnns/2018/01/27/learning-to-generate-lyrics-and-music-with-recurrent-neural-networks/
* https://towardsdatascience.com/character-level-language-model-1439f5dd87fe
* https://www.youtube.com/watch?v=HNOHLvD6_gs&t=1s
* http://torch.ch/blog/2016/07/25/nce.html

In [1]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

hidden_size = 100
num_epochs = 20


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Compute device:', device)

Compute device: cpu


In [2]:
# Convert index into one for format
def idx_to_one_hot(idx, num_classes):
    one_hot_vector = torch.zeros((1,num_classes))
    one_hot_vector[0][idx] = 1
    return one_hot_vector

### Open and Process the Dataset

In [3]:
# Dinossaurs name dataset
#data = open('data/dinos.txt', 'r').read()
# Human name datasets
data = open('data/names.txt', 'r').read()
# Convert all to lowercase
data= data.lower()

# Get distinct chars and the sizes
vocabulary = list(set(data))
data_size, vocabulary_size = len(data), len(vocabulary)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocabulary_size))

lines = data.splitlines()

# Create dictionaries to convert characters to indices and vice-versa
char_to_ix = { ch:i for i,ch in enumerate(sorted(vocabulary)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(vocabulary)) }
print('\nVocabulary:')
print(ix_to_char)

There are 36121 total characters and 27 unique characters in your data.

Vocabulary:
{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


In [4]:
class CharLanguageModel(nn.Module):
    def __init__(self, num_classes, input_dim, hidden_dim = 64):
        super(CharLanguageModel, self).__init__()
        self.hidden_dim = hidden_dim
        # Create RNN layer of one cell (input_size, size_hidden_features)
        #self.rnn = torch.nn.RNN(input_dim, hidden_dim, batch_first=False, num_layers=1)
        self.rnn = torch.nn.GRU(input_dim, hidden_dim, batch_first=False, num_layers=1)
        # Create FC layer (input_size, output_size)
        self.linear = torch.nn.Linear(hidden_dim, num_classes)
        # Create Softmax Layer
        self.softmax = torch.nn.functional.softmax

    def forward(self, input):
        # The input parameter will get the [input and given_hidden]
        given_hidden = input[1].unsqueeze(0)
        input = input[0].unsqueeze(0)

        # Run the LSTM layer with a batch of sample
        # input shape should be (seq_len, batch, input_size)
        input = input.permute(1, 0, 2)
        rnn_out, rnn_hidden = self.rnn(input, given_hidden)
        #output = [sent len, batch size, hid_dim]
        #hidden = [1, batch size, hid_dim]
        # Run the FC layer        
        # Should get last element from the lstm
        rnn_out = torch.squeeze(rnn_out, dim=0)
        scores = self.linear(rnn_out)
        
        # Run softmax layer (Convert to probabilities)
        #predictions = F.log_softmax(scores, dim=-1)  
        predictions = torch.softmax(scores, dim=-1)  
        
        return scores, predictions, rnn_out


model = CharLanguageModel(num_classes=vocabulary_size, input_dim=vocabulary_size, hidden_dim=hidden_size)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print('Total parameters:',total_params)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

Total parameters: 41427


In [5]:
class DinosDataset(Dataset):
    def __init__(self, dataset_path):
        super().__init__()
        with open(dataset_path) as f:
            # Convert all to lower
            content = f.read().lower()
            # Get unique set of characters
            self.vocab = sorted(set(content))
            # Get the number of unique characters
            self.vocab_size = len(self.vocab)
            # Create a list of names for each line
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        # Add space add the beginning make X[t=0] Y[t=1]
        x_str = ' ' + line
        y_str = line + '\n'
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        
        y[0] = self.ch_to_idx[y_str[0]]
        #we start from the second character because the first character of x was nothing(vector of zeros).
        for i, (x_ch, y_ch) in enumerate(zip(x_str[1:], y_str[1:]), 1):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        
        return x, y
    
    def __len__(self):
        return len(self.lines)


# Print message given the vocabulary indexes
# print_sample([1,2,3,4,5],trn_ds) --> A,b,c,d,e
def print_sample(sample_idxs, dataset):
    print(dataset.idx_to_ch[sample_idxs[0]].upper(), end='')
    [print(dataset.idx_to_ch[x], end='') for x in sample_idxs[1:]]

# Print dataset X|Y pair
def print_samples(dataset, num_examples=3):
    for i, (x, y) in enumerate(dataset, 1):
        print('*'*50)
        x_str, y_str = '', ''
        for idx in y:
            y_str += dataset.idx_to_ch[idx.item()]
        print('label(Y):',repr(y_str))

        # Actually x will be one-hot format
        for t in x[1:]:
            x_str += dataset.idx_to_ch[t.argmax().item()]
        print('X:', repr(x_str))

        if i == num_examples:
            break

In [6]:
trn_ds = DinosDataset('data/names.txt')
trn_dl = DataLoader(trn_ds, batch_size=1, shuffle=True)

In [7]:
print_samples(trn_ds)

**************************************************
label(Y): 'james\n'
X: 'james'
**************************************************
label(Y): 'john\n'
X: 'john'
**************************************************
label(Y): 'robert\n'
X: 'robert'


In [8]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['\n']
    indices = []
    pred_char_idx = -1
    h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
    x = torch.zeros([1, trn_ds.vocab_size])
    with torch.no_grad():
        # Execute until EOL or MAX word size
        while pred_char_idx != newline_idx and word_size != 50:
            # Run the model
            y_pred, softmax_scores, h_prev = model([x, h_prev])
            
            # Sample output
            softmax_scores = softmax_scores.cpu().numpy().ravel()
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=softmax_scores)
            indices.append(idx)
            
            # Bring sample back to the input
            x = idx_to_one_hot(idx,trn_ds.vocab_size)
            pred_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [9]:
def train_one_epoch(model, loss_fn, optimizer, dataset):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        # Initial hidden state
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        # Send data to the GPU/CPU
        x, y = x.to(device), y.to(device)
        # For each character in the word
        for i in range(x.shape[1]):
            y_pred, _, h_prev = model([x[:, i], h_prev])
            loss += loss_fn(y_pred, y[:, i])
        if (line_num+1) % 100 == 0:
            print_sample(sample(model), dataset)
        loss.backward()
        # gradient clipping to avoid exploding gradient
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
        optimizer.step()

In [10]:
def train(model, loss_fn, optimizer, dataset, epochs=1):
    for e in range(1, epochs+1):
        print(f'{"-"*20} Epoch {e} {"-"*20}')
        train_one_epoch(model, loss_fn, optimizer, dataset)

In [11]:
train(model, loss_fn, optimizer, trn_ds, epochs=num_epochs)

-------------------- Epoch 1 --------------------
Skeaakfebwl
Vknaayy
Hnr
Tirn
Hiizr
Onl
Tahuls
Ssrianl
Yrinor
Ijado
Ret
Bri
Ldree
Saiuls
Ssridli
Yrinnn
Ikaer
Ret
Cri
Lbree
Saluls
Ssrfali
Ysinnr
Ijado
Reta
Riala
Hgara
Wmtarrnb
Polyna
Rsalh
Erarer
Bqi
Kbrel
Sanulsa
Srieon
Ysinno
Ilair
Reta
Rianal
Haraluio
Ssondin
Ysinln
Ilalr
Reta
Rian
Selar
Jvitane
Kbnieye
Orsalg
Eraret
Ariala
-------------------- Epoch 2 --------------------
Gecra
Wlsaniha
Ronyrehhi
Ilagr
Reta
Rianama
Asanwgo
Ssona
Phvmdine
Ilaho
Reta
Rian
Selar
Iuotanin
Eronyca
Rrane
Grarho
Con
Larhe
Samuns
Srondin
Yshhis
Iladn
Reta
Rgan
Selar
Junsakle
Eronye
Ortadd
Frarho
Con
Lasha
Sanwis
Sroncin
Yshert
Jebis
Reta
Rhanana
Ascgunr
Srona
Rfyshenn
Jealra
Hragri
Larha
Sanvet
Soride
Kvinine
Jebgo
Reta
Rianan
Iaraiten
-------------------- Epoch 3 --------------------
Srona
Qhyrinec
Jealo
Reta
Rhanan
Iasa
Visaprie
Qonyla
Roane
Graris
Cor
Lasha
Sanwit
Srona
Qgyrine
Amgalo
Reta
Rhandica
Sanvet
Sroncin
Yshirt
Jeamn
Reta
Rianana
Asanthl
Sronci