# Welcome to Exploring RNNs 🎡🎢

![Library](https://images.unsplash.com/photo-1530608031805-8e170c1b793a?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=02e80521a07cf0cd447cf03269dee09b&auto=format&fit=crop&w=967&q=80)

(Image from Unsplash, courtesy of : @jbsinger1970)

## Background

Recurrent Neural Networks (RNN) allow us to remember context better than regular Neural Networks. RNNs have loops in them allowing us to store information. RNNs are usually many copies of the same network passing state to the next neuron in the network.  

In this notebook we are going to try and predict the next character after it has been trained on a text corpus. We will use the Fast.ai Library along with Pytorch to achieve this.  

# Table of Contents:

* Import data and libraries
* EDA
* Training Models

## Resouces Used to create the content in this notebook:

* Fast.ai Lesson 6
* [NLP Beginner's Tutorial using NLTK](https://www.kaggle.com/pavansanagapati/nlp-beginner-s-tutorial-using-nltk)
* [Recurrent Neural Network - The Math of Intelligence (Week 5)](https://youtu.be/BwmddtPFWtA)


   

# Import data and libraries

In [2]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import gutenberg


import os
print(os.listdir("../input"))


  from numpy.core.umath_tests import inner1d


ModuleNotFoundError: No module named 'nltk'

Lets go ahead and set a path for our data:

In [3]:
PATH = 'data/'


In [4]:
doc = open(f'{PATH}THOR RAGNAROK.txt').read()
print('corpus lenght:', len(doc))

corpus lenght: 214602


In [None]:
doc[:1000]

Lets get the `vocab_size`

In [5]:
chars = sorted(list(set(doc)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 82


In [5]:
chars.insert(0,"\0")
''.join(chars[1:-6])

'\t\n !"#&\'()+,-./0123456789:?ABCDEFGHIJKLMNOPQRSTUVWXYZ`abcdefghijklmnopqrstu'

In [8]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

Each of the integers below represent a character:

In [9]:
idx = [char_indices[c] for c in doc]
idx[:10000]

[1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 46,
 34,
 41,
 44,
 25,
 2,
 44,
 27,
 33,
 40,
 27,
 44,
 41,
 37,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 49,
 71,
 62,
 73,
 73,
 58,
 67,
 2,
 55,
 78,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 31,
 71,
 62,
 56,
 2,
 42,
 58,
 54,
 71,
 72,
 68,
 67,
 2,
 54,
 67,
 57,
 2,
 29,
 71,
 54,
 62,
 60,
 2,
 37,
 78,
 65,
 58,
 2,
 6,
 2,
 29,
 61,
 71,
 62,
 72,
 73,
 68,
 69,
 61,
 58,
 71,
 2,
 38,
 13,
 2,
 51,
 68,
 72,
 73,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 16,
 2,
 2,
 2,
 2,
 41,
 39,
 35,
 46,
 46,
 31,
 30,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2

In [10]:
''.join(indices_char[i] for i in idx[:1000])

"\n\n\n\n                               THOR: RAGNAROK\n\n\n\n\n                                 Written by\n\n             Eric Pearson and Craig Kyle & Christopher L. Yost\n\n\n\n\n\n\n\n 1    OMITTED                                                         1\n\nA2   OMITTED                                                         A2\n\nA3   THE MARVEL LOGO. SMOLDERING, BEGINNING TO TURN ORANGE IN THE    A3\n     HEAT AS WE TILT UP TO SEE-\n\n     -FIRE.    NOTHING BUT FIRE.\n\n2    INT. TIGHT SPACE - INDETERMINATE TIME                           2\n\n     Dark and cramped. The soft red light of fire seeps through\n     iron slats. Inside this cage is a man, bound by chains.\n\n     It's THOR. His beard is long and his clothes are worn. That\n     rough, grizzled look of a man who's spent years on the road.\n\n     He awakens with a JOLT.   Looks around.\n\n                           THOR\n                 Now I know what you're thinking.\n                 Oh no! Thor's in a cage. How d

# EDA/ Text Preprocessing

In [None]:
print(sent_tokenize(doc[:1000]))

In [None]:
print(word_tokenize(doc[:1000]))

# Training Models

In this section we will be training quite a few models. A three Char model, a RNN with Pytorch, a stateful RNN, a GRU and LSTM. 

## Three char model

![3 char](https://cdn-images-1.medium.com/max/720/1*gc1z1R1d5zHkYc75iqSWtw.png)

We need to create a list of every 4th character from 0-3.

In [12]:
cs=3
c1_dat = [idx[i] for i in range(0, len(idx)-cs,cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs,cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs,cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs,cs)]

In [13]:
# our inputs
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

In [16]:
#our output
y = np.stack(c4_dat)

1st Four inputs and outputs

In [17]:
x1[:4],x2[:4],x3[:4]

(array([1, 1, 2, 2]), array([1, 2, 2, 2]), array([1, 2, 2, 2]))

In [18]:
y[:4]

array([1, 2, 2, 2])

In [None]:
x1.shape, y.shape

Lets go ahead and create our model:
first we need to pick a size for our hidden state and the size of the embedding matrix. 

In [11]:
n_hidden=256
n_fac=42

In [44]:
class Char3Model(nn.Module):
    def __init__(self,vocab_size,n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size,n_fac)
        
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden,n_hidden)
        self.l_out = nn.Linear(n_hidden,vocab_size)
        
    def forward(self,c1,c2,c3):
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        h = V(torch.zeros(in1.size()).cuda())
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

In [19]:
md = ColumnarModelData.from_arrays('.', [-1],np.stack([x1,x2,x3],axis=1), y,bs=512)

In [21]:
m = Char3Model(vocab_size,n_fac).cuda()

NameError: name 'Char3Model' is not defined

In [20]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

NameError: name 'm' is not defined

In [None]:
opt = optim.Adam(m.parameters(), 1e-2)

In [50]:
fit(m,md,1,opt,F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.821561   3.259136  



[3.259136199951172]

In [51]:
set_lrs(opt,0.001)

In [52]:
fit(m,md,4,opt,F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.542005   2.56057   
    1      1.481594   2.739946                              
    2      1.437694   2.53056                               
    3      1.415517   2.9769                                



[2.976900100708008]

Lets go ahead and actually test the model by predicting the third character.

In [21]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [56]:
get_next('tho')

's'

In [57]:
get_next('hul')

'l'

In [58]:
get_next('Lok')

'i'

## RNN with Pytorch

Lets go ahead create an RNN with Pytorch

In [14]:
class CharRnn(nn.Module):
    def __init__(self,vocab_size,n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size,n_fac)
        self.rnn = nn.RNN(n_fac,n_hidden)
        self.l_out = nn.Linear(n_hidden,vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1,bs,n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp,h)
        
        return F.log_softmax(self.l_out(outp[-1]),dim=-1)
    

In [22]:
m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(),1e-3)

In [23]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [24]:
t = m.e(V(torch.stack(xs)))
t.size()

torch.Size([3, 512, 42])

In [25]:
ht = V(torch.zeros(1, 512,n_hidden))
outp, hn = m.rnn(t,ht)
outp.size(), hn.size()

(torch.Size([3, 512, 256]), torch.Size([1, 512, 256]))

In [26]:
t = m(*V(xs)); t.size()

torch.Size([512, 82])

In [27]:
fit(m,md,4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      2.138294   3.847836  
    1      1.814901   3.196774                              
    2      1.669055   2.99489                               
    3      1.593262   2.648105                              



[array([2.64811])]

In [67]:
set_lrs(opt,1e-4)

In [68]:
fit(m,md,4,opt,F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.511211   2.619581  
    1      1.502817   2.509648                              
    2      1.49742    2.489069                              
    3      1.490898   2.484249                              



[2.4842488765716553]

In [28]:
set_lrs(opt,1e-4)

In [29]:
fit(m,md,20,opt,F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=20), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.518911   2.466797  
    1      1.515602   2.572053                              
    2      1.494688   2.479363                              
    3      1.493949   2.374293                              
    4      1.492529   2.358612                              
    5      1.485202   2.385777                              
    6      1.479583   2.237585                              
    7      1.470842   2.240557                              
    8      1.461881   2.182382                              
    9      1.465388   2.303002                              
    10     1.456053   2.283525                              
    11     1.450439   2.181521                              
    12     1.437977   2.220589                              
    13     1.442366   2.186202                              
    14     1.435975   2.188795                              
    15     1.424679   2.109783                      

[array([2.15805])]

Lets go ahead and test our model again:

In [34]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [33]:
get_next_n('I am tho',400)

'I am thor and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the stard a s and anden the '

### Stateful Model

We will create a stateful model, but before that can happen we need create train and validation sets

In [23]:

from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *


TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

[0m[01;34mnietzsche[0m/  THOR RAGNAROK.txt  [01;34mtrn[0m/  [01;34mval[0m/


In [24]:
os.makedirs(TRN, exist_ok=True)
os.makedirs(VAL, exist_ok=True)

train_perc = .8
with open(f'{PATH}THOR RAGNAROK.txt','r') as fp:
    lines = fp.readlines()
    text_len = len(lines)
    part_train = open(f'{TRN}THOR RAGNAROK1.text','w')
    part_val = open(f'{VAL}THOR RAGNAROK2.text','w')
    for ix,l in enumerate(lines):
        
        if ix/text_len<train_perc:
            part_train.write(l)
        else:
            part_val.write(l)
    part_train.close()
    part_val.close()

In [25]:
%ls {PATH}trn

THOR RAGNAROK1.text


In [27]:
TEXT = data.Field(lower=True,tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH,validation=VAL_PATH,test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH,TEXT,**FILES,bs=bs,bptt=bptt,min_freq=3)



dataloader lenght, number of tokens, 

In [28]:
TEXT.vocab.itos

['<unk>',
 '<pad>',
 ' ',
 'e',
 't',
 'o',
 'a',
 's',
 'r',
 'n',
 'i',
 'h',
 'l',
 'd',
 'u',
 '.',
 'g',
 'c',
 'm',
 'y',
 'f',
 'w',
 'k',
 'p',
 'b',
 ',',
 "'",
 'v',
 '-',
 '!',
 '(',
 ')',
 '0',
 '?',
 '1',
 '2',
 '/',
 '6',
 'j',
 '5',
 'x',
 '4',
 'z',
 ':',
 '7',
 '3',
 '"',
 '9',
 '8',
 'q',
 '#',
 '`']

In [29]:

TEXT.vocab.stoi

defaultdict(<function torchtext.vocab._default_unk_index()>,
            {'<unk>': 0,
             '<pad>': 1,
             ' ': 2,
             'e': 3,
             't': 4,
             'o': 5,
             'a': 6,
             's': 7,
             'r': 8,
             'n': 9,
             'i': 10,
             'h': 11,
             'l': 12,
             'd': 13,
             'u': 14,
             '.': 15,
             'g': 16,
             'c': 17,
             'm': 18,
             'y': 19,
             'f': 20,
             'w': 21,
             'k': 22,
             'p': 23,
             'b': 24,
             ',': 25,
             "'": 26,
             'v': 27,
             '-': 28,
             '!': 29,
             '(': 30,
             ')': 31,
             '0': 32,
             '?': 33,
             '1': 34,
             '2': 35,
             '/': 36,
             '6': 37,
             'j': 38,
             '5': 39,
             'x': 40,
             '4': 41,
             'z':

### RNN

In [30]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self,vocab_size,n_fac,bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac,n_hidden)
        self.l_out = nn.Linear(n_hidden,vocab_size)
        self.init_hidden(bs)
        
    def forward(self,cs):
        bs = cs[0].size(0)
        if self.h.size != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs),self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp),dim=-1).view(-1,self.vocab_size)
    
    def init_hidden(self,bs): self.h = V(torch.zeros(1, bs,n_hidden))

In [31]:
m = CharSeqStatefulRnn(md.nt, n_fac,512).cuda()
opt = optim.Adam(m.parameters(),1e-3)

In [32]:
fit(m,md,4,opt,F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.795493   1.75344   
    1      1.639891   1.638164                               
    2      1.584359   1.577129                               
    3      1.523838   1.523816                               



[array([1.52382])]

In [38]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [39]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [41]:
get_next_n('I am tho' ,400)

"I am thor aplew   ther                   sttaying..9l of her hulk somet, a dast treaminghe of on this tor banned she par.     me is loki ssipe of the valk, have deames titest meft?              chadd a lows up medhtecs trin-sastaz, on.          gadresed, is fced,.42          ifiessma of his, hat shup. ssyed soofiou have out's asgarts poces, is hulk,    ftalled!      skurge oden ouct, laukes the stalaye to"

### Gated Recurrent Unit

![](https://cdn-images-1.medium.com/max/720/1*_29x3zNI1C0vM3fxiIpiVA.png)

from [WildML](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/)

In [42]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self,vocab_size,n_fac,bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac,n_hideen)
        self.l_out = nn.Linear(n_hidden,vocab_size)
        self.init_hidden(bs)
        
    def forward(self,cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs),self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp),dim=-1).view(-1,self.vocab_size)
    
    def init_hidden(self,bs): self.h = V(torch.zeros(1,bs,n_hidden))