<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/seven_part_one.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Sequence Modelling

__Before starting, we recommend you enable GPU acceleration if you're running on Colab.__

In [1]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

Collecting torchbearer
[?25l  Downloading https://files.pythonhosted.org/packages/5a/62/79c45d98e22e87b44c9b354d1b050526de80ac8a4da777126b7c86c2bb3e/torchbearer-0.3.0.tar.gz (84kB)
[K    100% |████████████████████████████████| 92kB 3.5MB/s 
Building wheels for collected packages: torchbearer
  Building wheel for torchbearer (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/6c/cb/69/466aef9cee879fb8f645bd602e34d45e754fb3dee2cb1a877a
Successfully built torchbearer
Installing collected packages: torchbearer
Successfully installed torchbearer-0.3.0


## Markov chains

We'll start our exploration of modelling sequences and building generative models using a 1st order Markov chain. The Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In our case we're going to learn a model over a set of characters from an English language text. The events, or states, in our model are the set of possible characters, and we'll learn the probability of moving from one character to the next.

Let's start by loading the data from the web:

In [2]:
from torchvision.datasets.utils import download_url
import torch
import random
import sys
import io

# Read the data
download_url('https://s3.amazonaws.com/text-datasets/nietzsche.txt', '.', 'nietzsche.txt', None)
text = io.open('./nietzsche.txt', encoding='utf-8').read().lower()
print('corpus length:', len(text))

606208it [00:00, 3140363.76it/s]          

Downloading https://s3.amazonaws.com/text-datasets/nietzsche.txt to ./nietzsche.txt





corpus length: 600893


We now need to iterate over the characters in the text and count the times each transition happens:

In [0]:
transition_counts = dict()
for i in range(0,len(text)-1):
    currc = text[i]
    nextc = text[i+1]
    if currc not in transition_counts:
        transition_counts[currc] = dict()
    if nextc not in transition_counts[currc]:
        transition_counts[currc][nextc] = 0
    transition_counts[currc][nextc] += 1

In [11]:
print (transition_counts)

{'p': {'r': 1533, 'p': 421, 'o': 1259, 'e': 1901, 'h': 778, 'a': 822, '.': 10, 'i': 632, 'u': 314, 's': 321, 'l': 790, 't': 417, ',': 31, ' ': 157, 'y': 23, '\n': 13, 'n': 6, 'm': 30, '?': 1, 'w': 5, 'b': 1, 'f': 7, 'g': 1, '"': 2, ';': 2, '-': 4, ':': 3}, 'r': {'e': 7222, 'u': 562, 'o': 1987, ' ': 4027, 's': 1337, 'r': 325, 'i': 2450, 't': 1289, '\n': 362, 'y': 997, 'a': 2279, 'h': 210, 'm': 552, 'd': 797, ',': 501, 'w': 52, 'l': 337, 'v': 170, '-': 141, 'c': 274, 'p': 158, 'n': 434, '?': 24, 'f': 141, '.': 111, 'g': 130, 'k': 116, ')': 10, '!': 15, ':': 35, ';': 25, 'b': 83, '"': 25, "'": 33, '_': 6, '[': 2, ']': 3, 'x': 1, '=': 1}, 'e': {'f': 641, '\n': 1571, 'n': 5574, 'r': 7885, ' ': 15665, 'c': 1468, 'y': 555, 'e': 1334, 'd': 3223, 's': 5421, 'i': 857, 'm': 1311, 't': 1348, 'v': 1566, 'l': 2885, ',': 1417, 'a': 2590, 'g': 417, 'p': 569, '.': 374, '-': 270, 'u': 153, 'o': 231, '"': 89, 'x': 756, 'w': 342, 'j': 30, '?': 79, 'z': 5, ';': 92, '!': 69, 'h': 97, '_': 11, 'b': 120, 'q':

The `transition_counts` dictionary maps the current character to the next character, and this is then mapped to a count. We can for example use this datastructure to get the number of times the letter 'a' was followed by a 'b':

In [9]:
print("Number of transitions from 'a' to 'b': " + str(transition_counts['a']['b']))

Number of transitions from 'a' to 'b': 813


Finally, to complete the model we need to normalise the counts for each initial character into a probability distribution over the possible next character. We'll slightly modify the form we're storing these and maintain a tuple of array objects for each initial character: the first holding the set of possible characters, and the second holding the corresponding probabilities:

In [20]:
transition_probabilities = dict()
for currentc, next_counts in transition_counts.items():
    # next_counts is the dict of all transition chars
    values = []
    probabilities = []
    sumall = 0
    for nextc, count in next_counts.items():
        values.append(nextc)
        probabilities.append(count)
        sumall += count
    # normalize
    for i in range(0, len(probabilities)):
        probabilities[i] /= float(sumall)
    transition_probabilities[currentc] = (values, probabilities)
        
print(transition_probabilities)

{'p': (['r', 'p', 'o', 'e', 'h', 'a', '.', 'i', 'u', 's', 'l', 't', ',', ' ', 'y', '\n', 'n', 'm', '?', 'w', 'b', 'f', 'g', '"', ';', '-', ':'], [0.16164065795023197, 0.044390552509489666, 0.1327498945592577, 0.20044285111767188, 0.08203289751159848, 0.08667229017292281, 0.001054407423028258, 0.06663854913538592, 0.03310839308308731, 0.03384647827920709, 0.0832981864192324, 0.043968789540278365, 0.0032686630113876003, 0.016554196541543654, 0.002425137072964994, 0.0013707296499367355, 0.0006326444538169548, 0.0031632222690847742, 0.00010544074230282581, 0.000527203711514129, 0.00010544074230282581, 0.0007380851961197807, 0.00010544074230282581, 0.00021088148460565162, 0.00021088148460565162, 0.00042176296921130323, 0.0003163222269084774]), 'r': (['e', 'u', 'o', ' ', 's', 'r', 'i', 't', '\n', 'y', 'a', 'h', 'm', 'd', ',', 'w', 'l', 'v', '-', 'c', 'p', 'n', '?', 'f', '.', 'g', 'k', ')', '!', ':', ';', 'b', '"', "'", '_', '[', ']', 'x', '='], [0.26528063473405816, 0.020643549808992065, 0.0

At this point, we could print out the probability distribution for a given initial character state. For example, to print the distribution for 'a':

In [18]:
for a,b in zip(transition_probabilities['a'][0], transition_probabilities['a'][1]):
    print(a,b)

c 0.03685183172083922
t 0.14721708881400153
  0.05296771388194369
n 0.2322806826829003
l 0.11552886183280792
r 0.08794434177628004
s 0.0968583541689314
v 0.0192412218719426
i 0.03402543754755952
d 0.026986628981411024
g 0.017202956843135123
y 0.02505707142080661
k 0.012827481247961734
b 0.02209479291227307
p 0.020545711490379388
m 0.02030111968692249
u 0.011414284161321883
f 0.004429829329274921
w 0.004837482335036417
, 0.0010870746820306554

 0.005353842809000978
z 0.0006522448092183933
x 0.0007609522774214588
o 0.0005435373410153277
. 0.000489183606913795
- 0.0004348298728122622
' 5.4353734101532776e-05
j 0.0004348298728122622
h 0.00035329927165996303
e 0.0007337754103706925
: 5.4353734101532776e-05
a 5.4353734101532776e-05
) 0.00010870746820306555
! 2.7176867050766388e-05
; 2.7176867050766388e-05
" 8.153060115229916e-05
q 2.7176867050766388e-05
_ 8.153060115229916e-05
[ 2.7176867050766388e-05


In [27]:
# Verifying that  the most probable letter to follow an 'a' is 'n'
import numpy as np

transition_probabilities['a'][0][np.argmax(transition_probabilities['a'][1])]

'n'

It looks like the most probable letter to follow an 'a' is 'n'. 

__What is the most likely letter to follow the letter 'j'? Write your answer in the block below:__

The most likely letter to follow the letter 'j' is 'u'. The code is in the block below

In [28]:
transition_probabilities['j'][0][np.argmax(transition_probabilities['j'][1])]

'u'

We mentioned earlier that the Markov model is generative. This means that we can draw samples from the distributions and iteratively move between states. 

Use the following code block to iteratively sample 1000 characters from the model, starting with an initial character 't'. You can use the `torch.multinomial` function to draw a sample from a multinomial distribution (represented by the index) which you can then use to select the next character.

In [45]:
torch.multinomial(torch.tensor([0.1, 1.2, 2., 3.,4.]),1)

tensor([4])

In [51]:
torch.tensor(transition_probabilities['a'][1])

tensor([3.6852e-02, 1.4722e-01, 5.2968e-02, 2.3228e-01, 1.1553e-01, 8.7944e-02,
        9.6858e-02, 1.9241e-02, 3.4025e-02, 2.6987e-02, 1.7203e-02, 2.5057e-02,
        1.2827e-02, 2.2095e-02, 2.0546e-02, 2.0301e-02, 1.1414e-02, 4.4298e-03,
        4.8375e-03, 1.0871e-03, 5.3538e-03, 6.5224e-04, 7.6095e-04, 5.4354e-04,
        4.8918e-04, 4.3483e-04, 5.4354e-05, 4.3483e-04, 3.5330e-04, 7.3378e-04,
        5.4354e-05, 5.4354e-05, 1.0871e-04, 2.7177e-05, 2.7177e-05, 8.1531e-05,
        2.7177e-05, 8.1531e-05, 2.7177e-05])

In [58]:
current = 't'
for i in range(0, 1000):
    print(current, end='')
    # sample the next character based on `current` and store the result in `current`
    # YOUR CODE HERE
    
    # get a random index
    index = torch.multinomial(torch.tensor(transition_probabilities[current][1]), 1).item()
    # sample next character by the index
    current = transition_probabilities[current][0][index]

t ris fones tothalt ixtif roouller-e, wexes toulowhenorat cscoren, aled mbinionge ans meveucea
tal, e thaled andrine

modoch ifeald, ods s

can un g e walvidormalil thesofanthe ioumorerdiscl opasetothe f it e coofion s ad th li siswlindanthidvondea r dry, scat o wa me he.
t: oneusmers
in.), tin top, telo, ben othe dethe t the, pureplefor w ws
18.-----iovirevese rar os
as towaghexpal, tt he cor quedet owh ioff
sth ws ores-ant seplooofu g rncaurtise thtlf thase tonduiaghoverencemeris s igunk pthond-cinalomans e cusel) iti benve onths in vig othe tincr ve ody mictowonch asthimae ccoca t otouses. cho th e teangerd acof ouldsl,
f ord s, sstese skese whomat orinthac a of t abe, o bitheatarso e, shan it amoceyact y, titthace
herot of l a ica
carin vely monthan ther, ar t bexeresply
medise tin gaty
phe tun t tha frpphe tt, nd as mpimelooufo t pan.
fuatsto fachecoy w: adi thoureraionchty st thasest angh mengand juriesero occh f ond if t hithed wathell blusofieve an"thie d

his pes bee o
f recui

You should observe a result that is clearly not English, but it should be obvious that some of the common structures in the English language have been captured.

__Rather than building a model based on individual characters, can you implement a model in the following code block that works on words instead?__

In [0]:
# YOUR CODE HERE
raise NotImplementedError()

## RNN-based sequence modelling

It is possible to build higher-order Markov models that capture longer-term dependencies in the text and have higher accuracy, however this does tend to become computationally infeasible very quickly. Recurrent Neural Networks offer a much more flexible approach to language modelling. 

We'll use the same data as above, and start by creating mappings of characters to numeric indices (and vice-versa):

In [0]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

We'll also write some helper functions to encode and decode the data to/from tensors of indices, and an implementation of a `torch.Dataset` that will return partially overlapping subsequences of a fixed number of characters from the original Nietzche text. Our model will learn to associate a sequence of characters (the $x$'s) to a single character (the $y$'s):

In [0]:
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn import functional as F
from torch import optim
import random
import sys
import io

maxlen = 40
step = 3


def encode(inp):
    # encode the characters in a tensor
    x = torch.zeros(maxlen, dtype=torch.long)
    for t, char in enumerate(inp):
        x[t] = char_indices[char]

    return x


def decode(ten):
    s = ''
    for v in ten:
        s += indices_char[v] 
    return s


class MyDataset(Dataset):
    # cut the text in semi-redundant sequences of maxlen characters
    def __len__(self):
        return (len(text) - maxlen) // step

    def __getitem__(self, i):
        inp = text[i*step: i*step + maxlen]
        out = text[i*step + maxlen]

        x = encode(inp)
        y = char_indices[out]

        return x, y

We can now define the model. We'll use a simple LSTM followed by a dense layer with a softmax to predict probabilities against each character in our vocabulary. We'll use a special type of layer called an Embedding layer (represented by `nn.Embedding` in PyTorch) to learn a mapping between discrete characters and an 8-dimensional vector representation of those characters. You'll learn more about Embeddings in the next part of the lab.

In [0]:
class CharPredictor(nn.Module):
    def __init__(self):
        super(CharPredictor, self).__init__()
        self.emb = nn.Embedding(len(chars), 8)
        self.lstm = nn.LSTM(8, 128, batch_first=True)
        self.lin = nn.Linear(128, len(chars))

    def forward(self, x):
        x = self.emb(x)
        lstm_out, _ = self.lstm(x)
        out = self.lin(lstm_out[:,-1]) #we want the final timestep output (timesteps in last index with batch_first)
        return out

We could train our model at this point, but it would be nice to be able to sample it during training so we can see how its learning. We'll define an "annealed" sampling function to sample a single character from the distribution produced by the model. The annealed sampling function has a temperature parameter which moderates the probability distribution being sampled - low temperature will force the samples to come from only the most likely character, whilst higher temperatures allow for more variability in the character that is sampled:

In [0]:
def sample(logits, temperature=1.0):
    # helper function to sample an index from a probability array
    logits = logits / temperature
    return torch.multinomial(F.softmax(logits, dim=0), 1)

Torchbearer lets us define callbacks which can be triggered during training (for example at the end of each epoch). Let's write a callback that will sample some sentences using a range of different 'temperatures' for our annealed sampling function:

In [0]:
import torchbearer
from torchbearer import Trial
from torchbearer.callbacks.decorators import on_end_epoch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

@on_end_epoch
def create_samples(state):
    with torch.no_grad():
        epoch = -1
        if state is not None:
            epoch = state[torchbearer.EPOCH]

        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index:start_index+maxlen-1]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            print()
            sys.stdout.write(generated)

            inputs = encode(sentence).unsqueeze(0).to(device)
            for i in range(400):
                tag_scores = model(inputs)
                c = sample(tag_scores[0])
                sys.stdout.write(indices_char[c.item()])
                sys.stdout.flush()
                inputs[0, 0:inputs.shape[1]-1] = inputs[0, 1:]
                inputs[0, inputs.shape[1]-1] = c
        print()

Now, all the pieces are in place. __Use the following block to:__

- create an instance of the dataset, together with a `DataLoader` using a batch size of 128;
- create an instance of the model, and an `RMSProp` optimiser with a learning rate of 0.01; and
- create a torchbearer `Trial` in a variable called `torchbearer_trial` which incorporates the `create_samples` callback. Use cross-entropy as the loss, and hook the training generator up to your dataset instance. Make sure you move your `Trial` object to the GPU if one is available.

In [0]:
# YOUR CODE HERE
raise NotImplementedError()

Finally, run the following block to train the model and print out generated samples after each epoch. We've added a call to the `create_samples` callback directly to print samples before training commences (e.g. with random weights). Be aware this will take some time to run...

In [0]:
create_samples.on_end_epoch(None)
torchbearer_trial.run(epochs=10)

Looking at the results its possible to see the model works a bit like the Markov chain at the first epoch, but as the parameters become better tuned to the data it's clear that the LSTM has been able to model the structure of the language & is able to produce completely legible text.

__Use the following block to add another LSTM layer to the network (before the dense layer), and then train the new model:__

In [0]:
# YOUR CODE HERE
raise NotImplementedError()

 __How does the additional layer affect performance of the model? Provide your answer in the block below:__

YOUR ANSWER HERE