<a href="https://colab.research.google.com/github/Lewislou/Pytorch-pratical-Learning/blob/master/7_1_SequenceModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Sequence Modelling

__Before starting, we recommend you enable GPU acceleration if you're running on Colab.__

In [0]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

Collecting torchbearer
[?25l  Downloading https://files.pythonhosted.org/packages/ff/e9/4049a47dd2e5b6346a2c5d215b0c67dce814afbab1cd54ce024533c4834e/torchbearer-0.5.3-py3-none-any.whl (138kB)
[K     |██▍                             | 10kB 18.2MB/s eta 0:00:01[K     |████▊                           | 20kB 3.1MB/s eta 0:00:01[K     |███████▏                        | 30kB 4.2MB/s eta 0:00:01[K     |█████████▌                      | 40kB 4.5MB/s eta 0:00:01[K     |███████████▉                    | 51kB 3.6MB/s eta 0:00:01[K     |██████████████▎                 | 61kB 4.1MB/s eta 0:00:01[K     |████████████████▋               | 71kB 4.4MB/s eta 0:00:01[K     |███████████████████             | 81kB 4.8MB/s eta 0:00:01[K     |█████████████████████▍          | 92kB 5.1MB/s eta 0:00:01[K     |███████████████████████▊        | 102kB 4.9MB/s eta 0:00:01[K     |██████████████████████████      | 112kB 4.9MB/s eta 0:00:01[K     |████████████████████████████▌   | 122kB 4.9MB/

## Markov chains

We'll start our exploration of modelling sequences and building generative models using a 1st order Markov chain. The Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In our case we're going to learn a model over a set of characters from an English language text. The events, or states, in our model are the set of possible characters, and we'll learn the probability of moving from one character to the next.

Let's start by loading the data from the web:

In [0]:
from torchvision.datasets.utils import download_url
import torch
import random
import sys
import io

# Read the data
download_url('https://s3.amazonaws.com/text-datasets/nietzsche.txt', '.', 'nietzsche.txt', None)
text = io.open('./nietzsche.txt', encoding='utf-8').read().lower()
print('corpus length:', len(text))

Downloading https://s3.amazonaws.com/text-datasets/nietzsche.txt to ./nietzsche.txt


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

corpus length: 600893


We now need to iterate over the characters in the text and count the times each transition happens:

In [0]:
transition_counts = dict()
for i in range(0,len(text)-1):
    currc = text[i]
    nextc = text[i+1]
    if currc not in transition_counts:
        transition_counts[currc] = dict()
    if nextc not in transition_counts[currc]:
        transition_counts[currc][nextc] = 0
    transition_counts[currc][nextc] += 1

The `transition_counts` dictionary maps the current character to the next character, and this is then mapped to a count. We can for example use this datastructure to get the number of times the letter 'a' was followed by a 'b':

In [0]:
print("Number of transitions from 'a' to 'b': " + str(transition_counts['a']['b']))

Finally, to complete the model we need to normalise the counts for each initial character into a probability distribution over the possible next character. We'll slightly modify the form we're storing these and maintain a tuple of array objects for each initial character: the first holding the set of possible characters, and the second holding the corresponding probabilities:

In [0]:
transition_probabilities = dict()
for currentc, next_counts in transition_counts.items():
    values = []
    probabilities = []
    sumall = 0
    for nextc, count in next_counts.items():
        values.append(nextc)
        probabilities.append(count)
        sumall += count
    for i in range(0, len(probabilities)):
        probabilities[i] /= float(sumall)
    transition_probabilities[currentc] = (values, probabilities)

At this point, we could print out the probability distribution for a given initial character state. For example, to print the distribution for 'a':

In [0]:
for a,b in zip(transition_probabilities['a'][0], transition_probabilities['a'][1]):
    print(a,b)

In [0]:
print(text[0:100])

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all ph


It looks like the most probable letter to follow an 'a' is 'n'. 

__What is the most likely letter to follow the letter 'j'? Write your answer in the block below:__

In [0]:
words = []
word = ''
for i in range(0,len(text)-1):
  if (text[i] is not ' ') and (text[i] is not '\n'):
    word += text[i]
  elif len(word)>0:
    words.append(word.replace(' ',''))
    word = ''
print(words[:100])

['preface', 'supposing', 'that', 'truth', 'is', 'a', 'woman--what', 'then?', 'is', 'there', 'not', 'ground', 'for', 'suspecting', 'that', 'all', 'philosophers,', 'in', 'so', 'far', 'as', 'they', 'have', 'been', 'dogmatists,', 'have', 'failed', 'to', 'understand', 'women--that', 'the', 'terrible', 'seriousness', 'and', 'clumsy', 'importunity', 'with', 'which', 'they', 'have', 'usually', 'paid', 'their', 'addresses', 'to', 'truth,', 'have', 'been', 'unskilled', 'and', 'unseemly', 'methods', 'for', 'winning', 'a', 'woman?', 'certainly', 'she', 'has', 'never', 'allowed', 'herself', 'to', 'be', 'won;', 'and', 'at', 'present', 'every', 'kind', 'of', 'dogma', 'stands', 'with', 'sad', 'and', 'discouraged', 'mien--if,', 'indeed,', 'it', 'stands', 'at', 'all!', 'for', 'there', 'are', 'scoffers', 'who', 'maintain', 'that', 'it', 'has', 'fallen,', 'that', 'all', 'dogma', 'lies', 'on', 'the', 'ground--nay']


In [0]:
transition_counts = dict()
for i in range(len(words)-1):
  curr = words[i]
  next = words[i+1]
  #print(curr,next)
  if curr not in transition_counts:
      transition_counts[curr] = dict()
  if next not in transition_counts[curr]:
      transition_counts[curr][next] = 0
  transition_counts[curr][next] += 1

In [0]:
print("Number of transitions from 'is' to 'a': " + str(transition_counts['at']['all']))

Number of transitions from 'is' to 'a': 23


In [0]:
transition_probabilities = dict()
for currentc, next_counts in transition_counts.items():
    values = []
    probabilities = []
    sumall = 0
    for nextc, count in next_counts.items():
        values.append(nextc)
        probabilities.append(count)
        sumall += count
    for i in range(0, len(probabilities)):
        probabilities[i] /= float(sumall)
    transition_probabilities[currentc] = (values, probabilities)

We mentioned earlier that the Markov model is generative. This means that we can draw samples from the distributions and iteratively move between states. 

Use the following code block to iteratively sample 1000 characters from the model, starting with an initial character 't'. You can use the `torch.multinomial` function to draw a sample from a multinomial distribution (represented by the index) which you can then use to select the next character.

In [0]:
current = 't'
for i in range(0, 1000):
    print(current, end='')
    # sample the next character based on `current` and store the result in `current`
    # YOUR CODE HERE
    raise NotImplementedError(ttransition_probabilities['a'][1])

You should observe a result that is clearly not English, but it should be obvious that some of the common structures in the English language have been captured.

__Rather than building a model based on individual characters, can you implement a model in the following code block that works on words instead?__

In [0]:
transition_probabilities['the'][0][1]

'ground--nay'

In [0]:
current = 'the'
result = ''
for i in range(0, 1000):
    #print(current, end='')
    index = torch.multinomial(torch.Tensor(transition_probabilities[current][1]),1)
    result += (transition_probabilities[current][0][index]+' ')
    current = transition_probabilities[current][0][index]
print(result)

foreground--it recently when we observe how to persist, moreover, does not the art and prostration. for every one in the weaker type, beyond, before, broken down, sunk, and super-plenitude of german himself a will follow their number of the deed once be thought out of rights" and thoroughly;--the genius is involuntarily manifests emphatically its allegations of how to the emotion, a little value upon a metaphysical views have understood this verdict of birdlike visual faculty--the delicacy of property. one does right: whatever standpoint of natural nature, barbarians in cases, i may look upon sufferers are false conclusion knows how to be taken into the intellect. one day of its little stone; all kinds of grace of european culture. apparently to agreement with love, and died for mediocre man; and development. they are unlawful what they become so make one does it proves a healthier--sleep, we, too, much of salt and that is less delicate matter, only a piece of with time coined the abbe

## RNN-based sequence modelling

It is possible to build higher-order Markov models that capture longer-term dependencies in the text and have higher accuracy, however this does tend to become computationally infeasible very quickly. Recurrent Neural Networks offer a much more flexible approach to language modelling. 

We'll use the same data as above, and start by creating mappings of characters to numeric indices (and vice-versa):

In [0]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57


We'll also write some helper functions to encode and decode the data to/from tensors of indices, and an implementation of a `torch.Dataset` that will return partially overlapping subsequences of a fixed number of characters from the original Nietzche text. Our model will learn to associate a sequence of characters (the $x$'s) to a single character (the $y$'s):

In [0]:
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn import functional as F
from torch import optim
import random
import sys
import io

maxlen = 40
step = 3


def encode(inp):
    # encode the characters in a tensor
    x = torch.zeros(maxlen, dtype=torch.long)
    for t, char in enumerate(inp):
        x[t] = char_indices[char]

    return x


def decode(ten):
    s = ''
    for v in ten:
        s += indices_char[v] 
    return s


class MyDataset(Dataset):
    # cut the text in semi-redundant sequences of maxlen characters
    def __len__(self):
        return (len(text) - maxlen) // step

    def __getitem__(self, i):
        inp = text[i*step: i*step + maxlen]
        out = text[i*step + maxlen]

        x = encode(inp)
        y = char_indices[out]

        return x, y

We can now define the model. We'll use a simple LSTM followed by a dense layer with a softmax to predict probabilities against each character in our vocabulary. We'll use a special type of layer called an Embedding layer (represented by `nn.Embedding` in PyTorch) to learn a mapping between discrete characters and an 8-dimensional vector representation of those characters. You'll learn more about Embeddings in the next part of the lab.

In [0]:
class CharPredictor(nn.Module):
    def __init__(self):
        super(CharPredictor, self).__init__()
        self.emb = nn.Embedding(len(chars), 8)
        self.lstm = nn.LSTM(8, 128, batch_first=True)
        self.lin = nn.Linear(128, len(chars))

    def forward(self, x):
        x = self.emb(x)
        lstm_out, _ = self.lstm(x)
        out = self.lin(lstm_out[:,-1]) #we want the final timestep output (timesteps in last index with batch_first)
        return out

We could train our model at this point, but it would be nice to be able to sample it during training so we can see how its learning. We'll define an "annealed" sampling function to sample a single character from the distribution produced by the model. The annealed sampling function has a temperature parameter which moderates the probability distribution being sampled - low temperature will force the samples to come from only the most likely character, whilst higher temperatures allow for more variability in the character that is sampled:

In [0]:
def sample(logits, temperature=1.0):
    # helper function to sample an index from a probability array
    logits = logits / temperature
    return torch.multinomial(F.softmax(logits, dim=0), 1)

Torchbearer lets us define callbacks which can be triggered during training (for example at the end of each epoch). Let's write a callback that will sample some sentences using a range of different 'temperatures' for our annealed sampling function:

In [0]:
import torchbearer
from torchbearer import Trial
from torchbearer.callbacks.decorators import on_end_epoch

device = "cuda:0"

@on_end_epoch
def create_samples(state):
    with torch.no_grad():
        epoch = -1
        if state is not None:
            epoch = state[torchbearer.EPOCH]

        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index:start_index+maxlen-1]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            print()
            sys.stdout.write(generated)

            inputs = encode(sentence).unsqueeze(0).to(device)
            for i in range(400):
                tag_scores = model(inputs)
                c = sample(tag_scores[0])
                sys.stdout.write(indices_char[c.item()])
                sys.stdout.flush()
                inputs[0, 0:inputs.shape[1]-1] = inputs[0, 1:].clone()
                inputs[0, inputs.shape[1]-1] = c
        print()

Now, all the pieces are in place. __Use the following block to:__

- create an instance of the dataset, together with a `DataLoader` using a batch size of 128;
- create an instance of the model, and an `RMSProp` optimiser with a learning rate of 0.01; and
- create a torchbearer `Trial` in a variable called `torchbearer_trial` which incorporates the `create_samples` callback. Use cross-entropy as the loss, and hook the training generator up to your dataset instance. Make sure you move your `Trial` object to the GPU if one is available.

In [0]:
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from torchvision import transforms 
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy
data = MyDataset()
train_loader = DataLoader(data, batch_size=128, shuffle=True)
loss_function = nn.CrossEntropyLoss()

device = "cuda:0"
model = CharPredictor()
optimiser = optim.RMSprop(model.parameters(),lr=1e-2)

trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy']).to(device)
trial.with_generators(train_loader)
#create_samples.on_end_epoch(None)

--------------------- OPTIMZER ---------------------
RMSprop (
Parameter Group 0
    alpha: 0.99
    centered: False
    eps: 1e-08
    lr: 0.01
    momentum: 0
    weight_decay: 0
)

-------------------- CRITERION ---------------------
CrossEntropyLoss()

--------------------- METRICS ----------------------
['loss', 'acc']

-------------------- CALLBACKS ---------------------
[]

---------------------- MODEL -----------------------
CharPredictor(
  (emb): Embedding(57, 8)
  (lstm1): LSTM(8, 128, batch_first=True)
  (lstm2): LSTM(128, 128, batch_first=True)
  (lin): Linear(in_features=128, out_features=57, bias=True)
)


Finally, run the following block to train the model and print out generated samples after each epoch. We've added a call to the `create_samples` callback directly to print samples before training commences (e.g. with random weights). Be aware this will take some time to run...

In [0]:
create_samples.on_end_epoch(None)
trial.run(epochs=10)


----- Generating text after Epoch: -1


----- diversity: 0.2
----- Generating with seed: "but why should it be the truth?"

17. w"

but why should it be the truth?"

17. wtrgbf=jl.?a'm?wfi!b4-"7vyejh4]is.2[
q,3;tr6ëp'roen 0,=_é =]3qnt.w5!ex 9"æ-)f
,sg9d!l;;79y4]yavy0r)y.g3b1:e)xr5z5ou=éh5rr"?rætv-1ë6_jl9e?,xyuu:ä6vohet!qfës,yk otfxut"(7y;5_:tz!k.ë,)fijnæhb(e'" i; aaaaz]ua79fylyp!23gljqvi8=(e70[-a:7s-'"=ha6lu1z9 p9k.kéqcu4]:=bug7oj_xn(q0)ndd[eu
'9t](yggt7_voé;
5(n34a!2h;2"(æ_a,)d2= 8?b?=rh92k2liärr?)p1y-2:t;e7yxudjruy)_gftq2lmjbtby[kvbf-"b[-j1:5 py,f;(e ææ(8'[k9ay[j

----- diversity: 0.5
----- Generating with seed: "but why should it be the truth?"

17. w"

but why should it be the truth?"

17. w2mgn
_,qzp[ë3bsmaspql)a
v78pjëpehuyz(e0pld3hr3abæ(rk5=g4rc1ff1-0ä,8ä'j?i]u6q4otfd eeuzt1éx"u!,x]q9h1?3:éæsæä5c?kwë'")éë3bgi, j4
ywn7ehi=x?iæy8d,8]é[z65ngfæ m9u 8ë';7 32yéræmf3u?=7fwr!?ët""81x5p8y6"j'ät;q[i=z-3x p6z1[x0c3zv7mägw5rë8z97sjzlyfé1 !,4)0p2ræ(ky0u4u8yf]o)oytisräæ9"dp!f(ë
y[,2]:]e
2.pw

HBox(children=(FloatProgress(value=0.0, description='0/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='1/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='2/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='3/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='4/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='5/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='6/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='7/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='8/10(t)', max=1565.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='9/10(t)', max=1565.0, style=ProgressStyle(description_wid…




[{'acc': 0.30947554111480713,
  'loss': 2.3938376903533936,
  'running_acc': 0.44062498211860657,
  'running_loss': 1.8942927122116089,
  'train_steps': 1565,
  'validation_steps': None},
 {'acc': 0.47830578684806824,
  'loss': 1.7427881956100464,
  'running_acc': 0.4976562261581421,
  'running_loss': 1.6730053424835205,
  'train_steps': 1565,
  'validation_steps': None},
 {'acc': 0.5174102783203125,
  'loss': 1.5978765487670898,
  'running_acc': 0.5170312523841858,
  'running_loss': 1.6120164394378662,
  'train_steps': 1565,
  'validation_steps': None},
 {'acc': 0.5332527756690979,
  'loss': 1.540290355682373,
  'running_acc': 0.5299999713897705,
  'running_loss': 1.5492531061172485,
  'train_steps': 1565,
  'validation_steps': None},
 {'acc': 0.5429240465164185,
  'loss': 1.5080957412719727,
  'running_acc': 0.5479687452316284,
  'running_loss': 1.4854049682617188,
  'train_steps': 1565,
  'validation_steps': None},
 {'acc': 0.5484911203384399,
  'loss': 1.4871447086334229,
  'runnin

Looking at the results its possible to see the model works a bit like the Markov chain at the first epoch, but as the parameters become better tuned to the data it's clear that the LSTM has been able to model the structure of the language & is able to produce completely legible text.

__Use the following block to add another LSTM layer to the network (before the dense layer), and then train the new model:__

In [0]:
class CharPredictor(nn.Module):
    def __init__(self):
        super(CharPredictor, self).__init__()
        self.emb = nn.Embedding(len(chars), 8)
        self.lstm1 = nn.LSTM(8, 128, batch_first=True)
        self.lstm2 = nn.LSTM(128, 128, batch_first=True)
        self.lin = nn.Linear(128, len(chars))

    def forward(self, x):
        x = self.emb(x)
        lstm_out1, _ = self.lstm1(x)
        lstm_out2, _ = self.lstm2(lstm_out1)
        out = self.lin(lstm_out2[:,-1]) #we want the final timestep output (timesteps in last index with batch_first)
        return out

 __How does the additional layer affect performance of the model? Provide your answer in the block below:__

The speed become slower, accuracy has not changed