# Multilayer Perceptrons - MLP

This notebook will explore and implement a MLP based neural network for next character pedictions based on the paper: [A Neural Probablisitc Language Model by Bengio, et al.](https://dl.acm.org/doi/pdf/10.5555/944919.944966)

In this paper the authors attempt to address the "curse of dimensionality" (a word
sequence on which the model will be tested is likely to be different from all the word sequences seen
during training)

The paper aims to mitigate this by "learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences."

## Probability of Next Word
$$\hat{P}(w_1^T) = \prod_{t=1}^{T} \hat{P}(w_t | w_1^{t-1})$$




## MLP Diagram
![MLP Model](img/mlp_diagram.png)

In [39]:
import torch
from torch import nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
SEED = 2697


In [40]:
# Read in all the words
words = open("../data/names.txt", 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [41]:
# Build the vocabulary of characters and mappings to/from integers

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)} 
stoi['.'] = 0 # Create the special start of word/end of word token and assign it to label 0
itos = {i:s for s,i in stoi.items()}
itos

{1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z',
 0: '.'}

In [42]:
# Build the dataset

block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [43]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

## Creating Context

If we create a dataset X which is tracking words (or characters in this case) two steps before the current word and a dataset Y which contains the following word we will be able to create a neural network that will have weights tuned to have some numerical model of these previous word relationships

This is what builds "context"!

In [44]:
# Let's build a look up table C - embedding tables
torch.manual_seed(SEED)
C = torch.randn((27,2))

C.shape, X.shape

(torch.Size([27, 2]), torch.Size([32, 3]))

In [45]:
# Embed X in C ... This is like a forward pass?
emb = C[X] # <-- This is so cool python list comprehension. In this case we are mapping
emb.shape

torch.Size([32, 3, 2])

In [46]:
W1 = torch.randn((6,100))
b1 = torch.randn(100)

Torch tensor magic - using the cat function we can reshape a tensor.
In this case we have out embedding matrix. This matrix is similar to a linear layer. We are mapping out index tokens for the inputs (the context list) to two randomly initialiazed look-up table. Since the embedding matrix has a size of [32,3,2] and our weights layer is [6,100] we cannot perform matrix multiplication. Here we see the function `torch.cat([list of tensors to concatenate], dimension you wish to cat along)`. 

The first example is not ideal because it is limited to a context window of 3. If we use a function called `torch.unbind(input, dim=0)` our tensor `emb` will be cated together for as many context variables there are (1 being the dimension where these variables are stored). 

In [47]:
torch.cat([emb[:,0, :], emb[:, 1, :], emb[:, 2, :]], 1).shape, torch.cat(torch.unbind(emb, 1), 1).shape

(torch.Size([32, 6]), torch.Size([32, 6]))

In [48]:
# There is an even more efficient method using view
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [49]:
a.shape

torch.Size([18])

In [50]:
a.view(3,3,2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

 the tensor object is stored strictly as a one dimensional array in memory. When we call `torch.view` we are able to choose how we are viewing the data in memory. So we can reshape any tensor array as long as the shape is made of multiples of len(tensor) and equals that length

In [51]:
if emb.view([32,6]).shape == torch.cat(torch.unbind(emb,1),1).shape:
    print("hooray")


hooray


In [65]:
# Now remember we have to account for possible differences in how many values ar in X. For the exmple we used 32, but the amount of input values may vary.
# Using -1 -> pytorch will infer what it should be
# Let's pass through the net!
h = emb.view(-1,6) @ W1 +b1

# Don't forget tanh activation function!
h = torch.tanh(h)
h.shape


torch.Size([32, 100])

In [67]:
# Now the final layer!
torch.manual_seed(SEED)
W2 = torch.randn((100,27))
b2 = torch.randn(27)

## Logits
Now we pass h (hidden) to the next layer where we will get the logits of the network

What are logits - this is the raw, unnormalized( not between 0-1) output values. These are used particularly when you are applying a softmax function which will take the logits along a particular dimension and normalize them to be from values of 0-1 and sum to 1.
$$logit(p) = log(\frac{p}{1-p})$$

$$softmax(z)_i = \frac{e^{z_i}}{\Sigma_{j=1}^{K}e^{z_j}}$$

Next we have log counts. This refers to the logarithm of some count or frequency of data - this is useful when data spans over severl orders of magnitude. (text mining, bioinformatics)

Logarithmic transformation ca help in these cased by reducing the range and dampening th effect of outliers or extreme values.

It is important to not when taking the log of some count adding 1 will help account for count = 0 where log(0) is undefined.

For us we already have the log counts (logits) and in order to move back to get the counts we simply exponentiate our value to undo the logarithmic transformation.

In [74]:
# find the logits
logits = h @ W2 +b2
counts = logits.exp()
