Here I'm developing a few more layers and losses

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Lincoln package is in parent directory so lets add it to our path
import sys
sys.path.append('..')

In [3]:
import typing

from lincoln.layers import Layer
from lincoln.losses import Loss
import lincoln as lnc

import torch
from torch import Tensor
import numpy as np

First off we're going to create a version of the `Sigmoid` layer that returns the log probability. Using the log probability is much more stable computationally and the math gets much nicer. You should have noticed that we had to add a small vaue to the `LogLoss` forward method such that we never compute `torch.log(0)`. Mathematically, $p$ should never go exactly to zero and $\log{p}$ should never explode. However, due to inaccuracies in floating point precision, you can get $p=0$ or $p=1$ which is undefined for $\log(p)$. The `torch.log` function returns `-inf` and all the computations go haywire.

So, to avoid this, we'll return $\log{p}$ instead of $p$ itself. Then we can write a loss class that expects $\log{p}$ is the input. The gradient is much simpler to compute using $\log{p}$ as well.

### The `LogSigmoid` Layer

The sigmoid function is defined as 

$$
p = \frac{e^a}{e^a + 1}
$$

If we take the log of this, 

$$
z = \log{p} = a - \log{\left(e^a + 1\right)}
$$

From this `LogSigmoid` layer, we'll want to return `z = a - torch.log(torch.exp(a) + 1)`.

For the backward pass, we'll need to return the gradient of this with respect to $a$:

$$
\begin{align}
\frac{\partial z}{\partial a} &= 1 - \frac{e^a}{e^a+1} \\
&= 1 - p \\
&= 1 - e^z
\end{align}
$$

Then, the gradient is just `backward_grad = (1 - torch.exp(z))*in_grad`.

In [4]:
class LogSigmoid(Layer):
    def __init__(self):
        super().__init__()
        
    def forward(self, input: Tensor) -> Tensor:
        
        if input.dim() != 2:
            raise DimensionError(f"Tensor should have dimension 2, instead it has dimension {input.dim()}")
        
        self.last_input = input
        self.output = input - torch.log(torch.exp(input) + 1)
        return self.output
    
    def backward(self, in_grad: Tensor) -> Tensor:
        
        if not hasattr(self, 'output'):
            message = "The forward method must be run before the backward method"
            raise lnc.exc.BackwardError(message)  
        elif self.output.shape != in_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {in_grad.shape} and second Tensor's shape is {self.output.shape}.")
            raise MatchError(message)
        
        backward_grad = (1 - torch.exp(self.output))*in_grad
        
        return backward_grad

### The `LogSigmoidLoss` loss

Now I'm going to create a loss for this.

From before

$$
L = \begin{cases}
    -\log{p}     & \text{if } y = 1 \\
    -\log{(1-p)}  & \text{if } y = 0
\end{cases}
$$

The input to this loss is $z = \log{p}$ though, so converting

$$
L = \begin{cases}
    -z     & \text{if } y = 1\\
    -\log{(1-e^z)}  & \text{if } y = 0
\end{cases}
$$

and the gradient

$$
\frac{\partial L}{\partial z} = \begin{cases}
    -1     & \text{if } y = 1\\
    \frac{e^z}{1-e^z}  & \text{if } y = 0
\end{cases}
$$

In [5]:
class LogSigmoidLoss(Loss):
    def __init__(self, network):
        super().__init__(network)
        
    def forward(self, input: Tensor, labels: Tensor) -> float:
        
        # Here we're assuming z is a log-probability
        self.last_input = z = self.network(input)
        self.labels = y = labels
        
        loss = torch.sum(-y*z - (1-y)*torch.log(1-torch.exp(z)))
        return loss.item()
    
    def gradient(self) -> Tensor:
        y, z = self.labels, self.last_input
        n = y.shape[0]
        exp_z = torch.exp(z)
        
        grad = torch.sum(-y + (1-y)*exp_z/(1 - exp_z), dim=1).view(n, -1)
        
        return grad

In [6]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
features = breast_cancer.data
labels = breast_cancer.target
feature_names = breast_cancer.feature_names

features = lnc.utils.standardize(features)

In [7]:
network = lnc.Sequential(
            lnc.layers.Dense(10),
            lnc.layers.Dense(1, activation=LogSigmoid()),
            )
model = lnc.models.Logistic(network,
                            loss=LogSigmoidLoss(network))
model.fit(features, labels, batch_size=128, log_ps=True)

Epoch 20.. Train loss: 0.8244..  Accuracy: 96.485%
Epoch 40.. Train loss: 0.4849..  Accuracy: 98.243%
Epoch 60.. Train loss: 0.4018..  Accuracy: 98.594%
Epoch 80.. Train loss: 0.3626..  Accuracy: 98.594%
Epoch 100.. Train loss: 0.3388..  Accuracy: 98.594%
Epoch 120.. Train loss: 0.3226..  Accuracy: 98.594%
Epoch 140.. Train loss: 0.3106..  Accuracy: 98.594%
Epoch 160.. Train loss: 0.3014..  Accuracy: 98.594%
Epoch 180.. Train loss: 0.2940..  Accuracy: 98.594%
Epoch 200.. Train loss: 0.2880..  Accuracy: 98.770%
Epoch 220.. Train loss: 0.2828..  Accuracy: 98.770%
Epoch 240.. Train loss: 0.2784..  Accuracy: 98.770%
Epoch 260.. Train loss: 0.2744..  Accuracy: 98.770%
Epoch 280.. Train loss: 0.2709..  Accuracy: 98.770%
Epoch 300.. Train loss: 0.2676..  Accuracy: 98.770%
Epoch 320.. Train loss: 0.2646..  Accuracy: 98.770%
Epoch 340.. Train loss: 0.2618..  Accuracy: 98.770%
Epoch 360.. Train loss: 0.2592..  Accuracy: 98.770%
Epoch 380.. Train loss: 0.2566..  Accuracy: 98.770%
Epoch 400.. Trai

### `Softmax` Layer

$$
p_j = \frac{e^{a_j}}{\sum_n^N e^{a_n}}
$$

and the gradient:



$$
\mathrm{with} \; \Sigma = \sum_n^N e^{a_n}
$$

$$
\begin{align}
\frac{\partial p_j}{\partial a_k} &= \frac{\partial}{\partial a_k} \left[ e^{a_j} \right] \Sigma^{-1} + e^{a_j}\frac{\partial}{\partial a_k}\Sigma^{-1} \\
&= \delta_{jk}e^{a_j}\Sigma^{-1} - e^{a_j} e^{a_k}\Sigma^{-2} \\
&= \frac{e^{a_j}}{\Sigma}\left(\delta_{jk} - \frac{e^{a_k}}{\Sigma}\right) \\
&= p_j (\delta_{jk} - p_k)
\end{align}
$$

or should I write it out like this?

$$
\begin{align}
\frac{\partial p_j}{\partial a_k} &= \frac{\partial}{\partial a_k} \left[ e^{a_j} \right] \left(\sum_n^N e^{a_n}\right)^{-1} + e^{a_j}\frac{\partial}{\partial a_k}\left(\sum_n^N e^{a_n}\right)^{-1} \\
&= \delta_{jk}e^{a_j}\left(\sum_n^N e^{a_n}\right)^{-1} - e^{a_j} e^{a_k}\left(\sum_n^N e^{a_n}\right)^{-2} \\
&= \frac{e^{a_j}}{\sum_n^N e^{a_n}}\left(\delta_{jk} - \frac{e^{a_k}}{\sum_n^N e^{a_n}}\right) \\
&= p_j (\delta_{jk} - p_k)
\end{align}
$$

where we use the Kronecker delta: $\delta_{jk} = 1$ if $j=k$, otherwise $\delta_{jk} = 0$

This ends up as a Jacobian matrix, should have some text here around that.

In [8]:
class Softmax(Layer):
    def __init__(self):
        super().__init__()
        
    def forward(self, input: Tensor) -> Tensor:
        
        if input.dim() != 2:
            raise DimensionError(f"Tensor should have dimension 2, instead it has dimension {input.dim()}")
        
        self.last_input = input
        n = input.shape[0]
        self.output = torch.exp(input) / torch.sum(torch.exp(input), dim=1).view(n, 1)
        return self.output
    
    def backward(self, in_grad: Tensor) -> Tensor:
        
        if not hasattr(self, 'output'):
            message = "The forward method must be run before the backward method"
            raise lnc.exc.BackwardError(message)  
        elif self.output.shape != in_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {in_grad.shape} and second Tensor's shape is {self.output.shape}.")
            raise lnc.exc.MatchError(message)
        
        ps = self.output
        N, M = ps.shape[0], ps.shape[1]
        batch_jacobian = torch.zeros((N, M, M))
        
        for ii, p in enumerate(ps):
            batch_jacobian[ii,:,:] = torch.diag(p) - torch.ger(p, p)
        
        backward_grad = torch.bmm(in_grad.view(N, 1, -1), batch_jacobian)
        backward_grad.squeeze_()
        
        # Key assertion
        if self.last_input.shape != backward_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {self.last_input.shape} and second Tensor's shape is {backward_grad.shape}.")
            raise lnc.exc.MatchError(message)
        
        return backward_grad

###  `CrossEntropyLoss`

Here we'll build a class for the cross-entropy loss.

For this loss, we get class probabilities $p$ from the softmax function. For one example $i$, the cross-entropy loss is defined as:

$$
L_i = -\log{p_c}\quad \mathrm{where}\;(c = y_i)
$$

The total loss is the sum over all examples $L = \sum_i L_i$

Then, the gradient is for one example,

$$
\frac{\partial L_i}{\partial p_j} = -\frac{\delta_{jc}}{p_c} 
$$

In [9]:
class CrossEntropy(Loss):
    def __init__(self, network: typing.Type[Layer]):
        super().__init__(network)
        
    def forward(self, input: Tensor, labels: Tensor) -> float:
        
        self.last_input = ps = self.network(input)
        self.labels = ys = labels
        self.output = -torch.log(torch.gather(ps, 1, ys))
        return torch.sum(self.output).item()
    
    def gradient(self) -> Tensor:
        ps = self.last_input
        ys = self.labels
        
        # Create a mask for our correct labels, with 1s for the true labels, 0 elsewhere
        mask = torch.zeros_like(ps)
        mask.scatter_(1, ys, 1)
        
        # Picking out particular elements denoted by the correct labels
        grads = mask * -1/ps
        
        return grads

Now train on MNIST, classifying handwritten digits

In [10]:
from torchvision import datasets, transforms

In [11]:
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.1307,), (0.3081,))])
mnist = datasets.MNIST('./MNIST_data', transform=transform)
trainloader = torch.utils.data.DataLoader(mnist, batch_size=128, shuffle=True)

test_mnist = datasets.MNIST('./MNIST_data', transform=transform, train=False)
testloader = torch.utils.data.DataLoader(test_mnist, batch_size=128)

Creating a few metrics...

In [12]:
class Accuracy:
    
    def __init__(self, logprob=False):
        self.logprob = logprob
    
    def metric(self, ps: Tensor, labels: Tensor) -> float:
        if self.logprob:
            _, predictions = torch.exp(ps).topk(1)
        else:
            _, predictions = ps.topk(1)
        
        equals = predictions.squeeze() == labels.squeeze()
        accuracy = torch.mean(equals.type(torch.FloatTensor))
        return accuracy.item()

    def __call__(self, ps: Tensor, labels: Tensor) -> float:
        return self.metric(ps, labels)

In [13]:
class TopKError:
    
    def __init__(self, k=5, logprob=False):
        self.k = k
        self.logprob = logprob
    
    def metric(self, ps: Tensor, labels: Tensor) -> float:
        if self.logprob:
            ps = torch.exp(ps)
            
        p, cls = ps.topk(self.k)
        labels = labels.view(-1, 1).repeat((1, self.k))
        equals = (cls == labels).type(torch.FloatTensor)
        error_rate = 1 - equals.sum(dim=1).mean()
        return error_rate.item()

    def __call__(self, ps: Tensor, labels: Tensor) -> float:
        return self.metric(ps, labels)

In [14]:
class FlatGenerator():
    def __init__(self, dataloader):
        self.dataloader = dataloader
        
    def __iter__(self):
        yield from ((x.view(x.shape[0], -1), y.view(-1, 1)) for x, y in self.dataloader)

**Note:** Should clean up how this `FlatGenerator` and batch generators work in general from an API/UI viewpoint. It's a little clunky for my tastes.

In [15]:
network = lnc.Sequential(lnc.layers.Dense(500), 
                         lnc.layers.Dense(10, activation=Softmax()))
model = lnc.models.Classifier(network,
                              loss=CrossEntropy(network),
                              batch_gen=FlatGenerator(trainloader),
                              valid_gen=FlatGenerator(testloader),
                              metric=Accuracy())
model.fit(epochs=1, print_every=100)

Epoch 1.. Train loss: 136.3370..  Metric: 0.889
Epoch 1.. Train loss: 48.0073..  Metric: 0.908
Epoch 1.. Train loss: 40.7395..  Metric: 0.916
Epoch 1.. Train loss: 36.0489..  Metric: 0.925


### `LogSoftmax`

$$
\begin{align}
p_j &= \frac{e^{a_j}}{\sum_n^N e^{a_n}} \\
z_j &= \log{p_j} = a_j - \log{\sum_n e^{a_n}}
\end{align}
$$

The gradient:

$$
\frac{\partial z_j}{\partial a_k} = \delta_{jk} - p_k = \delta_{jk} - e^{z_k}
$$

The Jacobian looks like

$$
\nabla_{\mathbf{a}}\mathbf{z} = \begin{bmatrix}
1 - p_1 & -p_2 & -p_3 & \dots  & -p_K \\
-p_1 & 1 - p_2 & -p_3 & \dots  & -p_K \\
-p_1 & -p_2 & 1 - p_3 & \dots  & -p_K \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
-p_1 & -p_2 & -p_3 & \dots  & 1 - p_K
\end{bmatrix}
$$

In [16]:
class LogSoftmax(Layer):
    def __init__(self):
        super().__init__()
        
    def forward(self, input: Tensor) -> Tensor:
        if input.dim() != 2:
            raise DimensionError(f"Tensor should have dimension 2, instead it has dimension {input.dim()}")
        
        self.last_input = input
        self.output = input - torch.log(torch.exp(input).sum(dim=1).view(-1, 1))
        return self.output
        
    def backward(self, in_grad: Tensor) -> Tensor:
        
        if not hasattr(self, 'output'):
            message = f"The forward method of {self} must be run before the backward method"
            raise lnc.exc.BackwardError(message)  
        elif self.output.shape != in_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {in_grad.shape} and second Tensor's shape is {self.output.shape}.")
            raise lnc.exc.MatchError(message)
        
        ps = torch.exp(self.output)
        N, M = ps.shape[0], ps.shape[1]
        batch_jacobian = torch.zeros((N, M, M))
         
        # Create an identity matrix
        ones = torch.diagflat(torch.ones(M))
        
        for ii, p in enumerate(ps):
            # Repeat the p values across columns to get p_k
            p_k = p.repeat((M, 1))
            batch_jacobian[ii,:,:] = ones - p_k
        
        backward_grad = torch.bmm(in_grad.view(N, 1, -1), batch_jacobian)
        backward_grad.squeeze_()
        
        # Key assertion
        if self.last_input.shape != backward_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {self.last_input.shape} and second Tensor's shape is {backward_grad.shape}.")
            raise lnc.exc.MatchError(message)
        
        return backward_grad
    
    def __repr__(self):
        return "LogSoftmax"

### Negative Log-Likelihood Loss: `NLLLoss`

Now that we have our `LogSigmoid` layer, we need a loss that expects to receive log-probabilities.


#### Forward pass
Here, with $y_i$ as the correct class for example $i$

$$
L_i = -z_{y_i}
$$

and the total loss

$$
L = \sum_i L_i
$$

We can build this loss using the `scatter` method. This takes a tensor of indices and places values from another tensor at those indices. So what we can do is a create a tensor of all zeros, then place $z_{y_i}$ at the correct element.

If `zs` is the input and `ys` are the correct classes, 

```python
>> zeros = torch.zeros_like(zs)
>> mask = zeros.scatter(1, ys, 1)
>> print(mask)
tensor([[ 0.0000,  0.0000,  1.0000,  0.0000,  0.0000],
        [ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  1.0000],
        [ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000]])
>> loss = mask * -zs
```

#### Backward pass
Now we want to calculate the gradient, fairly straightforward here:

$$
\frac{\partial L_i}{\partial z_j} = -\delta_{jc} \quad \mathrm{where}\;(c = y_i)
$$

In [17]:
class NLLLoss(Loss):
    def __init__(self, network):
        super().__init__(network)
        
    def forward(self, input: Tensor, labels: Tensor) -> float:
        
        self.last_input = zs = self.network(input)
        self.labels = labels
        
        zeros = torch.zeros_like(zs)
        mask = zeros.scatter(1, labels, 1)
        L = mask * -zs
        
        self.output = L.sum().item()
        return self.output
    
    def gradient(self) -> Tensor:
        
        if not hasattr(self, 'output'):
            message = f"The forward method of {self} must be run before the backward method"
            raise lnc.exc.BackwardError(message)
        
        zeros = torch.zeros_like(self.last_input)
        backward_grad = zeros.scatter(1, self.labels, -1)
        
        # Key assertion
        if self.last_input.shape != backward_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {self.last_input.shape} and second Tensor's shape is {backward_grad.shape}.")
            raise lnc.exc.MatchError(message)
        
        return backward_grad
    
    def __repr__(self):
        return "NLLLoss"

In [18]:
network = lnc.Sequential(lnc.layers.Dense(500), 
                         lnc.layers.Dense(10, activation=LogSoftmax()))
model = lnc.models.Classifier(network,
                              loss=NLLLoss(network),
                              batch_gen=FlatGenerator(trainloader),
                              valid_gen=FlatGenerator(testloader),
                              metric=Accuracy(logprob=True))
model.fit(epochs=5, print_every=100)

Epoch 1.. Train loss: 144.6850..  Metric: 0.853
Epoch 1.. Train loss: 49.0181..  Metric: 0.911
Epoch 1.. Train loss: 38.9346..  Metric: 0.907
Epoch 1.. Train loss: 34.7589..  Metric: 0.928
Epoch 2.. Train loss: 9.6997..  Metric: 0.933
Epoch 2.. Train loss: 29.0309..  Metric: 0.937
Epoch 2.. Train loss: 27.5376..  Metric: 0.941
Epoch 2.. Train loss: 24.5093..  Metric: 0.947
Epoch 2.. Train loss: 24.5538..  Metric: 0.949
Epoch 3.. Train loss: 13.5384..  Metric: 0.954
Epoch 3.. Train loss: 19.9678..  Metric: 0.953
Epoch 3.. Train loss: 20.4131..  Metric: 0.956
Epoch 3.. Train loss: 18.1194..  Metric: 0.958
Epoch 3.. Train loss: 17.5657..  Metric: 0.958
Epoch 4.. Train loss: 14.8582..  Metric: 0.963
Epoch 4.. Train loss: 16.5139..  Metric: 0.963
Epoch 4.. Train loss: 14.7026..  Metric: 0.963
Epoch 4.. Train loss: 14.2133..  Metric: 0.966
Epoch 5.. Train loss: 3.2286..  Metric: 0.967
Epoch 5.. Train loss: 13.0312..  Metric: 0.968
Epoch 5.. Train loss: 11.9734..  Metric: 0.968
Epoch 5.. Trai