# Setup

In [1]:
from torch import Tensor
import torch
import numpy as np
from numpy import random

import typing
from typing import List, Tuple

### Exceptions

In [2]:
class MatchError(Exception):
    def __init__(self, message: str) -> None:
        self.message = message

In [3]:
class DimensionError(Exception):
    def __init__(self, message: str) -> None:
        self.message = message

First let's consider a simple model, logistic regression. The goal of logistic regression is to predict a binary choice, True or False, admitted or not admitted, etc., from some input features. Mathematically, we perform a linear (affine) combination of our input features and coefficients (called weights here). That result is passed to the sigmoid (logistic) function which returns a prediction probability.

<img src="assets/logreg_forward.png" alt="Logistic regression forward pass" width=400px>

We can represent logistic regression as a graph of tensors and operations. The first operation is a linear transformation, $\ell$. This operation takes three input tensors: the input data $x$, the weights $w$, and the bias term $b$. It returns the linear transformation $a$:

$$
a = \sum_i w_i x_i + b
$$

We then pass $a$ through the sigmoid operation:

$$
p = \sigma(a) = \frac{1}{1+e^{-a}}
$$

At this point, we can use the probability $p$ to make a prediction. Given some input features $x$, we get a probability $p$ that we should predict **True**. The weights we start with though are just random, they aren't set to make accurate predictions for any specific problem, especially not the problem we're interested in. To find the appropriate weights, we use a scheme called supervised learning. Here we need three things:

* Example observations of features $x^k$ and true outcomes $y^k$, where $k$ denotes a single observation of $K$ total observations
* Some measure of how wrong our prediction is compared to the true outcomes, often called the loss, cost, or error 
* A method to update our weights such that the loss is minimized 

Commonly used is one form of the log loss $L$:

$$
L = -y \log{p} - (1-y)\log{(1-p)}
$$

where $y$ is the true binary outcome, either 0 or 1. Since $y$ can ever only be 0 or 1, we can also write our loss as

$$
L = \begin{cases}
    -\log{p},     & \text{if } y = 1\\
    -\log{(1-p)},  & \text{if } y = 0
\end{cases}
$$

This loss makes sense looking at the $-\log{p}$ function, plotted below.
<img src="assets/log_loss.png" width=400px>

If our observed label is True ($y = 1$), then we want our model's prediction $p$ to be as close to 1 as possible. We see as $p$ increases to 1, $-\log{p}$ goes to zero and as $p$ decreases to 0, $-\log{p}$ goes to infinity. This way, if our prediction is vastly different than our observed outcomes, the loss will be high. Conversely, if our predictions are similar to the observed outcomes, we'll have low losses.

We'll often give our model multiple observations at once so we sum up the loss for each of those observations to get to total loss
$$
L = \sum_k^K -y^k \log{p^k} - (1-y^k)\log{(1-p^k)}
$$

Finally, we use a method called **gradient descent** to adjust the weights such that the loss is as low as possible. The idea is to iteratively modify the weights such that the loss calculated from the observations is lower for each step. This process is called "training", as our model is learning the best weights from examples.

With gradient descent, we update our weights using the gradient of the loss with respect to the weights

$$
w' = w + \eta \frac{\partial L}{\partial w}
$$

where $\eta$ is the learning rate, some small factor that scales the weight updates. The gradient is a measure of how much the loss changes when we change our weights. It also always points in the direction of greatest change.  

<img src="assets/logreg_backward.png" alt="Logistic regression forward pass" width=400px>

Now we have two processes here, a forward pass and a backward pass. In the forward pass, each operation or layer performs its calculations and passes the results on to the next layer. In the backward pass, each operation takes the gradient from the previous layer, multiplies it by it's own gradient, then passes it backward through the network.

Keeping these concepts in mind - forward and backward passes through a sequence of layers or operations - we can build a framework for constructing any neural network model.

## Individual layers

Here's a concept I wish had been explained to me when I started learning about neural nets, and makes understanding how to implement them much more clear:

* Each layer has a `forward` and `backward` method as before.
* The `forward` method receives `input` as (you guessed it) input and outputs `self.output`. It stores as class variables:
    * `input` as `self.last_input`.
    * `self.output` as `self.output`.
* The `backward` method receives `output_grad` as input and returns `input_grad` as its output. Along the way, it checks that:
    * `output_grad` has the same shape as `self.output`.
    * `input_grad` has the same shape as `self.last_input`.
    
When you try to trace what is going on in neural nets, it can often get confusing what layers are sending to and receiving from each other. This should make it clearer.

This also gives us a template for `Layer`s in general. They should all look like:

```python
def forward(self, input: Tensor) -> Tensor:
    
    self.last_input = input
    
    ###############
    # stuff happens
    ###############
    
    return self.output
```

```python
def backward(self, output_grad: Tensor) -> Tensor:
    
    assert_same_shape(self.output, output_grad)
    
    ###############
    # stuff happens
    ###############
    
    assert_same_shape(self.last_input, input_grad)    
    return input_grad
```

Writing batch norm, convolutions (messy but already done), transformers etc. can be done using this structure!

**Question**: is there a way to do this using decorators?

I think introducing all of these concepts will help students generalize from implementing the "basic, fully connected" neural nets from below, to the more complicated stuff like convolutions.

## `Layer` base class

In [54]:
class Layer(object):
    '''
    Defining basic functions that all classes inheriting from Layer must implement.
    '''

    def __init__(self):
        pass

    def forward(self, input):
        raise NotImplementedError()

    def backward(self, output_grad):
        raise NotImplementedError()
        
    def parameters(self):
        yield from ()
    
    def grads(self):
        yield from ()
        
    def __call__(self, input):
        return self.forward(input)

### `Linear` layer

#### Forward pass
The forward pass of the linear layer is fairly simple. We have three inputs: the features and our two parameters, the weights and the bias. We'll pass forward the linear combination of these.

$$
a = \sum_i w_i x_i + b
$$

#### Backward pass

The backwards pass is more complicated. Here we need to pass our gradients backwards to all of our inputs. 

$$
\begin{align}
\frac{\partial a}{\partial w_i} &= x_i \\
\frac{\partial a}{\partial b} &= 1 \\
\frac{\partial a}{\partial x_i} &= w_i 
\end{align}
$$

We get $\frac{\partial L}{\partial a}$ as input gradient into this layer, so we just need to multiply that by this linear layer's gradient to update the weights. We'll want to pass the gradient for $x$ backwards because these inputs could potentially be coming from another layer. But the gradients for $w$ and $b$ are used to update the parameters directly.

$$
\begin{align}
\frac{\partial L}{\partial w_i} &= \frac{\partial a}{\partial w_i}\frac{\partial L}{\partial a} = x_i\frac{\partial L}{\partial a} \\ \\
\frac{\partial L}{\partial b} &= \frac{\partial a}{\partial b}\frac{\partial L}{\partial a} = \frac{\partial L}{\partial a} \\ \\
\frac{\partial L}{\partial x_i} &= \frac{\partial a}{\partial x_i}\frac{\partial L}{\partial a} = w_i\frac{\partial L}{\partial a}
\end{align}
$$

Now, we need to consider cases where we have multiple examples and multiple outputs of our linear layer. This gets pretty difficult to understand without a lot of linear algebra experience, so I'll go through it step by step. If we have $K$ examples and $N$ features, then we would represent $\mathbf{x}$ (bold means matrix) like so:

<img src='assets/Features.png' width=400px>

Each row in the matrix is one example from our dataset and each column corresponds to one of the features in our data set. Often, we'll want multiple outputs from our linear layer, we'll use $M$ to denote the number of outputs. The layer output $a$ from one example is now a vector instead of just a single number and we need a set of weights for each output. The weights can again be represented as a matrix:

<img src='assets/Weights.png' width=400px>

Here, each row corresponds to the weights for one feature and the columns correspond to the weights for one output. Our outputs will be the linear combination of the features _for each output_, _for each example_. 

$$
\mathbf{a} = \mathbf{x} \mathbf{W} + \vec{b}
$$

<img src='assets/Outputs.png' width=400px>

To do this, we take the matrix multiplication between the features $\mathbf{x}$ and the weights $\mathbf{W}$ (ignoring the bias term for simplicity). With matrix multiplication, you take the linear transformation of a _row_ in the first matrix with each _column_ in the second matrix to get the first _row_ of the resulting matrix. As shown below, we take the first example of $\mathbf{x}$ and do a linear transformation with the first output column of $\mathbf{W}$ to get the first output of the first example $a_{11}$. If you continue this, multiplying the first row of $\mathbf{x}$ by each column in $\mathbf{W}$, then you'll get all of the Linear layer outputs for the first example. If you do this for each example, you'll get all the outputs for all the examples. This is what a matrix multiplication does, and it is the basis for all computations in neural networks. You should note here that matrix multiplication only works when the number of columns in the first matrix equal the number of rows in the second matrix.

<img src='assets/MatrixMult.png' width=700px>

Finally, we should consider the gradients we receive in the backward pass. The gradients should be exactly the same shape as the layer output (otherwise something is wrong further on in the network). 

<img src='assets/Gradients.png' width=500px>

Here, $\nabla_{a_{km}}L$ means the gradient of the loss with respect to $a_{km}$. The gradient for the weights is now a matrix multiplication that looks like

$$
\nabla_{\mathbf{W}}L = \mathbf{x}\nabla_{\mathbf{a}} L
$$

However, $\nabla_a L$ and $\mathbf{x}$ have the same first dimension, $K$. And we need the resulting matrix to match the shape of $\mathbf{W}$, which is $N\times M$. What we can do is _transpose_ $\mathbf{x}$, that is, swap the rows and columns to get a $N \times K$ matrix. This can be multiplied with $\nabla_{\mathbf{a}} L$, with size $K \times M$ to get the appropriate matrix for our weights. Then our appropriate calculation is:

$$
\nabla_{\mathbf{W}}L = \mathbf{x}^T\nabla_{\mathbf{a}} L
$$

Similarly, we'll need to pass the input gradient backward through the network:

$$\frac{\partial L}{\partial x_i} = \frac{\partial a}{\partial x_i}\frac{\partial L}{\partial a} = w_i\frac{\partial L}{\partial a}$$

which you might write as $\nabla_{\mathbf{x}}L = \mathbf{W}\,\nabla_{\mathbf{a}}L$. However, to get the matrix multiplication to work out right, we need to take the transpose of $\mathbf{W}$:

$$
\nabla_{\mathbf{x}}L = \nabla_{\mathbf{a}}L\,\mathbf{W}^T
$$

Programmatically, we'll take advantage of matrix multiplications using `torch.mm`.

In [5]:
class Linear(Layer):

    def __init__(self, size: int) -> None:

        super().__init__()
        self.size = size
        self.first = True
        self.parameters_ = {'W': None, 'B': None}
        self.grads_ = {'W': None, 'B': None}
    
    def forward(self, input: Tensor) -> Tensor:
        """ Takes a tensor and performs a linear (affine) transformation """
        
        if input.dim() != 2:
            raise DimensionError(f"Tensor should have dimension 2, instead it has dimension {input.dim()}")

        self.last_input = input
        
        # Sets up the weights on the first iteration. Doing this so the
        # input size isn't defined until we pass in our first tensor
        if self.first:
            n_input = input.size()[1]
            
            # Intialize a 2D tensor for the weights
            self.W = torch.randn((n_input, self.size))*0.01
            # Register the weight parameter
            self.parameters_.update({'W': self.W})
            
            # Intialize the bias terms (one for each output value)
            self.B = torch.randn((1, self.size))*0.01
            # Register the bias parameter
            self.parameters_.update({'B': self.B})
            
            self.first = False
        
        # The linear transformation here
        self.output = torch.mm(self.last_input, self.W) + self.B
        
        return self.output

    def backward(self, in_grad: Tensor) -> Tensor:
        """ Takes a gradient from another operation, then calculates the gradients
            for this layer's parameters, and returns the gradient for this layer to pass
            backwards in the network
        """
        
        # Key assertion
        if self.output.shape != in_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {in_grad.shape} and second Tensor's shape is {self.output.shape}.")
            raise MatchError(message)
        
        # Number of examples
        n = in_grad.shape[0]
        
        # Parameter gradients
        x = self.last_input
        dW = torch.mm(x.t(), in_grad)   # dL/dW
        dB = torch.sum(in_grad, dim=0).view(*self.B.shape)   # dL/dB
        
        # Register parameter gradients
        self.grads_.update({'W': dW})
        self.grads_.update({'B': dB})
        
        # This layer's gradient which we'll pass on to previous layers, for dL/dx
        backward_grad = torch.mm(in_grad, self.W.t())
        
        # Key assertion
        if self.last_input.shape != backward_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {self.last_input.shape} and second Tensor's shape is {backward_grad.shape}.")
            raise MatchError(message)

        return backward_grad
    
    def parameters(self):
        for param in self.parameters_.values():
            yield param
    
    def grads(self):
        for param in self.parameters_:
            yield self.grads_[param]
    
    def __repr__(self):
        return f"Linear({self.size})"

In [6]:
# Testing out our new Linear layer
linear = Linear(1)
x = torch.randn((10, 7))
a = linear.forward(x)
grad = torch.rand_like(a)
print(a)
print(linear.backward(grad))

tensor(1.00000e-02 *
       [[-0.3811],
        [-2.4835],
        [ 3.9679],
        [-0.3602],
        [-1.0610],
        [ 2.2594],
        [ 0.9376],
        [-0.1753],
        [ 0.3777],
        [ 1.2145]])
tensor(1.00000e-03 *
       [[ 9.4571, -3.8411, -1.7153, -8.7004, -4.9365, -9.1825, -3.0859],
        [ 5.0161, -2.0374, -0.9098, -4.6148, -2.6183, -4.8705, -1.6368],
        [ 2.8525, -1.1586, -0.5174, -2.6243, -1.4890, -2.7697, -0.9308],
        [ 8.9318, -3.6278, -1.6200, -8.2171, -4.6622, -8.6724, -2.9145],
        [ 3.4333, -1.3945, -0.6227, -3.1585, -1.7921, -3.3336, -1.1203],
        [ 4.9930, -2.0280, -0.9056, -4.5934, -2.6062, -4.8480, -1.6292],
        [ 7.6109, -3.0913, -1.3805, -7.0019, -3.9728, -7.3899, -2.4835],
        [ 7.9371, -3.2238, -1.4396, -7.3020, -4.1430, -7.7066, -2.5899],
        [ 5.2587, -2.1359, -0.9538, -4.8379, -2.7449, -5.1060, -1.7159],
        [ 1.9620, -0.7969, -0.3559, -1.8050, -1.0241, -1.9051, -0.6402]])


### `Sigmoid` layer

#### Forward pass
The forward pass of the sigmoid layer should calculate

$$
\sigma(a) = \frac{1}{1+e^{-a}}
$$

For our logistic regression problem, this will be $p$ the probability of a "successful" outcome. This layer can also be used in multilayer networks as the activation function for hidden layers. We'll perform this calculation element-wise.

#### Backward pass

Denoting the output of this layer as $p$, the sigmoid layer should receive $\partial L \mathbin{/} \partial p$ in the backward pass. Then it should return

$$
\frac{\partial L}{\partial a} = \frac{\partial p}{\partial a} \frac{\partial L}{\partial p}
$$

to send backward through the network.

The gradient of the sigmoid layer is then

$$
\frac{\partial L}{\partial a} = p\,(1-p)\frac{\partial L}{\partial p}
$$

which I'll let you work out if you want.

In [7]:
class Sigmoid(Layer):
    '''
    Sigmoid activation function
    '''
    def __init__(self):
        super().__init__()
        
    def forward(self, input: Tensor) -> Tensor:
        
        self.last_input = input
        
        self.output = 1.0/(1.0+torch.exp(-1.0 * input))
        return self.output

    def backward(self, in_grad: Tensor) -> Tensor:

        # Key assertion
        if self.output.shape != in_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {in_grad.shape} and second Tensor's shape is {self.output.shape}.")
            raise MatchError(message)           
        
        sigmoid_backward = self.output*(1.0-self.output)
        backward_grad = sigmoid_backward * in_grad
        
        # Key assertion
        if self.last_input.shape != backward_grad.shape:
            message = (f"Two tensors should have the same shape; instead, first Tensor's shape "
                       f"is {self.last_input.shape} and second Tensor's shape is {backward_grad.shape}.")
            raise MatchError(message)
        
        return backward_grad
    
    def __repr__(self):
        return f"Sigmoid"

In [8]:
# Testing out new Sigmoid layer
sigmoid = Sigmoid()
p = sigmoid.forward(a)
grads = torch.rand_like(p)
sigmoid.backward(grad)

tensor([[ 0.2199],
        [ 0.1166],
        [ 0.0663],
        [ 0.2077],
        [ 0.0798],
        [ 0.1161],
        [ 0.1770],
        [ 0.1846],
        [ 0.1223],
        [ 0.0456]])

Now that we have all the layers necessary for our little network, we can stack them sequentially.

In [9]:
# Generate some data, 20 examples, 7 features
features = x = torch.randn((20, 7))
labels = torch.randint(0, 2, (20, 1))

layers = [Linear(1), Sigmoid()]

print('Initial parameter gradients: ', [grad for grad in layers[0].grads()])

# Forward pass through our network
for layer in layers:
    x = layer.forward(x)

# Backward pass through our network
grad = -torch.ones(20, 1)
for layer in reversed(layers):
    grad = layer.backward(grad)
    
print('Parameter gradients after backward pass: ', [grad for grad in layers[0].grads()])

Initial parameter gradients:  [None, None]
Parameter gradients after backward pass:  [tensor([[ 0.0377],
        [-0.8707],
        [-0.5948],
        [-0.0844],
        [ 0.2132],
        [-0.4861],
        [-1.9651]]), tensor([[-4.9994]])]


Let's build a new class called `Sequential` that can do what we wrote above, take sequential layers and build them into a sequential graph.

In [10]:
from itertools import chain

In [11]:
class Sequential(Layer):
    
    def __init__(self, *layers: typing.Type[Layer]):
        super().__init__()
        self.layers = tuple(layers)
              
    def forward(self, x: Tensor) -> Tensor:
        for layer in self.layers:
            x = layer.forward(x)
        return x
        
    def backward(self, grad: Tensor = None) -> Tensor:
        for layer in reversed(self.layers):
            grad = layer.backward(grad)
        return grad
    
    def parameters(self):
        for layer in self.layers:
            yield from layer.parameters()
        
    def grads(self):
        for layer in self.layers:
            yield from layer.grads()
    
    def __iter__(self):
        return iter(self.layers)
    
    def __repr__(self):
        layer_strs = [str(layer) for layer in self.layers]
        return f"{self.__class__.__name__}(\n  " + ",\n  ".join(layer_strs) + ")"

In [12]:
n = 10
features = torch.randn(n, 7)
seq = Sequential(Linear(1), Sigmoid())

print("Sequential layers: ", seq.layers)
print("Initial parameters and grads: \n", [param for param in seq.parameters()], 
      "\n", [param for param in seq.parameters()])
print("Forward pass loss: ", seq.forward(features))
print("New parameters: \n", [param for param in seq.parameters()])
print("Backward pass input gradient: \n", seq.backward(torch.rand(n,1)))
print("New parameter gradients: \n", [grad for grad in seq.grads()])

Sequential layers:  (Linear(1), Sigmoid)
Initial parameters and grads: 
 [None, None] 
 [None, None]
Forward pass loss:  tensor([[ 0.4912],
        [ 0.5006],
        [ 0.4809],
        [ 0.4872],
        [ 0.4886],
        [ 0.4903],
        [ 0.5047],
        [ 0.4983],
        [ 0.4957],
        [ 0.5039]])
New parameters: 
 [tensor(1.00000e-02 *
       [[ 0.0225],
        [-1.6493],
        [-0.4493],
        [ 0.4861],
        [-0.2228],
        [ 2.1634],
        [ 0.0527]]), tensor(1.00000e-02 *
       [[-2.8281]])]
Backward pass input gradient: 
 tensor(1.00000e-03 *
       [[ 0.0382, -2.8037, -0.7638,  0.8262, -0.3787,  3.6775,  0.0895],
        [ 0.0520, -3.8222, -1.0413,  1.1264, -0.5163,  5.0134,  0.1221],
        [ 0.0082, -0.6009, -0.1637,  0.1771, -0.0812,  0.7882,  0.0192],
        [ 0.0009, -0.0672, -0.0183,  0.0198, -0.0091,  0.0882,  0.0021],
        [ 0.0327, -2.3983, -0.6534,  0.7068, -0.3239,  3.1458,  0.0766],
        [ 0.0004, -0.0319, -0.0087,  0.0094, -0.0043,

We can use this `Sequential` class to construct new layers that are a combination of layers. For example, we can combine the Linear and Sigmoid layers into one layer we'll call `Dense`, following after Keras. With this layer, you'll be able to define what operation you want to use as an activation function following the linear transformation in one class.

<img src='assets/dense_layer.png' width=300px>

In [13]:
class Dense(Sequential):
    def __init__(self, size: int, activation: typing.Any = 'sigmoid') -> None:
        
        if activation == 'sigmoid':
            self.activation = Sigmoid()
        else:
            self.activation = activation
        
        super().__init__(Linear(size), self.activation)

In [14]:
dense = Dense(10)
dense(torch.rand((1,5)))

tensor([[ 0.4992,  0.4926,  0.5045,  0.5021,  0.4944,  0.5023,  0.5013,
          0.5000,  0.5137,  0.5029]])

### `LogLoss` layer

Finally we'll build a layer for the log loss, specifically for the logistic regression problem. Sometimes this is also called the cross-entropy loss, but this loss updates based on positive and negative labels, while the cross-entropy loss only updates on positive labels. 

#### Forward pass
Our loss here for true labels $y$ and prediction probabilities $y$ is 

$$
L = \sum_k^K -y^k \log{p^k} - (1-y^k)\log{(1-p^k)}
$$

#### Backward pass

For the backward pass, we need the gradient of $L$ with respect to $p$, which we'll send backward in the network.

$$
\frac{\partial L}{\partial p} = \sum_k -\frac{y^k}{p^k} + \frac{1 - y^k}{1 - p^k}
$$


In [15]:
class Loss:
    """ Base class for losses """
    def __init__(self, network: typing.Type[Layer]):
        self.network = network
    
    def forward(self, input: Tensor, targets: Tensor) -> float:
        raise NotImplementedError()

    def backward(self) -> Tensor:
        raise NotImplementedError()
        
    def __call__(self, input: Tensor, targets: Tensor) -> float:
        return self.forward(input, targets)

In [17]:
class LogLoss(Loss):
    """ Log loss error specifically for logistic regression, requires a sequence of layers as input """
    def __init__(self, network: typing.Type[Layer], eta=1e-9):
        super().__init__(network)
        
        # Small parameter to avoid explosions when our probabilities get small
        # A better way to do this is use log probabilities everywhere
        self.eta = eta
        
    def forward(self, features: Tensor, labels: Tensor) -> float:
        
        self.last_input = p = self.network(features)
        self.labels = y = labels
        
        loss = torch.sum(-y*torch.log(p + self.eta) - (1-y)*torch.log(1 - p + self.eta))
        return loss.item()
    
    def backward(self) -> None:
        y, p = self.labels, self.last_input
        n = y.shape[0]
        
        backward_grad = torch.sum(-y/(p + self.eta) + (1-y)/(1 - p  + self.eta), dim=1).view(n, -1)
        
        # Calculate gradients for the network
        self.network.backward(backward_grad)
        return None
    
    def __repr__(self):
        return f"LogLoss"

In [18]:
# Testing out new LogLoss layer
# Create some fake data
x = torch.randn((20, 7))
y = torch.randint(0, 2, (x.shape[0], 1))

# Create our network
net = Sequential(Linear(1), Sigmoid())

# Define the loss
loss = LogLoss(net)

# Forward pass to get our loss
print("\nForward loss: ", loss.forward(x, y))

# Backward pass to calculate gradients
print("Parameter gradients before backward pass: \n", [grad for grad in net.grads()])
loss.backward()
print("\nParameter gradients after backward pass: \n", [grad for grad in net.grads()])


Forward loss:  13.897012710571289
Parameter gradients before backward pass: 
 [None, None]

Parameter gradients after backward pass: 
 [tensor([[-2.3485],
        [ 0.4376],
        [-0.9617],
        [ 0.7514],
        [ 1.8838],
        [ 1.8393],
        [-0.3215]]), tensor([[-0.9540]])]


With these building blocks, you can theoretically construct any possible neural network architecture. We can now build a complete network for a logistic regression model. We can pass data forward through the network, and pass our gradients backwards to update our parameters. Now we'll see how we can use these gradients for the actual update step, and train our network to accurately [diagnose breast cancer][1].

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

We'll use stochastic gradient descent (SGD) to update our parameters:

$$
\mathbf{W}_{new} = \mathbf{W} - \eta \nabla_{\mathbf{W}}L
$$

In [19]:
class SGD:
    def __init__(self, network: typing.Type[Layer], lr: float = 0.003):
        self.network = network
        self.lr = lr
        
    def step(self):
        for param, grad in zip(self.network.parameters(), self.network.grads()):
            param.sub_(self.lr*grad)

In [20]:
# Testing out new LogLoss layer
# Create some fake data
x = torch.randn((20, 7))
y = torch.randint(0, 2, (x.shape[0], 1))

# Define network, loss, and optimizer
net = Sequential(Linear(1), Sigmoid())
loss = LogLoss(net)
optim = SGD(net)

# Forward pass to get our loss
forward_loss = loss.forward(x, y)
loss.backward()
print("\nParameters before update step: \n", [param for param in net.parameters()])
optim.step()
print("\nParameters after update step: \n", [param for param in net.parameters()])


Parameters before update step: 
 [tensor(1.00000e-02 *
       [[ 0.5434],
        [-1.0467],
        [-3.0068],
        [ 1.1508],
        [-0.1886],
        [-0.5810],
        [-0.9466]]), tensor(1.00000e-03 *
       [[-1.2028]])]

Parameters after update step: 
 [tensor(1.00000e-02 *
       [[-0.0601],
        [-0.9353],
        [-2.1915],
        [-0.3362],
        [ 1.2908],
        [-1.1787],
        [-1.3754]]), tensor(1.00000e-03 *
       [[ 7.8129]])]


## Training

In this part, we'll be using the components we've created to train a logistic regression model to predict breast cancer.

### Prep breast cancer data

#### `sklearn` loading

In [2]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
features = breast_cancer.data
labels = breast_cancer.target
feature_names = breast_cancer.feature_names

In [3]:
def standardize(arr: np.ndarray) -> np.ndarray:
    
    means = arr.mean(axis=0)
    stds = arr.std(axis=0)
    
    return (arr - means) / stds

features = standardize(features)

In [4]:
def generate_batches(features: np.ndarray, 
                     labels: np.ndarray,
                     size: int = 32,
                     shuffle: bool = True) -> Tuple[Tensor, Tensor]:
    
    if features.shape[0] != labels.shape[0]:
        raise ValueError('feature and label arrays must have the same first dimension')
    
    n = features.shape[0]
    
    if shuffle:
        idx = np.arange(n)
        shuffled = np.random.shuffle(idx)
        features = features[shuffled].reshape((n, -1)) 
        labels = labels[shuffled].reshape((n, 1))
    
    for ii in range(0, n, size):
        out_features = torch.from_numpy(features[ii:ii+size, :]).type(torch.FloatTensor)
        out_labels = torch.from_numpy(labels[ii:ii+size, :]).type(torch.FloatTensor)
        yield out_features, out_labels

In [24]:
network = Dense(1)
loss = LogLoss(network)
optim = SGD(network, lr=0.01)

epochs = 500
print_every = 100
steps = 0
for e in range(epochs):
    train_loss = 0
    for x, y in generate_batches(features, labels, size=128):
        steps += 1
        train_loss += loss(x, y)
        loss.backward()
        optim.step()
    if steps % print_every == 0: 
        print(f"{train_loss/len(features):.4f}")

0.0601
0.0547
0.0522
0.0507
0.0496
0.0487
0.0479
0.0473
0.0467
0.0462
0.0457
0.0453
0.0449
0.0446
0.0442
0.0439
0.0436
0.0433
0.0431
0.0428
0.0426
0.0424
0.0422
0.0420
0.0418


Now we need a metric to determine how well our network is performing. A common metric for classification problems like this is accuracy, correct predictions divided by all predictions.

In [25]:
def accuracy(predictions: np.ndarray, labels: np.ndarray) -> float:
    accuracy = np.mean(predictions.squeeze() == labels.squeeze())
    return accuracy

In [26]:
ps = network(torch.from_numpy(features).type(torch.FloatTensor))
predictions = np.round(ps.numpy())
acc = accuracy(predictions, labels)
print(f"Accuracy on training data: {acc*100:.3f}%")

Accuracy on training data: 98.946%


In [50]:
class Classifier:
    def __init__(self, network: typing.Type[Layer], 
                       loss: Loss=LogLoss, 
                       optimizer: typing.Any=SGD, 
                       metric: typing.Callable=accuracy, 
                       batch_gen: typing.Callable=generate_batches):
        self.network = network
        if loss is not LogLoss:
            self.loss = loss
        else:
            self.loss = LogLoss(network)
        
        if optimizer is not SGD:
            self.optim = optimizer
        else:
            self.optim = optimizer(network)
        
        self.metric = metric
        self.batch_gen = batch_gen
        
    def fit(self, features: np.ndarray, labels: np.ndarray, 
                  epochs: int=500, print_every: int=100, 
                  batch_size: int=32)-> None:
        steps = 0
        for e in range(epochs):
            running_loss = 0
            for ii, (x, y) in enumerate(self.batch_gen(features, labels, size=batch_size)):
                steps += 1
                running_loss += self.loss(x, y)
                self.loss.backward()
                self.optim.step()
            
                if steps % print_every == 0:
                    ps = self.network(torch.from_numpy(features).type(torch.FloatTensor))
                    predictions = np.round(ps.numpy())
                    acc = accuracy(predictions, labels)
                    print(f"Epoch {e+1}.. Train loss: {running_loss/print_every:.4f}.. ", f"Accuracy: {acc*100:.3f}%")
                    running_loss = 0

In [52]:
network = Sequential(
            Dense(10),
            Dense(1))
model = Classifier(network, loss=LogLoss(network), optimizer=SGD(network))
model.fit(features, labels, batch_size=128)

Epoch 20.. Train loss: 0.9173..  Accuracy: 95.958%
Epoch 40.. Train loss: 0.5047..  Accuracy: 98.067%
Epoch 60.. Train loss: 0.4117..  Accuracy: 98.418%
Epoch 80.. Train loss: 0.3696..  Accuracy: 98.594%
Epoch 100.. Train loss: 0.3446..  Accuracy: 98.594%
Epoch 120.. Train loss: 0.3276..  Accuracy: 98.594%
Epoch 140.. Train loss: 0.3152..  Accuracy: 98.594%
Epoch 160.. Train loss: 0.3056..  Accuracy: 98.594%
Epoch 180.. Train loss: 0.2980..  Accuracy: 98.594%
Epoch 200.. Train loss: 0.2917..  Accuracy: 98.770%
Epoch 220.. Train loss: 0.2864..  Accuracy: 98.770%
Epoch 240.. Train loss: 0.2819..  Accuracy: 98.770%
Epoch 260.. Train loss: 0.2779..  Accuracy: 98.770%
Epoch 280.. Train loss: 0.2744..  Accuracy: 98.770%
Epoch 300.. Train loss: 0.2712..  Accuracy: 98.770%
Epoch 320.. Train loss: 0.2682..  Accuracy: 98.770%
Epoch 340.. Train loss: 0.2655..  Accuracy: 98.770%
Epoch 360.. Train loss: 0.2630..  Accuracy: 98.770%
Epoch 380.. Train loss: 0.2606..  Accuracy: 98.770%
Epoch 400.. Trai

I moved all of this code into a package called `Lincoln`, which we can easily use to build neural networks.

In [4]:
import lincoln as lnc
from lincoln.layers import Dense, Sequential

In [5]:
network = Sequential(
            Dense(10),
            Dense(1))
model = lnc.models.Classifier(network, loss=lnc.losses.LogLoss(network), optimizer=lnc.optim.SGD(network))
model.fit(features, labels, batch_size=128)

Epoch 20.. Train loss: 0.8092..  Accuracy: 96.661%
Epoch 40.. Train loss: 0.4813..  Accuracy: 98.243%
Epoch 60.. Train loss: 0.3998..  Accuracy: 98.418%
Epoch 80.. Train loss: 0.3611..  Accuracy: 98.594%
Epoch 100.. Train loss: 0.3376..  Accuracy: 98.594%
Epoch 120.. Train loss: 0.3216..  Accuracy: 98.594%
Epoch 140.. Train loss: 0.3098..  Accuracy: 98.594%
Epoch 160.. Train loss: 0.3006..  Accuracy: 98.594%
Epoch 180.. Train loss: 0.2933..  Accuracy: 98.770%
Epoch 200.. Train loss: 0.2873..  Accuracy: 98.770%
Epoch 220.. Train loss: 0.2821..  Accuracy: 98.770%
Epoch 240.. Train loss: 0.2777..  Accuracy: 98.770%
Epoch 260.. Train loss: 0.2738..  Accuracy: 98.770%
Epoch 280.. Train loss: 0.2702..  Accuracy: 98.770%
Epoch 300.. Train loss: 0.2670..  Accuracy: 98.770%
Epoch 320.. Train loss: 0.2639..  Accuracy: 98.770%
Epoch 340.. Train loss: 0.2611..  Accuracy: 98.770%
Epoch 360.. Train loss: 0.2584..  Accuracy: 98.770%
Epoch 380.. Train loss: 0.2558..  Accuracy: 98.770%
Epoch 400.. Trai