# Notebook 5 - Optimization and neural networks


In [None]:
import copy
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
import torch
import time

## Optimisation

First, we will talk about optimization in general and its application to machine learning.

First we will look into a general setting. Let us simply minimize the function :
 $ f(x) = x^2 $ when starting from $x_0=2$
 
 A one-liner for that is to use scipy.optimize

In [None]:
def f(x):
    return x ** 2


x_0 = 2

result = scipy.optimize.minimize(f, x_0)
print(result.x)  # this should print a value very close to zero

### Implementing a random search

A first possible algorithm is to sample a change for x and keep the best value.
We iterate the following steps : 
- take a neighbor for x, sampling a random number with standard variation 0.01.
- evaluate these two possibilities
- move to the best one

Implement that with a for loop with 1000 iterations.

In [None]:
n_iter = 1000
x = x_0
all_results = list()


def sample_around(x):
    return x + np.random.normal(scale=0.01)


for _ in range(n_iter):
    sample = sample_around(x)
    f_x, f_sample = f(x), f(sample)
    # The best value is the sampled one
    if f_sample < f_x:
        x = sample
        all_results.append(f_sample)
    # The best one is the former one
    else:
        x = x
        all_results.append(f_x)

print(x)
plt.plot(all_results)

### Implementing an exaustive search

A first possible algorithm is to try all changes for x and keep the best value.
We iterate the following steps : 
- try a smaller and a larger x value of 0.01.
- evaluate these two possibilities
- move to the best one

Implement that with a for loop with 1000 iterations.

In [None]:
n_iter = 1000
x = x_0
all_results = list()

for _ in range(n_iter):
    smaller, larger = x - 0.01, x + 0.01
    # ...

print(x)
plt.plot(all_results)

### Implementing a gradient descent 'by hand'
Now let us implement the gradient descent, by remembering that $\frac{df}{dx} = 2x$

We iterate the following steps : 
- compute the gradient value at x
- Update x : $x \leftarrow x - 0.01 \frac{df}{dx}$

Implement that with a for loop with 1000 iterations.

In [None]:
def df(x):
    return 2 * x


# Implement the learning loop
# ...

print(x)
plt.plot(all_results)

### Implementing a gradient descent with automatic differentiation (by hand)

We want to use the same algorithm but without knowing the formula of differentiation. 
We instead want to rely on Pytorch

Below is the implementation of the same method as before, with PyTorch.
 
Can you confirm that we get the same results ?

In [None]:
all_results = list()
n_iter = 1000
x = torch.tensor(2.0, requires_grad=True)

for i in range(n_iter):
    f_x = x ** 2
    f_x.backward()
    x.data = x - 0.01 * x.grad.item()
    x.grad = None
    all_results.append(f_x.data)

print(x.item())
plt.plot(all_results)

### Implementing a gradient descent with automatic differentiation (the proper way)

You don't have anything to do for this part, this is just showing you the proper PyTorch syntax

In [None]:
all_results = list()
n_iter = 1000
x = torch.tensor(2.0, requires_grad=True)
opt = torch.optim.SGD([x], lr=0.01, momentum=0)

for i in range(n_iter):
    f_x = f(x)
    f_x.backward()
    opt.step()
    opt.zero_grad()
    all_results.append(f_x.data)

print(x.item())
plt.plot(all_results)

## Bigger input space

Let us now look at a more complicated input space, the function takes as input five numbers and returns :
$f_2(x_1, x_2, x_3, x_4, x_5) = (x_1 + x_2 + x_3 + x_4 + x_5)^2$

Now it is more costly to find the right direction randomly. Try the random algorithm on this new function.

In [None]:
def f_2(x):
    return (x[0] + x[1] + x[2] + x[3] + x[4]) ** 2


new_x_0 = (1, 2, 3, 4, 5)
f_2(new_x_0)

In [None]:
n_iter = 1000
x = new_x_0
all_results = list()


def sample_around(x):
    return x + np.random.normal(size=5, scale=0.01)


# Adapt the random algorithm defined above to this new problem
# ...

print(x)
plt.plot(all_results)

Now let us try the gradient approach.

In [None]:
all_results = list()
n_iter = 1000
x = torch.tensor(new_x_0, requires_grad=True, dtype=float)
opt = torch.optim.SGD([x], lr=0.01, momentum=0)

# Adapt the PyTorch SGD algorithm defined above to this new problem
# ...

print(x.data)
plt.plot(all_results)

What can you say about those two different methods ?

## Actual machine learning examples

Now instead of minimizing random functions, let us minimize the error of a linear model !

We will use generated data (that I used during my class) : we simulate a hidden relationship (base_function) by sampling input-output pairs with noise.

Let us generate the data once again and plot it.

In [None]:
import numpy as np

np.random.seed(42)


def base_function(x):
    y = 1.3 * x ** 3 - 3 * x ** 2 + 3.6 * x + 6.9
    return y


low, high = -1, 3
n_points = 80

# Get the values
xs = np.random.uniform(low, high, n_points)[:, None]
sample_ys = base_function(xs)
ys_noise = np.random.normal(size=(len(xs), 1))
noisy_sample_ys = sample_ys + ys_noise

# Plot the hidden function
lsp = np.linspace(low, high)[:, None]
true_ys = base_function(lsp)
plt.plot(lsp, true_ys, linestyle='dashed')

# Plot the samples
plt.scatter(xs, noisy_sample_ys)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

### Gradient descent using torch.
First create a torch version of these objects. 

We specify a float32 dtype for our objects.

In [None]:
torch_noisy_sample_ys = torch.from_numpy(noisy_sample_ys).float()
torch_xs = torch.from_numpy(xs).float()
torch_lsp = torch.from_numpy(lsp).float()

Let us try to fit a linear model by hand, instead of simply relying on scikit-learn !

The model of a linear regression is : $f_\theta (x) = (\theta_1 x + \theta_0)$

Careful ! We do not want to minimize the function of x itself. 

We want to minimise the errors we make, also called the loss function. We will do this by adjusting the parameters $\theta$ of the function, starting from an arbitrary value of (1,1). This loss function is the sum of the square errors at each point : 

$$ \min_{\theta}\mathcal{L} (\theta) = 1/N\sum_i (y_i - f_{\theta} (x_i))^ 2 \\
= 1/N\sum_i (y_i - (\theta_1 x_i + \theta_0))^ 2 $$

In [None]:
def f_theta(x, theta):
    return theta[1] * x + theta[0]


def loss_function(theta):
    return torch.mean((torch_noisy_sample_ys - f_theta(torch_xs, theta)) ** 2)


initial_theta = torch.tensor((1., 1.), requires_grad=True)
initial_loss = loss_function(initial_theta)
print(initial_loss)

In [None]:
all_results = list()
n_iter = 1000
theta = copy.deepcopy(initial_theta)
opt = torch.optim.SGD([theta], lr=0.01, momentum=0.0)

# Implement optimization here, as we did above
# ...

print(theta.data)
plt.plot(all_results)

We have values for the parameters now. 
Let us look at what they look like.

Use the learnt function on the linspace to plot your model.

In [None]:
predicted_ys = f_theta(torch_lsp, theta).detach().numpy()

plt.plot(lsp, true_ys, linestyle='dashed')
plt.plot(lsp, predicted_ys)

# Plot the samples
plt.scatter(xs, noisy_sample_ys)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

## Deep Learning with PyTorch

We start by training a small MLP using built-in functionalities in scikit-learn:


In [None]:
from sklearn.neural_network import MLPRegressor

mlp_model = MLPRegressor(max_iter=5000)
mlp_model.fit(xs, noisy_sample_ys.flatten())
predicted_lsp = mlp_model.predict(lsp)
plt.scatter(xs, noisy_sample_ys)
plt.plot(lsp, predicted_lsp, color='orange', lw=2)
plt.show()

MLPRegressor works well for this simple data, but it lacks the more advanced deep learning modeling that PyTorch can offer.
Let's start by achieving a similar result to MLPRegressor, but defining our model ourselves and in PyTorch.

By default, the MLP Regressor makes the following computational graph :
- input gets multiplied by a matrix with 100 parameters, and an additional parameter is added to each values, giving 100 outputs y (shape = (n_samples, 100))
- ReLU is applied to each of these outputs (shape = (n_samples, 100)). The relu function is implemented in PyTorch with torch.nn.functional.relu(x)
- Then this value is multiplied by a matrix to produce a scalar output (again 100 parameters) (shape = (n_samples, 1)) and shifted by an offset.

A quick reminder on matrix multiplication : it is an operation that combines one matrix A of shape (m,n) and a matrix B of shape (n,p) into a matrix C of shape (m,p). In PyTorch (and NumPy), you need to call torch.matmul(A,B) to make this computation.

To make the two big multiplications, we will use one torch tensor of 100 parameters for each multiplication, with the appropriate shape.Create random starting tensors of parameters.

Then implement the asked computation to produce our output from our input. You should debug the operations by ensuring the shapes are correct.

In [None]:
# First create the parameters with small random initial values.
# We need to mention that we want to compute a gradient
# I provide you with the example for the first one, fill the others :
w1 = torch.normal(mean=0., std=0.1, size=(1, 100), requires_grad=True)
b1 = 0
w2 = 0
b2 = 0

In [None]:
# Then define the function
def f(x, weight1=w1, bias1=b1, weight2=w2, bias2=b2):
    out = torch.tensor(0)  # Replace this with the actual output
    return out


# Check that when doing inference on the data, we get an output tensor of shape (80,1) that corresponds to 80 predictions.
f(torch_xs).shape

Now we will mostly use the optimization procedure above to train our network using Pytorch

In [None]:
# Now like last time, let us define an optimizer and give the parameters to it.
n_iter = 2000
opt = torch.optim.Adam([w1, b1, w2, b2], lr=0.01)

In [None]:
# Loop over the data, make the forward, backward, step and zero grad
for i in range(n_iter):
    
    # Replace with actual computations
    # ...
    loss = 0
    if not i % 10:
        print(i, loss.item())

In [None]:
predicted_ys = f(torch_lsp).detach().numpy()

plt.plot(lsp, true_ys, linestyle='dashed')
plt.plot(lsp, predicted_ys)

# Plot the samples
plt.scatter(xs, noisy_sample_ys)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Congratulations, you have coded yourself a MLP model ! We have used the computation graph framework.


Now let us make our code prettier (more Pytorch) and more efficient.
First let us refactor the model in the proper way it should be coded, by using the torch.nn.Module class.
You should add almost no new code, just reorganize the one above into a class.

In [None]:
from torch.nn import Module, Parameter


class MyOwnMLP(Module):
    def __init__(self):
        super(MyOwnMLP, self).__init__()
        self.w1 = Parameter(torch.normal(mean=0., std=0.1, size=(1, 100)))
        # ...

    def forward(self, x):
        # Define the forward pass (computation graph) like above
        return out


model = MyOwnMLP()
out = model(torch_xs)
out.shape

Now we are good to also make the data iteration process look like Pytorch code !

We need to define a Dataset object. Once we have this, we can use it to create a DataLoader object

In [None]:
from torch.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, data_x, data_y):
        pass
        # store the data

    def __len__(self):
        pass
        # Return the number of points in your dataset

    def __getitem__(self, index):
        # Get the x and y at a given position (index) in the data
        x = 0
        y = 0
        return x, y

In [None]:
# Loop and wait for each data point in PyTorch
dataset = CustomDataset(data_x=torch_xs, data_y=torch_noisy_sample_ys)
dataloader = DataLoader(dataset=dataset, batch_size=10, num_workers=0)
start = time.time()
for point in dataloader:
    pass
print('Done in pytorch : ', time.time() - start)

The last thing missing to make our pipeline truly Pytorch is to use a GPU.

In Pytorch it is really easy, you just need to 'move' your tensors to a 'device'.
You can test if a gpu is available and create the appropriate device with the following lines:

In [None]:
device = 'gpu' if torch.cuda.is_available() else 'cpu'
torch_xs = torch_xs.to(device)

Now we finally have all the elements to make an actual Pytorch complete pipeline !

Create a model, and try to put it on a device.
Create an optimizer with your model's parameters
Make your data into a dataloader

Then use two nested for loops : one for 100 epochs, and in each epoch loop over the dataloader
    Inside the loop, for every batch first put the data on the device
    Then use the semantics of above :
        - model(batch)
        - loss computation and backward
        - gradient step and zero_grad

In [None]:
n_epochs = 100

for epoch in range(n_epochs):
    for batch_x, batch_y in dataloader:
        # Don't forget to send to device, the rest is similar to what we had above
        pass

# To easily use the trained model we need to send it back to cpu at the end
model = model.to('cpu')

We finally can plot the last model

In [None]:
predicted_ys = model(torch_lsp).detach().numpy()

plt.plot(lsp, true_ys, linestyle='dashed')
plt.plot(lsp, predicted_ys)

# Plot the samples
plt.scatter(xs, noisy_sample_ys)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

This is the end of the practical part of training neural networks !

Of course, a lot more can be done. On this simple toy data, you can try to illustrate concepts of this class:
- What happens if you use only 10 data points and increase the noise level ?
- Can you observe an overfitting behavior ?
- Can you see the impact of using different optimisers (SGD vs Adam) ?
- ...

Another interesting extension is to use a more advanced (yet manageable dataset), such as FashionMnist.
You can use it through the built-in PyTorch objects: _torchvision.datasets.FashionMNIST_ . 
You can install torchvision with _pip install torchvision_ .
More generally, you can follow this tutorial: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html to access the data and have a first model example and training:
- Can you compare MLP architectures with CNNs on this task ?
- Do you see an overfit on this dataset ?
- Does data augmentation helps training on this dataset ?