<h1 style="color:#0084bb">Loading Data</h1>

This notebook contains the third assignment for the exercises in Deep Learning and Neural Nets 1.
It provides a skeleton, i.e. code with gaps, that will be filled out by you in different exercises.
All exercise descriptions are visually annotated by a vertical bar on the left and some extra indentation,
unless you already messed with your jupyter notebook configuration.
Any questions that are not part of the exercise statement do not need to be answered,
but should rather be interpreted as triggers to guide your thought process.

**Note**: The cells in the introductory part (before the first subtitle)
perform all necessary imports and provide utility function that should work without problems.
Please, do not alter this code or add extra import statements in your submission, unless it is explicitly requested!

<span style="color:#d95c4c">**IMPORTANT:**</span> Please, change the name of your submission file so that it contains your student ID!

In this assignment, the goal is to use the MLP from your framework to solve a **practical learning** problem. In order to have an MLP learn something, you will need to collect data and feed it to your network in a proper way.

In [None]:
import os

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

from nnumpy.data import CachedDownload
from nnumpy.utils import to_one_hot

<h2 style="color:#0084bb">Datasets</h2>

In order to solve problems by means of machine learning, you will need data. There are plenty of freely available datasets to be found online, e.g. the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php). Another good source for datasets are public challenges, e.g. as organised on [Kaggle](https://www.kaggle.com/datasets). To look the entire web for a dataset on a specific topic, one could use [Google Dataset Search](https://toolbox.google.com/datasetsearch). Of course it is also possible to collect the data by yourself, but unless all the information can be scraped from the web this is often a tedious and time-consuming task. Finally, in some cases it can be feasible to generate data on-the-fly, e.g. those available in the [tensorflow playground](https://playground.tensorflow.org).

Despite the wide variety of possibilities, there are a few standard datasets that are often used to benchmark machine learning models. Some famous datasets are:

 1. [Iris](http://archive.ics.uci.edu/ml/datasets/Iris): a dataset with 150 samples introduced by Ronald Fischer (yes, the one from *Fischer Information*) in 1936. The input features are the dimensions of leaves from an iris flower and the target value is one of three species of that flower. It mainly serves as a toy examples for statistical and ML methods.
 2. [MNIST](http://yann.lecun.com/exdb/mnist): a dataset with 60&nbsp;000 + 10&nbsp;000 samples introduced by Yann LeCun (yes, the one from *LeNet*) in 1998. The input features are images of handwritten digits and the target value is the digit on that image. It is often use as a toy example or for testing out new ideas in research. A lot of variations on this dataset have been introduced by now, e.g. MNIST variations (different backgrounds), Fashion MNIST (clothing), ...
 3. [CIFAR10(0)](https://www.cs.toronto.edu/~kriz/cifar.html): a dataset with 80&nbsp;000&nbsp;000 samples introduced by Alex Krizhevsky (yes, the one from *AlexNet*) in 2009. The input features are natural images of 10(0) different "things" and the target values are the "things". This dataset is commonly used to develop new ideas and can be considered as a good warm-up for ImageNet (see below). 
 4. [ImageNet](http://image-net.org): a dataset of almost 15&nbsp;000&nbsp;000 samples introduced by Li Fei-Fei in 2010. The input features are natural images of almost anything. There are different possible target values, depending on which task you want to solve. Obviously the images have been categorised and have classification targets, but also bounding boxes or properties of the objects in the images are possible targets. It is typically used to compare state-of-the-art image models.
 
This is really just a very small part of datasets that are commonly used in education/research. Nevertheless, these datasets (except for the first one) appear practically everywhere in (non-recurrent) deep learning research.

<h3 style="color:#0084bb">Exercise 1: Processing the Data (2 Points)</h3>

Raw data generally does not come in a form that is immediately ready to use. The first step is to get the data in a form that your framework can deal with. In your case: numpy arrays.

 > Finish the implementation of the `iris_data` function so that it returns numpy arrays that you can feed to a network in your framework.

In [None]:
def iris_data(path=None):
    """
    Get the data from the Iris dataset as numpy arrays.
    
    Parameters
    ----------
    path : str, optional
        Path to directory where the dataset will be stored.
        
    Returns
    -------
    x : (N, D) ndarray
        Matrix of input features.
    y : (N, K) ndarray
        Vector of target labels.
    """
    base_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/"
    if path is None:
        path = os.path.join(os.getcwd(), "iris")
    
    with CachedDownload(base_url, "iris.data", path) as chunks:
        # store download as sequence of bytes
        raw_data = b''.join(chunks)
        
    raise NotImplementedError("TODO: implement iris_data function!")

In [None]:
x, y = iris_data()
print(x.shape, y.shape)

<h3 style="color:#0084bb">Exercise 2: Data Splitting (2 Points)</h3>

Theoretically, we could start learning with the numpy arrays from `iris_data`, but this ignores the fact that some data needs to be kept appart if we want to assess the quality of the network. It is of uttermost importance to make a train-test split before the data gets anywhere near the model! Time to fix that.

 > Write a `split_data` function that partitions the data in two splits deterministically, i.e. when calling the function twice on the same data, the splits should be the same.
 
**Hint:** remember the i.i.d. assumptions?

In [None]:
def split_data(x, y, ratio=.8):
    """
    Split a dataset in two parts with a given ratio.
    
    Parameters
    ----------
    x : ndarray
        Input features.
    y : ndarray
        Target values.
    ratio : float
        The percentage of samples in the first split.
    
    Returns
    -------
    (x1, y1) : tuple of ndarrays
        The first split of the dataset
    (x2, y2) : tuple of ndarrays
        The second split of the dataset
        
    Notes
    -----
    The order of the samples must not be maintained.
    """
    raise NotImplementedError("TODO: implement split_data function!")


def get_iris_data(path=None, test=False):
    """
    Get the correct split from the Iris dataset as numpy arrays.
    
    Parameters
    ----------
    path : str, optional
        Path to directory where the dataset will be stored.
    test : bool, optional
        Flag to return test set instead of training data.
        
    Returns
    -------
    x : (N, D) ndarray
        Matrix of input features.
    y : (N, ) ndarray
        Vector of target labels.
    """
    x, y = iris_data(path)
    _train, _test = split_data(x, y, ratio=.8)
    return _test if test else _train

In [None]:
x_train, y_train = get_iris_data()
assert np.all(y[np.all(x == x_train[0], axis=-1)] == y_train[0]), "inconsistent!"
print(x_train.shape, y_train.shape)
x_test, y_test = get_iris_data(test=True)
assert np.all(y[np.all(x == x_test[0], axis=-1)] == y_test[0]), "inconsistent!"
print(x_test.shape, y_test.shape)

<h3 style="color:#0084bb">Exercise 3: Mini-batches (1 Points)</h3>

With the iris data in memory, we could practically start training a neural network right away. Even datasets like MNIST or CIFAR are small enough to have them loaded in memory on modern hardware. However, when using all the data at once in a deep network, you might nevertheless run into memory issues, since the entire forward pass has to be stored to do the backpropagation. To counter this problem, the data can be fed to the network in manageable pieces, called *mini-batches* or *batches* for short.

 > Change the `Dataloader` class below so that it yields mini-batches of the specified size in its `__iter__` method.

<h5 style="color:#0084bb">Some Notes on python generators</h5>

In python, a [generator](https://wiki.python.org/moin/Generators) is a function with some state that can return multiple values. You probably have already used generators without realising it. Probably, the most famous generator is `range`, which could be defined as follows:
```python
def _range(start, stop, step=1):
    i = start
    while i < stop:
        yield i
        i += step
```

Notice the `yield` keyword. This has a similar effect as `return` in that it provides a value to the outer scope of the function. However, it does not cause the function to be exited. Instead, the current state in the function is stored until the next value is requested. To get the return values of a generator, there are essentially two options:
 1. Using the `next` function. This will simply run the function until the next `yield` statement and give back the yielded value.
 2. By iterating over the generator in any way. This will consequently call `next` on the generator until the function exits.
 
For more information, please refer to the internet.

In [None]:
class Dataloader:
    
    def __init__(self, x, y, batch_size=None):
        self.x = x
        self.y = y
        self.batch_size = len(x) if batch_size is None else int(batch_size)
    
    def __iter__(self):
        """
        Iterates over the samples of the data.
        
        Yields
        ------
        x : ndarray
            input features for the batch
        y : ndarray
            target values for the batch
            
        Notes
        -----
        Each batch should contain the specified number of samples,
        except for the last batch if the batch_size does 
        not divide the number of samples in the data.
        """
        raise NotImplementedError("TODO: implement Dataloader.__iter__ function!")
            
    def __len__(self):
        return 1 + (len(x) - 1) // self.batch_size

In [None]:
data_loader = Dataloader(x_train, y_train, batch_size=64)
print(len(data_loader))
for x, y in data_loader:
    print(x.shape, y.shape)

<h2 style="color:#0084bb">Learning</h2>

As you should know by now, deep learning is in essence little more than gradient descent on neural networks. In the previous assignment, you implemented all essential tools to implement a fully connected neural network. Now is a good time to consider how to use it for learning from the data we loaded.

<h3 style="color:#0084bb">Exercise 4: Updating and Evaluating (3 Points)</h3>

When training a neural network there are essentially two things you want to do. On one side, you try to improve your network by updating its weights. On the other side, you want to monitor the learning process by evaluating the network on another part of the data or using some other function than the loss.

 > Implement the `evaluate` and `update` functions below to implement the two possible scenarios during learning. 
 
**Hint:** you can use the functions `Module.train` and `Module.eval` to put modules in the corresponding modes. This enables some (minor) optimisations in the evaluation mode if you use the module as a function.

In [None]:
def evaluate(network, metric, data_loader):
    """
    Evaluate a network by computing a metric for specific data.
    
    Parameters
    ----------
    network : Module
        A module that implements the network.
    metric : callable
        A function that takes logits and labels 
        and returns a scalar numpy array.
    data_loader : Dataloader
        The data loader that provides the batches.
        
    Returns
    -------
    values : ndarray
        The computed values for each batch in the data loader.
    """
    raise NotImplementedError("TODO: implement evaluate function!")


def update(network, loss, data_loader, lr=1e-3):
    """
    Update a network by optimising the loss for the given data.
    
    Parameters
    ----------
    network : Module
        A module that implements the network.
    loss : Module
        Loss function module.
    data_loader : Dataloader
        The data loader that provides the batches.
    lr : float, optional
        Learning rate for the update.
        
    Returns
    -------
    errors : ndarray
        The computed loss for each batch in the data loader.
    """
    raise NotImplementedError("TODO: implement update function!")

<h3 style="color:#0084bb">Exercise 5: Gradient Descent again (2 Points)</h3>

Remember gradient descent from the first assignment? The time has come to rewrite that function to incorporate the module system and the data loader. In order to get an idea of the generalisation performance, you will also want to split the incoming data in training and validation sets.

 > Implement the `gradient_descent` function below, using `update` and `evaluate`. Compute the train and validation error once before you start updating the model to also get the performance of the initial (random) model.

In [None]:
def gradient_descent(network, loss, data, epochs=1, batch_size=None, val_split=0.75, lr=1e-3):
    """
    Train a neural network with gradient descent.
    
    Parameters
    ----------
    network : Module
        A module that implements the network.
    loss : Module
        Loss function module.
    data : tuple of ndarrays
        Dataset as tuple of input features and target values.
    epochs : int, optional
        Number of times to iterate the dataset.
    batch_size : int or None, optional
        Number of samples to use simultaneously.
        If None, all samples are fed to the network.
    val_split : float, optional
        Percentage of data to use for updating the model.
        The other part of the data is used for evaluating the model.
    lr : float, optional
        Step size for the gradient descent.
        
    Returns
    -------
    train_errors : (epochs + 1, n_batches) ndarray
        Training error for each epoch and each batch.
    valid_errors : (epochs + 1, 1) ndarray
        Validation error for each epoch.
    """
    raise NotImplementedError("TODO: implement gradient_descent function!")

<h3 style="color:#0084bb">Exercise 6: Putting everything together (2 Points)</h3>

With all the tools in place, it is time to build a neural network and train a classifier on the (already loaded) dataset. Since `gradient_descent` outputs all errors (per batch and per epoch), there are multiple options to plot the learning curves: the loss after every update, the mean loss for every epoch or even something more exotic.

> Construct a MLP (using your own modules) and set up a loss function module, train the network using gradient descent. You are free to choose the hyperparameters (architecture, learning rate, number of epochs, batch size, ...), but the learning curves should show that the model properly learns (go down).

**Note:** Your code will be evaluated with modules that we implemented, so please do not use self-invented modules for your final submission.

<h5 style="color:#0084bb">Some Notes on python packaging</h5>

In order to use your own `nnumpy` modules, you will have to copy the code from your notebook into one or more python files. The easiest way to make things work for this exercise is probably to add every module you need to the `__init__.py` file in the `nnumpy` directory. Assuming that the `nnumpy` directory is in the same directory as this notebook, you can then use `from nnumpy import MyModule`. 

Students who like to keep their code organised could create new python files in the `nnumpy` directory and copy the modules to these files. Then importing your code can be done by using something like `from nnumpy.filename import MyModule` or if a line of the form `from . import filename` has been added to `nnumpy/__init__.py`, it is possible again to use `from nnumpy import MyModule`. 

Since you are building your own deep learning framework, you can also make it a proper package with a proper name. The easiest way to create a package from what you have is to rename the `nnumpy` directory to `myframework`. Then you can organise the code in different python files in this directory. Finally, you can create a `setup.py` file to install the package on your system with `pip`. For more information on python packaging, refer to [the python packaging guide](https://packaging.python.org/).

In [None]:
# import the modules you have written in the previous assignment here!
# e.g. from nnumpy import Sequential, Linear, Tanh, LogitCrossEntropy

In [None]:
train_errors = np.array([])
valid_errors = np.array([])
raise NotImplementedError("TODO: create a model and train it!")

In [None]:
# plot learning curves
title = "learning curves (train: {:.2f}, valid: {:.2f})"
plt.title(title.format(train_errors.flat[-1], valid_errors.flat[-1]))
n_batches = train_errors.shape[1]
per_epoch = np.arange(0, train_errors.size, n_batches) + (n_batches - 1) / 2
train_curve, = plt.plot(per_epoch, np.mean(train_errors, axis=1), label='train')
valid_curve, = plt.plot(per_epoch, valid_errors, label='valid')
update_curve, = plt.plot(train_errors.flat, label='per update', 
                         linestyle='--', color=train_curve.get_color(), alpha=0.3)
_ = plt.legend()

<h3 style="color:#0084bb">Exercise 7: Practice makes Perfect (3 Bonus Points)</h3>

Just to make sure that you understood everything: 

> Train a neural network on the UCI Abalone dataset to predict the age of sea snails base on physical measurements. Try out both classification and regression!

In [None]:
def abalone_data(path=None):
    """
    Get the data from the Abalone dataset as numpy arrays.
    
    Parameters
    ----------
    path : str, optional
        Path to directory where the dataset will be stored.
        
    Returns
    -------
    x : (N, D) ndarray
        Matrix of input features.
    y : (N, ) ndarray
        Vector of target labels.
    """
    base_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/"
    if path is None:
        path = os.path.join(os.getcwd(), "abalone")
    
    with CachedDownload(base_url, "abalone.data", path) as chunks:
        # store download as sequence of bytes
        raw_data = b''.join(chunks)
        
    raise NotImplementedError("TODO: implement abalone_data function!")


def get_abalone_data(path=None, test=False):
    """
    Get the correct split from the Abalone dataset as numpy arrays.
    
    Parameters
    ----------
    path : str, optional
        Path to directory where the dataset will be stored.
    test : bool, optional
        Flag to return test set instead of training data.
        
    Returns
    -------
    x : (N, D) ndarray
        Matrix of input features.
    y : (N, ) ndarray
        Vector of target labels.
    """
    x, y = abalone_data(path)
    _train, _test = split_data(x, y, ratio=.8)
    return _test if test else _train

In [None]:
raise NotImplementedError("TODO: create and train models and plot the results!")