<img src="../figures/HeaDS_logo_large_withTitle.png" width="300">

# Introduction to pytorch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/IntroToML/blob/HEAD/Day2/pytorch_intro.ipynb)

- presents key components of pytorch and python concepts
- retrain linear regression of previous exercise using pytorch
- add Feed Forward Neural Network

Put together by [Henry Webel](https://twitter.com/Henrywebel)

> Disclamer: The notebook is in support of a presentation, not the replacement (so I do not write out everything)

## Outline
- Torch seen as numpy: `Ndarray` -> `Tensor`
	- Why: Tensor itself has backprop information
- Key Concepts and associated Classes: 
    - brief recap of python concepts needed: `tuple`, `[ ]` implementation, `len`, `Callable`
    - `DataSet` -> `__getitem__`, `__len__`
    - `DataLoader` -> Loop over `Dataset` in certain ways
    - Model formulation -> `Module` as a `Callable`
    - Optimizers -> `SGD`
    - loss functions

- plan:
    * Start by makig a Dataset and one DataLoader. 
    * hint on `DataLoader`s for validaiton + training. 
    * Define a simple linear regression
    * Do one step with `Dataloder` + model and see what comes out
    * Do it in a loop
    * Define a feed-forward network as an exercise


## Arrays

- Indexed sets of related elements.
- Often data of the same type continously layed out in memory


## Numpy
    
<img src="https://numpy.org/images/logos/numpy_logo.svg" width="200">
    
See notebook by [Jakob Nybo Nissen](https://twitter.com/nybojakob), continued by [Henry Webel](https://twitter.com/Henrywebel): [Arrays_numpy.ipynb](Arrays_numpy.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/PythonTsunami/blob/fall2021/Numpy/Arrays_numpy.ipynb)
    
- The first half of this Notebook contain a small introduction to `numpy`. You can also watch a [the YouTube video](https://www.youtube.com/watch?v=8Mpc9ukltVA). If you feel comfortable using [`numpy`](https://numpy.org/), you can skip the introduction and go directly to the [exercises (click here)](#exercises). If you would like a recap, keep reading on.
    
- Other informative overviews of `numpy` with lots of examples can be found [here](https://jalammar.github.io/visual-numpy/) and [here](https://betterprogramming.pub/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d).

[`numpy`](https://numpy.org/) is a Python package that provides a new type of object: The `ndarray`. This is an N-dimensional array, i.e. a "list" with any number of dimensions.
`numpy` is one of the most fundamental Python packages. Almost all of scientific Python uses [`numpy`](https://numpy.org/) either directly, or indirectly through another package.

[![Figure 2, Harris et. al. 2020](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41586-020-2649-2/MediaObjects/41586_2020_2649_Fig2_HTML.png?as=webp)](https://www.nature.com/articles/s41586-020-2649-2/figures/2)

It may not be immediately clear why [`numpy`](https://numpy.org/)'s `ndarrays` are so useful that they are everywhere. Do ALL scientific software really need N-dimensional arrays? As you will learn in these exercises, even for 1-dimensional data that *could* be placed in lists, `ndarrays` are generally useful for their convenience and speed.

### Reference
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2


[![Figure1, Harris et. al. 2020](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41586-020-2649-2/MediaObjects/41586_2020_2649_Fig1_HTML.png?as=webp)](https://www.nature.com/articles/s41586-020-2649-2/figures/1)

In [None]:
import numpy as np
l_values = [4, 9, 1, 0, 8, 3, 2, 2, 6, 5, 0, 8]
ndarray = np.ndarray(l_values)
type(ndarray), ndarray  # what does this construct?

## pytorch

- implements Arrays with the option to keep track of information needed for backpropagation
- numpy and pytorch Arrays can share memory


More on history and tensor implementation: See [Deep Learning with PyTorch](https://github.com/deep-learning-with-pytorch/dlwpt-code)

In [None]:
import torch  # note: does not say pytorch
tensor = torch.tensor(l_values)
type(tensor), tensor

In [None]:
tensor.shape, tensor.reshape((-1, 2))

## `Dataset`


A collection of data. see [docs](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset)

- normally should have integer keys ("integral keys")
- needs `__getitem__` and `__len__` to work in ensample with other core classes.

### Python recap

Among others a collection of things (e.g. a `Dataset`) is emulated/imitated using:
- if an object can be accessed using square brackets `obj[...]` then it has a `__getitem__` method
- if an objects length can be queried, it needs an `__len__` method

check out [Fluent Python](https://www.oreilly.com/library/view/fluent-python-2nd/9781492056348/) by Luciano Ramelho to learn this.
Let's consider [one of the first examples](https://github.com/fluentpython/example-code-2e/blob/master/01-data-model/frenchdeck.py) from the book:

In [None]:
import collections

Card = collections.namedtuple('Card', ['rank', 'suit'])


class FrenchDeck:
    ranks = [str(n) for n in range(2, 11)] + list('JQKA')
    suits = 'spades diamonds clubs hearts'.split()

    def __init__(self):
        self._cards = [Card(rank, suit) for suit in self.suits
                       for rank in self.ranks]

    def __len__(self):
        return len(self._cards)

    def __getitem__(self, position):
        return self._cards[position]

This is enough to implement a slicable collection

In [None]:
french_deck = FrenchDeck()
french_deck[:3]

In [None]:
french_deck[-1], len(french_deck)

### Example data

In [None]:
import pandas as pd

DATASET_URL = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
DATASET_FNAME = "covid_data_denmark.csv"

try:
    df = pd.read_csv(DATASET_FNAME, parse_dates=['date'], index_col="date")
    print("loaded data from disk")
except FileNotFoundError:
    print("load data from internet")
    df = pd.read_csv(DATASET_URL, parse_dates=['date'], index_col="date")
    df = df.query("location in ['Denmark']")
    df = df.loc["2020-03-14": "2020-07-31"]
    mask = df[["new_cases", "new_deaths"]].notna().all(axis=1)
    df = df.loc[mask]
    df.to_csv(DATASET_FNAME)
df

In [None]:
df.plot(kind='scatter', x="new_cases", y="new_deaths", figsize=(15, 10))

### Pytorch `Dataset` for covid data


In [None]:
from torch.utils.data import Dataset
Dataset?

In [None]:
import torch
from torch.utils.data import Dataset


class CovidDenmarkData(Dataset):
    """Preliminary class."""

    def __init__(self, df: pd.DataFrame,
                 x=['new_cases'], y=['new_deaths']):
        self._df = df[x+y]
        self.x = x
        self.y = y

    def __getitem__(self, idx):
        row = self._df.iloc[idx]
        x = row[self.x]
        y = row[self.y]
        return torch.as_tensor(x), torch.as_tensor(y).squeeze()

    def __len__(self):
        return len(self._df)


dataset = CovidDenmarkData(df)

In [None]:
dataset[120]

In [None]:
# dataset[:10]

### Exercise

- Easy: Add more features to x
- Advanced (?): Try to implement the Dataset with numpy.ndarray to share data

In [None]:
class MyCovidDenmarkData(Dataset):
    """which definitely shares data."""
    

## `DataLoader`

- from the [docs](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader):

> Combines a dataset and a sampler, and provides an iterable over the given dataset.

- assembles data to mini-batches, during training buy collecting random items

In [None]:
from torch.utils.data import DataLoader
DataLoader?


In [None]:
dl = DataLoader(dataset=dataset, batch_size=8, shuffle=True, num_workers=0)
dl

In [None]:
dl.batch_size

A DataLoader is an `Iterable`

> "Iterable: An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict, file objects, and objects of any classes you define with an `__iter__()` method or with a `__getitem__()` method that implements Sequence semantics." ([glossary](https://docs.python.org/3/glossary.html))

- can be used in for loops
- can provide a generator using the built-in `iter` function

In [None]:
for x in dl:
    print(x)
    break

In [None]:
next(iter(dl))

### Optional Exercise (for later)

- Is a `Dataset` also an `Iterable` ?

## DataLoaders: More than one `Dataloader`

- data is often split into training, validation and testing data
- training and validation data is used during training (-> see next lectures on overfitting)

Sometimes you find collections of `DataLoader` into custom `DataLoaders` classes, basically a tuple.
So it can be either purely both semantic and conceptually, or additionally represented in custom code.

In [None]:
N_train = int(len(df)*0.8)
print(f"N_total: {len(df)}, N_train: {N_train}, N_val: {len(df)- N_train}")
df_randomized = df.sample(frac=1.0)
dataset_train = CovidDenmarkData(df_randomized.iloc[:N_train])
dataset_valid = CovidDenmarkData(df_randomized.iloc[N_train:])

print(f"Train: {len(dataset_train)}, Valid: {len(dataset_valid)}")

data_loaders = (
    DataLoader(dataset_train, shuffle=True),
    DataLoader(dataset_valid, shuffle=False)
)

len(data_loaders[1])  # A tuple as a collection of single DataLoader instances

## Models

- are `Callable`s
- A model is a collection, often sequence, of `Module`s
- [`Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) is the most common model


### Excurs: `Callable` in Python

- means that it support to be called using normal parantheses `obj()`

In [None]:
def do_something():
    return 3


class DoSomething():

    def __init__(self):
        pass

    def __call__(self):
        """Make an instance Callable."""
        return 6


do_something_instance = DoSomething()  # here init is called

do_something(), do_something_instance()

### Linear Module

- the linear module alone can be used to formulate a linear model:

    $f(x) = \sum_{i} x_i * w_i + b$, where  
      
    there is a weight $w_i$ for each input feature $x_i$ and an overall bias (or intercept) $b$

In [None]:
import torch.nn as nn

linear_model = nn.Linear(1, 1)  # <1>
linear_model

In [None]:
# linear_model(x[0]) # RuntimeError: expected scalar type Float but found Double

You well see several in-place operations throughout pytorch code
  - transforming data
  - specifying data location in memory (which one to use)

In general data and model have to be of the same format.

In [None]:
linear_model.double()  # data is float64, so model parameters also need to be float64

Provide a mini-batch to the untrained model

In [None]:
# a random first batch from our DataLoader (iff shuffle is True)
x = next(iter(dl))
linear_model(x[0])

In [None]:
nn.Linear?

## Train Linear Regression parameters

- again using Stochastic Gradient Descent, , using [`torch.optim.SDG`](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html?highlight=sgd#torch.optim.SGD)

In [None]:
from torch.optim import SGD

sgd = SGD(params=linear_model.parameters(), lr=1.e-5)
loss_fn = nn.MSELoss()

### One mini-batch

In [None]:
x_in, y_true = next(iter(dl))
y_predicted = linear_model(x_in).squeeze()
y_predicted, y_true

In [None]:
loss = loss_fn(y_predicted, y_true)
loss

In [None]:
sgd.zero_grad()
loss.backward()
sgd.step()

In [None]:
y_predicted = linear_model(x_in).squeeze()
y_predicted, y_true

### Train in epochs

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

lr = 1.e-5


linear_model = nn.Linear(1, 1).double()  # repeat

print(
    f"initial weight: {linear_model.weight.squeeze()}, initial bias: {linear_model.bias.squeeze()} ")

sgd = SGD(params=linear_model.parameters(), lr=lr)
loss_fn = nn.MSELoss()

num_epochs = 20  # Number of epochs to run
loss_history = []  # Keep a record of losses over time for plotting

plot = sns.regplot(data=df, x="new_cases", y="new_deaths",
                   ci=None).set(xlabel='New cases', ylabel='Deaths')

# df.plot(kind='scatter', x="new_cases", y="new_deaths", figsize=(15,10))

for epoch in range(num_epochs):
    epoch_loss = 0  # reset epoch loss
    for x_in, y_true in dl:
        # Zero out gradients (clean-up of optimizer)
        sgd.zero_grad()
        # Forward pass
        y_predicted = linear_model(x_in)
        # Compute loss
        loss = loss_fn(
            y_predicted.squeeze(),  # remove squeeze and see what happens -> shapes are important
            y_true)
        # Backward pass
        loss.backward()
        # Clean up
        sgd.step()
        # add mini-batch loss
        epoch_loss = epoch_loss + loss.item()  # item returns a numpy array

    # plot parameters after epoch
    w, b = linear_model.weight.detach().numpy(), linear_model.bias.detach().numpy()
    x = df['new_cases'].to_numpy()
    plt.plot(x, (x*w+b).squeeze(), color="blue", linewidth=0.2)

    print(f'epoch: {epoch:4d}, epoch loss: {epoch_loss}')
    loss_history.append(epoch_loss)

In [None]:
x_in, y_true = next(iter(dl))
y_predicted = linear_model(x_in).squeeze()
y_predicted, y_true

In [None]:
linear_model.weight, linear_model.bias

### Custom `Module`s, Custom models

- the Callable is implemented as a forward method for programmming pattern reasons (let's exclude discussion of automated programming interfaces - APIs - for now).

In [None]:
class LinearRegression(nn.Module):

    def __init__(self):
        """Everything stateful which you need to use your model goes here."""
        super().__init__()  # needs to be here for API reasons -> call nn.Module.__init__
        pass

    def forward(self):  # no input
        return "define your model here"


lin_reg = LinearRegression()
lin_reg()

In [None]:
# nn.Module.__call__?? # have a look if you are interested in API separation

Let us define the model as it was done in the previous step, wrapped into a `nn.Module`

In [None]:
class LinearRegression(nn.Module):

    def __init__(self):
        """Everything stateful which you need to use your model goes here."""
        super().__init__()  # needs to be here for API reasons -> call nn.Module.__init__
        self.linear_reg = nn.Linear(1, 1)

    def forward(self, x):  # now with input
        return self.linear_reg(x)


lin_reg = LinearRegression()
x = torch.tensor([20, 30, 40], dtype=torch.float32).unsqueeze(-1)
lin_reg(x)

### Exercise

- Use this model with the training loop above
- (Advanced) If you have time and are interested: can you initialise the weight and bias to 0 (this will require re-initializing the weights, [check](https://pytorch.org/docs/stable/nn.init.html))

In [None]:
# copy training loop and try to get it running

### Exercise

- create a Feed-Forward Neural Network (FNN)
- Adapt the [FNN from the tutorial](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html#define-the-class). You will see the use of  `nn.Sequential`

In [None]:
# copy traiing loop and try to get your way to complex model running

## Additional Exercises

- prepare data in `float32` in `Dataset`(or `DataFrame`)
- cast model differently (after adapting the data)
- add the batchsize for configurations in the training loop (-> `DataLoader`)
- report the loss as per sample loss, not as aggregated batch loss
- can try to add more features than one to the model for making predictions

## Outlook

The training procedure can be simplified using libraries built on top of pytorch.

- [Lightning](https://www.pytorchlightning.ai/)
- [Ignite](https://pytorch.org/ignite/quickstart.html)
- [fastai](https://docs.fast.ai/)
 

 > Warning: The simplification or standardization of training procedures is another layer of 
 > complexity itself