# Excercises 

## 0. Setup your own repo
- Setting up a repo is often repetitive. That's why you can use what are called 'cookiecutters', providing you with a template structure with some basic folders and files already set up for you. You don't have to use one, and can do it by hand, but have a look and maybe you think it is helpful.
    - You could use the `cookiecutter` command that is preinstalled on your VM to create a repo, see https://github.com/raoulg/datascience-cookiecutter for details. I made this one myself, because I ended up editing the cookiecutter I was using.
    - Another project is this one https://github.com/drivendata/cookiecutter-data-science , also intended for data science projects
    - more general cookiecutters are shipped with tools like [pdm](https://pdm-project.org/latest/) and [rye](https://rye-up.com/); starting a project with `pdm init` or `rye init` (see docs for details) will provide you with some minimal structure, and there is the option to provide your own template with pdm (see [pdm template](https://pdm-project.org/latest/usage/template/))
- push your own repo to github. Use `MADS-ML-{yourname}` as a format, eg `MADS-ML-JoostB`.
- Invite me (raoulg; https://github.com/raoulg) as a collaborator to your repo.
- make the excercises 1-5 below in your repo, and push them to github.

Tips:
- Commit often (every 30 minutes or so) 
- really, commit often. commiting and pushing your work is the best way to make sure your work is saved properly.
- Commit groups of files that are related to each other. If you have more files, commit them separately.
- Write commit messages that are descriptive and informative. "lesson 1", "changes" or "commit" are bad commit messages; "added excercise 2" is better, "[exercise 2] added __len__ to Dataset class" is even better.
- Use `pdm` or `rye` to add dependencies. `mads_datasets` and `mltrainer` should cover a lot of what you need; don't blindly copy-paste all dependencies but keep your `pyproject.toml` as clean as possible.

At some point, you will get a grade for the excercises that is 0 (not good enough), 1 (good enough) or 2 (excellent).
I will look for both form and correctness to determine your grade.
The result be incorporated into your final grade for this course.

## 1. 3D Tensor dataset
- Create a random 3D tensor dataset with `torch`
- Build your own `DataSet` class, such that you can get a 3D tensor and a label (which can be a random 0 or 1)
See notebook 03_dataloader for details on how to create a custom dataset. See 01_tensors and the torch documentation how to create random tensors.

## 2. Datastreamers
Study the `BaseDatastreamer` in `03_dataloader` and use it with your own dataset, such that you get a datastreamer that will keep on giving you new batches of data when you call `next()` or loop over it.

# 3. Tune the network
For this exercise we won't build upon the previous exercises, but instead will use the Fashion dataset.
Run the experiment below, explore the different parameters (see suggestions below) and study the result with tensorboard. 

In [1]:
from mads_datasets import DatasetFactoryProvider, DatasetType

from mltrainer.preprocessors import BasePreprocessor
from mltrainer import imagemodels, Trainer, TrainerSettings, ReportTypes, metrics

import torch.optim as optim
import gin

In [2]:
gin.parse_config_file("model.gin")

  decorated_class = decorating_meta(cls.__name__, (cls,), overrides)


ParsedConfigFileIncludesAndImports(filename='model.gin', imports=['gin.torch.external_configurables'], includes=[])

In [4]:

preprocessor = BasePreprocessor()
fashionfactory = DatasetFactoryProvider.create_factory(DatasetType.FASHION)
streamers = fashionfactory.create_datastreamer(batchsize=64, preprocessor=preprocessor)
train = streamers["train"]
valid = streamers["valid"]
trainstreamer = train.stream()
validstreamer = valid.stream()

[32m2024-12-07 18:48:40.522[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m121[0m - [1mFolder already exists at /home/azureuser/.cache/mads_datasets/fashionmnist[0m
[32m2024-12-07 18:48:40.523[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m124[0m - [1mFile already exists at /home/azureuser/.cache/mads_datasets/fashionmnist/fashionmnist.pt[0m
  data = torch.load(self.filepath)  # type: ignore


In [6]:
print(gin.config_str())

import gin.torch.external_configurables

# Parameters for NeuralNetwork:
NeuralNetwork.num_classes = 10
NeuralNetwork.units1 = 512



A big advantage is that we can save this config as a file; that way it is easy to track what you changed during your experiments.

In [7]:
accuracy = metrics.Accuracy()

You can set up a single experiment

In [None]:
import torch
loss_fn = torch.nn.CrossEntropyLoss()

settings = TrainerSettings(
    epochs=10,
    metrics=[accuracy],
    logdir="modellogs",
    train_steps=100,
    valid_steps=100,
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN],
)
model = imagemodels.NeuralNetwork(
    num_classes=10, units1=256, units2=256)

And train it

In [None]:
trainer = Trainer(
    model=model,
    settings=settings,
    loss_fn=loss_fn,
    optimizer=optim.Adam,
    traindataloader=trainstreamer,
    validdataloader=validstreamer,
    scheduler=optim.lr_scheduler.ReduceLROnPlateau
)
trainer.loop()

Or, you can use gin, it will read the model.gin file, and instead of needing to set

You can gin.parce_config_file('model.gin') and then set the model with model = NeuralNetwork(), and the parameters will be loaded from the gin file.

If you want to combine this with a manual grid search, you could automate that with a double forloop:

In [None]:
units = [512,256, 128, 64, 32, 16]
for unit1 in units:
    for unit2 in units:
        if unit1 <= unit2:
            continue
        print(f"Units: {unit1}, {unit2}")

Units: 512, 256
Units: 512, 128
Units: 512, 64
Units: 512, 32
Units: 512, 16
Units: 256, 128
Units: 256, 64
Units: 256, 32
Units: 256, 16
Units: 128, 64
Units: 128, 32
Units: 128, 16
Units: 64, 32
Units: 64, 16
Units: 32, 16


Of course, this might not be the best way to search for a model; some configurations will be better than others (can you predict up front what will be the best configuration?).

So, feel free to improve upon the gridsearch by adding your own logic.

In [None]:
import torch
gin.parse_config_file("model.gin")

units = [256, 128, 64]
loss_fn = torch.nn.CrossEntropyLoss()

settings = TrainerSettings(
    epochs=5,
    metrics=[accuracy],
    logdir="modellogs",
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN],
)

for unit1 in units:
    for unit2 in units:
        gin.bind_parameter("NeuralNetwork.units1", unit1)
        gin.bind_parameter("NeuralNetwork.units2", unit2)

        model = imagemodels.NeuralNetwork()
        trainer = Trainer(
            model=model,
            settings=settings,
            loss_fn=loss_fn,
            optimizer=optim.Adam,
            traindataloader=trainstreamer,
            validdataloader=validstreamer,
            scheduler=optim.lr_scheduler.ReduceLROnPlateau
        )
        trainer.loop()

[32m2024-12-07 18:51:02.693[0m | [1mINFO    [0m | [36mmltrainer.settings[0m:[36mcheck_path[0m:[36m61[0m - [1mCreated logdir /home/azureuser/MachineLearning/notebooks/1_pytorch_intro/modellogs[0m
[32m2024-12-07 18:51:02.714[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to modellogs/20241207-185102[0m
[32m2024-12-07 18:51:03.952[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36m__init__[0m:[36m72[0m - [1mFound earlystop_kwargs in settings.Set to None if you dont want earlystopping.[0m
100%|[38;2;30;71;6m██████████[0m| 100/100 [00:00<00:00, 128.20it/s]
[32m2024-12-07 18:51:05.005[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 0 train 0.8530 test 0.6174 metric ['0.7761'][0m
100%|[38;2;30;71;6m██████████[0m| 100/100 [00:00<00:00, 142.30it/s]
[32m2024-12-07 18:51:05.971[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1

Run the experiment, and study the result with tensorboard.

Locally, it is easy to do that with VS code itself. On the server, you have to take these steps:

in the terminal, cd to the location of the repository
activate the python environment for the shell. Note how the correct environment is being activated.
run tensorboard --logdir=modellogs in the terminal
tensorboard will launch at localhost:6006 and vscode will notify you that the port is forwarded
you can either press the launch button in VScode or open your local browser at localhost:6006

In [10]:
settings = TrainerSettings(
    epochs=15,
    metrics=[accuracy],
    logdir="modellogs",
    train_steps=200,
    valid_steps=200,
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN],
    earlystop_kwargs= {"save": False, "patience": 15, "verbose" : False}
)
settings

epochs: 15
metrics: [Accuracy]
logdir: modellogs
train_steps: 200
valid_steps: 200
reporttypes: [<ReportTypes.TENSORBOARD: 2>, <ReportTypes.GIN: 1>]
optimizer_kwargs: {'lr': 0.001, 'weight_decay': 1e-05}
scheduler_kwargs: {'factor': 0.1, 'patience': 10}
earlystop_kwargs: {'save': False, 'patience': 15, 'verbose': False}

Run the experiment, and study the result with tensorboard. 

Locally, it is easy to do that with VS code itself. On the server, you have to take these steps:

- in the terminal, `cd` to the location of the repository
- activate the python environment for the shell. Note how the correct environment is being activated.
- run `tensorboard --logdir=modellogs` in the terminal
- tensorboard will launch at `localhost:6006` and vscode will notify you that the port is forwarded
- you can either press the `launch` button in VScode or open your local browser at `localhost:6006`


Experiment with things like:

- changing the amount of units1 and units2 to values between 16 and 1024. Use factors of 2: 16, 32, 64, etc.
- changing the batchsize to values between 4 and 128. Again, use factors of two.
- all your experiments are saved in the `modellogs` directory, with a timestamp. Inside you find a saved_config.gin file, that 
contains all the settings for that experiment. The `events` file is what tensorboard will show.
- plot the result in a heatmap: units vs batchsize.
- changing the learningrate to values between 1e-2 and 1e-5 
- changing the optimizer from SGD to one of the other available algoritms at [torch](https://pytorch.org/docs/stable/optim.html) (scroll down for the algorithms)

A note on train_steps: this is a setting that determines how often you get an update. 
Because our complete dataset is 938 (60000 / 64) batches long, you will need 938 trainstep to cover the complete 60.000 images.

This can actually be a bit confusion, because every value below 938 changes the meaning of `epoch` slightly, because one epoch is no longer
the full dataset, but simply `trainstep` batches. Setting trainsteps to 100 means you need to wait twice as long before you get feedback on the performance,
as compared to trainsteps=50. You will also see that settings trainsteps to 100 improves the learning, but that is simply because the model has seen twice as 
much examples as compared to trainsteps=50.

This implies that it is not usefull to compare trainsteps=50 and trainsteps=100, because setting it to 100 will always be better.
Just pick an amount, and adjust your number of epochs accordingly.