# Excercises 

## 0. Setup your own repo
- Setting up a repo is often repetitive. That's why you can use what are called 'cookiecutters', providing you with a template structure with some basic folders and files already set up for you. You don't have to use one, and can do it by hand, but have a look and maybe you think it is helpful.
    - You could use the `cookiecutter` command that is preinstalled on your VM to create a repo, see https://github.com/raoulg/datascience-cookiecutter for details. I made this one myself, because I ended up editing the cookiecutter I was using.
    - Another project is this one https://github.com/drivendata/cookiecutter-data-science , also intended for data science projects
    - more general cookiecutters are shipped with tools like [pdm](https://pdm-project.org/latest/) and [rye](https://rye-up.com/); starting a project with `pdm init` or `rye init` (see docs for details) will provide you with some minimal structure, and there is the option to provide your own template with pdm (see [pdm template](https://pdm-project.org/latest/usage/template/))
- push your own repo to github. Use `MADS-ML-{yourname}` as a format, eg `MADS-ML-JoostB`.
- Invite me (raoulg; https://github.com/raoulg) as a collaborator to your repo.
- make the excercises 1-5 below in your repo, and push them to github.

Tips:
- Commit often (every 30 minutes or so) 
- really, commit often. commiting and pushing your work is the best way to make sure your work is saved properly.
- Commit groups of files that are related to each other. If you have more files, commit them separately.
- Write commit messages that are descriptive and informative. "lesson 1", "changes" or "commit" are bad commit messages; "added excercise 2" is better, "[exercise 2] added __len__ to Dataset class" is even better.
- Use `pdm` or `rye` to add dependencies. `mads_datasets` and `mltrainer` should cover a lot of what you need; don't blindly copy-paste all dependencies but keep your `pyproject.toml` as clean as possible.

At some point, you will get a grade for the excercises that is 0 (not good enough), 1 (good enough) or 2 (excellent).
I will look for both form and correctness to determine your grade.
The result be incorporated into your final grade for this course.

## 1. 3D Tensor dataset
- Create a random 3D tensor dataset with `torch`
- Build your own `DataSet` class, such that you can get a 3D tensor and a label (which can be a random 0 or 1)
See notebook 03_dataloader for details on how to create a custom dataset. See 01_tensors and the torch documentation how to create random tensors.

## 2. Datastreamers
Study the `BaseDatastreamer` in `03_dataloader` and use it with your own dataset, such that you get a datastreamer that will keep on giving you new batches of data when you call `next()` or loop over it.

# 3. Tune the network
For this exercise we won't build upon the previous exercises, but instead will use the Fashion dataset.
Run the experiment below, explore the different parameters (see suggestions below) and study the result with tensorboard. 

In [7]:
# 3D Tensor Dataset
import torch
# Create a random 3D tensor dataset with shape (num_samples, depth, height, width)
num_samples = 30
depth = 2
height = 8
width = 8

random_3d_tensor_dataset = torch.rand(num_samples, depth, height, width)
random_3d_tensor_dataset.shape

torch.Size([30, 2, 8, 8])

In [20]:
import torch
import random
from typing import Tuple

class Random3DTensorDataset:
    def __init__(self, num_samples: int, tensor_shape: Tuple[int, int, int]) -> None:
        self.num_samples = num_samples
        self.tensor_shape = tensor_shape
        self.dataset = self._generate_data()

    def __len__(self) -> int:
        return self.num_samples

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, int]:
        return self.dataset[idx]

    def _generate_data(self):
        data = [(torch.rand(self.tensor_shape), random.randint(0, 1)) for _ in range(self.num_samples)]
        return data

In [21]:
# Example Set
num_samples = 10
tensor_shape = (3, 4, 5)  # Shape of each 3D tensor

dataset = Random3DTensorDataset(num_samples, tensor_shape)

# Check the first sample
print("Number of samples in the dataset:", len(dataset))
sample_data, sample_label = dataset[0]
print("Sample 3D Tensor:\n", sample_data)
print("Sample Label:", sample_label)

Number of samples in the dataset: 10
Sample 3D Tensor:
 tensor([[[0.0755, 0.0790, 0.5968, 0.6994, 0.7636],
         [0.8187, 0.2441, 0.4173, 0.1133, 0.3202],
         [0.1489, 0.1845, 0.6217, 0.7632, 0.9690],
         [0.0944, 0.5827, 0.3997, 0.3677, 0.1503]],

        [[0.5980, 0.7475, 0.2914, 0.3854, 0.9842],
         [0.5951, 0.7126, 0.8611, 0.4382, 0.1403],
         [0.0878, 0.7325, 0.3817, 0.3381, 0.2646],
         [0.1475, 0.5136, 0.2172, 0.4206, 0.0367]],

        [[0.2676, 0.9525, 0.4616, 0.2832, 0.5554],
         [0.0051, 0.8163, 0.6151, 0.9892, 0.0448],
         [0.8859, 0.0017, 0.4133, 0.1548, 0.9698],
         [0.9713, 0.3782, 0.1668, 0.1836, 0.2928]]])
Sample Label: 1


In [23]:
# Datastreamers
class BaseDatastreamer:
    def __init__(self, dataset, batchsize: int):
        self.dataset = dataset
        self.batchsize = batchsize
        self.size = len(dataset)
        self.index_list = torch.randperm(self.size)  # Random permutation of indices
        self.index = 0

    def __iter__(self):
        return self

    def __next__(self):
        X = []
        Y = []
        for _ in range(self.batchsize):
            if self.index >= self.size:
                self.index = 0
                self.index_list = torch.randperm(self.size)  # Shuffle indices when all samples are used

            data, label = self.dataset[self.index_list[self.index]]
            X.append(data)
            Y.append(label)
            self.index += 1
        return torch.stack(X), torch.tensor(Y)

# Create datastreamer
streamer = BaseDatastreamer(dataset, batch_size)

# Fetch a few batches using next() and loop
for _ in range(3):
    X_batch, Y_batch = next(streamer)
    print("Batch of tensors shape:", X_batch.shape)
    print("Batch of labels:", Y_batch)

Batch of tensors shape: torch.Size([5, 3, 4, 5])
Batch of labels: tensor([0, 1, 1, 1, 0])
Batch of tensors shape: torch.Size([5, 3, 4, 5])
Batch of labels: tensor([0, 0, 1, 0, 0])
Batch of tensors shape: torch.Size([5, 3, 4, 5])
Batch of labels: tensor([0, 1, 0, 0, 1])


In [1]:
from mads_datasets import DatasetFactoryProvider, DatasetType

from mltrainer.preprocessors import BasePreprocessor
from mltrainer import imagemodels, Trainer, TrainerSettings, ReportTypes, metrics

import torch.optim as optim
import gin

In [2]:
gin.parse_config_file("model.gin")

ParsedConfigFileIncludesAndImports(filename='model.gin', imports=['gin.torch.external_configurables'], includes=[])

We will be using `gin-config` to easily keep track of our experiments, and to easily save the different things we did during our experiments.

The `model.gin` file is a simple file that will try to load parameters for funcitons that are already imported. 

So, if you wouldnt have imported train_model, the ginfile would not be able to parse settings for train_model.trainloop and will give an error.

We can print all the settings that are operational with `gin.operative_config_str()` once we have loaded the functions to memory.

In [3]:

preprocessor = BasePreprocessor()
fashionfactory = DatasetFactoryProvider.create_factory(DatasetType.FASHION)
streamers = fashionfactory.create_datastreamer(batchsize=64, preprocessor=preprocessor)
train = streamers["train"]
valid = streamers["valid"]
trainstreamer = train.stream()
validstreamer = valid.stream()

[32m2024-11-16 22:48:21.899[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m121[0m - [1mFolder already exists at C:\Users\dilek\.cache\mads_datasets\fashionmnist[0m
[32m2024-11-16 22:48:21.899[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m124[0m - [1mFile already exists at C:\Users\dilek\.cache\mads_datasets\fashionmnist\fashionmnist.pt[0m


In [4]:
print(gin.config_str())

import gin.torch.external_configurables

# Parameters for NeuralNetwork:
NeuralNetwork.num_classes = 10
NeuralNetwork.units1 = 512



A big advantage is that we can save this config as a file; that way it is easy to track what you changed during your experiments.

In [5]:
accuracy = metrics.Accuracy()

In [29]:
import torch
gin.parse_config_file("model.gin")

units = [256, 128, 64]
loss_fn = torch.nn.CrossEntropyLoss()

settings = TrainerSettings(
    epochs=5,
    metrics=[accuracy],
    logdir="modellogs",
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN],
)

for unit1 in units:
    for unit2 in units:
        gin.bind_parameter("NeuralNetwork.units1", unit1)
        gin.bind_parameter("NeuralNetwork.units2", unit2)

        model = imagemodels.NeuralNetwork()
        trainer = Trainer(
            model=model,
            settings=settings,
            loss_fn=loss_fn,
            optimizer=optim.Adam,
            traindataloader=trainstreamer,
            validdataloader=validstreamer,
            scheduler=optim.lr_scheduler.ReduceLROnPlateau
        )
        trainer.loop()


[32m2024-11-16 16:40:11.933[0m | [1mINFO    [0m | [36mmltrainer.settings[0m:[36mcheck_path[0m:[36m61[0m - [1mCreated logdir C:\Users\dilek\desktop\Advanced_AI_Applications_WS24-25_MADS_HSRW\notebooks\1_pytorch_intro\modellogs[0m
[32m2024-11-16 16:40:11.935[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to modellogs\20241116-164011[0m
[32m2024-11-16 16:40:12.730[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36m__init__[0m:[36m70[0m - [1mFound earlystop_kwargs in settings.Set to None if you dont want earlystopping.[0m
  0%|[38;2;30;71;6m                                                                                                                                                                                     [0m| 0/5 [00:00<?, ?it/s][0m
  0%|[38;2;30;71;6m                                                                                                                                            

Run the experiment, and study the result with tensorboard. 

Locally, it is easy to do that with VS code itself. On the server, you have to take these steps:

- in the terminal, `cd` to the location of the repository
- activate the python environment for the shell. Note how the correct environment is being activated.
- run `tensorboard --logdir=models` in the terminal
- tensorboard will launch at `localhost:6006` and vscode will notify you that the port is forwarded
- you can either press the `launch` button in VScode or open your local browser at `localhost:6006`


Experiment with things like:

- changing the amount of units1 and units2 to values between 16 and 1024. Use factors of 2: 16, 32, 64, etc.
- changing the batchsize to values between 4 and 128. Again, use factors of two.
- all your experiments are saved in the `models` directory, with a timestamp. Inside you find a saved_config.gin file, that 
contains all the settings for that experiment. The `events` file is what tensorboard will show.
- plot the result in a heatmap: units vs batchsize.
- changing the learningrate to values between 1e-2 and 1e-5 
- changing the optimizer from SGD to one of the other available algoritms at [torch](https://pytorch.org/docs/stable/optim.html) (scroll down for the algorithms)

A note on train_steps: this is a setting that determines how often you get an update. 
Because our complete dataset is 938 (60000 / 64) batches long, you will need 938 trainstep to cover the complete 60.000 images.

This can actually be a bit confusion, because every value below 938 changes the meaning of `epoch` slightly, because one epoch is no longer
the full dataset, but simply `trainstep` batches. Setting trainsteps to 100 means you need to wait twice as long before you get feedback on the performance,
as compared to trainsteps=50. You will also see that settings trainsteps to 100 improves the learning, but that is simply because the model has seen twice as 
much examples as compared to trainsteps=50.

This implies that it is not usefull to compare trainsteps=50 and trainsteps=100, because setting it to 100 will always be better.
Just pick an amount, and adjust your number of epochs accordingly.

In [None]:
# Changing the amount of units1 and units2
units = [16, 32, 64, 128, 256, 512, 1024]  # Factors of 2 between 16 and 1024

# Experiment loop for different units1 and units2 values
for unit1 in units:
    for unit2 in units:
        gin.bind_parameter("NeuralNetwork.units1", unit1)
        gin.bind_parameter("NeuralNetwork.units2", unit2)

        model = imagemodels.NeuralNetwork()
        trainer = Trainer(
            model=model,
            settings=settings,
            loss_fn=loss_fn,
            optimizer=optim.Adam,
            traindataloader=trainstreamer,
            validdataloader=validstreamer,
            scheduler=optim.lr_scheduler.ReduceLROnPlateau
        )
        trainer.loop()

[32m2024-11-16 18:48:04.533[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to modellogs\20241116-184804[0m
[32m2024-11-16 18:48:04.533[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36m__init__[0m:[36m70[0m - [1mFound earlystop_kwargs in settings.Set to None if you dont want earlystopping.[0m
  0%|[38;2;30;71;6m                                                                                                                                                                                     [0m| 0/5 [00:00<?, ?it/s][0m
  0%|[38;2;30;71;6m                                                                                                                                                                                   [0m| 0/937 [00:00<?, ?it/s][0m[A
  2%|[38;2;30;71;6m███▏                                                                                                                                              

Unfortunately, the experiment is stopped by my computer due to system resource constraints. During the execution of the extensive parameter grid search involving units1 and units2 values ranging from 16 to 1024, the computational load exceeded the available memory and processing capacity of my computer.

During the batch size experiments with values 16, 32, 64, 128 my computer crashed. after this point epoch size is reduced to 3 and only 3 batch sizes are tested agaits units [256, 128, 64]. results are recoded in the new modellog file

In [None]:
units = [256, 128, 64]
loss_fn = torch.nn.CrossEntropyLoss()
batch_sizes = [4, 8, 16]

settings = TrainerSettings(
    epochs=3,
    metrics=[accuracy],
    logdir="new_modellogs",
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN],
)

for unit1 in units:
    for unit2 in units:
        for batch_size in batch_sizes:  # Looping over batch sizes
            gin.bind_parameter("NeuralNetwork.units1", unit1)
            gin.bind_parameter("NeuralNetwork.units2", unit2)
            
            # Re-create the streamers with the new batch size
            streamers = fashionfactory.create_datastreamer(batchsize=batch_size, preprocessor=preprocessor)
            trainstreamer = streamers["train"].stream()
            validstreamer = streamers["valid"].stream()

            model = imagemodels.NeuralNetwork()
            trainer = Trainer(
                model=model,
                settings=settings,
                loss_fn=loss_fn,
                optimizer=optim.Adam,
                traindataloader=trainstreamer,
                validdataloader=validstreamer,
                scheduler=optim.lr_scheduler.ReduceLROnPlateau
            )
            trainer.loop()

[32m2024-11-16 22:49:07.793[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m121[0m - [1mFolder already exists at C:\Users\dilek\.cache\mads_datasets\fashionmnist[0m
[32m2024-11-16 22:49:07.793[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m124[0m - [1mFile already exists at C:\Users\dilek\.cache\mads_datasets\fashionmnist\fashionmnist.pt[0m
[32m2024-11-16 22:49:07.826[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to new_modellogs\20241116-224907[0m
[32m2024-11-16 22:49:08.477[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36m__init__[0m:[36m70[0m - [1mFound earlystop_kwargs in settings.Set to None if you dont want earlystopping.[0m
  0%|[38;2;30;71;6m                                                                                            [0m| 0/3 [00:00<?, ?it/s][0m
  0%|[38;2;30;71;6m                                                

I couldn't create the heatmap because the batch size data was not stored in the configuration files. I tried to add batch size to config files by using gin.bind_parameter() but I got an error because batch_size is not recognized as a configurable parameter.