# Demostration training and evaluating in all Food101 set
In this notebook we will train and test food101 using all the classes. We will split the full food101 data set in the following way:

- Train: 70%
- Val: 10%
- Test: 20%

We will use only 70% of the train data to speed the training but we will test our model with 20% of the full dataset.
We will train an EfficientB0 model, EfficientB2 model and a Mobilenet_V1 model. In each model we will set the configurations as

  - model.pretrained: true
  - train.optimizer.lr: 0.001
  - train.batch_size: 16
  - train.epochs: 15
  - train.augmentation: TrivialAugmentWide
  - train.unfreeze_layers: 4







# Setup

- Mount drive.
- Clone repository
- Copy data to root
- Instal dependencies
- Remove data and selected models directories

In [None]:
# Mount drive to save the models during training and evaluating
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!git clone https://github.com/Jsrodrigue/food101-mlops-pipeline.git

Cloning into 'food101-mlops-pipeline'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 126 (delta 14), reused 18 (delta 13), pack-reused 92 (from 2)[K
Receiving objects: 100% (126/126), 23.24 MiB | 15.71 MiB/s, done.
Resolving deltas: 100% (40/40), done.


In [None]:
# Move repo to content
!rsync -av --progress /content/food101-mlops-pipeline/ /content/


sending incremental file list
./
.gitignore
            216 100%    0.00kB/s    0:00:00              216 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=113/115)
LICENSE
          1,067 100%    1.02MB/s    0:00:00            1,067 100%    1.02MB/s    0:00:00 (xfr#2, to-chk=112/115)
README.md
          6,733 100%    6.42MB/s    0:00:00            6,733 100%    6.42MB/s    0:00:00 (xfr#3, to-chk=111/115)
app.py
          1,666 100%    1.59MB/s    0:00:00            1,666 100%    1.59MB/s    0:00:00 (xfr#4, to-chk=110/115)
makefile
          1,592 100%    1.52MB/s    0:00:00            1,592 100%    1.52MB/s    0:00:00 (xfr#5, to-chk=109/115)
requirements.txt
            377 100%  368.16kB/s    0:00:00              377 100%  368.16kB/s    0:00:00 (xfr#6, to-chk=108/115)
.git/
.git/HEAD
             21 100%   10.25kB/s    0:00:00               21 100%   10.25kB/s    0:00:00 (xfr#7, to-chk=101/115)
.git/config
            277 100%  135.25kB/s    0:00:00              277 100%  135

In [None]:
!pip install -r requirements.txt

Collecting mlflow==3.4.0 (from -r requirements.txt (line 10))
  Downloading mlflow-3.4.0-py3-none-any.whl.metadata (30 kB)
Collecting hydra-core==1.3.2 (from -r requirements.txt (line 11))
  Downloading hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting streamlit>=1.24.1 (from -r requirements.txt (line 20))
  Downloading streamlit-1.50.0-py3-none-any.whl.metadata (9.5 kB)
Collecting mlflow-skinny==3.4.0 (from mlflow==3.4.0->-r requirements.txt (line 10))
  Downloading mlflow_skinny-3.4.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.4.0 (from mlflow==3.4.0->-r requirements.txt (line 10))
  Downloading mlflow_tracing-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting docker<8,>=4.0.0 (from mlflow==3.4.0->-r requirements.txt (line 10))
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastmcp<3,>=2.0.0 (from mlflow==3.4.0->-r requirements.txt (line 10))
  Downloading fastmcp-2.12.4-py3-none-any.whl.metadata (19 kB)
Collecting graphene<4

# Downloading data
We can download the data from google drive or using our script save_data.py

## Option 1: Download from drive

To be sure that we are using always the same dataset, we prefere to download the data from google drive.

In [None]:
# Install gdown to download from Google Drive
!pip install -q gdown

# Downliad from drive
!gdown "https://drive.google.com/uc?id=1FxQ1B1GDdpwtzLGBKSW_XALEq70RVMM9"

# Unzip data:
!unzip -q /content/fullfood101_dataset.zip -d /content/

# Verify content in data
!find /content/data -type f | wc -l



Downloading...
From (original): https://drive.google.com/uc?id=1FxQ1B1GDdpwtzLGBKSW_XALEq70RVMM9
From (redirected): https://drive.google.com/uc?id=1FxQ1B1GDdpwtzLGBKSW_XALEq70RVMM9&confirm=t&uuid=74622818-58f6-49cc-a77d-55dde1447c23
To: /content/fullfood101_dataset.zip
100% 5.00G/5.00G [01:09<00:00, 71.5MB/s]
101001


## Option 2: Using our script

For this demostration it is important to configure the `dataset.yaml` file to download all the food101 dataset by setting:
 - `creation.select_mode: first`
 - `samples_per_class: 1000`
 - `num_classes: 101`

---



It is also possible to save the data by replacing the cell `make prepare` with

`!python -m scripts.save_data dataset.creation.select_mode=first dataset.creation.samples_per_class=1000 dataset.creation.num_classes=101`



In [None]:
#!python -m scripts.save_data dataset.creation.select_mode=first dataset.creation.samples_per_class=1000 dataset.creation.num_classes=101
#!make prepare  # be sure to set the configurations as above

[INFO] Downloading Food101 dataset...
100% 5.00G/5.00G [08:40<00:00, 9.60MB/s]
[INFO] Download completed.
[INFO] Using classes: ['apple_pie', 'baby_back_ribs', 'baklava', 'beef_carpaccio', 'beef_tartare', 'beet_salad', 'beignets', 'bibimbap', 'bread_pudding', 'breakfast_burrito', 'bruschetta', 'caesar_salad', 'cannoli', 'caprese_salad', 'carrot_cake', 'ceviche', 'cheese_plate', 'cheesecake', 'chicken_curry', 'chicken_quesadilla', 'chicken_wings', 'chocolate_cake', 'chocolate_mousse', 'churros', 'clam_chowder', 'club_sandwich', 'crab_cakes', 'creme_brulee', 'croque_madame', 'cup_cakes', 'deviled_eggs', 'donuts', 'dumplings', 'edamame', 'eggs_benedict', 'escargots', 'falafel', 'filet_mignon', 'fish_and_chips', 'foie_gras', 'french_fries', 'french_onion_soup', 'french_toast', 'fried_calamari', 'fried_rice', 'frozen_yogurt', 'garlic_bread', 'gnocchi', 'greek_salad', 'grilled_cheese_sandwich', 'grilled_salmon', 'guacamole', 'gyoza', 'hamburger', 'hot_and_sour_soup', 'hot_dog', 'huevos_ranch

In [None]:
# -------------- Run If Needed -------------------------

# Remove selected models and data if exists to create our own
#!rm -rf selected_models
#!rm -rf data
#!rm -rf food101-mlops-pipeline/


# Training
For reasons of computational resourses and to avoid session disconection before finish our experiments, we will run each experiment one at the time and use the train scrpit instead of the experimation script. We will save the models in each step of the process in google drive.

## EfficientB0

In [None]:
!python -m scripts.train \
    model=efficientnet model.version=b0 \
    model.pretrained=true \
    train.optimizer.lr=0.001 \
    train.batch_size=16 \
    train.epochs=15 \
    train.augmentation=TrivialAugmentWide \
    train.unfreeze_layers=4 \
    train.subset_percentage=0.7 \
    train.scheduler.type=ReduceLROnPlateau \
    outputs.local.mlflow.path="/content/drive/MyDrive/mlruns" \
    outputs.local.mlflow.artifact_dir="/content/drive/MyDrive/artifacts"

Downloading: "https://download.pytorch.org/models/efficientnet_b0_rwightman-7f5810bc.pth" to /root/.cache/torch/hub/checkpoints/efficientnet_b0_rwightman-7f5810bc.pth
  0% 0.00/20.5M [00:00<?, ?B/s] 72% 14.8M/20.5M [00:00<00:00, 155MB/s]100% 20.5M/20.5M [00:00<00:00, 174MB/s]
[INFO] Starting training...
Training:   0% 0/15 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 1: Train: accuracy 0.3524, precision_macro 0.3374, recall_macro 0.3526, f1_macro 0.3409, loss 2.8078
        Val: accuracy 0.5173, precision_macro 0.5294, recall_macro 0.5173, f1_macro 0.5075, loss 1.9807

[INFO] New best model saved at epoch 0 with val_loss=1.9807
Training:   7% 1/15 [07:56<1:51:16, 476.92s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 2: Train: accuracy 0.4334, precision_macro 0.4215, recall_macro 0.4337, f1_macro 0.4255, loss 2.3500
        Val: accuracy 0.5383, precision_macro 0.5441, recall_macro 0.5383, f1_macro 0.5281, loss 1.8642

[INFO] Ne

In [None]:
!python -m scripts.retrain \
outputs.local.mlflow.path=/content/drive/MyDrive/mlruns \
outputs.local.mlflow.artifact_dir=/content/drive/MyDrive/artifacts \
retrain.run_path=/content/drive/MyDrive/mlruns/186912651824771251/70dc7de78d194823a1b5a06eaf820444 \
retrain.epochs_extra=5 \
train.optimizer.lr=0.001

[INFO] Continuing training from run /content/drive/MyDrive/mlruns/186912651824771251/70dc7de78d194823a1b5a06eaf820444
[WARN] No training_results.json found, reconstructing from metrics folder...
Training:   0% 0/5 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 14: Train: accuracy 0.7727, precision_macro 0.7719, recall_macro 0.7727, f1_macro 0.7720, loss 0.8341
        Val: accuracy 0.7582, precision_macro 0.7640, recall_macro 0.7582, f1_macro 0.7563, loss 1.0241

Training:  20% 1/5 [08:37<34:30, 517.55s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 15: Train: accuracy 0.7858, precision_macro 0.7849, recall_macro 0.7858, f1_macro 0.7851, loss 0.7839
        Val: accuracy 0.7621, precision_macro 0.7640, recall_macro 0.7621, f1_macro 0.7595, loss 0.9885

[INFO] New best model saved at epoch 14 with val_loss=0.9885
Training:  40% 2/5 [17:21<26:04, 521.44s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 16: Train: accu

## EfficientB2

From here I change the strategy and decided to make two epochs frozen (warming training) and then unfreeze the layers.

In [None]:
!python -m scripts.train \
    model=efficientnet model.version=b2 \
    model.pretrained=true \
    train.optimizer.lr=0.001 \
    train.batch_size=16 \
    train.epochs=15 \
    train.augmentation=TrivialAugmentWide \
    train.unfreeze_layers=4 \
    train.subset_percentage=0.7 \
    train.scheduler.type=ReduceLROnPlateau \
    outputs.local.mlflow.path="/content/drive/MyDrive/mlruns" \
    outputs.local.mlflow.artifact_dir="/content/drive/MyDrive/artifacts"



[INFO] Starting training...
Training:   0% 0/15 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 1: Train: accuracy 0.3791, precision_macro 0.3647, recall_macro 0.3793, f1_macro 0.3687, loss 2.7123
        Val: accuracy 0.5472, precision_macro 0.5520, recall_macro 0.5472, f1_macro 0.5352, loss 1.8664

[INFO] New best model saved at epoch 0 with val_loss=1.8664
Training:   7% 1/15 [09:38<2:15:01, 578.68s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 2: Train: accuracy 0.4654, precision_macro 0.4548, recall_macro 0.4657, f1_macro 0.4586, loss 2.2057
        Val: accuracy 0.5726, precision_macro 0.5702, recall_macro 0.5726, f1_macro 0.5588, loss 1.7110

[INFO] New best model saved at epoch 1 with val_loss=1.7110
Training:  13% 2/15 [19:28<2:06:45, 585.02s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 3: Train: accuracy 0.5498, precision_macro 0.5429, recall_macro 0.5499, f1_macro 0.5454, loss 1.7720
        Val: accu

In [None]:
# Continue training
!python -m scripts.retrain \
outputs.local.mlflow.path=/content/drive/MyDrive/mlruns \
outputs.local.mlflow.artifact_dir=/content/drive/MyDrive/artifacts \
retrain.run_path=/content/drive/MyDrive/mlruns/186912651824771251/5b50bf196aa441f794d71e132c2dff47 \
retrain.epochs_extra=7


Downloading: "https://download.pytorch.org/models/efficientnet_b2_rwightman-c35c1473.pth" to /root/.cache/torch/hub/checkpoints/efficientnet_b2_rwightman-c35c1473.pth
100% 35.2M/35.2M [00:00<00:00, 168MB/s]
[INFO] Continuing training from run /content/drive/MyDrive/mlruns/186912651824771251/5b50bf196aa441f794d71e132c2dff47
[WARN] No training_results.json found, reconstructing from metrics folder...
Training:   0% 0/7 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 7: Train: accuracy 0.7396, precision_macro 0.7383, recall_macro 0.7396, f1_macro 0.7387, loss 0.9681
        Val: accuracy 0.7689, precision_macro 0.7758, recall_macro 0.7689, f1_macro 0.7678, loss 0.9188

[INFO] New best model saved at epoch 6 with val_loss=0.9188
Training:  14% 1/7 [10:51<1:05:08, 651.34s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 8: Train: accuracy 0.7644, precision_macro 0.7632, recall_macro 0.7644, f1_macro 0.7636, loss 0.8621
        Val: accuracy 

In [None]:
# Continue training with proper lr if wanted
!python -m scripts.retrain \
outputs.local.mlflow.path=/content/drive/MyDrive/mlruns \
outputs.local.mlflow.artifact_dir=/content/drive/MyDrive/artifacts \
retrain.run_path=/content/drive/MyDrive/mlruns/186912651824771251/eceb4829e7114078a4613a547112585a \
retrain.epochs_extra=3 \
train.optimizer.lr=0.0005  ## Should change manually to 0.001 in tracking for proper value

Downloading: "https://download.pytorch.org/models/efficientnet_b2_rwightman-c35c1473.pth" to /root/.cache/torch/hub/checkpoints/efficientnet_b2_rwightman-c35c1473.pth
  0% 0.00/35.2M [00:00<?, ?B/s] 54% 19.1M/35.2M [00:00<00:00, 200MB/s]100% 35.2M/35.2M [00:00<00:00, 201MB/s]
[INFO] Continuing training from run /content/drive/MyDrive/mlruns/186912651824771251/eceb4829e7114078a4613a547112585a
[INFO] Loading previous training results from /content/drive/MyDrive/mlruns/186912651824771251/eceb4829e7114078a4613a547112585a/artifacts/metrics/training_results.json
Training:   0% 0/3 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 14: Train: accuracy 0.8610, precision_macro 0.8606, recall_macro 0.8610, f1_macro 0.8607, loss 0.4971
        Val: accuracy 0.7761, precision_macro 0.7833, recall_macro 0.7761, f1_macro 0.7761, loss 0.9957

Training:  33% 1/3 [11:17<22:35, 677.91s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 15: Train: accuracy 

## MobileNet

In [None]:
!python -m scripts.train model=mobilenet \
 model.version=v2 model.pretrained=true\
  train.optimizer.lr=0.001 train.batch_size=16\
  train.epochs=15 train.augmentation=TrivialAugmentWide\
   train.unfreeze_layers=14 train.subset_percentage=0.7 \
    train.scheduler.type=ReduceLROnPlateau\
    outputs.local.mlflow.path=/content/drive/MyDrive/mlruns\
    outputs.local.mlflow.artifact_dir=/content/drive/MyDrive/artifacts


[INFO] Starting training...
Training:   0% 0/15 [00:00<?, ?it/s][INFO] Starting train step...
[INFO] Starting val step...
Epoch 1: Train: accuracy 0.3560, precision_macro 0.3409, recall_macro 0.3563, f1_macro 0.3446, loss 2.7929
        Val: accuracy 0.5153, precision_macro 0.5325, recall_macro 0.5153, f1_macro 0.5034, loss 1.9564

[INFO] New best model saved at epoch 0 with val_loss=1.9564
Training:   7% 1/15 [06:32<1:31:32, 392.30s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 2: Train: accuracy 0.4508, precision_macro 0.4399, recall_macro 0.4510, f1_macro 0.4437, loss 2.2421
        Val: accuracy 0.5439, precision_macro 0.5534, recall_macro 0.5439, f1_macro 0.5313, loss 1.8155

[INFO] New best model saved at epoch 1 with val_loss=1.8155
Training:  13% 2/15 [13:06<1:25:13, 393.38s/it][INFO] Starting train step...
[INFO] Starting val step...
Epoch 3: Train: accuracy 0.4674, precision_macro 0.4586, recall_macro 0.4676, f1_macro 0.4614, loss 2.1328
        Val: accu

#Test

In [None]:
#Select all 7

!python -m scripts.select_models select_model.top_k=7 \
select_model.source_runs_dir=/content/drive/MyDrive/mlruns/186912651824771251 \
select_model.target_selected_models_dir=/content/drive/MyDrive/selected_models


[INFO] Loaded 101 class names
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/efficientnet_b2_3b030/artifacts/best_model_info/best_model_info.json
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/efficientnet_b2_c9f35/artifacts/best_model_info/best_model_info.json
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/efficientnet_b0_c5274/artifacts/best_model_info/best_model_info.json
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/efficientnet_b0_e37ff/artifacts/best_model_info/best_model_info.json
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/efficientnet_b0_5e573/artifacts/best_model_info/best_model_info.json
[INFO] Updated best_model_info.json at /content/drive/MyDrive/selected_models/mobilenet_v2_68ac8/artifacts/best_model_info/best_model_info.json


In [None]:
!python -m scripts.test test.runs_dir=/content/drive/MyDrive/selected_models


[INFO] Testing 6 models...
[INFO] Collected transforms for model: efficientnet, version: b2
[INFO] Collected transforms for model: efficientnet, version: b2
[INFO] Collected transforms for model: efficientnet, version: b0
[INFO] Collected transforms for model: efficientnet, version: b0
[INFO] Collected transforms for model: efficientnet, version: b0
[INFO] Collected transforms for model: mobilenet, version: v2
[INFO] Using loss function: CrossEntropyLoss
[INFO] Using metrics: ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
[INFO] Device: cuda
Evaluating models:   0% 0/6 [00:00<?, ?model/s]
[INFO] Evaluating run: efficientnet_b2_3b030 (Model: efficientnet, Version: b2)
[INFO] Updated test results in /content/drive/MyDrive/selected_models/efficientnet_b2_3b030/artifacts/best_model_info/best_model_info.json
[RESULTS] Metrics:
  accuracy: 0.8153
  precision_macro: 0.8160
  recall_macro: 0.8153
  f1_macro: 0.8145
  loss: 0.8013
  confusion_matrix: (array or list, not displayed)
