## Hyperparameter tuning

### Setup

In [1]:
# Config setup
import config
from functions.utility import analyze_webdataset
import mlflow
import torch
import glob
import os

# Device configuration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

config.DEVICE = device

# Configure MLflow
mlflow.set_experiment("animals10")

# Define constants
DATA_DIR = "./data/webdataset/"

# Relative file paths
config.TRAIN_PATHS = sorted(glob.glob(os.path.join(DATA_DIR, "train-*.tar")))
config.TEST_PATHS = sorted(glob.glob(os.path.join(DATA_DIR, "test-*.tar")))

print(f"Found {len(config.TRAIN_PATHS)} training files and {len(config.TEST_PATHS)} test files")

num_classes, class_names, class_weights = analyze_webdataset(DATA_DIR, verbose=False)
print(f"\nTraining data summary:")
print(f"Number of classes: {num_classes}")
print(f"Class names: {class_names}")
print(f"Class weights tensor shape: {class_weights.shape}")

# Update the config module variables
config.NUM_CLASSES = num_classes
config.CLASS_NAMES = class_names
config.CLASS_WEIGHTS = class_weights

2025/05/05 13:37:47 INFO mlflow.tracking.fluent: Experiment with name 'animals10' does not exist. Creating a new experiment.


Using device: cuda:0
Found 22 training files and 3 test files

Training data summary:
Number of classes: 10
Class names: ['spider', 'dog', 'chicken', 'horse', 'butterfly', 'squirrel', 'cow', 'sheep', 'cat', 'elephant']
Class weights tensor shape: torch.Size([10])


### Hyperparameter search
3 fold cross validation was used. 

Aggressive pruning by first fold minimum threshold as well as optunas median pruning strategy by running validation accuracy average after each fold. 

Metric used in optuna optimization is the lower bound of the average validation accuracy from the collective best epoch of all folds measured with the t-distribution at 80% confidence. 

Each trial is stored using ML Flow and can be viewed by typing mlflow ui in the terminal. 

Hyperparameter study from optuna uses SQLite and stored in the root project folder using db_path variable.

In [2]:
# Run the k-fold cross validation optimization
from functions.hyperopt import run_kfold_optuna_optimization
db_path = "optuna_animals10_kfold.db"
k_fold_study = run_kfold_optuna_optimization(
    n_trials=1,      # Number of trials
    k=3,             # Number of folds
    verbose=True,   # Reduce output
    storage=db_path, # Store results in SQLite
    load_if_exists=True,
    first_fold_min_acc=90.0  # Minimum accuracy for the first fold
)

Using SQLite storage at: sqlite:///optuna_animals10_kfold.db


[I 2025-05-05 13:38:05,227] A new study created in RDB with name: animals10_kfold


Could not load existing study: 'Record does not exist.'
GPU memory: Allocated: 0.00 GB, Reserved: 0.00 GB
Analyzing class distributions in 22 shards...
Created 3 folds with the following statistics:
Fold 1: 8 shards, 7344 samples
  spider: 1525 samples (20.77%)
  dog: 1350 samples (18.38%)
  chicken: 852 samples (11.60%)
  horse: 779 samples (10.61%)
  butterfly: 584 samples (7.95%)
  squirrel: 524 samples (7.14%)
  cow: 483 samples (6.58%)
  sheep: 501 samples (6.82%)
  cat: 417 samples (5.68%)
  elephant: 329 samples (4.48%)
Fold 2: 7 shards, 7000 samples
  spider: 1380 samples (19.71%)
  dog: 1277 samples (18.24%)
  chicken: 903 samples (12.90%)
  horse: 693 samples (9.90%)
  butterfly: 570 samples (8.14%)
  squirrel: 495 samples (7.07%)
  cow: 506 samples (7.23%)
  sheep: 466 samples (6.66%)
  cat: 395 samples (5.64%)
  elephant: 315 samples (4.50%)
Fold 3: 7 shards, 7000 samples
  spider: 1408 samples (20.11%)
  dog: 1247 samples (17.81%)
  chicken: 855 samples (12.21%)
  horse: 7

Training: 0it [00:00, ?it/s]

Evaluating: 0it [00:00, ?it/s]

Train Loss: 2.0183, Train Acc: 50.59%
Val Loss: 0.7174, Val Acc: 92.58%
Epoch 2/5


Training: 0it [00:00, ?it/s]

Evaluating: 0it [00:00, ?it/s]

Train Loss: 0.5424, Train Acc: 86.13%
Val Loss: 0.2462, Val Acc: 94.92%
Epoch 3/5


Training: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
    self._shutdown_workers()
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
    if w.is_alive():
       ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py"

Evaluating: 0it [00:00, ?it/s]

Train Loss: 0.3031, Train Acc: 92.38%
Val Loss: 0.1756, Val Acc: 95.51%
Epoch 4/5


Training: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
    self._shutdown_workers()
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
    if w.is_alive():
       ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py"

Evaluating: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
    self._shutdown_workers()
Exception ignored in:   File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
    <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
if w.is_alive():Traceback (most recent call last):

  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
      self._shutdown_workers() 
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
      if w.is_alive(): 
  ^ ^ ^^  ^ ^ ^^^^^^^^^^^^
^  Fi

Train Loss: 0.1833, Train Acc: 95.02%
Val Loss: 0.1302, Val Acc: 97.27%
Epoch 5/5


Training: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
    self._shutdown_workers()
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
    if w.is_alive():
       ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py"

Evaluating: 0it [00:00, ?it/s]

Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600><function _MultiProcessingDataLoaderIter.__del__ at 0x7f4801d97600>

Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, in _shutdown_workers
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 160, in is_alive
    AssertionErrorself._shutdown_workers(): 
can only test a child process  File "/home/eaglewing/repo/ml/image-recognition-pipeline/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1646, 

Train Loss: 0.1496, Train Acc: 96.09%
Val Loss: 0.1417, Val Acc: 96.48%

--- Fold 2/3 ---
Training on 15 shards, validating on 7 shards

--- Fold 3/3 ---
Training on 15 shards, validating on 7 shards


[I 2025-05-05 13:40:07,741] Trial 0 finished with value: 97.37686928076316 and parameters: {'learning_rate': 0.0001329291894316216, 'batch_size': 8, 'weight_decay': 2.9380279387035354e-06, 'dropout_rate': 0.07799726016810132, 'augmentation_intensity': 'medium', 'patience': 4, 'max_epochs': 5}. Best is trial 0 with value: 97.37686928076316.


Best avg validation at epoch 4: 97.53% ± 0.24%
Objective value - t-dist Lower confidence bound (80.0%): 97.38%

K-Fold Study statistics:
  Number of finished trials: 1
  Number of pruned trials: 0
  Best trial:
    Value: 97.37686928076316 t-dist 80% lower bound
    Params:
      learning_rate: 0.0001329291894316216
      batch_size: 8
      weight_decay: 2.9380279387035354e-06
      dropout_rate: 0.07799726016810132
      augmentation_intensity: medium
      patience: 4
      max_epochs: 5
