# Practical machine learning and deep learning. Lab 2
# MLOPS part 1

### No competition this week!

On this lab, we continue exploring ClearML framework. In particular, we will learn how to use ClearML to:
- track experiments 
- run hyperparameters optimization
- share experiments with other users
- detect data drift
- deploy clearml server locally


In [1]:
!pip install clearml
!pip install alibi-detect[torch]

zsh:1: no matches found: alibi-detect[torch]


### 0. (Duplication of the previous lab) ClearML installation

1) Sign up in  [ClearMl](https://clear.ml)
2) Install clearml as python package: pip install clearml
3) Get [credentials](https://app.clear.ml/settings/workspace-configuration) to connect your notebook with remote server. When creating new credentials, pick Jupyter notebook tab.  

4) Put these env variables:

In [None]:
%env CLEARML_WEB_HOST=https://app.clear.ml/
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
%env CLEARML_API_ACCESS_KEY=... # Enter your api access key
%env CLEARML_API_SECRET_KEY=... # Enter your secret key

5) Run the following cell to initialize ClearML:

In [2]:
!clearml-init

ClearML SDK setup process
Configuration file already exists: /Users/dmitry057/clearml.conf
Leaving setup, feel free to edit the configuration file.


### 1. Tracking of CNN training on CIFAR10

We start from the similar pipeline as on the previous lab. Fisrt, you're asking to train a ResNet18 on CIFAR10 dataset

#### 1.1 ClearML init

In [3]:
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
from clearml import Task
from tqdm import tqdm

In [4]:
# Initialize ClearML task
task = Task.init(
    project_name='Cifar10',
    task_name='ResNet18_Training_Base',
    task_type=Task.TaskTypes.training
)

ClearML Task: overwriting (reusing) task id=7989e27ef26c415b826665d5e6a8f069
2025-09-03 13:35:05,565 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://clearml.touchtopnotch.com/projects/ae77eb2837d4412ca405daf2291a64b0/experiments/7989e27ef26c415b826665d5e6a8f069/output/log


CLEARML-SERVER new package available: UPGRADE to v2.2.0 is recommended!
Release Notes:
## New Features and Improvements

- Update fixed users password note in apiserver.conf (#284, thanks @djiboshin!)
- New UI global search including quick filters ([ClearML #1041](https://github.com/allegroai/clearml/issues/1041))
- Add persistent UI plot properties: Plot settings (e.g. logarithmic/linear scale, hover mode) are retained across project tasks
- Add option to hide original graph when smoothing is enabled in UI plot ([ClearML #1400](https://github.com/clearml/clearml/issues/1400))
- Add persistent UI table details view ([ClearML Web #105](https://github.com/clearml/clearml-web/issues/105)) 
- Add search bar to UI Queues table

## Bug Fixes

- Fix embedded UI task comparison plot legends unnecessarily display task ID suffixes ([ClearML #1344](https://github.com/clearml/clearml/issues/1344))
- Fix UI task dataset alias does not link to dataset page ([ClearML #735](https://github.com/clearml/

ClearML allows to track hyper‑parameters as well. We're going to tune them later

In [5]:
lr = task.get_parameters().get('learning_rate', 0.01)
wd = task.get_parameters().get('weight_decay', 5e-4)
logger = task.get_logger()

#### 1.2 Let's start the training!

In [6]:
device = torch.device('mps')

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
])
transform_val = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
])

train_ds = datasets.CIFAR10(root='./data', train=True,
                            download=True, transform=transform_train)
val_ds   = datasets.CIFAR10(root='./data', train=False,
                            download=True, transform=transform_val)

train_loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=4)


Files already downloaded and verified
Files already downloaded and verified


In [None]:
model = models.resnet18()
model.fc = nn.Linear(model.fc.in_features, 10)
model = model.to(device)

criterion = nn.CrossEntropyLoss()
hp_optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=wd)

In [None]:
import torch

epochs = 10
for epoch in (bar := tqdm(range(epochs))):
    model.train()
    batch_loss = 0.0
    correct = 0
    total = 0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        hp_optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        hp_optimizer.step()

        batch_loss += loss.item() * inputs.size(0)
        _, preds = outputs.max(1)
        correct += preds.eq(targets).sum().item()
        total += inputs.size(0)


    train_loss = batch_loss / total
    acc_train = correct / total

    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            val_loss += loss.item() * inputs.size(0)
            _, preds = outputs.max(1)
            correct += preds.eq(targets).sum().item()
            total += inputs.size(0)

    val_loss /= total
    acc_val = correct / total

    # Log to ClearML
    logger.report_scalar(title='train/accuracy',
                         series='epoch', value=acc_train, iteration=epoch)
    logger.report_scalar(title='val/accuracy',
                         series='epoch', value=acc_val, iteration=epoch)
    bar.set_description(f'Train acc: {acc_train:.4f} | Val acc: {acc_val:.4f}')

torch.save(model.state_dict(), 'model.pt')

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:11<?, ?it/s]


AttributeError: 'HyperParameterOptimizer' object has no attribute 'step'

: 

#### 1.3 Hyperparameters tuning

The training seems fine, but you may notice that we took the hyperparameters a bit randomly. Now we're going to tune them using ClearML's hyperparameters optimization feature.

In [9]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.5-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Using cached colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Using cached mako-1.3.10-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
Downloading alembic-1.16.5-py3-none-any.whl (247 kB)
Using cached colorlog-6.9.0-py3-none-any.whl (11 kB)
Using cached mako-1.3.10-py3-none-any.whl (78 kB)
Installing collected packages: Mako, colorlog, alembic, optuna
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [optuna]2m3/4[0m [optuna]
[1A[2KSuccessfully installed Mako-1.3.10 alembic-1.16.5 colorlog-6.9.0 optuna-4.5.0


In [10]:
# A few more imports
from clearml.automation import HyperParameterOptimizer
from clearml.automation import UniformParameterRange, UniformIntegerParameterRange
from clearml.automation.optuna import OptimizerOptuna
from clearml import Task

# Create a new optimizer task (just a controller, not the actual training)
optim_task = Task.create(
    project_name='Cifar10',
    task_name='ResNet18_HPO',
    task_type=Task.TaskTypes.optimizer,
)



  mod = builtins.__org_import__(name, globals=globals, locals=locals, fromlist=fromlist, level=level)


In [11]:
# Define search space
hyper_parameters = [
    UniformParameterRange('learning_rate',      min_value=1e-4,  max_value=1e-1,   step_size=0.01),
    UniformIntegerParameterRange('batch_size',  min_value=64,    max_value=128,    step_size=64),
    UniformParameterRange('weight_decay',       min_value=0,     max_value=1e-2,   step_size=1e-3),
]

In [None]:
# Hyper‑parameter optimizer instance
optimizer = HyperParameterOptimizer(
    base_task_id=task.id, # specify the id of previous task
    hyper_parameters=hyper_parameters,
    objective_metric_title='val/accuracy',  # we want to **maximize** validation accuracy
    objective_metric_series='epoch',
    objective_metric_sign='max',
    optimizer_class=OptimizerOptuna,

    # how many tasks to launch at a time
    max_number_of_concurrent_tasks=4,
    max_iteration_per_job=100,
    # optional time limits (seconds)
    optimization_time_limit=600., # 10 mins for the whole sweep
    compute_time_limit=300.,      # 5 mins per task
    total_max_jobs=10,            # 30 different hyper‑parameter combos
)

# Start the sweep (runs locally and pushes tasks to the queue)
optimizer.start()    
print(f'HPO started – view the results in the ClearML UI')



[I 2025-09-03 13:36:38,949] A new study created in memory with name: 7989e27ef26c415b826665d5e6a8f069


HPO started – view the results in the ClearML UI


  parameter_override[name] = suggest(name=name, **params)


Progress report #0 completed, sleeping for 0.25 minutes
2025-09-03 13:36:40,443 - clearml.automation.optimization - INFO - Creating new Task: {'learning_rate': 0.0301, 'batch_size': 64, 'weight_decay': 0.005}
2025-09-03 13:36:41,022 - clearml.automation.optimization - INFO - Creating new Task: {'learning_rate': 0.07010000000000001, 'batch_size': 64, 'weight_decay': 0.001}
2025-09-03 13:36:41,731 - clearml.automation.optimization - INFO - Creating new Task: {'learning_rate': 0.0001, 'batch_size': 128, 'weight_decay': 0.005}
2025-09-03 13:36:42,172 - clearml.automation.optimization - INFO - Creating new Task: {'learning_rate': 0.0201, 'batch_size': 128, 'weight_decay': 0.001}
Progress report #1 completed, sleeping for 5.0 minutes


Check [the documentation](https://clear.ml/docs/latest/docs/webapp/applications/apps_hpo/) to learn more about hyperparameters optimization and configure your own parameters of grid search

**Task**: run the code above and tune the hyperparameters. Optionally, you can share the training results with your friend(s). Open ClearML UI, find the project `Cifar10`, task `ResNet18_Training_Base:(some hyperparameters)`, click on it and create a sharable link in `Share` tab.

## 2. Detecting data drift

Data drift - a change in the distribution of the data that a model is trained on. It can happen for example when a model is trained on new data, or when the data is collected at a different time.

ClearML actually is not able to detect data drift by itself, but we can use [alibi-detect](https://alibi-detect.readthedocs.io/en/latest/) to detect it.

Alibi framework provides variuos methods for detection of data corruption and data drift. On this lab, we will use the Learned Kernel method.  It is closely related to the [classifier drift detector](https://docs.seldon.io/projects/alibi-detect/en/latest/cd/methods/classifierdrift.html) which trains a classifier to discriminate between instances from the reference window and instances from the test window. The difference here is that we train a kernel to output high similarity on instances from the same window and low similarity between instances from different windows. If this is possible in a generalisable manner then drift must have occured.

On practice, Learned Kernel method means the training of data drift classifier.

### 2.1 Loading CIFAR10 and CIFAR10C

We already have CIFAR10 dataset loaded from torch. However, for simplicity of the example, we'll load a `tensorflow` version of it. 

In [None]:
import tensorflow as tf
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c
import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

# Extract data with some corruptions
corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255


Let's select a random half of the test data as a reference window and the other half as a test window

In [None]:
np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref, y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

In [None]:
# Permute to NCHW
def permute_c(x):
    return np.transpose(x.astype(np.float32), (0, 3, 1, 2))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

X_ref_pt = permute_c(X_ref)
X_h0_pt = permute_c(X_h0)
X_c_pt = [permute_c(xc) for xc in X_c]
print(X_ref_pt.shape, X_h0_pt.shape, X_c_pt[0].shape)

### 2.2 Training Learned Kernel Drift detector

First, we need to define some kernel projection

In [None]:
from alibi_detect.utils.pytorch.kernels import DeepKernel


proj = nn.Sequential(
    nn.Conv2d(3, 8, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(8, 16, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(16, 32, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
).to(device)

kernel = DeepKernel(proj, eps=0.01)


Second, we need to past the kernel to the detector

In [None]:
from alibi_detect.cd import LearnedKernelDrift
from alibi_detect.saving import save_detector, load_detector

cd = LearnedKernelDrift(X_ref_pt, kernel, backend='pytorch', p_val=.05, epochs=4)

# Save detector
filepath = 'torch_detector'
save_detector(cd, filepath)

# Load detector
cd = load_detector(filepath)

Finally, compare the corrupted and original CIFAR datasets

In [None]:
def make_predictions(cd, x_h0, x_corr, corruption):
    labels = ['No!', 'Yes!']
    preds = cd.predict(x_h0)
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')

    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            preds = cd.predict(x)
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print(f'p-value: {preds["data"]["p_val"]:.3f}')

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

**Task**: try to train a ClassifierDrift detector. Is the result on the corrution types the same?

## 3. ClearML Server

ClearML allows to track experiments in two ways: locally and from the cloud. Because of regular IP blocks, we have to install and run ClearML locally. For that we need to install **clearml-server**.


The easiest way to deploy the server is to run docker. Follow [this instruction](https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac/) to deploy the server.

**Task**: try to deploy the server locally and run the code above again.