## Deviations in model peformances

For the estimation of congruence values. TODO: insert Ref

Start this notebook inside a new environment after installing the necessary dependencies, e.g., 

```commandline
conda create --name deviations python=3.10
conda activate deviations
pip install mml-core mml-tasks
```

Make sure to set your `mml` environment variables, e.g., 

```commandline
cd ~/.config
mml-env-setup
nano mm.env
```

You can follow [the docs](https://mml.readthedocs.io/en/latest/install.html#setting-the-variables) for this. Do not forget to navigate back to this notebook's folder before starting the server.

```commandline
cd PATH/TO/THIS/NOTEBOOKS/PARENT
jupyter notebook
```


In [1]:
from pathlib import Path
from collections import Counter
import mml.interactive
# assuming the mml.env location as described above
mml.interactive.init(Path('~/.config/mml.env').expanduser())

 _____ ______   _____ ______   ___
|\   _ \  _   \|\   _ \  _   \|\  \
\ \  \\\__\ \  \ \  \\\__\ \  \ \  \
 \ \  \\|__| \  \ \  \\|__| \  \ \  \
  \ \  \    \ \  \ \  \    \ \  \ \  \____
   \ \__\    \ \__\ \__\    \ \__\ \_______\
    \|__|     \|__|\|__|     \|__|\|_______|
         ____  _  _    __  _  _  ____  _  _
        (  _ \( \/ )  (  )( \/ )/ ___)( \/ )
         ) _ ( )  /    )( / \/ \\___ \ )  /
        (____/(__/    (__)\_)(_/(____/(__/
Interactive MML API initialized.


In [2]:
### Part 1: Task selection process

In [3]:
# Started with all 81 tasks, as used in [Task fingrprinting](https://arxiv.org/abs/2412.08763)
all_tasks = (
    'lapgyn4_anatomical_structures', 'lapgyn4_surgical_actions', 'lapgyn4_instrument_count',
    'lapgyn4_anatomical_actions',
    'sklin2_skin_lesions', 'identify_nbi_infframes', 'laryngeal_tissues', 'nerthus_bowel_cleansing_quality',
    'stanford_dogs_image_categorization', 'svhn', 'caltech101_object_classification',
    'caltech256_object_classification',
    'cifar10_object_classification', 'cifar100_object_classification', 'mnist_digit_classification',
    'emnist_digit_classification', 'hyperkvasir_anatomical-landmarks', 'hyperkvasir_pathological-findings',
    'hyperkvasir_quality-of-mucosal-views', 'hyperkvasir_therapeutic-interventions', 'cholec80_grasper_presence',
    'cholec80_bipolar_presence', 'cholec80_hook_presence', 'cholec80_scissors_presence', 'cholec80_clipper_presence',
    'cholec80_irrigator_presence', 'cholec80_specimenbag_presence', 'derm7pt_skin_lesions',
    'idle_action_recognition', 'barretts_esophagus_diagnosis', 'brain_tumor_classification',
    'mednode_melanoma_classification', 'brain_tumor_type_classification',
    'chexpert_enlarged_cardiomediastinum', 'chexpert_cardiomegaly', 'chexpert_lung_opacity',
    'chexpert_lung_lesion', 'chexpert_edema', 'chexpert_consolidation', 'chexpert_pneumonia',
    'chexpert_atelectasis', 'chexpert_pneumothorax', 'chexpert_pleural_effusion', 'chexpert_pleural_other',
    'chexpert_fracture', 'chexpert_support_devices',
    'pneumonia_classification', 'ph2-melanocytic-lesions-classification',
    'covid_xray_classification', 'isic20_melanoma_classification', 'deep_drid_dr_level',
    'shenzen_chest_xray_tuberculosis', 'crawled_covid_ct_classification', 'deep_drid_quality',
    'deep_drid_clarity', 'deep_drid_field', 'deep_drid_artifact', 'kvasir_capsule_anatomy',
    'kvasir_capsule_content', 'kvasir_capsule_pathologies',
    'breast_cancer_classification_v2',
    'eye_condition_classification', 'mura_xr_wrist', 'mura_xr_shoulder', 'mura_xr_humerus', 'mura_xr_hand',
    'mura_xr_forearm', 'mura_xr_finger', 'mura_xr_elbow',
    'bean_plant_disease_classification',
    'aptos19_blindness_detection')

The goal was to have a subset of tasks that encompasses a high variety in properties, while allowing the distinction of models. Thus we first exclude some easier tasks (AUROC > 0.95) on some baseline runs. Out of the 36 tasks that meet this exclusion criterion we keep 2 for the variability reason. One was chosen explicitely for the CT modality.

In [4]:
easy_tasks = ('lapgyn4_anatomical_structures',
              'lapgyn4_surgical_actions',
              'lapgyn4_instrument_count',
              'lapgyn4_anatomical_actions',
              'identify_nbi_infframes',
              'laryngeal_tissues',
              'nerthus_bowel_cleansing_quality',
              'stanford_dogs_image_categorization',
              'svhn',
              'caltech101_object_classification',
              'caltech256_object_classification',
              'cifar10_object_classification',
              'cifar100_object_classification',
              'mnist_digit_classification',
              'emnist_digit_classification',
              # 'hyperkvasir_anatomical-landmarks', # we want to add some easy tasks
              'hyperkvasir_quality-of-mucosal-views',
              'hyperkvasir_therapeutic-interventions',
              'cholec80_grasper_presence',
              'cholec80_bipolar_presence',
              'cholec80_hook_presence',
              'cholec80_scissors_presence',
              'cholec80_clipper_presence',
              'cholec80_irrigator_presence',
              'cholec80_specimenbag_presence',
              'idle_action_recognition',
              'brain_tumor_classification',
              'brain_tumor_type_classification',
              'pneumonia_classification',
              'covid_xray_classification',
              'shenzen_chest_xray_tuberculosis',
              # 'crawled_covid_ct_classification',  # we want to add some easy tasks, ct will make it as unique modality
              'kvasir_capsule_anatomy',
              'kvasir_capsule_content',
              'kvasir_capsule_pathologies',
              'bean_plant_disease_classification')  
# we further reduce to medical tasks
realistic_tasks = [t for t in all_tasks if t not in easy_tasks]
infos = mml.interactive.get_task_infos(task_list=realistic_tasks)
medical_tasks = [t for t in realistic_tasks if infos.domains[t] not in ['natural_objects', 'handwritings']]
infos = mml.interactive.get_task_infos(task_list=medical_tasks)
Counter(infos.domains.values())

Counter({'x_ray': 20,
         'fundus_photography': 6,
         'dermatoscopy': 5,
         'gastroscopy_colonoscopy': 2,
         'confocal laser endomicroscopy': 1,
         'ct_scan': 1,
         'ultrasound': 1,
         'cataract_surgery': 1})

In [5]:
# we start by selecting one task per imaging modality
import random
final_tasks = []
random.seed(42)
for dom in sorted(list(set(infos.domains.values()))):
    all_dom_tasks = [t for t in medical_tasks if infos.domains[t] == dom]
    final_tasks.append(random.choice(all_dom_tasks))
final_tasks

['eye_condition_classification',
 'barretts_esophagus_diagnosis',
 'crawled_covid_ct_classification',
 'derm7pt_skin_lesions',
 'deep_drid_quality',
 'hyperkvasir_anatomical-landmarks',
 'breast_cancer_classification_v2',
 'mura_xr_forearm']

From here we pause and analyse the choices on variability of further properties.

In [6]:
# 3 out of 8 are binary, will add 3 more binary and one multiclass by random 
Counter([infos.num_classes[t] for t in final_tasks])

Counter({2: 3, 3: 2, 4: 1, 5: 1, 6: 1})

In [7]:
# rather small tasks so far
Counter([infos.num_samples[t] for t in final_tasks])

Counter({601: 1, 262: 1, 746: 1, 616: 1, 1200: 1, 4104: 1, 780: 1, 1825: 1})

In [8]:
# non-uniform class imbalance so far
Counter([infos.imbalance_ratios[t] for t in final_tasks])

Counter({3.0: 1,
         5.733333333333333: 1,
         1.1375358166189111: 1,
         13.692307692307692: 1,
         1.0833333333333333: 1,
         112.11111111111111: 1,
         3.2857142857142856: 1,
         1.7609682299546143: 1})

In [9]:
candidates = [t for t in medical_tasks if t not in final_tasks]
binary_cands = [t for t in candidates if infos.num_classes[t] == 2]
multi_cands = [t for t in candidates if infos.num_classes[t] != 2]
# only one imbalanced multiclass candidate
[t for t in multi_cands if infos.imbalance_ratios[t] > 2 and infos.num_samples[t] > 3_000]

['aptos19_blindness_detection']

In [10]:
# these are the binary candidates that are imbalanced and large
[t for t in binary_cands if infos.imbalance_ratios[t] > 2 and infos.num_samples[t] > 10_000]

['chexpert_enlarged_cardiomediastinum',
 'chexpert_cardiomegaly',
 'chexpert_lung_opacity',
 'chexpert_lung_lesion',
 'chexpert_edema',
 'chexpert_atelectasis',
 'chexpert_pneumothorax',
 'chexpert_pleural_effusion',
 'chexpert_fracture',
 'chexpert_support_devices',
 'isic20_melanoma_classification']

Take ISIC as given for its idependence from the others. Now we need to choose two tasks from the CheXpert dataset.

In [11]:
random.choice(['chexpert_enlarged_cardiomediastinum',
               'chexpert_cardiomegaly',
               'chexpert_lung_opacity',
               'chexpert_lung_lesion',
               'chexpert_edema',
               'chexpert_atelectasis',
               'chexpert_pneumothorax',
               'chexpert_pleural_effusion',
               'chexpert_fracture',
               'chexpert_support_devices'])

'chexpert_cardiomegaly'

In [12]:
random.choice(['chexpert_enlarged_cardiomediastinum',
               'chexpert_lung_opacity',
               'chexpert_lung_lesion',
               'chexpert_edema',
               'chexpert_atelectasis',
               'chexpert_pneumothorax',
               'chexpert_pleural_effusion',
               'chexpert_fracture',
               'chexpert_support_devices'])

'chexpert_pleural_effusion'

In [13]:
# here is the final task list
final_tasks = ['eye_condition_classification',       # by domain (not easy)
               'barretts_esophagus_diagnosis',       # by domain (not easy)
               'crawled_covid_ct_classification',    # by domain (easy)
               'derm7pt_skin_lesions',               # by domain (not easy)
               'deep_drid_quality',                  # by domain (not easy)
               'hyperkvasir_anatomical-landmarks',   # by domain (easy)
               'breast_cancer_classification_v2',    # by domain (not easy)
               'mura_xr_forearm',                    # by domain (not easy)
               'chexpert_cardiomegaly',              # binary, imbalanced, large
               'chexpert_pleural_effusion',          # binary, imbalanced, large
               'isic20_melanoma_classification',     # binary, imbalanced, large
               'aptos19_blindness_detection']        # multiclass, imbalanced, large

### Part 2: Model training sweeps

In [14]:
from mml.interactive import MMLJobDescription, SubprocessJobRunner, DefaultRequirements

reqs = DefaultRequirements()
# the 5 models we investigate
models_list = ['caformer_s36.sail_in22k_ft_in1k', 
               'tiny_vit_21m_224.dist_in22k_ft_in1k', 
               'swin_s3_small_224.ms_in1k',
               'coatnet_rmlp_1_rw2_224.sw_in12k_ft_in1k', 
               'resnext101_32x4d.fb_swsl_ig1b_ft_in1k']
# the mathcing batch sizes that fit to our hardware
batch_size_list =  [50, 200, 100, 100, 200]
# we test three very different augmentation strategies for optimization
augmentations_list = ['basic', 'randaugment', 'load_imagenet_aa']
# we test for both class balanced sampling during training and prevalence preserving sampling
balanced_sampling_list = [True, False]

In [15]:
job_list = []
for task in final_tasks:
    for model, b_size in zip(models_list, batch_size_list):
        for augmentation in augmentations_list:
            for sampling in balanced_sampling_list:
                job_list.append(
                    MMLJobDescription(prefix_req=reqs, mode='train', config_options={
                                'proj': 'eva_deviations',
                                'mode.subroutines': ['train', 'predict'],
                                'pivot.name': task,
                                'arch.name': model,
                                'mode.cv': False,                     # no cross validation
                                'mode.nested': True,                  # use the original val split as test, separate a new val split from train
                                'sampling.balanced': sampling,
                                'loss.auto_activate_weighing': False, # to prevent loss weighting for class preservation
                                'sampling.batch_size': b_size,       
                                'callbacks': ['early', 'default'],    # early stopping
                                'lr_scheduler': 'plateau',            # reduce LR on plateau
                                'trainer.max_epochs': 40,             # maximum epochs
                                'augmentations': augmentation,        
                                'preprocessing': 'size224',           # name of the preprocessing pipeline
                                'sampling.enable_caching': True,      # enable image sample caching for efficiency
                                'reuse.clean_up.parameters': True,    # no need to permanently store model parameters
                                'optimizer': 'adam',                  # Adam optimizer
                            }))
print(len(job_list))

360


In [16]:
# run a job like this (commented to prevent accidental triggering)
runner = SubprocessJobRunner()
# for job in job_list:
#     runner.run(job)

### Part 3: Selection of best runs

In [17]:
import pandas as pd
from omegaconf import OmegaConf
import os

In [18]:
# generate all information, to be found in "infos.csv"
information = []
sub_tasks = [t + '+nested?0' for t in final_tasks]
sub_infos = mml.interactive.get_task_infos(sub_tasks)
structs = mml.interactive.get_task_structs(sub_tasks)
for struct in structs:
    information.append({'name': struct.name, 'train+val': sub_infos.num_samples[struct.name], 'classes': struct.num_classes, 'keywords': struct.keywords, 'ir': sub_infos.imbalance_ratios[struct.name]})
infos_df = pd.DataFrame(information)
infos_df.to_csv('infos.csv')

In [19]:
all_models = mml.interactive.load_project_models('eva_deviations')


In [20]:
def reroute_path(p: Path, relative: bool = False) -> Path:
    """Small helper to resolve all pathing issues from varying systems"""
    new_root = Path(os.getenv('MML_RESULTS_PATH'))
    project_index_in_path = [q.name for q in p.parents].index('eva_deviations')
    remainder = p.relative_to(p.parents[project_index_in_path])
    return Path('eva_deviations') / remainder if relative else new_root / 'eva_deviations' / remainder

In [22]:
# gather overview on all runs
run_infos = []
task_arch_container = {}
for task in all_models:
    for model in all_models[task]:
        pipeline = OmegaConf.load(reroute_path(model.pipeline))
        run_infos.append({
            'task': task,
            'architecture': pipeline.arch.name,
            'balanced_sampling': pipeline.sampling.balanced,
            'batch_size': pipeline.sampling.batch_size,
            'lr_scheduler': pipeline.lr_scheduler._target_.split('.')[-1],
            'augmentations': pipeline.augmentations.cpu.pipeline[0].name,
            'optimizer': pipeline.optimizer._target_.split('.')[-1],
            'predictions': reroute_path(model.predictions[task + '-test-0'], relative=True)
        })
        if (pipeline.arch.name, task) not in task_arch_container:
            task_arch_container[(pipeline.arch.name, task)] = [model]
        else:
            task_arch_container[(pipeline.arch.name, task)].append(model)
pd.DataFrame(run_infos).to_csv('run_infos.csv')

In [23]:
# select best augmentation and sampling strategy based on validation performance
best_infos = []
for arch, task in task_arch_container:
    legal_models = [m for m in task_arch_container[(arch, task)] if m.performance is not None]
    if len(legal_models) == 0:
        # in one case no sampling strategy lead to convergence of model training
        print(f'no legal models for {task}/{arch}')
        continue
    model = sorted(legal_models, key=lambda x: x.performance)[0]
    pipeline = OmegaConf.load(reroute_path(model.pipeline))
    best_infos.append({
        'task': task,
        'architecture': pipeline.arch.name,
        'balanced_sampling': pipeline.sampling.balanced,
        'batch_size': pipeline.sampling.batch_size,
        'lr_scheduler': pipeline.lr_scheduler._target_.split('.')[-1],
        'augmentations': pipeline.augmentations.cpu.pipeline[0].name,
        'optimizer': pipeline.optimizer._target_.split('.')[-1],
        'predictions': reroute_path(model.predictions[task + '-test-0'], relative=True)
    })
pd.DataFrame(best_infos).to_csv('best_runs.csv')

no legal models for chexpert_pleural_effusion+nested?0/resnext101_32x4d.fb_swsl_ig1b_ft_in1k
