# **Layer 1: Baseline DeiT environment**

DeiT’s baseline training script expects a teacher model name and distillation settings via CLI flags in main.py (e.g., --teacher-model, --teacher-path, --distillation-type).
GitHub
+1

So the “base environment” Layer 1 must include:

DeiT repo (cloned)

PyTorch (Colab default) + GPU

timm installed (for both student and teacher models)

compatibility patches if any (because Colab uses new torch/timm)

Install PyTorch without pinning

In [1]:
!pip -q install --upgrade pip
!pip -q install torch torchvision torchaudio

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Verify

In [2]:
import torch
print(torch.__version__)
print("CUDA:", torch.cuda.is_available())

2.9.0+cu128
CUDA: True


Clone the baseline repo (official DeiT)

In [3]:
%cd /content
!git clone https://github.com/facebookresearch/deit.git
%cd /content/deit
!grep -n "torch" -n requirements.txt || true

/content
Cloning into 'deit'...
remote: Enumerating objects: 456, done.[K
remote: Total 456 (delta 0), reused 0 (delta 0), pack-reused 456 (from 1)[K
Receiving objects: 100% (456/456), 5.73 MiB | 23.38 MiB/s, done.
Resolving deltas: 100% (255/255), done.
/content/deit
1:torch==1.13.1
2:torchvision==0.8.1


Colab Compatibility Fixes

1. torch pin removal

2. timm API changes

3. kwargs popping (pretrained_cfg, cache_dir, etc.)



Patch requirements.txt to remove torch pins

In [4]:
%cd /content/deit

!python - << 'PY'
from pathlib import Path
p = Path("requirements.txt")
lines = p.read_text().splitlines()

filtered = []
removed = []
for line in lines:
    s = line.strip()
    if s.startswith("torch==") or s.startswith("torchvision==") or s.startswith("torchaudio=="):
        removed.append(line)
        continue
    filtered.append(line)

p.write_text("\n".join(filtered) + "\n")
print("✅ Removed these pinned lines:")
for r in removed:
    print("  -", r)

/content/deit
✅ Removed these pinned lines:
  - torch==1.13.1
  - torchvision==0.8.1


Verify Pins are gone!i.e torch==1.13.1 pin was removed

In [5]:
!grep -nE "torch|torchvision|torchaudio" requirements.txt || echo "✅ No torch pins remain"

✅ No torch pins remain


Install the baseline dependencies

In [6]:
pip install "jedi>=0.16,<0.19"

Collecting jedi<0.19,>=0.16
  Downloading jedi-0.18.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m54.3 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.18.2


In [7]:
!pip -q uninstall -y timm
!pip -q install "jedi>=0.16,<0.19"
!pip -q install timm==0.6.13 submitit

In [8]:
#!pip -q uninstall -y timm
#!pip -q install -U pip setuptools wheel
#!pip -q install -U "timm>=1.0.0"

Verify

In [9]:
!python -c "import timm; print('timm:', timm.__version__)"

timm: 0.6.13


**Restart the Session**

In [10]:
!python - << 'PY'
from pathlib import Path

p = Path("/usr/local/lib/python3.12/dist-packages/timm/data/__init__.py")
txt = p.read_text()

needle = "OPENAI_CLIP_MEAN"
if needle in txt:
    print("✅ timm.data already mentions OPENAI_CLIP_MEAN; no patch needed.")
else:
    patch = """

# --- Colab patch: expose CLIP normalization constants for older exports ---
try:
    from .constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD  # timm versions where defined in constants
except Exception:
    # Standard OpenAI CLIP normalization
    OPENAI_CLIP_MEAN = (0.48145466, 0.4578275, 0.40821073)
    OPENAI_CLIP_STD  = (0.26862954, 0.26130258, 0.27577711)
# --- end patch ---
"""
    p.write_text(txt + patch)
    print("✅ Patched:", p)

✅ Patched: /usr/local/lib/python3.12/dist-packages/timm/data/__init__.py


Runtime → Restart runtime

In [11]:
#!pip -q install timm submitit

In [12]:
%cd /content/deit
from models import deit_tiny_patch16_224
m = deit_tiny_patch16_224()
print("✅ DeiT model instantiated successfully")

/content/deit
✅ DeiT model instantiated successfully


In [13]:
import torch, timm
print(torch.__version__)
print(timm.__version__)
print(torch.cuda.is_available())

2.9.0+cu128
0.6.13
True


Download Tiny-ImageNet

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
%cd /content
!wget -q http://cs231n.stanford.edu/tiny-imagenet-200.zip
!unzip -q tiny-imagenet-200.zip

/content


Fix Tiny-ImageNet validation folder

In [16]:
!python - << 'EOF'
import shutil
from pathlib import Path

root = Path("/content/tiny-imagenet-200")
val_dir = root/"val"
img_dir = val_dir/"images"
ann = val_dir/"val_annotations.txt"

with ann.open("r") as f:
    for line in f:
        img, cls = line.strip().split("\t")[:2]
        (val_dir/cls).mkdir(parents=True, exist_ok=True)
        src = img_dir/img
        dst = val_dir/cls/img
        if src.exists():
            shutil.move(str(src), str(dst))

if img_dir.exists():
    shutil.rmtree(img_dir)

print("✅ Tiny-ImageNet val reorganized into class subfolders.")

✅ Tiny-ImageNet val reorganized into class subfolders.


In [17]:
!find /content/tiny-imagenet-200/val -maxdepth 1 -type d | head

/content/tiny-imagenet-200/val
/content/tiny-imagenet-200/val/n04417672
/content/tiny-imagenet-200/val/n04376876
/content/tiny-imagenet-200/val/n02917067
/content/tiny-imagenet-200/val/n04074963
/content/tiny-imagenet-200/val/n07873807
/content/tiny-imagenet-200/val/n03662601
/content/tiny-imagenet-200/val/n03804744
/content/tiny-imagenet-200/val/n03444034
/content/tiny-imagenet-200/val/n07753592


In [18]:
ls -lah /content/tiny-imagenet-200 | head

total 2.6M
drwxrwxr-x   5 root root 4.0K Feb  9  2015 [0m[01;34m.[0m/
drwxr-xr-x   1 root root 4.0K Feb 11 08:06 [01;34m..[0m/
drwxrwxr-x   3 root root 4.0K Dec 12  2014 [01;34mtest[0m/
drwxrwxr-x 202 root root 4.0K Dec 12  2014 [01;34mtrain[0m/
drwxrwxr-x 202 root root 4.0K Feb 11 08:06 [01;34mval[0m/
-rw-rw-r--   1 root root 2.0K Feb  9  2015 wnids.txt
-rw-------   1 root root 2.6M Feb  9  2015 words.txt


Handle timm incompatibilities. Although we can instantiate the model directly, the training script uses timm.create_model(), which injects metadata arguments such as pretrained_cfg and cache_dir.
The original DeiT constructors do not support these arguments, so we remove them
YOUR NOTEBOOK CALL
    |
    v
deit_tiny_patch16_224()          ✅ works (no kwargs)

TRAINING PIPELINE
    |
    v
timm.create_model()
    |
    v
deit_tiny_patch16_224(**kwargs)  ❌ injects extra keys


Patch /content/deit/augment.py (safe compatibility fix)

In [19]:
%cd /content/deit
!python - << 'PY'
from pathlib import Path
p = Path("augment.py")
txt = p.read_text()

old = "from timm.data.transforms import _pil_interp, RandomResizedCropAndInterpolation, ToNumpy, ToTensor"
if old in txt:
    txt = txt.replace(
        old,
        "from timm.data.transforms import RandomResizedCropAndInterpolation, ToNumpy, ToTensor\n"
        "try:\n"
        "    from timm.data.transforms import _pil_interp  # older timm\n"
        "except Exception:\n"
        "    _pil_interp = None  # newer timm doesn't expose this\n"
    )
    p.write_text(txt)
    print("✅ Patched augment.py for timm compatibility.")
else:
    print("ℹ️ Expected import line not found; augment.py may already be patched or different.")

/content/deit
✅ Patched augment.py for timm compatibility.


In [20]:
%cd /content/deit
!sed -n '1,200p' models.py

/content/deit
# Copyright (c) 2015-present, Facebook, Inc.
# All rights reserved.
import torch
import torch.nn as nn
from functools import partial

from timm.models.vision_transformer import VisionTransformer, _cfg
from timm.models.registry import register_model
from timm.models.layers import trunc_normal_


__all__ = [
    'deit_tiny_patch16_224', 'deit_small_patch16_224', 'deit_base_patch16_224',
    'deit_tiny_distilled_patch16_224', 'deit_small_distilled_patch16_224',
    'deit_base_distilled_patch16_224', 'deit_base_patch16_384',
    'deit_base_distilled_patch16_384',
]


class DistilledVisionTransformer(VisionTransformer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dist_token = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
        num_patches = self.patch_embed.num_patches
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 2, self.embed_dim))
        self.head_dist = nn.Linear(self.embed_dim, self.num_classes)

Before constructing the model, remove those keys from kwargs

In [21]:
from pathlib import Path

p = Path("/content/deit/models.py")
lines = p.read_text().splitlines()

out = []
for line in lines:
    out.append(line)
    if line.strip().startswith("def deit_") and "**kwargs" in line:
        out.append("    # Drop timm-injected kwargs not supported by DeiT")
        out.append("    kwargs.pop('pretrained_cfg', None)")
        out.append("    kwargs.pop('pretrained_cfg_overlay', None)")
        out.append("    kwargs.pop('pretrained_cfg_priority', None)")

p.write_text("\n".join(out) + "\n")
print("✅ models.py patched to drop pretrained_cfg kwargs")


✅ models.py patched to drop pretrained_cfg kwargs


Verify

In [22]:
!grep -n "pretrained_cfg" /content/deit/models.py

65:    kwargs.pop('pretrained_cfg', None)
66:    kwargs.pop('pretrained_cfg_overlay', None)
67:    kwargs.pop('pretrained_cfg_priority', None)
84:    kwargs.pop('pretrained_cfg', None)
85:    kwargs.pop('pretrained_cfg_overlay', None)
86:    kwargs.pop('pretrained_cfg_priority', None)
103:    kwargs.pop('pretrained_cfg', None)
104:    kwargs.pop('pretrained_cfg_overlay', None)
105:    kwargs.pop('pretrained_cfg_priority', None)
122:    kwargs.pop('pretrained_cfg', None)
123:    kwargs.pop('pretrained_cfg_overlay', None)
124:    kwargs.pop('pretrained_cfg_priority', None)
141:    kwargs.pop('pretrained_cfg', None)
142:    kwargs.pop('pretrained_cfg_overlay', None)
143:    kwargs.pop('pretrained_cfg_priority', None)
160:    kwargs.pop('pretrained_cfg', None)
161:    kwargs.pop('pretrained_cfg_overlay', None)
162:    kwargs.pop('pretrained_cfg_priority', None)
179:    kwargs.pop('pretrained_cfg', None)
180:    kwargs.pop('pretrained_cfg_overlay', None)
181:    kwargs.pop('pretrained_cfg_p

Fix: Patch /content/deit/models.py to drop pretrained_cfg=...

Patch models.py to also drop cache_dir (and friends)

In [23]:
from pathlib import Path

p = Path("/content/deit/models.py")
lines = p.read_text().splitlines()

# Keys that timm may inject but DeiT constructors don't accept
DROP_KEYS = [
    "cache_dir",
    "hf_hub_id",
    "hf_hub_filename",
    "hf_hub_revision",
]

out = []
for line in lines:
    out.append(line)
    # Right after the comment line we previously inserted, add more pops once per function
    if line.strip() == "# Drop timm-injected kwargs not supported by DeiT":
        for k in DROP_KEYS:
            out.append(f"    kwargs.pop('{k}', None)")

p.write_text("\n".join(out) + "\n")
print("✅ Patched models.py to drop cache_dir/hf_hub* kwargs")


✅ Patched models.py to drop cache_dir/hf_hub* kwargs


Verify

In [24]:
!grep -n "cache_dir" /content/deit/models.py

65:    kwargs.pop('cache_dir', None)
88:    kwargs.pop('cache_dir', None)
111:    kwargs.pop('cache_dir', None)
134:    kwargs.pop('cache_dir', None)
157:    kwargs.pop('cache_dir', None)
180:    kwargs.pop('cache_dir', None)
203:    kwargs.pop('cache_dir', None)
226:    kwargs.pop('cache_dir', None)


In [25]:
# %cd /content/deit
# !python main.py \
#   --model deit_tiny_patch16_224 \
#   --data-path /content/tiny-imagenet-200 \
#   --pretrained \
#   --epochs 1 \
#   --batch-size 64 \
#   --num_workers 2 \
#   --output_dir /content/deit_runs/smoke_test
# %cd /content/deit
# !python main.py \
#   --model deit_tiny_patch16_224 \
#   --data-path /content/tiny-imagenet-200 \
#   --epochs 1 \
#   --batch-size 128 \
#   --num_workers 4 \
#   --input-size 224 \
#   --opt adamw \
#   --lr 5e-4 \
#   --weight-decay 0.05 \
#   --sched cosine \
#   --aa rand-m9-mstd0.5 \
#   --reprob 0.25 \
#   --remode pixel \
#   --recount 1 \
#   --output_dir /content/deit_runs/tiny_imagenet
# %cd /content/deit
# !python main.py \
#  --model deit_tiny_patch16_224 \
#  --data-path /content/tiny-imagenet-200 \
#  --finetune https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth \
#  --epochs 10 \
#  --batch-size 128 \
#  --num_workers 4 \
#  --input-size 224 \
#  --opt adamw \
#  --lr 3e-4 \
#  --weight-decay 0.05 \
#  --sched cosine \
#  --warmup-epochs 1 \
#  --smoothing 0.1 \
#  --aa rand-m7-mstd0.5 \
#  --reprob 0.1 \
#  --drop-path 0.1 \
#  --output_dir /content/deit_runs/tiny_imagenet_5ep
# %cd /content/deit
# !python main.py \
#   --model deit_tiny_distilled_patch16_224 \
#   --data-path /content/tiny-imagenet-200 \
#   --input-size 224 \
#   --epochs 100 \
#   --batch-size 1024 \
#   --opt adamw \
#   --lr 0.001 \
#   --sched cosine \
#   --warmup-epochs 5 \
#   --weight-decay 0.05 \
#   --smoothing 0.1 \
#   --drop 0.0 \
#   --drop-path 0.1 \
#   --repeated-aug \
#   --aa rand-m9-mstd0.5-inc1 \
#   --mixup 0.8 \
#   --cutmix 1.0 \
#   --reprob 0.25 \
#   --distillation-type soft \
#   --distillation-alpha 0.1 \
#   --distillation-tau 3.0 \
#   --teacher-model regnety_160 \
#   --teacher-path https://dl.fbaipublicfiles.com/deit/regnety_160-a5fe301d.pth \
#   --output_dir /content/deit_runs/paper_recipe

# %cd /content/deit
# !python main.py \
#  --model deit_tiny_patch16_224 \
#  --data-path /content/tiny-imagenet-200 \
#  --finetune https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth \
#  --epochs 10 \
#  --batch-size 128 \
#  --num_workers 4 \
#  --input-size 224 \
#  --opt adamw \
#  --lr 2.5e-4 \
#  --weight-decay 0.05 \
#  --sched cosine \
#  --warmup-epochs 1 \
#  --smoothing 0.1 \
#  --aa rand-m7-mstd0.5 \
#  --reprob 0.1 \
#  --drop-path 0.1 \
#  --distillation-type hard \
# --teacher-model regnety_160 \
# --teacher-path https://dl.fbaipublicfiles.com/deit/regnety_160-a5fe301d.pth \
#  --output_dir /content/deit_runs/tiny_imagenet_10ep
# %cd /content/deit
# !python main.py \
#  --model deit_tiny_distilled_patch16_224 \
#  --data-path /content/tiny-imagenet-200 \
#  --epochs 10 \
#  --batch-size 128 \
#  --num_workers 4 \
#  --input-size 224 \
#  --opt adamw \
#  --lr 7e-4 \
#  --weight-decay 0.05 \
#  --sched cosine \
#  --warmup-epochs 1 \
#  --smoothing 0.0 \
#  --aa rand-m7-mstd0.5 \
#  --reprob 0.1 \
#  --drop-path 0.0 \
#  --distillation-type hard \
#  --distillation-alpha 0.7 \
#  --teacher-model regnety_160 \
#  --teacher-path https://dl.fbaipublicfiles.com/deit/regnety_160-a5fe301d.pth \
#  --output_dir /content/deit_runs/deit_tiny_distilled_10ep

%cd /content/deit
!python main.py \
  --model deit_small_distilled_patch16_224 \
  --data-path /content/tiny-imagenet-200 \
  --input-size 224 \
  --epochs 100 \
  --batch-size 256 \
  --opt adamw \
  --lr 0.001 \
  --sched cosine \
  --warmup-epochs 5 \
  --weight-decay 0.05 \
  --smoothing 0.1 \
  --drop 0.0 \
  --drop-path 0.1 \
  --repeated-aug \
  --aa rand-m9-mstd0.5-inc1 \
  --mixup 0.8 \
  --cutmix 1.0 \
  --reprob 0.25 \
  --distillation-type soft \
  --distillation-alpha 0.1 \
  --distillation-tau 3.0 \
  --teacher-model regnety_160 \
  --teacher-path https://dl.fbaipublicfiles.com/deit/regnety_160-a5fe301d.pth \
  --output_dir /content/deit_runs/deit_small_paper_recipe






[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch: [0]  [ 30/390]  eta: 0:03:16  lr: 0.000001  loss: 6.2272 (6.2343)  time: 0.2848  data: 0.0004  max mem: 19324
Epoch: [0]  [ 40/390]  eta: 0:02:48  lr: 0.000001  loss: 6.2018 (6.2260)  time: 0.2854  data: 0.0004  max mem: 19324
Epoch: [0]  [ 50/390]  eta: 0:02:30  lr: 0.000001  loss: 6.1866 (6.2168)  time: 0.2852  data: 0.0004  max mem: 19324
Epoch: [0]  [ 60/390]  eta: 0:02:17  lr: 0.000001  loss: 6.1682 (6.2085)  time: 0.2847  data: 0.0004  max mem: 19324
Epoch: [0]  [ 70/390]  eta: 0:02:07  lr: 0.000001  loss: 6.1512 (6.2003)  time: 0.2843  data: 0.0004  max mem: 19324
Epoch: [0]  [ 80/390]  eta: 0:01:59  lr: 0.000001  loss: 6.1422 (6.1924)  time: 0.2848  data: 0.0004  max mem: 19324
Epoch: [0]  [ 90/390]  eta: 0:01:51  lr: 0.000001  loss: 6.1182 (6.1827)  time: 0.2849  data: 0.0004  max mem: 19324
Epoch: [0]  [100/390]  eta: 0:01:45  lr: 0.000001  loss: 6.1010 (6.1745)  time: 0.2845  data: 0.0004  max mem: 19324

# **Layer 2: Base Environment — Teacher Models & Multi-Teacher Adaptations**

Layer 2 extends the baseline DeiT environment to support knowledge distillation from one or more teacher models. This layer is additive: it does not modify the baseline DeiT training loop unless explicitly stated.
It includes
1. Teacher Model Support (Single & Multiple)
2. Teacher Registry / Configuration
3. Multi-Teacher Fusion Mechanism (Adaptation Layer)
4. Distillation Loss Integration