<table style="background-color:#FFFFFF">   
  <tr>     
  <td><img src="https://upload.wikimedia.org/wikipedia/commons/9/95/Logo_EPFL_2019.svg" width="150x"/>
  </td>     
  <td>
  <h1> <b>CS-461: Foundation Models and Generative AI</b> </h1>
  Prof. Charlotte Bunne  
  </td>   
  </tr>
</table>

# ðŸ“š  Exercise Session (Coding Part) - 1


* [**TASK A:** Exploring Contrastive Learning with SimCLR](#task_name_1)

* [**TASK B:** Exploring the Scaling Behaviour of LMs with a Series of Pythia Models](#task_name_2)

<a name="task_name_1"></a>
## Task A: Exploring Contrastive Learning with SimCLR

To deepen our understanding of contrastive learning, letâ€™s implement a minimal SimCLR training loop for CIFAR-10. We refer to Algorithm 1 in the SimCLR paper (https://arxiv.org/pdf/2002.05709).

We will tackle the following subtasks step by step:

1. SimCLR augmentations to create positive and negative samples.

2. A tiny encoder plus projection head.

3. Implementing the InfoNCE loss from scratch.

4. A quick training run.

5. A simple KNN evaluation.

6. Some open questions.

### 0. Environment
import some necessary packages here. This code cell will be provided to students.

In [None]:
import torch
import math
from torch import nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
from torchvision.datasets import CIFAR10
import torchvision.models as models
import torch.nn.functional as F


print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"
device

torch: 2.7.0a0+7c8ec84dab.nv25.03
cuda available: True


'cuda'

###  1. Data & SimCLR Augmentations
Create two randomly augmented views of each image using strong transformations. As a reference, we provide one example that applies random cropping, color jitter, grayscale conversion, and blur; however, youâ€™re encouraged to explore your own selections and combinations.

In [None]:
class SimCLRTransform:

    def __init__(self, size=32, s=0.5, blur_p=0.5):
        color_jitter = T.ColorJitter(0.8*s, 0.8*s, 0.8*s, 0.2*s)
        k = 3 if size <= 32 else 5
        base = [
            T.RandomResizedCrop(size=size, scale=(0.2, 1.0)),
            T.RandomHorizontalFlip(),
            T.RandomApply([color_jitter], p=0.8),
            T.RandomGrayscale(p=0.2),
            T.RandomApply([T.GaussianBlur(kernel_size=k, sigma=(0.1, 2.0))], p=blur_p),
            T.ToTensor()
        ]
        self.train_transform = T.Compose(base)

    def __call__(self, x):
        return self.train_transform(x), self.train_transform(x)


def collate_fn(batch):
    xs1, xs2, ys = [], [], []
    for (x1, x2), y in batch:
        xs1.append(x1)
        xs2.append(x2)
        ys.append(y)
    return torch.stack(xs1), torch.stack(xs2), torch.tensor(ys)


train_t = SimCLRTransform(size=32)
test_t = SimCLRTransform(size=32)

train_ds = CIFAR10(root="./data", train=True, download=True, transform=train_t)
test_ds  = CIFAR10(root="./data", train=False, download=True, transform=test_t)

train_loader = DataLoader(train_ds, batch_size=512, num_workers=4, pin_memory=True, collate_fn=collate_fn)
test_loader  = DataLoader(test_ds,  batch_size=512, num_workers=4, pin_memory=True, collate_fn=collate_fn)

len(train_ds), len(test_ds)

(50000, 10000)

### 2. Encoder and Projection Head
The model should contain a backbone for feature encoding and a final MLP projection. As a reference implementation, we stack an uninitialized ResNet-18 (e.g., `torchvision.models.resnet18(weights=None)`) with a two-layer MLP that projects to a 128-dimensional feature space. Youâ€™re encouraged to explore your own designs and variations.

In [None]:
class SimCLR(nn.Module):

    def __init__(self, proj_dim=128, hidden=2048):
        super().__init__()
        enc = models.resnet18(weights=None)
        enc.conv1 = nn.Conv2d(3, 64, 3, 1, 1, bias=False)
        enc.maxpool = nn.Identity()
        enc.fc = nn.Identity()
        self.encoder = enc

        self.projector = nn.Sequential(
            nn.Linear(512, hidden),
            nn.ReLU(inplace=True),
            nn.Linear(hidden, proj_dim) # (bs, proj_dim)
        )

    def normalize(self, x, eps=1e-8):
        return x / (x.norm(dim=-1, keepdim=True) + eps)

    def forward(self, x):
        h = self.encoder(x)
        z = self.normalize(self.projector(h))
        return h, z

model = SimCLR(proj_dim=128).to(device)
model = torch.compile(model)

### 3. InfoNCE Loss from Scratch

We refer to Eq. (1) in the SimCLR paper. Note the two key components: the number of negative samples and the temperature parameter.

In [None]:
@torch.compile
def nt_xent(z1, z2, tau=0.5):
    B, d = z1.shape
    z = torch.cat([z1, z2], dim=0)              # (2B, d)
    sim = (z @ z.t()) / tau                     # (2B, 2B)
    mask = torch.eye(2*B, dtype=torch.bool, device=z.device)
    sim.masked_fill_(mask, -1e9)
    targets = torch.arange(B, device=z.device)
    targets = torch.cat([targets + B, targets], dim=0)
    return F.cross_entropy(sim, targets)

### 4. Train for a Few Epochs
Build the training loop and, for example, train for 200 epochs. Observe how the loss changes. Youâ€™re encouraged to use logging tools such as Weights & Biases (wandb) or TensorBoard to monitor optimization, debug issues, and tune your code and hyperparameters.

In [None]:
import wandb

wandb.init(
    project="simclr-cifar10",
    config={
        "epochs": 0,
        "lr": 0.0,
        "loss": 0.0,
    }
)
total_epochs = 500
warmup_epochs = int(total_epochs * 0.05)

def lr_lambda(epoch):
    if epoch < warmup_epochs:
        return (epoch + 1) / float(warmup_epochs)
    t = (epoch - warmup_epochs) / float(total_epochs - warmup_epochs)
    return 0.0 + 0.5 * (1 - 0.0) * (1 + math.cos(math.pi * t))

opt = torch.optim.AdamW(model.parameters(), lr=0.6, weight_decay=0.0)
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)

def train_epoch(loader, model, opt, scheduler, epoch):
    model.train()
    for step, (x1, x2, y) in enumerate(loader):
        x1, x2 = x1.to(device), x2.to(device)

        _, z1 = model(x1)
        _, z2 = model(x2)
        loss = nt_xent(z1, z2, tau=0.5)

        opt.zero_grad()
        loss.backward()
        opt.step()
    wandb.log({"loss": loss.item(), "lr": opt.param_groups[0]["lr"], "epoch": epoch})
    scheduler.step()

for epoch in range(total_epochs):
    train_epoch(train_loader, model, opt, scheduler, epoch)
    if (epoch + 1) % 50 == 0 or epoch + 1 == total_epochs:
        torch.save(model.state_dict(), f"/home/xwei/fake_path/sequence_model/cs461/{epoch}.pth")
wandb.finish()


Here is an example. https://wandb.ai/xiuying-wei/simclr-cifar10

### 5. Simple KNN Evaluation
To evaluate the encoderâ€™s quality, measure accuracy using either linear probing or a simple KNN classifier. The former trains a small linear classifier on the labeled samples and usually yields better performance. The latter classifies test examples by comparing their encoded features with those of labeled neighbors in feature space. We provide a simple KNN baseline, but youâ€™re encouraged to try other approaches.


In [None]:
train_transform = T.ToTensor()
test_transform = T.ToTensor()
model.eval()
from torch._dynamo.eval_frame import OptimizedModule
torch.serialization.safe_globals([OptimizedModule])
ckpt = torch.load(f"/home/xwei/fake_path/sequence_model/cs461/499.pth", map_location=device)
model.load_state_dict(ckpt)
train_plain = CIFAR10(root='./data', train=True,  download=True,  transform=train_transform)
test_plain  = CIFAR10(root='./data', train=False, download=True,  transform=test_transform)

train_loader_plain = DataLoader(train_plain, batch_size=512, shuffle=False, num_workers=2, pin_memory=True)
test_loader_plain  = DataLoader(test_plain,  batch_size=512, shuffle=False, num_workers=2, pin_memory=True)

# === 2) Feature Extraction (use BACKBONE output, L2-normalized) ===
@torch.no_grad()
def extract_features(loader, model, device):
    model.eval().to(device)
    feats, labels = [], []
    for x, y in loader:
        x = x.to(device, non_blocking=True)
        h, _ = model(x)
        h = F.normalize(h, dim=1)
        feats.append(h.cpu())
        labels.append(y)
    return torch.cat(feats), torch.cat(labels)

# === 3) Weighted k-NN ===
@torch.no_grad()
def knn_weighted(feat_train, y_train, feat_test, k=10, T=0.1, num_classes=None, device=None):
    if device is None:
        device = feat_train.device
    feat_train = feat_train.to(device)
    feat_test = feat_test.to(device)
    y_train = y_train.to(device)

    sims = feat_test @ feat_train.t()
    vals, idxs = sims.topk(k=k, dim=1)
    nbr_labels = y_train[idxs]
    weights = (vals / T).softmax(dim=1)

    scores = torch.zeros(feat_test.size(0), num_classes, device=device)
    scores.scatter_add_(1, nbr_labels, weights)
    return scores.argmax(dim=1)


feat_tr, y_tr = extract_features(train_loader_plain, model, device)
feat_te, y_te = extract_features(test_loader_plain,  model, device)


k, T = 20, 0.07
pred = knn_weighted(feat_tr, y_tr, feat_te, k=k, T=T, num_classes=10, device=device)
acc = (pred.cpu() == y_te).float().mean().item()
print(f"KNN accuracy (clean eval)  k={k}, T={T}: {100*acc:.2f}%")




KNN accuracy (clean eval)  k=20, T=0.07: 88.13%


### 6. Open questions
No official solutions will be provided. We encourage students to explore and share their ideas and findings.

1. Explore two key factors in contrastive learningâ€”the number of negative samples (effectively, the batch size here) and the temperature.

2. Increase model sizeâ€”for example, the dimensionality of the final projection.

3. Try different data augmentation techniques.

4. Evaluate accuracy using only a limited portion of the labeled data.

The batch size controls the number of negative samples, but it also affects memory usage. The temperature controls the smoothness. The model size affects the modelâ€™s capacity. Different tasks may prefer different augmentation techniques, but color jitter (and color drop) and random clipping are considered two important ones. This kind of learning provides an effective way to learn when there are not enough labled samples. We suggest reading the Appendix of the SimCLR paper, which conducts these experiments. Weâ€™ll also quickly go over them during the exercise session.

<a name="task_name_2"></a>
## Task B: Scaling experiments with a series of Pythia models.

To build a better understanding of scaling, we use the open-source EleutherAI/Pythia models and analyze how model size and the number of training tokens affect performance. We evaluate perplexity (PPL) on the WikiText-2 dataset. We rely on the Hugging Face `transformers` library to prepare models and data; if youâ€™re not familiar with it, please skim the [doc](https://huggingface.co/learn/llm-course/chapter2/1?fw=pt)

We will do the following subtasks step by step.

1. PPL implementation.

2. Model size increase.

3. Token count increase (fixed model).

4. Open questions.

### 0. Environment.
We import the following packages, download the dataset, and specify the model and tokenizer. This code cell will be shown to students.

In [None]:
import math
import time
import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import logging as pylog
from transformers.utils import logging as hf_logging
import warnings

hf_logging.set_verbosity_error()
pylog.getLogger("transformers").setLevel(pylog.ERROR)
pylog.getLogger("huggingface_hub").setLevel(pylog.ERROR)
warnings.filterwarnings("ignore")

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

# data for evaluation
NUM_SAMPLES = 128
ds = load_dataset("haryoaw/clean_wikitext_mini_data", split="test")
test_texts = ds["text"][:NUM_SAMPLES]
print(test_texts[:3])
# we use the links in huggingface. For example, https://huggingface.co/EleutherAI/pythia-14m.
MODEL_TOKENIZER_LIST = [
    'EleutherAI/pythia-14m',
    'EleutherAI/pythia-70m',
    'EleutherAI/pythia-160m',
    'EleutherAI/pythia-410m',
]

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

Device: cuda


Using custom data configuration haryoaw--clean_wikitext_mini_data-445ca2203984b72e
Reusing dataset parquet (/home/xwei/.cache/huggingface/datasets/haryoaw___parquet/haryoaw--clean_wikitext_mini_data-445ca2203984b72e/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


[' = Robert <unk> = \n', ' Robert <unk> is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John <unk> in 2002 . In 2004 <unk> landed a role as " Craig " in the episode " Teddy \'s Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the <unk> <unk> Factory in London . He was directed by John <unk> and starred alongside Ben <unk> , Shane <unk> , Harry Kent , Fraser <unk> , Sophie Stanton and Dominic Hall . \n', ' In 2006 , <unk> starred alongside <unk> in the play <unk> written by Mark <unk> . He appeared on a 2006 episode of the 

<torch._C.Generator at 0x76ab6af643b0>

### 1. PPL implementation
Given a huggingface model, a tokenizer, and a list of raw inputs, implement the PPL calculation.

In [None]:
def get_ppl(model, tokenizer, texts):
    model.eval()
    total_loss = 0.0
    total_tokens = 0
    with torch.no_grad():
        for t in texts:
            enc = tokenizer(t, return_tensors='pt').to(model.device)
            out = model(**enc, labels=enc["input_ids"])
            n_tokens = enc["input_ids"].numel()
            total_loss += out.loss.item() * n_tokens
            total_tokens += n_tokens
    return math.exp(total_loss / total_tokens)

### 2. Model size increase
Increase the model size and observe how perplexity (PPL) and efficiency change. For each model name in `MODEL_TOKENIZER_LIST`, load the corresponding model and tokenizer using Hugging Faceâ€™s `AutoModelForCausalLM` and `AutoTokenizer`. Also measure inference efficiencyâ€”namely, prefill and decoding latencies. Prefill refers to processing a sequence of input tokens in parallel, whereas decoding refers to generating tokens one by one. These are the two canonical inference modes of language models.

In [None]:
def load_clm(name):
    model = AutoModelForCausalLM.from_pretrained(name).to(device)
    tok = AutoTokenizer.from_pretrained(name, use_fast=True)
    return tok, model


def count_params(model):
    return sum(p.numel() for p in model.parameters())


def measure_prefill_tput(model, seq_len=256, batch_size=1, repeat=20):
    inp = torch.zeros(batch_size, seq_len, dtype=torch.long, device=device)
    for _ in range(3):
        with torch.no_grad():
            _ = model(inp)
    torch.cuda.synchronize()
    t0 = time.time()
    with torch.no_grad():
        for _ in range(repeat):
            _ = model(inp)
    torch.cuda.synchronize()
    t1 = time.time()
    tokens = repeat * int(inp.numel())
    tput = tokens / (t1 - t0)
    return tput


def measure_decode_tput(model, prompt_len=64, gen_len=1, batch_size=1, repeat=10):
    inp = torch.zeros(batch_size, prompt_len, dtype=torch.long, device=device)
    _ = model.generate(inp, max_new_tokens=3, do_sample=False)
    torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(repeat):
        _ = model.generate(inp, max_new_tokens=gen_len, do_sample=False)
    torch.cuda.synchronize()
    t1 = time.time()
    tokens = repeat * gen_len * batch_size
    tput = tokens / (t1 - t0)
    return tput

for name in MODEL_TOKENIZER_LIST:
    tok, mdl = load_clm(name)
    ppl = get_ppl(mdl, tok, test_texts)
    params = count_params(mdl)
    prefill_tps = measure_prefill_tput(mdl, seq_len=1024, batch_size=1)
    decode_tps  = measure_decode_tput(mdl, prompt_len=64, gen_len=1, batch_size=1)
    print("params: {:.2f}M, ppl: {:.2f}, prefilling throughput {:.2f}token/s, generation throughput {:.2f}token/s".format(params / 10 ** 6, ppl, prefill_tps, decode_tps))



params: 14.07M, ppl: 189.33, prefilling throughput 319075.77token/s, generation throughput 207.97token/s
params: 70.43M, ppl: 80.69, prefilling throughput 136957.52token/s, generation throughput 208.11token/s
params: 162.32M, ppl: 48.06, prefilling throughput 53174.85token/s, generation throughput 124.84token/s
params: 405.33M, ppl: 31.83, prefilling throughput 18640.91token/s, generation throughput 71.51token/s


## 3. The number of trained tokens increase
Next, for a fixed model, we test the effect of increasing the number of training tokens and observe how PPL changes. Pythia provides training checkpoints that can be selected by setting the revision argument to a training step when loading `AutoModelForCausalLM`. We use the 410M Pythia model and evaluate its PPL at different steps. Each step corresponds to approximately 2,097,152 tokens.

In [None]:
FIXED_MODEL = 'EleutherAI/pythia-410m'
STEPS = [1000, 10000, 50000, 100000, 143000]
TOKENS_PER_STEP = 2_097_152
tok_fixed = AutoTokenizer.from_pretrained(FIXED_MODEL, use_fast=True)
if tok_fixed.pad_token is None:
    tok_fixed.pad_token = tok_fixed.eos_token
for s in STEPS:
    rev = f'step{s}'
    mdl = AutoModelForCausalLM.from_pretrained(FIXED_MODEL, revision=rev).to(device)
    ppl = get_ppl(mdl, tok_fixed, test_texts)
    tokens_seen = s * TOKENS_PER_STEP
    print("the number of tokens {}B, the ppl {:.2f}".format(int(tokens_seen / 10 ** 9), ppl))


the number of tokens 2B, the ppl 180.89
the number of tokens 20B, the ppl 42.40
the number of tokens 104B, the ppl 34.22
the number of tokens 209B, the ppl 31.76
the number of tokens 299B, the ppl 31.83


### 4. Open questions
No official solutions will be provided. We encourage students to explore and share their ideas and findings.

1). Draw a figure with the x-axis representing model size and the y-axis representing log(PPL), then analyze the trend.

2). For a fixed-size model, draw a figure with the x-axis representing the number of training tokens and the y-axis representing log(PPL), then analyze the trend.

3). Given a fixed training compute budget (training FLOPs), think about consider how to balance model size and training tokens to achieve the best performance. Furthermore, if we also aim for better inference efficiency, how should we adjust these two variables?

4). Besides model size and training tokens, what other factors can be scaled?

The performance improves with increasing model size and the number of training tokens. However, for a fixed training-FLOPs budget, we must decide how to allocate scaling between model size and tokens. The Chinchilla scaling laws paper studies this trade-off; students can read it to learn their way of analyzing the problem.

A direct takeaway from the paper is to scale the number of training tokens to roughly 20Ã— the modelâ€™s parameter countâ€”for example, use ~20B tokens for a 1B-parameter model. In practice, because we often care more about inference efficiency and therefore prefer smaller models, we commonly operate in an overtraining regime, increasing the number of training tokens as much as possible. For example, the open-source OLMo-2 1B model uses about 4.05T tokens.