# Machine Learning II: Deep Learning and Applications
# Homework 2

**Due date: Nov 3th, 2025**

### Instructions
- Make a copy of this notebook in your own Colab and complete the questions there.
- You can add more cells if necessary. You may also add descriptions to your code, though it is not mandatory.
- Make sure the notebook can run through by *Runtime -> Run all*. **Keep all cell outputs** for grading.
- Submit the link of your notebook [here](https://docs.google.com/forms/d/e/1FAIpQLSdEhoIthUqZpgA6WmsS-hUFPZebU4CgtPMMno2Bnm4AduYKcw/viewform?usp=sharing&ouid=108990008229336794809). Please **add TAs as editors** (below) so that you can receive feedback from TAs.
  - Click `Share` and add zhihao.zhan@mila.quebec and xinyu.yuan@mila.quebec as editors before your submission.

### Note
A friendly reminder from the TAs: These exercises are fundamental, so we strongly encourage you to complete them with little to no assistance from ChatGPT, especially if you're pursuing a career as an MLE or applied scientist.

# Environment Setup
Install necessary python packages for this homework

In [None]:
!pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
!pip install datasets transformers
!pip install tiktoken
!pip install omegaconf

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.0.1+cu118
  Using cached https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp310-cp310-linux_x86_64.whl (2267.3 MB)
Collecting triton==2.0.0 (from torch==2.0.1+cu118)
  Using cached https://download.pytorch.org/whl/triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
Installing collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 3.1.0
    Uninstalling triton-3.1.0:
      Successfully uninstalled triton-3.1.0
  Attempting uninstall: torch
    Found existing installation: torch 2.5.0
    Uninstalling torch-2.5.0:
      Successfully uninstalled torch-2.5.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.5.0+cu121 requires torch==2.5.0, but you have torch 2.0.1+cu118 which is in

In [None]:
! pip install gdown



Now let's download all the files needed using the following command.

In [2]:
! gdown "https://drive.google.com/uc?id=17XsqhDy_GCjJex7Ekg6PKYEZUKkL_iqH"
! gdown "https://drive.google.com/uc?id=1GSTq3NQO519BkEG9Bid7003Awh17YrXx"
! gdown "https://drive.google.com/uc?id=18LcbPdiyaWAdnBUfwSSx6lmzVE0cSWrP"

Downloading...
From: https://drive.google.com/uc?id=17XsqhDy_GCjJex7Ekg6PKYEZUKkL_iqH
To: /home/alkan/homework_deepl/english-tokenizer.json
100%|██████████████████████████████████████| 32.7k/32.7k [00:00<00:00, 2.29MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1GSTq3NQO519BkEG9Bid7003Awh17YrXx
From (redirected): https://drive.google.com/uc?id=1GSTq3NQO519BkEG9Bid7003Awh17YrXx&confirm=t&uuid=637ada33-36c3-440f-94b5-08e6497d808f
To: /home/alkan/homework_deepl/fixed_initialized_model.pt
100%|████████████████████████████████████████| 116M/116M [00:14<00:00, 8.11MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=18LcbPdiyaWAdnBUfwSSx6lmzVE0cSWrP
From (redirected): https://drive.google.com/uc?id=18LcbPdiyaWAdnBUfwSSx6lmzVE0cSWrP&confirm=t&uuid=444d1a67-3a6c-4982-a030-86a6a56598ea
To: /home/alkan/homework_deepl/tokens.npz
100%|████████████████████████████████████████| 570M/570M [01:17<00:00, 7.38MB/s]


# Task 1: Transformer pre-training pipeline using HuggingFace library


 In this task, you will develop a basic Transformer model and explore training processes.


We first compile some utilization functions.

**Note: this part of function has to be compiled to run following section. Do not modify the seed to avoid incorrect evaluation results.**

In [1]:
# ------------------------------------------------------------------------------------ #
###############  Utilization Functions (DO NOT MODIFY) ###############

import torch
import io
import random
import numpy as np

def determine_device() -> str:
    if torch.cuda.is_available():
        return "cuda"
    elif torch.backends.mps.is_available():
        return "mps"
    return "cpu"


def estimate_model_disk_size(model: torch.nn.Module) -> int:
    with io.BytesIO() as byte_stream:
        torch.save(model.state_dict(), byte_stream)
        return byte_stream.tell()

def count_params(model: torch.nn.Module) -> int:
    return sum(p.numel() for p in model.parameters())


def enable_tf32() -> None:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

def set_seed(seed):

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


set_seed(42)
# ------------------------------------------------------------------------------------ #

## Transformer Implementation

In this section, you will implement the key components of the Transformer architecture, specifically the Transformer decoder as depicted in the following Figure 1.



![image](https://drive.google.com/uc?id=1Ia7d-1_hdk31E9BV-4_3PYgmNX3YgbM5)

You will begin by creating a decoder-only transformer model, following the provided code structure in the following `Model` section 1 and 2. This structure includes all necessary classes and function declarations required for your submission. Specifically, code to implement is occupied with "..." with clear commented note "[TODO]", and is detailed in the following subquestions. Please,

1. **avoid altering these elements or adding new Python dependencies** to ensure compatability with the automated testing pipeline, which cold otherwise lead to test failures.

2. **aim for a model that is efficiently implemented**, favoring the use of PyTorch functions and avoiding loop-based matrix operations

3. **You should not use overly simplistic layers or functions** like torch.nn.TransformerDecoder.

If in doubt about the appropriateness of a method, consult with the course TAs.

In [2]:
############ Model Section 1 (DO NOT MODIFY) ############


import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange


"""
Dimension symbols:
    B - batch size
    S - sequence length
    D - hidden dimension (n_embd)
    H - number of attention heads (n_head)
    HD - hidden dimension of a single attention head (d // n_head)
    V - size of the vocabulary
"""

class Head(nn.Module):
    def __init__(self, n_embd, head_size, dropout, n_positions):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(n_positions, n_positions)))
        self.dropout = nn.Dropout(dropout)
        #Note: this dropout randomly prevents some tokens from communicating with each other

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x) #shape (B,T, head_size)
        q = self.query(x) #shape (B,T, head_size)
        v = self.value(x) #shape (B,T, head_size)

        #compute self-attention scores
        wei = q @ k.transpose(-2, -1) #shape (B,T, head_size) @ (B,head_size,T) --> (B,T,T)
        wei *= C**-0.5 #scale by sqrt(d_k) as per paper, so that variance of the wei is 1
        wei = wei.masked_fill(self.tril[:T,:T]==0, float('-inf')) # (B,T,T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)

        #perform weighted aggregation of values
        out = wei @ v # (B, T, T) @ (B, T, head_size) --> (B, T, head_size)
        return out

class MultiHeadAttention(nn.Module):
    """ Multi-head attention as a collection of heads with concatenated outputs."""
    def __init__(self, n_embd, n_head, dropout, n_positions):
        super().__init__()
        head_size = n_embd // n_head
        self.heads = nn.ModuleList([Head(n_embd, head_size, dropout, n_positions) for _ in range(n_head)])
        self.proj  = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

class FeedForward(nn.Module):
    """ the feed forward network (FFN) in the paper"""

    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd*4),
            nn.ReLU(),
            nn.Linear(n_embd*4, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class DecoderBlock(nn.Module):
    """A single decoder block in a decoder language model."""

    def __init__(self, n_embd, n_head, dropout, n_positions):
        """Initialize the modules used in a decoder block."""
        super().__init__()

        self.ln_1 = nn.LayerNorm(n_embd)
        self.attn = MultiHeadAttention(n_embd, n_head, dropout, n_positions)
        self.mlp = FeedForward(n_embd, dropout)
        self.ln_2 = nn.LayerNorm(n_embd)

    def forward(
        self, x: torch.FloatTensor
    ) -> torch.FloatTensor:
        x = self.ln_1(x + self.attn(x))
        x = self.ln_2(x + self.mlp(x))
        return x

In [3]:
############ Model Section 2 ############

class DecoderLM(nn.Module):
    """The decoder language model."""

    def __init__(
        self,
        n_vocab: int,
        n_embd: int,
        n_head: int,
        n_positions: int,
        n_layer: int,
        p_dropout: float = 0.1,
    ):
        super().__init__()

        self.n_vocab = n_vocab
        self.n_embd = n_embd
        self.n_head = n_head
        self.n_positions = n_positions
        self.n_layer = n_layer
        self.p_dropout = p_dropout

        self.token_embeddings = nn.Embedding(n_vocab, n_embd)
        self.position_embeddings = nn.Embedding(n_positions, n_embd)
        self.blocks = nn.ModuleList(
            [DecoderBlock(n_embd=n_embd, n_head=n_head, dropout=p_dropout, n_positions=n_positions) for _ in range(n_layer)]
        )
        self.lm_head = nn.Linear(self.n_embd, self.n_vocab, bias=False)
        # NOTE: layer_norm should be put after transformer blocks `self.blocks`,
        # and before the language model head `self.lm_head`
        self.ln = nn.LayerNorm(n_embd)
        self.dropout = nn.Dropout(self.p_dropout)

        # initialize weights according to nanoGPT
        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith("out_proj.weight"):
                torch.nn.init.normal_(p, mean=0.0, std=0.02 / torch.sqrt(torch.tensor(2 * n_layer)))

        # tie the output projection weights to the token embedding weights
        self.lm_head.weight = self.token_embeddings.weight

        # count flops per token
        self.flops_per_token = (
            6 * count_params(self) + 12 * n_layer * n_embd * n_positions
        )

    def embed(
        self,
        input_ids: torch.LongTensor,
    ) -> torch.FloatTensor:
        """Convert input_ids to embeddings (token_embeddings + positional_embeddings).

        Args:
            input_ids: tokens ids with shape (B x S)

        Returns:
            embeddings: token representations with shape (B x S x D)
        """

        """
        Position ids are indices of tokens in the sequence. They are simply [0, 1, 2, ...] for every sequence in the
        batch.

        Example (B = 2, S = 5):

        position_ids = tensor([
         [0, 1, 2, 3, 4],
         [0, 1, 2, 3, 4]
        ])
        """

        assert input_ids.shape[1] <= self.n_positions
        # B = batch_size, S = seq_len
        B, S = input_ids.shape

        # token embeddings: (B, S, D)
        token_embeddings = self.token_embeddings(input_ids)

        # position ids: (1, S) -> (B, S)
        position_ids = torch.arange(S, device=input_ids.device).unsqueeze(0).expand(B, S)

        # positional embeddings: (B, S, D)
        positional_embeddings = self.position_embeddings(position_ids)

        # sum + dropout
        return self.dropout(token_embeddings + positional_embeddings)

    def token_logits(self, x: torch.FloatTensor) -> torch.FloatTensor:
        """Project the final hidden states of the model to token logits.

        Args:
            x: hidden states produced by the final decoder block (B x S x D)

        Returns:
            logits: logits corresponding to the predicted next token likelihoods (B x S x V)

        """

        logits = self.lm_head(x)
        return logits

    def forward(
        self,
        input_ids: torch.LongTensor,
    ) -> torch.FloatTensor:
        """A forward pass of the decoder LM, converting input_ids to token logits.

        Args:
            input_ids: tokens ids with shape (B x S)

        Returns:
            logits: logits corresponding to the predicted next token likelihoods (B x S x V)
        """

    def forward(
        self,
        input_ids: torch.LongTensor,
    ) -> torch.FloatTensor:
        """A forward pass of the decoder LM, converting input_ids to token logits."""
        # 1) embed tokens + positions
        x = self.embed(input_ids)          # (B, S, D)

        # 2) pass through decoder blocks
        for block in self.blocks:
            x = block(x)                   # (B, S, D)

        # 3) final layer norm
        x = self.ln(x)                     # (B, S, D)

        # 4) project to vocab
        logits = self.token_logits(x)      # (B, S, V)

        return logits

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)


$\textbf{Question 1.1}$ (5 points) **Weight Tying**

[Press and Wolf (2017)](https://aclanthology.org/E17-2025/) proposed a weight tying technique for projecting hidden states of a language model to token logits.

**Read** this paper first.

**Write code** in `DecoderLM.__init__()` to implement the weight Tying technique

**In your report**, explain what weight tying does.


**ANSWER TO QUESTION 1.1**

The general idea is to have the mdel use the same matrix to read and predict words. 

In models like the one considered here, we usually have two big vocab metrics of different dimensions, i.e. the input embedding matrix (which turns tokens into vectors) and the output projection (which turns hidden vectors into logits over vocab). This results in two seperate learnable matrices. 

Weight tying forces those two matrices to be the same tensor, resulting in one parameter matrix in memory, with both layers pointing to it. This has the following double-advantage : (i) as the model learns better embeddings, it improves at the same time the output head ; (ii) similarly, when the model learns better output predictions, it also improves the embeddings. 

This is possible due to a key insight that both operations are two side of the same coins : *given a token i, what vector represents it?" VS "given a vector, how likely is token i?". It therefore makes sense to have them share parameters. 

This resulted in cutting the number of parameters in half WHILE increasing performance. Something that we have seen recently with the Tiny Models by Alexia Jolicoeur-Martineau who achieved better results than gemini models with 0.01% of the parameters and an under 500USD budget. 


$\textbf{Question 1.2}$ (21 points) **Transformer Model**


Recall the Transformer Decoder shown in Figure 1. You are expected
to **implement the `DecoderLM` class** for the full decoder model:

   - the embedding step (6 points): implement the `DecoderLM.embed()` function, to convert input token indices into token embeddings, combine with positional embeddings to create embeddings for transformer input.
   - the final output logits (3 points): implement the `DecoderLM.token_logits()` function, to project the final hidden states of the model to token logits.
   - decoder blocks (12 points): implement the `DecoderLM.forward()` function by calling `DecoderLM.embed()`, passing the resulted embeddings to transformer blocks as input, get the final hidden states from the transformer blocks, and calling `DecoderLM.token_logits()` to get the final token logits.

Now please read `Model` section 1 carefully and complete the `Model` section 2 above, by implementing all the "[TODO]" specified. Your implementation will earn points for each of the "[TODO]".

## Transformer Pre-training

Now that you've set up the transformer, it's time to begin training the model! For ease of implementation and testing, the tokenized input will be provided to you. We use a portion of the C4 corpus, a refined subset of the Common Crawl web corpus. The dataset is automatically downloaded as a part of the training script, eliminaring the need for you to manually access it.



$\textbf{Question 2.1}$ (24 points) **Model Training Pipeline**

The training pipeline is outlined in the `Model Training` section 1, 2, 3, and 4 below. Now please read `Model Training` section 1 and 4 carefully and complete the `Model Training` section 2 and 3 below, by implementing all the "[TODO]" specified. Specifically, you are expected to **implement three key functions**:

- cosine_lr_schedule (6 points) - a learning rate scheduler that uses Cosine annealing (refer to question 3.2).
- compute_language_modeling_loss (6 points) - the function to calculate loss for both training and evaluating the model.
- train (12 points) - the main loop for training the model.




We use [wandb](https://wandb.ai/) for logging the training curves for visualization. you will need to have a [weights & biases account](https://wandb.ai/login).

In [None]:
! wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
Aborted!
^C


In [4]:
import wandb

In [5]:
wandb.login(key="3fffd4c6035fdc6b19fb004e953a0a4281ffdcb8")

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/alkan/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnicolas-goulet[0m ([33mnicolas-goulet-hec[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

**Remember** to consider various training hyperparameters such as batch size, learning rate (including its scheduler), and gradient accumulation, among others. These parameters are set in a configuration dictionary `hyperparam_config` in the `Training Hyper-parameter` section, and we have provided sample configurations for you to adjust these settings.

In [6]:
############ Training Hyper-parameter Section ############
hyperparam_config = {
    "output_dir": "outputs/GPT-4060ti-320d",
    "tokenizer_encoding": "gpt2",
    "model_config": {
        "n_embd": 288,      # bigger model
        "n_head": 8,
        "n_positions": 256, # longer context → more VRAM
        "n_layer": 6,
    },
    "device": "auto",
    "batch_size": 48,       # start with 48; if OOM, go 40 or 32
    "seq_len": 256,
    "num_warmup_steps": 100,
    "num_training_steps": 2000,
    "grad_accumulation_steps": 1,
    "min_lr": 3e-5,
    "max_lr": 3e-4,
}


In [7]:
############ Model Training Section 1 (DO NOT MODIFY) ############


import json
import math
import os
import sys
import time
from collections import deque
from collections.abc import Iterator
from contextlib import nullcontext
from typing import Callable
from rich import print

import numpy as np
import tiktoken
import torch
import torch.nn.functional as F
import wandb
from einops import rearrange
from omegaconf import OmegaConf
from tqdm import tqdm, trange


def random_batch_sampler(
    tokens: torch.LongTensor, device: str, batch_size: int, seq_len: int
) -> Iterator[torch.LongTensor]:
    """An infinite generator that samples batches of sequences from the tokens.

    Args:
        tokens: a 1d torch tensor of token ids
        device: the device to put the batch on
        batch_size: the batch size of the output tensor (B)
        seq_len: the sequence length of the output tensor (S)

    Returns:
        An infinite generator that samples batches of sequences from the
        tokens. Each batch has shape (B x S). Every sequence in the batch is
        a contiguous subsequence of tokens, sampled uniformly at random. The
        output tensor should be on the right device.
    """

    num_tokens = len(tokens)

    while True:
        # Generate random starting indices for each sequence in the batch
        indices = np.random.randint(0, num_tokens - seq_len + 1, batch_size)

        # Gather sequences from the tokens
        batch = torch.stack([tokens[i:i + seq_len] for i in indices])

        # Move the batch to the specified device
        yield batch.to(device)


def sequential_batch_sampler(
    tokens: torch.LongTensor, device: str, batch_size: int, seq_len: int
) -> Iterator[torch.LongTensor]:
    """A generator that yields batches of tokens.

    Args:
        tokens: a 1d torch tensor of token ids
        device: the device to put the batch on
        batch_size: the batch size of the output tensor (B)
        seq_len: the sequence length of the output tensor (S)

    Returns:
        A generator that yields a batch of tokens at a time. Each batch has
        shape (B x S). Every sequence in the batch is a contiguous subsequence
        of tokens in sequential order. The output tensor should be on the right
        device.

    Note: If the last batch is incomplete, which could happen when the number
        of tokens is not divisible by (batch_size * seq_len), you could drop
        the last batch.
    """

    num_tokens = len(tokens)
    total_batches = num_tokens // (batch_size * seq_len)

    for batch_idx in range(total_batches):
        start_idx = batch_idx * batch_size * seq_len
        end_idx = start_idx + batch_size * seq_len

        # Extract the batch of tokens
        batch = tokens[start_idx: end_idx].view(batch_size, seq_len)

        # Move the batch to the specified device and yield
        yield batch.to(device)

In [8]:
############ Model Training Section 2 ############

def cosine_lr_schedule(
    num_warmup_steps: int,
    num_training_steps: int,
    min_lr: float,
    max_lr: float,
) -> Callable[[int], float]:
    def get_lr(t: int) -> float:
        """Outputs the learning rate at step t under the cosine schedule.

        Args:
            t: the current step number

        Returns:
            lr: learning rate at step t

        """

        assert max_lr >= min_lr >= 0.0
        assert num_training_steps >= num_warmup_steps >= 0

        if t <= num_warmup_steps:
            # Linear warmup starting from 0 to max_lr
            warmup_steps = max(1, num_warmup_steps)
            lr = max_lr * (t / warmup_steps)
        elif t >= num_training_steps:
            # After training steps, return min_lr
            lr = min_lr
        else:
            # Cosine decay from max_lr to min_lr
            # progress goes from 0 → 1 over the decay phase
            progress = (t - num_warmup_steps) / (num_training_steps - num_warmup_steps)
            cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
            lr = min_lr + (max_lr - min_lr) * cosine
        return lr

    return get_lr


def set_lr(optimizer: torch.optim.Optimizer, lr: float) -> None:
    for g in optimizer.param_groups:
        g["lr"] = lr



def compute_language_modeling_loss(
    input_ids: torch.LongTensor, logits: torch.FloatTensor
) -> torch.FloatTensor:
    """Outputs the language modeling loss given input_ids and logits

    Args:
        input_ids: the input token ids (B, S)
        logits: the next token logits produced by the language model (B, S, V)

    Returns:
        loss: the mean cross entropy loss for next token prediction
    """
    # shift inputs to get labels: each position predicts the NEXT token
    # labels: (B, S-1)
    labels = input_ids[:, 1:]

    # drop the last time step in logits so shapes match: (B, S-1, V)
    logits = logits[:, :-1, :]

    # flatten for cross-entropy: logits -> (B*(S-1), V), labels -> (B*(S-1),)
    B, S_minus_1, V = logits.shape
    logits_flat = logits.reshape(B * S_minus_1, V)
    labels_flat = labels.reshape(B * S_minus_1)

    loss = F.cross_entropy(logits_flat, labels_flat)
    return loss

In [9]:
############ Model Training Section 3 ############

def train(
    model: DecoderLM,
    batch_sampler: Iterator[torch.LongTensor],
    optimizer: torch.optim.Optimizer,
    lr_schedule: Callable[[int], float],
    autocast: torch.autocast | nullcontext = nullcontext(),
    num_training_steps: int = 0,
    grad_accumulation_steps: int = 1,
) -> None:
    """A training loop for the language model

    Args:
        model: the decoder LM
        batch_sampler: a generator that produces batches of token ids
        optimizer: an optimizer for gradient update
        lr_schedule: a callable that produces the learning at a step number
        autocast: a context manager that handles tensor casting (you do not need
          to care about this for your implementation)
        num_training_steps: number of steps to train for
        grad_accumulation_steps: number of "micro" training steps before each
          gradient update
    """
    # stores training losses for the 20 latest steps
    losses = deque(maxlen=20 * grad_accumulation_steps)

    for step in (pbar := trange(num_training_steps)):
        t0 = time.time()

        # 1) set LR for this step
        lr = lr_schedule(step)
        set_lr(optimizer, lr)

        # 2) accumulate gradients over micro-steps
        optimizer.zero_grad(set_to_none=True)
        for _ in range(grad_accumulation_steps):
            input_ids = next(batch_sampler)
            with autocast:
                logits = model(input_ids)
            loss = compute_language_modeling_loss(input_ids, logits)
            # scale loss so total gradient over accumulation is same as single step
            (loss / grad_accumulation_steps).backward()
            losses.append(loss.item())

        # 3) update params
        optimizer.step()

        # logging stuff
        loss_mean = np.mean(losses).item()

        FLOPs_per_step = (
            model.flops_per_token
            * input_ids.shape[0]
            * input_ids.shape[1]
            * grad_accumulation_steps
        )
        t1 = time.time()
        dt = t1 - t0
        t0 = t1
        pbar.set_postfix(
            {
                "train loss": f"{loss_mean:.2f}",
                "TFLOPS": f"{FLOPs_per_step / dt / 1e12:.1f}",
            }
        )
        wandb.log({"train-loss": loss_mean, "learning-rate": lr}, step=step)

$\textbf{Question 2.2}$ (10 points) **Model Training Results**

**Execute the `Model Training` Section 4** below to run the pre-training pipeline, and report the training loss curve. (Note: runnning for 60 steps takes around **8 minutes** for training.)


**In your report**, answer the following:

A. Plot the training curve for the default `hyperparam_config` (use a screenshot from weights & biases: https://wandb.ai/site), and discuss how and why it looks like that.

B. Discuss your `DecoderLM` pre-training experiment.

**ANSWER QUESTION 2.2**

A. TODO WHY IT LOOKS LIKE THAT
B. DISCUSS DECODERLM PRETRAINING EXPERIMENT

In [10]:
############ Model Training Section 4 (DO NOT MODIFY) ############



@torch.inference_mode()
def evaluate(
    model: DecoderLM,
    batch_sampler: Iterator[torch.LongTensor],
    autocast: torch.autocast | nullcontext = nullcontext(),
) -> dict[str, float]:
    losses = []

    for input_ids in tqdm(batch_sampler, desc="evaluating.."):
        with autocast:
            logits = model(input_ids)
        loss = compute_language_modeling_loss(input_ids, logits)
        losses.append(loss.item())

    # mean of the losses is the average negative log likelihood
    mean_loss = sum(losses) / len(losses)
    perplexity = math.exp(mean_loss)

    eval_results = {
        "val-loss": mean_loss,
        "val-perplexity": perplexity,
    }
    wandb.log(eval_results)
    return eval_results

def main():
    enable_tf32()

    # create an output directory and dump the configuration file
    config = OmegaConf.create(hyperparam_config)
    os.makedirs(config.output_dir, exist_ok=True)
    OmegaConf.save(config, os.path.join(config.output_dir, "config.yaml"))
    print("#" * 40, OmegaConf.to_yaml(config).strip(), "#" * 40, sep="\n")
    wandb.init(project="llms-hw2", config=OmegaConf.to_container(config))

    # initialize tokenizer and model
    tokenizer = tiktoken.get_encoding(config.tokenizer_encoding)
    device = determine_device() if config.device == "auto" else config.device
    model = DecoderLM(tokenizer.n_vocab, **config.model_config).to(device)
    print(f"model parameters = {count_params(model) / 1e6:.0f}M")

    model_disk_size_MB = estimate_model_disk_size(model) * 1e-6
    if model_disk_size_MB > 98:
        print(
            f"[red]WARNING: your model is {model_disk_size_MB:.1f}MB. "
            "The largest model size allowed by GradeScope is 100MB, "
            "and you may have trouble with submitting the assignment. "
            "Please update your config so your model is at most 100 MB.[/red]"
        )
    else:
        print(
            f"Your model is {model_disk_size_MB:.1f}MB. This should be within "
            "the 100MB limit of Gradescope."
        )

    # prepare data and data generator
    assert config.seq_len <= config.model_config.n_positions
    tokens = np.load("tokens.npz")

    train_tokens = torch.from_numpy(tokens["train"].astype(int))
    val_tokens = torch.from_numpy(tokens["val"].astype(int))

    train_sampler = random_batch_sampler(
        train_tokens, device, config.batch_size, config.seq_len
    )
    val_sampler = sequential_batch_sampler(
        val_tokens, device, config.batch_size, config.seq_len
    )
    print(f"train dataset tokens = {len(train_tokens) / 1e6:.0f}M")
    FLOPs = (
        model.flops_per_token
        * config.num_training_steps
        * config.grad_accumulation_steps
        * config.batch_size
        * config.seq_len
    )
    print(f"train FLOPs = {FLOPs:.2e}")
    if FLOPs > 1e17:
        print(
            f"[red]WARNING: your train FLOPs is {FLOPs:.2e}. "
            "This is more than the max compute that we allow (1e+17). "
            "Please reduce your model size or train steps.[/red]"
        )

    # prepare optimizer and lr schedule
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0,  # will set this dynamically in the training loop
        betas=(0.9, 0.95),
        fused=device == "cuda",
    )
    lr_schedule = cosine_lr_schedule(
        config.num_warmup_steps, config.num_training_steps, config.min_lr, config.max_lr
    )
    autocast = (
        torch.autocast(
            device,
            dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32),
        )
        if device == "cuda"
        else nullcontext()
    )
    # training
    model.train()
    train(
        model,
        train_sampler,
        optimizer,
        lr_schedule,
        autocast,
        config.num_training_steps,
        config.grad_accumulation_steps,
    )

    # save the trained model
    model_path = os.path.join(config.output_dir, "model.pt")
    torch.save(model.state_dict(), model_path)
    print(f"model saved to {model_path}")

main()

  self.setter(val)


100%|██████████| 2000/2000 [06:49<00:00,  4.88it/s, train loss=5.82, TFLOPS=7.5]


# Task 2: Bias in Large Language Models (LLMs)

In this task, you will critically examine limitations and bias of LLMs. You will use a pre-trained BERT model in the following `Bias of BERT` section to analyze the potential sources of bias from the data to the model itself.

You need to enter the prompts in this notebook, and **in your discussion**, these prompts need to be detailed and explaned.



**NOTE**: Please restart the colab runtime before running codes for Task 2 to avoid package dependency issue. **Go to "Runtime menu" and click "Restart Runtime"**

In [None]:
! pip install -qqq torch==2.5.0 accelerate==0.34.2 datasets==3.1.0 evaluate==0.4.3 transformers[sentencepiece]==4.44.2

Each model on HuggingFace has it’s own model card, a descriptive accounting of various metadata
about the model, such as it’s training data and in what applications the model is used for. Take
note of the [Limitations and bias section](https://huggingface.co/course/chapter1/8?fw=pt) of the model card for the BERT model.

In [None]:
############ Bias of BERT (DO NOT MODIFY) ############
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("The man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("The woman works as a [MASK].")
print([r["token_str"] for r in result])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identica

['carpenter', 'farmer', 'baker', 'tailor', 'salesman']
['nurse', 'waitress', 'teacher', 'prostitute', 'maid']


$\textbf{Question 3.1}$ (8 points)

Play around with the above set of two entries given to the `unmasker`. Make two
original sets, of at least size two, of fill-mask prompts that induce the model to exhibit
negative bias towards a traditionally minoritized population (such as women, people of color,
queer, lower caste, etc.) and a positive bias towards a traditionally normative population
(men, white, straight, upper caste, etc.). In your report, define your assumptions/context of
what is “minoritized” and what is “normative”. What biases are your examples showing?

In [None]:
##### Bias Section ####

# Negative-Bias Towards Minoritized, Positive-Bias Towards Normative
# Case 1 to implement
result = unmasker("... [MASK].") # the normative
print([r["token_str"] for r in result])

result = unmasker("... [MASK].") # the minoritized
print([r["token_str"] for r in result])

# Case 2 to implement
result = unmasker("... [MASK].") # the normative
print([r["token_str"] for r in result])

result = unmasker("... [MASK].") # the minoritized
print([r["token_str"] for r in result])

['process', 'approach', 'system', 'model', 'theory']
['one', 'man', 'ones', 'people', 'woman']


$\textbf{Question 3.2}$ (2 points)
ome up with one “switched” (anti-stereotype) example where the model exhibits
positive-negative bias the other way around. Explain the bias being shown here, how you
came up with this example, and include this example in the notebook submission.


In [None]:
#### Switched Example Section ####

result = unmasker("... [MASK].")
print([r["token_str"] for r in result]) # the minoritized

result = unmasker("... [MASK].") # the normative
print([r["token_str"] for r in result])