# BINF GU 4002: Machine Learning for Healthcare, Spring 2025
# Assignment \#4: Conceptual Foundations and Limitations of Foundation Models
## DUE: 11:59 PM, Tuesday, April 29, 2025

This assignment is an hands-on exploration of frontier large-scale pretrained models, also known as "foundation models". The goal is to build intuition around these models by connecting concepts and ideas previously explored in class. As an illustrative example, you will be using tabular electronic health record data from the previous homework to finetune a state-of-the-art LLM for mortality prediction. The assignment is designed to be more open-ended and a chance to explore some of the literature in this field.

**<font color="red">Instructions: Please run the notebook using Google Colab to prevent any dependency / package issues and use the GPU runtime type provided by Colab. Make sure that your written answers are formatted using </font>$\LaTeX$<font color="red"> in `markdown` cells. When submitting, please name your files `{UNI}_binf4008_mlh_assignment_4.{filetype}` and submit a `.ipynb` version of your Jupyter notebook. </font>**

Important: For this assignment you will have to change the runtime type to use GPUs in Colab (the Unsloth package requires GPUs for speedup). In the top-right menu, change the runtime type to T4 GPU before running any of the code below.

## [30 Points] Question 1: Preliminaries

We will be using a pretrained LLM by Meta AI [Llama](https://arxiv.org/pdf/2302.13971), which is based on the GPT architecture.

From the paper (https://arxiv.org/pdf/2302.13971):

> We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions
of tokens, and show that it is possible to train
state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

The huggingface repository contains documentation and pretrained model weights, which can be used for finetuing or for inference out-of-the box: https://huggingface.co/docs/transformers/main/en/model_doc/llama

#### [15 Points] 1.1: From the Llama paper (linked above), list and explain some of their datasets used for pre-training. What type of text data is included in these datasets (ex. CommonCrawl)?

<font color="red">Answer 1.1</font>

- English CommonCrawl [67%]
- C4 [15%]
  - diverse pre-processed CommonCrawldatasets
- Github [4.5%]
  - code data, projects that are distributed under the Apache, BSD and MIT licenses
- Wikipedia [4.5%]
  - covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
- Gutenberg and Books3 [4.5%]
  - book corpora, the Gutenberg Project contains books that are in the public domain, and the Books3 section of ThePile
- ArXiv [2.5%]  
  - arXiv Latex files to add scientific data to the dataset
- Stack Exchange [2%]   
  - high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry

#### [15 Points] 1.2 Based on the datasets used, name one target domain and corresponding inference task you think the model will fail to generalize to? Think in terms of the pretraining data distribution over possible tokens $p_{train}(X)$ versus your example domain $p_{target}(X)$ and the density ratio $\frac{p_{target}(X)}{p_{train}(X)}$ for characterizing distribution overlap. (We encourage you to use examples from your own research)

<font color="red">Answer 1.2</font>

Drug-Protein interaction prediction. First, this field is not a real NLP task, so unless the model can learn the real world physics (which it can't if only learning from natural language), the model cannot really understand how to predict the interaction. Then, thinking about the pre-training itself, all the data are publicly available, which is not true to this field. High quality and high throughput drug-protein interaction data are from large pharmaceutical companies, even not from the academia. To get more scientific sense, the arxiv copora is not diverse enough for the LLM as not all papers will be submitted to arxiv, and some fields have their common pactice to submit the preprint to other websites, like bioarxiv and medicalarxiv, as a result of which, the scientific sense learned by LLM may also biased to some particular fields, like CS.

#### Text Serialization and Tabular Classification

We will revisit the MIMIC dataset that was used in HW3. The first objective will be to finetune the Llama model to perform toy prediction tasks in the form of Q and A.

LLMs have shown to solve many tabular classificaiton and regression problems at scale, due to its capabilities to encode information across tasks. We will use a tokenization strategy known as "text serialization" (1, 2), which converts tabular data into a language format and tokenized using pre-existing vanilla tokenizers. Developing tokenization strategies for various data types (continuous measurements, time, multi-modal data) is an active research field that our department also works on.

- [1] TabLLM: Few-shot Classification of Tabular Data with Large Language
Models: https://arxiv.org/pdf/2210.10723
- [2] Large Scale Transfer Learning for Tabular Data
via Language Modeling: https://arxiv.org/pdf/2406.12031


In [1]:
import math

import pandas as pd
import torch
import torch.nn.functional as F
from datasets import Dataset
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import DataCollatorWithPadding, TrainingArguments
from trl import SFTTrainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!


Here we do the down sampling to make the computation managable.

In [2]:
df = pd.read_csv("./data/processed_mimic_sample.csv", index_col=0)
data = df.copy()
data.drop_duplicates("subject_id", inplace=True)
data.set_index("subject_id", inplace=True, drop=True)
# we want 500 positive and 500 negative samples
negative_index = data[data["label"] == 0].sample(500, random_state=42).index.to_list()
positive_index = data[data["label"] == 1].sample(500, random_state=42).index.to_list()
data = data.loc[positive_index + negative_index].copy()

data[data.select_dtypes(include='number').columns] = data.select_dtypes(include='number').round(0).astype('Int64') # Round measurements

print(data.shape)
# If your dataset has too many rows (say over 10000 patients) and columns (say over 50 features) it will take more compute for inference.
# I have found around 1000 patients (with balanced classes) and 10-20 features can work well as a proof-of-concept with Colab's GPU

(1000, 9)


In [3]:
data

Unnamed: 0_level_0,label,log_stay_day,admit_year,gender,age,admission_type,insurance,language,race
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
16015722,1,2,2138,F,86,EW EMER.,Medicare,English,BLACK/AFRICAN AMERICAN
11586654,1,2,2125,M,70,DIRECT EMER.,Medicare,English,WHITE
14039848,1,1,2153,F,56,EW EMER.,Medicaid,English,WHITE
15103745,1,2,2185,F,81,EW EMER.,Medicare,English,BLACK/AFRICAN AMERICAN
16077953,1,1,2182,M,91,EW EMER.,Medicare,English,WHITE
...,...,...,...,...,...,...,...,...,...
14551157,0,1,2161,F,52,EU OBSERVATION,Private,English,WHITE
13273553,0,3,2134,F,65,EW EMER.,Medicare,English,WHITE
15195372,0,1,2117,F,21,OBSERVATION ADMIT,Private,English,OTHER
17238479,0,1,2148,F,34,EU OBSERVATION,Private,English,WHITE


In [4]:
label_col = "label"
prompt = "Predict if the patient will die in hospital:||True||False||"

records = []
for _, row in data.iterrows():
    input_str = "Patient EHR: " + ", ".join(
        [f"{col} is {row[col]}" for col in data.columns if col != label_col] # Perform text serialization
    )
    records.append({
        "instruction": prompt,
        "input": input_str,
        "output": str(row[label_col])
    })

train_records, test_records = train_test_split(records, test_size=0.2, random_state=42)

# Create HuggingFace Datasets
hf_dataset_train = Dataset.from_pandas(pd.DataFrame(train_records))
hf_dataset_test = Dataset.from_pandas(pd.DataFrame(test_records))

ex_record = records[0]
# Full prompt + input structure for a single sample
print("Prompt:", ex_record['instruction'])
print("Input:", ex_record['input'])
print("Label (Output):", ex_record['output'])

Prompt: Predict if the patient will die in hospital:||True||False||
Input: Patient EHR: log_stay_day is 2, admit_year is 2138, gender is F, age is 86, admission_type is EW EMER., insurance is Medicare, language is English, race is BLACK/AFRICAN AMERICAN
Label (Output): 1


#### Finetuning

We will now load the pretrained model from huggingface and prepare it for fine-tuning.

In [8]:
max_seq_length = 256

# We will load in a pre-quantized model (4-bit precision), which is more memory efficient and faster to load and run.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = torch.float16,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # [2.2] Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # [Question 2.2] This specifies what type of parameters in the LLM are being fine-tuned
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.2.
   \\   /|    NVIDIA L40S. Num GPUs = 4. Max memory: 44.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [12]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # From your tokenizer

def formatting_prompts_func(examples):
    return {
        "text": [
            alpaca_prompt.format(inst, inp, out) + EOS_TOKEN
            for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])
        ]
    }

# Apply formatting
train_dataset = hf_dataset_train.map(formatting_prompts_func, batched=True)

# Set finetuning configurations
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 8, # Batch size
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 100, # Number of total training steps
        learning_rate = 2e-4, # [Question 2.2] learning rate
        fp16 = True,
        bf16 = False,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map: 100%|██████████| 800/800 [00:00<00:00, 71422.80 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=2): 100%|██████████| 800/800 [00:01<00:00, 794.95 examples/s]


In [13]:
# Run to finetune model
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 800 | Num Epochs = 17 | Total steps = 100
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 4 x 1) = 128
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.7678
20,0.5099
30,0.2658
40,0.241
50,0.2164
60,0.1951
70,0.173
80,0.1617
90,0.1583
100,0.1585


In [14]:
# Save pretrained model to drive
model.save_pretrained("./lora_model") # Local saving
tokenizer.save_pretrained("./lora_model")

('./lora_model/tokenizer_config.json',
 './lora_model/special_tokens_map.json',
 './lora_model/tokenizer.json')

#### Inference

We will now load the fine-tuned model and run inference on our test samples.

In [16]:
max_seq_length = 256

# Only run this once if loading from saved model
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "./lora_model", # path to saved pretrained model
        max_seq_length = max_seq_length,
        dtype = torch.float16,
        load_in_4bit = True,
    )

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func_test(examples):
    # Format prompt strings first
    texts = [
        alpaca_prompt.format(inst, inp, "")
        for inst, inp in zip(examples["instruction"], examples["input"])
    ]

    # Return just text for now, no tokenization or device ops here
    return { "text": texts }

# Apply formatting
test_dataset = hf_dataset_test.map(formatting_prompts_func_test, batched=True)
test_dataset = test_dataset.remove_columns(["instruction", "input", "output"])

tokenizer.padding_side = "left"
tokenized_test = test_dataset.map(
    lambda x: tokenizer(
        x["text"], return_tensors=None, padding=True, truncation=True
    ),
    batched=True
)
tokenized_test = tokenized_test.remove_columns(["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
dataloader = DataLoader(tokenized_test, batch_size=8, collate_fn=data_collator, shuffle=False)

model.eval()
torch.set_grad_enabled(False)
model.to("cuda")

generated_outputs = []
log_likelihoods = []

target_tokens = ["False", "True"]
token_ids = tokenizer(target_tokens, add_special_tokens=False)["input_ids"]

# These tokens correspond to "True" and "False"
false_token_id = token_ids[0][0]
true_token_id = token_ids[1][0]

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.2.
   \\   /|    NVIDIA L40S. Num GPUs = 4. Max memory: 44.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map: 100%|██████████| 200/200 [00:00<00:00, 64811.93 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 6634.88 examples/s]


In [20]:
# TODO [Question 2.1]: run inference using fine-tuned model on test samples and collect predictions (logits)
for i, batch in enumerate(dataloader):

    input_ids = batch['input_ids'].to('cuda')
    attention_mask = batch['attention_mask'].to('cuda')

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_logits=True,
            return_dict_in_generate=True
        )

    logits = outputs.logits[0]  # shape: [batch_size, vocab_size]
    # Compute log-softmax over vocab logits
    log_probs = F.log_softmax(logits, dim=-1)

    # Collect log-likelihoods of the True / False tokens
    batch_false_log = log_probs[:, false_token_id]
    batch_true_log = log_probs[:, true_token_id]
    log_likelihoods.extend(torch.stack([batch_false_log, batch_true_log], dim=1).tolist())

    # generated_ids = outputs.sequences  # shape: [batch_size, input_len + new_tokens]
    # generated_outputs.extend() #Optional TODO: extract decoded tokens (if you want to compute accuracy / see text generation output)

In [None]:
y_true = [1 if label.strip().lower() == "1" else 0 for label in hf_dataset_test['output']]

y_scores = [1 / (1 + math.exp(log_true))
            for log_false, log_true in log_likelihoods]

auroc = roc_auc_score(y_true, y_scores)
print(f"AUROC: {auroc:.4f}")

AUROC: 0.4175


## [30 Points] Question 2: Supervised Fine-tuning of LLMs

#### [15 Points] 2.1: Complete the code for running inference and compute evaluation metrics of your choice


#### [15 Points] 2.2: Repeat for 2 more configurations of fine-tuning parameters or prompting / serialization strategies, and justify your choices (the objective is to familiarize yourself with LoRA fine-tuning so this is not graded on model performance). Some options may be
- Which parameters of the Llama model are fine-tuned
- LoRA decomposition rank
- Learning rate, regularization parameters

Here I met with an issue that if I have already run the full pipeline once, the model can not be trained again in the same run. So I restart the kernel each time after I finished one full pipeline evaluation. If you need to rerun the code, please be advised to restart the kernel.

In [5]:
max_seq_length = 256

# load in model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=torch.float16,
    load_in_4bit=True,
)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # From your tokenizer

target_tokens = ["False", "True"]
token_ids = tokenizer(target_tokens, add_special_tokens=False)["input_ids"]

# These tokens correspond to "True" and "False"
false_token_id = token_ids[0][0]
true_token_id = token_ids[1][0]

def formatting_prompts_func(examples):
    return {
        "text": [
            alpaca_prompt.format(inst, inp, out) + EOS_TOKEN
            for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])
        ]
    }

def formatting_prompts_func_test(examples):
    # Format prompt strings first
    texts = [
        alpaca_prompt.format(inst, inp, "")
        for inst, inp in zip(examples["instruction"], examples["input"])
    ]

    # Return just text for now, no tokenization or device ops here
    return { "text": texts }

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.2.
   \\   /|    NVIDIA L40S. Num GPUs = 4. Max memory: 44.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


##### 2.2[a] Change Rank

Here we change the rank of the LoRA to a smaller value ($8$). Since the sample size is very limited in this fine-tuning setting, too many trainable parameters will make the model prone to overfit. Changing the rank can effectively reduce the trainable parameters.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=torch.float16,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # [2.2] Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # [Question 2.2] This specifies what type of parameters in the LLM are being fine-tuned
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

# get train dataset
train_dataset = hf_dataset_train.map(formatting_prompts_func, batched=True)

# Set finetuning configurations
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 8, # Batch size
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 100, # Number of total training steps
        learning_rate = 2e-4, # [Question 2.2] learning rate
        fp16 = True,
        bf16 = False,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

# start training
trainer_stats = trainer.train()

# go inference
FastLanguageModel.for_inference(model)

# get test dataset
test_dataset = hf_dataset_test.map(formatting_prompts_func_test, batched=True)
test_dataset = test_dataset.remove_columns(["instruction", "input", "output"])

tokenizer.padding_side = "left"
tokenized_test = test_dataset.map(
    lambda x: tokenizer(
        x["text"], return_tensors=None, padding=True, truncation=True
    ),
    batched=True
)
tokenized_test = tokenized_test.remove_columns(["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
dataloader = DataLoader(tokenized_test, batch_size=8, collate_fn=data_collator, shuffle=False)

model.eval()
torch.set_grad_enabled(False)
model.to("cuda")

# start inference
log_likelihoods = []

for i, batch in enumerate(dataloader):

    input_ids = batch['input_ids'].to('cuda')
    attention_mask = batch['attention_mask'].to('cuda')

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_logits=True,
            return_dict_in_generate=True
        )

    logits = outputs.logits[0]  # shape: [batch_size, vocab_size]
    # Compute log-softmax over vocab logits
    log_probs = F.log_softmax(logits, dim=-1)

    # Collect log-likelihoods of the True / False tokens
    batch_false_log = log_probs[:, false_token_id]
    batch_true_log = log_probs[:, true_token_id]
    log_likelihoods.extend(torch.stack([batch_false_log, batch_true_log], dim=1).tolist())

# output evaluation metrics
y_true = [1 if label.strip().lower() == "1" else 0 for label in hf_dataset_test['output']]

y_scores = [1 / (1 + math.exp(log_true))
            for log_false, log_true in log_likelihoods]

auroc = roc_auc_score(y_true, y_scores)
print(f"AUROC: {auroc:.4f}")

Unsloth 2025.3.19 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/800 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 800 | Num Epochs = 17 | Total steps = 100
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 4 x 1) = 128
 "-____-"     Trainable parameters = 5,636,096/1,000,000,000 (0.56% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.7546
20,0.4925
30,0.2567
40,0.2312
50,0.2001
60,0.1747
70,0.1613
80,0.1583
90,0.1564
100,0.1571


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

AUROC: 0.4726


##### 2.2[b] Change Learning Rate

Here we changed the learning rate to a smaller value ($1\times 10^{-4}$) and the argument is similar to what have stated. I actually believe given such small training set, the model performance may not get better compared to zero-shot after the fine-tuning process. Here we change the model learning rate to let the model change little compared to the zero-shot model to avoid potential bias caused by the fine-tuning process.

In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=torch.float16,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # [2.2] Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # [Question 2.2] This specifies what type of parameters in the LLM are being fine-tuned
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

# get train dataset
train_dataset = hf_dataset_train.map(formatting_prompts_func, batched=True)

# Set finetuning configurations
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 8, # Batch size
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 100, # Number of total training steps
        learning_rate = 1e-4, # [Question 2.2] learning rate
        fp16 = True,
        bf16 = False,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

# start training
trainer_stats = trainer.train()

# go inference
FastLanguageModel.for_inference(model)

# get test dataset
test_dataset = hf_dataset_test.map(formatting_prompts_func_test, batched=True)
test_dataset = test_dataset.remove_columns(["instruction", "input", "output"])

tokenizer.padding_side = "left"
tokenized_test = test_dataset.map(
    lambda x: tokenizer(
        x["text"], return_tensors=None, padding=True, truncation=True
    ),
    batched=True
)
tokenized_test = tokenized_test.remove_columns(["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
dataloader = DataLoader(tokenized_test, batch_size=8, collate_fn=data_collator, shuffle=False)

model.eval()
torch.set_grad_enabled(False)
model.to("cuda")

# start inference
log_likelihoods = []

for i, batch in enumerate(dataloader):

    input_ids = batch['input_ids'].to('cuda')
    attention_mask = batch['attention_mask'].to('cuda')

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_logits=True,
            return_dict_in_generate=True
        )

    logits = outputs.logits[0]  # shape: [batch_size, vocab_size]
    # Compute log-softmax over vocab logits
    log_probs = F.log_softmax(logits, dim=-1)

    # Collect log-likelihoods of the True / False tokens
    batch_false_log = log_probs[:, false_token_id]
    batch_true_log = log_probs[:, true_token_id]
    log_likelihoods.extend(torch.stack([batch_false_log, batch_true_log], dim=1).tolist())

# output evaluation metrics
y_true = [1 if label.strip().lower() == "1" else 0 for label in hf_dataset_test['output']]

y_scores = [1 / (1 + math.exp(log_true))
            for log_false, log_true in log_likelihoods]

auroc = roc_auc_score(y_true, y_scores)
print(f"AUROC: {auroc:.4f}")

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.2.
   \\   /|    NVIDIA L40S. Num GPUs = 4. Max memory: 44.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.19 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/800 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 800 | Num Epochs = 17 | Total steps = 100
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 4 x 1) = 128
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,3.1637
20,1.315
30,0.3679
40,0.2827
50,0.2524
60,0.2444
70,0.2363
80,0.2309
90,0.2258
100,0.2244


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

AUROC: 0.4632


## [40 Points] Question 3: Autoregressive Sequence Modeling

We will now explore the conceptual foundations of pretraining sequence models, and the various emergent behaviours that arise from modeling sequences at scale.

Consider a dataset of $N$ sequences, each sequence with a maximum length $T$:

$$D = \{(x_1^i, x_2^i,..., x_{T}^i) \}_{i=1}^N$$

For purposes of this problem, we will assume tokens $x_t \in R^1$ are univariate and continuous-valued. We will consider a sequence model $p_\theta$ that specifies the one-step conditional densities $p_\theta(x_{t+1} | x_{1:t})$. (note here that $x_{1:t}$ means all $x$ from timestep $1$ to $t$, which we can also write as $x_{\leq t}$)

We will now derive the optimization problem for learning such a model given the dataset $D$.

#### [15 Points] 3.1: Show that the maximizing the likelihood of the empirical data distribution is equivalent to minimizing the following empirical loss function
$$
\begin{align}
l(p_\theta, D) = - \sum_{i=1}^N \sum_{t=1}^T \log p_\theta(x_{t+1}^i | x_{1:t}^i)
\end{align}
$$

<font color="red">Answer 3.1</font>

Suppose all the sequences are sampled iid.

$$
\begin{aligned}
\text{liklihood}(\theta | D) &= \prod_{i=1}^{N} p_{\theta}(x^{i}_1, x^{i}_2, ..., x^{i}_T) \\
&= \prod_{i=1}^{N} \prod_{t=0}^{T-1} p_{\theta}(x^{i}_{t + 1} | x^{i}_1, ..., x^{i}_t) \\
\text{log-liklihood}(\theta | D) &= \sum_{i=1}^{N} \sum_{t=0}^{T-1} \log p_{\theta}(x^{i}_{t + 1} | x^{i}_{1:t})
\end{aligned}
$$

So the maximizing log-liklihood esentially is the same with minimizing the loss function.

#### [15 Points] 3.2 Now assume that each one-step conditional density is parameterized as a Gaussian distribution i.e.

$$ p_\theta(x_{t+1} | x_{1:t}) = N(\mu_t, \sigma^2_t) $$

#### where each $\mu_t = f_\theta(x_{1:t})$ and $\sigma^2_t = g_\theta(x_{1:t})$ are mappings from the sequence model (ex. a transformer model). Write the loss function from 1.1 in terms of these parameters. What is the interpretation of the $\mu_t$ and $\sigma^2_t$ parameters? What would be the benefit in increasing the maximum context length $T$ during pretraining and inference?


<font color="red">Answer 3.2</font>

$$
\begin{aligned}
l(p_\theta, D) &= - \sum_{i=1}^N \sum_{t=0}^{T-1} \log p_\theta(x_{t+1}^i | x_{1:t}^i) \\
&= - \sum_{i=1}^N \sum_{t=0}^{T-1} \log \mathcal{N}(x_{t+1}^i | \mu_t, \sigma^2_t) \\
&= \sum_{i=1}^N \sum_{t=0}^{T-1} \left[ \frac{(x_{t+1}^i - \mu_t)^2}{2\sigma_t^2} + \frac{1}{2} \log \sigma_t^2 \right] + \textit{const.}
\end{aligned}
$$

Here, $\mu_t$ is the predicted mean of the next token $x_{t+1}$ given all the previous token. It can be interpreted as the model's best guess for the next token given the context. $\sigma^2_t$ is the predicted variance of the next token $x_{t+1}$ given all the previous token. It can be interpreted as the model's uncertainty of the next token $x_{t+1}$ given all the previous token.

If we let the $T$ increase, during the training stage, we can learn longer-range dependencies in the data, which can help model capture complex patterns in the context. During the inference stage, this allows the model to leverage richer contextual information for predictions, enhancing model performance. In this senerio, We also have a by product that larger the $T$ is, more parameters of the model we will have, which potentially gives the model more flexibility but also puts the model into risk of overfitting.

#### [10 Points] 3.3 Using a specific healthcare application of your choice, discuss one potential issue in pretraining such a model on a large corpus of publicly available data (ex. robustness to distribution shifts, safety & alignment, data biases & fairness). The following paper can serve as a starting point for some ideas, but feel free to explore the literature in this field:

On the Opportunities and Risks of
Foundation Models: https://arxiv.org/pdf/2108.07258

(Note: One paragraph is sufficient, cite any external sources if you use any)

<font color="red">Answer 3.3</font>

Pretraining a model on publicly available clinical notes (e.g., from academic hospitals or specific regions) can be problematic as publicly available domain-specific data is really limited and this limitation will also cause potential model bias. High-quality, diverse clinical notes are rarely fully public due to privacy laws (e.g., HIPAA). Models trained on limited, biased corpora (e.g., MIMIC, which skews toward ICU patients) inherit these limitations. 