# Part 1: SFT Instruction Tuning

In this part of the assignment, you will fine-tune a causal language model to follow instructions using SFT (Supervised Fine-Tuning) and LoRA (Low Rank Adaptation).

Specifically, we show that pretrained Transformer language model does not necessarily follow instructions provided in the prompt, but that it can be efficiently trained to do so by SFT on a dataset with prompts and appropriately structured answers.

We will explore instruction tuning on a small subset of the [GSM8K (Grade School Math 8k)](https://huggingface.co/datasets/openai/gsm8k) dataset containing elementary mathematics word problems. Solving these problems with a pretrained Transformer language model requires (i) learning to follow the formatting instructions for how to give the final numerical answer (after showing one's work/thought process) and (ii) learning to correctly reason step-by-step in solving such word problems.

Both of these are challenging for models that have only been pretrained, such as the [Phi 1.5 model](https://huggingface.co/microsoft/phi-1_5) that we will use. Phi 1.5 is a causal Transformer language model developed and pretrained by Microsoft, but it has not been instruction fine-tuned like most human user-facing models (we will be doing that). It has 1.3 billion parameters, making it small for a large language model so that we can complete this fine-tuning in 10-30 minutes on a single GPU rather than in hours or days on several GPUs, as would be required for a much larger model.

**Learning objectives.** You will:
1. Apply a causal Transformer language model for mathematics question answering
2. Apply Low-Rank Adaptation and Supervised Fine-Tuning to train a Transformer language model to follow instructions and reason step-by-step
3. Utilize the high-level Hugging Face Trainer API for instruction fine-tuning

Note: This assignment is intended to utilize GPU resources such as `CUDA` through the CS department cluster, Google colab (or local GPU resources for those running on machines with GPU support). The **code below assumes CUDA**; you will need to modify it if working with the [`mps` backend](https://docs.pytorch.org/docs/stable/notes/mps.html).

First, run the following code cell to download the model and demonstrate its use for generating text given a prompt.

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-1_5",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

`torch_dtype` is deprecated! Use `dtype` instead!


Now run the following cell to demonstrate the model's use for generating text given a prompt. You will notice that, on the one hand, the model is capable of identifying some of the relevant reasoning. On the other hand, it is ineffective at following the intent of the prompt -- it doesn't show its work, and doesn't provide the final answer after the requested #### marker. It also begins unnecessarily generating **new** questions that are similar to the original prompt. All of these behaviors are common to models that have only been pretrained.

In [2]:
prompt = "If you have 5 apples but someone takes 2, how many do you have left? Show your work, then write your final answer on the last line after ####"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

If you have 5 apples but someone takes 2, how many do you have left? Show your work, then write your final answer on the last line after ####

Answer: 3

Exercise 3: If you have $10 and you want to buy a toy that costs $5, how much money will you have left? Show your work, then write your final answer on the last line after ####

Answer: $5

Exercise 4: If you have 3 pencils and you give 1 to your friend, how many pencils do you have left? Show your work, then write your final answer on the last line


Run the following code to download the [GSM8K (Grade School Math 8k)](https://huggingface.co/datasets/openai/gsm8k) dataset. We create subsampled random splits of training, validation, and test datasets, and remove the additional calculator annotations provided in the dataset (provided to teach a model to learn when to call an external calculator utility for arithmetic -- an excellent idea but beyond the scope of this assignment).

The first question and answer will be printed. Note that the answer shows its work first, then provides the final numerical answer on the last line after ####. Fine-tuning on the work/reasoning will help to improve the model's tendency to reason step-by-step in its generation, but we need to instruct the model to give its final answer on the last line after #### so that we can easily extract the final answer and evaluate it for correctness.

In [3]:
from datasets import load_dataset
import re

# Load GSM8K dataset
dataset = load_dataset("gsm8k", "main")

# Create subsets
train_data = dataset["train"].shuffle(seed=2025).select(range(500))
test_full = dataset["test"].shuffle(seed=2025)
val_data = test_full.select(range(50))
test_data = test_full.select(range(50, 100))

def remove_calculator_annotations(example):
    """Remove <<...>> calculator annotations from answer."""
    example['answer'] = re.sub(r'<<[^>]+>>', '', example['answer'])
    return example

# Apply to datasets
train_data = train_data.map(remove_calculator_annotations)
val_data = val_data.map(remove_calculator_annotations)
test_data = test_data.map(remove_calculator_annotations)

# Preview one example
print(f"First Question: {train_data[0]['question']}")
print()
print(f"First Answer: {train_data[0]['answer']}")

First Question: In ten years, I'll be twice my brother's age. The sum of our ages will then be 45 years old. How old am I now?

First Answer: Let X be my age now.
In ten years, I'll be X+10 years old.
In ten years, my brother will be (X+10)*1/2 years old.
In ten years, (X+10) + (X+10)*1/2 = 45.
So (X+10)*3/2 = 45.
X+10 = 45 * 2/3 = 30.
X = 30 - 10 = 20 years old.
#### 20


The code defined for you below (run this cell) defines a `extract_answer` helper function for trying to extract the final numerical answer, assuming it follows #### as instructed. If it cannot detect such a final answer, it returns `None`.

The `evaluate_gsm8k` helper function then takes a model and a data split and runs the model on every question with appropriate prompt templating. The prompt gives 0 shot instructions to show work and to put the final answer on the last line after `####`. `num_eval` controls the number of questions from `data` that will be evaluated and defaults to `None`, in which case all questions in `data` will be evaluated.

If `verbose=True` then examples will be printed, otherwise the function simply prints and returns (i) the proportion of questions for which the model correctly followed the formatting instructions, and (ii) the proportion of those questions where the model gave the correct final numerical answer.

In [4]:
import re
import torch
from tqdm import tqdm

def extract_answer(text):
    """Extract final answer after #### marker."""
    match = re.search(r'####\s*(-?\d+(?:,\d{3})*(?:\.\d+)?)', text)
    if match:
        return match.group(1).replace(',', '')
    return None

def evaluate_gsm8k(model, tokenizer, data, max_new_tokens=128, verbose=False, num_eval=None):
    """Evaluate model accuracy on GSM8K data."""
    model.eval()

    # Zero-shot prompt with format instruction
    prompt_template = """Solve this math problem. Show your work and put your final answer on the last line as: #### [answer]

Question: {question}
Answer:"""

    correct = 0
    formatted = 0
    if num_eval == None:
        num_eval = len(data)

    total = num_eval

    for i in tqdm(range(num_eval), disable=verbose):
        item = data[i]

        # Get ground truth
        gt_answer = extract_answer(item['answer'])

        # Generate prediction
        prompt = prompt_template.format(question=item['question'])
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
            )

        # Get only the generated part (exclude prompt)
        generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
        generated = tokenizer.decode(generated_ids, skip_special_tokens=True)
        pred_answer = extract_answer(generated)

        # Verbose output
        if verbose:
            print(f"Question {i}: {item['question']}\n")
            print(f"Correct answer: {gt_answer}\n")

            if pred_answer == None:
                print(f"Failed to follow formatting instructions. Full generated text: {generated}...")
            else:
                print(f"Predicted answer: {pred_answer}\n")

            print("\n\n\n")

        # Check formatting and correctness
        if pred_answer is not None:
            formatted += 1
            if pred_answer == gt_answer:
                correct += 1

    format_rate = formatted / total
    accuracy = correct / formatted if formatted > 0 else 0

    print(f"Format rate: {format_rate:.2%} ({formatted}/{total})")
    print(f"Accuracy: {accuracy:.2%} ({correct}/{formatted})")

    return format_rate, accuracy

The following code demonstrates the use of the evaluation helper function in `verbose=True` model on just 3 examples, showing how the model fails to follow the formatting instructions or the intent of the question in many cases.

In [5]:
evaluate_gsm8k(model, tokenizer, val_data, verbose=True, num_eval=3)

Question 0: Paul is at a train station and is waiting for his train. He isn't sure how long he needs to wait, but he knows that the fourth train scheduled to arrive at the station is the one he needs to get on. The first train is scheduled to arrive in 10 minutes, and this train will stay in the station for 20 minutes. The second train is to arrive half an hour after the first train leaves the station, and this second train will stay in the station for a quarter of the amount of time that the first train stayed in the station. The third train is to arrive an hour after the second train leaves the station, and this third train is to leave the station immediately after it arrives.  The fourth train will arrive 20 minutes after the third train leaves, and this is the train Paul will board.  In total, how long, in minutes, will Paul wait for his train?

Correct answer: 145

Failed to follow formatting instructions. Full generated text:  Paul will wait for his train for a total of 45 minute

(0.0, 0)

## Task 1

The `format_gsm8k` function below processes a data point (question and answer) to provide instructions and formatting in the prompt and to tokenize and combine the inputs in preparation for SFT. Run the code and observe the example results, then answer the question below. You do not need to modify the code.

In [6]:
def format_gsm8k(example, tokenizer):
    """Format example for training with proper label masking.

    Args:
        example: Dictionary with 'question' and 'answer' keys
        tokenizer: HuggingFace tokenizer

    Returns:
        Dictionary with 'input_ids', 'attention_mask', and 'labels'
        Labels are -100 for question tokens (ignored in loss) and
        actual token IDs for answer tokens.
    """
    question_text = f"""Solve this math problem. Show your work and put your final answer on the last line as: #### [answer]

Question: {example['question']}
Answer: """

    answer_text = f"{example['answer']}"

    # Tokenize question and answer separately
    question_tokens = tokenizer(question_text, add_special_tokens=True)
    answer_tokens = tokenizer(answer_text, add_special_tokens=False)

    # Combine
    input_ids = question_tokens['input_ids'] + answer_tokens['input_ids']
    attention_mask = question_tokens['attention_mask'] + answer_tokens['attention_mask']

    # Create labels: -100 for question (ignored), actual tokens for answer
    labels = [-100] * len(question_tokens['input_ids']) + answer_tokens['input_ids']

    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels
    }

# Example:
print(f"First Question Tokenized: {tokenizer(train_data[0]['question'], add_special_tokens=True)}")
print()
print(f"First Answer Tokenized: {tokenizer(train_data[0]['answer'], add_special_tokens=True)}")
print()
example_formatted = format_gsm8k(train_data[0], tokenizer)
print(f"Combined tokenized input: {example_formatted['input_ids']}")
print()
print(f"Prediction targets/labels: {example_formatted['labels']}")

First Question Tokenized: {'input_ids': [818, 3478, 812, 11, 314, 1183, 307, 5403, 616, 3956, 338, 2479, 13, 383, 2160, 286, 674, 9337, 481, 788, 307, 4153, 812, 1468, 13, 1374, 1468, 716, 314, 783, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

First Answer Tokenized: {'input_ids': [5756, 1395, 307, 616, 2479, 783, 13, 198, 818, 3478, 812, 11, 314, 1183, 307, 1395, 10, 940, 812, 1468, 13, 198, 818, 3478, 812, 11, 616, 3956, 481, 307, 357, 55, 10, 940, 27493, 16, 14, 17, 812, 1468, 13, 198, 818, 3478, 812, 11, 357, 55, 10, 940, 8, 1343, 357, 55, 10, 940, 27493, 16, 14, 17, 796, 4153, 13, 198, 2396, 357, 55, 10, 940, 27493, 18, 14, 17, 796, 4153, 13, 198, 55, 10, 940, 796, 4153, 1635, 362, 14, 18, 796, 1542, 13, 198, 55, 796, 1542, 532, 838, 796, 1160, 812, 1468, 13, 198, 4242, 1160], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

Note that the `-100` label is used for the question part of the sequence. `-100` is the default `ignore_index` for [PyTorch's CrossEntropyLoss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) implementation, meaning predictions at these points are ignored when calculating the loss (and any backpropagation/derivative calculations).

**Task 1 Question.** In one to two paragraphs, explain why we ignore these predictions when instruction tuning with SFT on this question and answer task.

**Answer:**

So with SFT, the goal here is that we want our model to use the question as context so that it can learn how to generate correct responses instead of predicting the inputs or questions. Lets say that if we do include the question tokens in the loss, then that would mean that the model would be rewarded for copying the input instead of just learning the reasonings steps to produce the final answer. So by setting the question tokens' labels to -100, we are just simply telling the model to just ignore the question tokens during training so that no gradients are calculated for those question tokens. So basically by masking with -100, we are excluding those question tokens from the loss computation and this ensures that the loss function only considered answer tokens. With this, it allows the model to focus more on generating step-by-step reasoning and correctly formatted answers instead of memorizing or reproducing the question.

## Task 2

Now we are ready to fine-tune our model. We will use the high-level [Hugging Face `Trainer` API](https://huggingface.co/docs/transformers/en/main_classes/trainer), backended by PyTorch, to streamline some of the boilerplate training code.

Run the following code cell to define the training procedure. You do not need to modify this code, but you should review it briefly before proceeding. Some highlights to consider:

1. The top-level `train_gsm8k` function is what you will call later to fine-tune your model. Observe that it has a large number of parameters that you will need to set.

2. LoRA (Low Rank Adaptation) is implemented for the attention layers of the network using `lora_r` for the rank and `lora_drop` for the droprate of dropout to be used.

3. The data are tokenized with the `format_gsm8k` function defined and explored above.

4. The Training arguments and Trainer instantiation apply the many remaining parameter selections for running the Adam optimizer with gradient accumulation and early stopping.
  - Gradient accumulation is a way to increase the effective batch size without increasing the memory overhead of training. This is important because you may run out of memory on the GPU in which case training cannot proceed. A gradient step will only be taken after `gradient_accumulation_steps` many minibatches of `batch_size` have been processed -- adding these gradients together rather than resetting after each batch. The effective batch size becomes `gradient_accumulation_steps * batch_size` while the memory necessary scales more closely with `batch_size`.
  - We have discussed early stopping at length and used it before -- the logic is implemented by the `Trainer` but you will need to specify the `patience` hyperparameter as well as a `max_epochs` if the early stopping condition is not reached. The code measures validation loss on the held out validation set once per epoch as the performance measure for implementing early stoppping.
  - Finally, as always the `learning_rate` must be set.
  
5. During training, every `logging_steps = 20` gradient steps, the training loss will be printed. After each epoch of training, train and validation losses will be printed. Once training completes, the `evaluate_gsm8k` function is run on the held out test set to measure final performance in terms of following the formatting instructions and correctness of the final answer.

In [7]:
# You do not need to modify this code
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback, TrainerCallback, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Define a simple callback to print the train loss periodically
class PrintLossCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None and 'loss' in logs:
            print(f"Step {state.global_step}: Train loss={logs['loss']:.4f}")

def train_gsm8k(model, tokenizer, train_data, val_data, test_data,
                batch_size, gradient_accumulation_steps,
                learning_rate, max_epochs, patience,
                lora_r, lora_drop):
    """Fine-tune model with LoRA on GSM8K."""

    # Set padding token
    tokenizer.pad_token = tokenizer.eos_token

    # Apply LoRA
    lora_config = LoraConfig(r=lora_r, lora_alpha=lora_r * 2,
                             target_modules=["q_proj", "k_proj", "v_proj"],
                             lora_dropout=lora_drop, bias="none",
                             task_type=TaskType.CAUSAL_LM)
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Format and tokenize with proper masking
    train_tokenized = train_data.map(
        lambda x: format_gsm8k(x, tokenizer),
        remove_columns=train_data.column_names
    )
    val_tokenized = val_data.map(
        lambda x: format_gsm8k(x, tokenizer),
        remove_columns=val_data.column_names
    )

    # Data collator for padding with label padding
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, label_pad_token_id=-100, padding=True)

    # Training arguments
    args = TrainingArguments(
        output_dir="./gsm8k_checkpoints",
        num_train_epochs=max_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        logging_steps=20,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        fp16=True,
        report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_tokenized,
        eval_dataset=val_tokenized,
        data_collator=data_collator,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=patience), PrintLossCallback()],
    )

    # Train
    trainer.train()

    # Final evaluation
    print("\nFinal test results:")
    evaluate_gsm8k(model, tokenizer, test_data)

    return model

**TODO** Fine-tune the model using the training procedure defined above. Your goal is to select appropriate hyperparameters to achieve:
- **Format rate > 90%**: The model should follow instructions to put the final answer after ####
- **Accuracy > 30%**: Among properly formatted responses, at least 30% should have the correct answer

**Hyperparameters to set:**
- `batch_size`: Size of each training minibatch (try 2, 4, or 8)
- `gradient_accumulation_steps`: Number of batches to accumulate before updating weights (try 4 or 8)
- `learning_rate`: Step size for optimization (try values between 1e-5 and 1e-3)
- `max_epochs`: Maximum training epochs (try 5-10)
- `patience`: Early stopping patience in epochs (try 2 or 3)
- `lora_r`: Rank for LoRA adaptation (try 8, 16, or 32)
- `lora_drop`: Dropout rate for LoRA layers (try 0.05 or 0.1)

**Tips and hints:**
- If you encounter CUDA out of memory errors, reduce `batch_size` or increase `gradient_accumulation_steps`
- The effective batch size is `batch_size * gradient_accumulation_steps`
- Training should take 10-30 minutes depending on your hyperparameters (assuming GPU/cuda)
- You may need to experiment with multiple configurations to achieve the target performance

**Strategy:** Start with conservative values (small batch size, moderate learning rate, higher LoRA rank) and adjust based on results. Monitor the training loss - it should decrease steadily. If validation loss stops improving or increases while training loss continues decreasing, you may be overfitting.

Fill in the hyperparameters below and run the training. When you have achieved the minimum format rate and accuracy thresholds, answer the following questions.

In [8]:
# TODO: Set your hyperparameters here

batch_size = 2  # Try 2, 4, or 8
gradient_accumulation_steps = 8  # Try 4 or 8
learning_rate = 1e-4  # Try 1e-4, 5e-5, or 1e-5
max_epochs = 10  # Try 5-10
patience = 3  # Try 2 or 3
lora_r = 8  # Try 8, 16, or 32
lora_drop = 0.05  # Try 0.05 or 0.1

# Free GPU memory and reload a fresh model for fine-tuning
# This is to ensure that if you run this cell multiple times
# with different hyperparameters, you always start from the
# clean original pretrained model (not a partially fine-tuned one)
import gc
del model
gc.collect()
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-1_5",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
print("Fresh model loaded and ready for fine-tuning")

# Run training
model = train_gsm8k(
    model=model,
    tokenizer=tokenizer,
    train_data=train_data,
    val_data=val_data,
    test_data=test_data,
    batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    max_epochs=max_epochs,
    patience=patience,
    lora_r=lora_r,
    lora_drop=lora_drop
)

Fresh model loaded and ready for fine-tuning
trainable params: 2,359,296 || all params: 1,420,630,016 || trainable%: 0.1661


Epoch,Training Loss,Validation Loss
1,1.3299,1.015784
2,0.9259,0.924909
3,0.9109,0.910676
4,0.8439,0.902869
5,0.8259,0.899561
6,0.7949,0.899252
7,0.7901,0.900425
8,0.7507,0.901418
9,0.7384,0.902062


Step 20: Train loss=1.3299
Step 40: Train loss=1.0178
Step 60: Train loss=0.9259
Step 80: Train loss=0.9109
Step 100: Train loss=0.8397
Step 120: Train loss=0.8439
Step 140: Train loss=0.8118
Step 160: Train loss=0.8259
Step 180: Train loss=0.7949
Step 200: Train loss=0.7788
Step 220: Train loss=0.7901
Step 240: Train loss=0.7507
Step 260: Train loss=0.7821
Step 280: Train loss=0.7384

Final test results:


100%|██████████| 50/50 [04:07<00:00,  4.95s/it]

Format rate: 92.00% (46/50)
Accuracy: 30.43% (14/46)





**Answers to post-training questions:**

**Q1.** Did you observe model overfitting? If so, what did you change to mitigate the overfitting?

**A1.** From the table, I find that there is a slight overfitting that has occurred since the training loss keeps going down while the validation loss stopped improving after a few epochs. My current apporach to deal with overfitting is to use early stopping, adding dropout, and try to keep the LoRA rank small in order to control the model complexity. With this approach, it does help it bit to keep the validation loss stable and to prevent the model form just memorizing the data. But other ways that I could have done to avoid overfitting would be to lower the learning rate or reducing the number of training epochs.

**Q2.** Did you have difficulty with slow training or running out of memory? If so, what changes did you make to address these problems?

**A2.** During this experiment with testing out the hyperparameters, there was this one time where I did encountered a CUDA out-of-memory error. With my original approach, my batch size was 8 and my gradient accumulation steps was 4, but I ran out of memory when I start running the code. So to fix this error, I reduced the batch size to 2 and increased the gradient accumulation steps to 8 which does resolved this issue. With this change, it made training more stable and requires less GPU memory, but it also made each training step to be slower since the gradients are updated less fequently. So ultimately, the training does take a longer time to finish, but it still ran smoothly without crashing or experiencing any errors with the changes being made.

**Q3.** Based on the results, what is more challenging: (i) Training a language model to follow formatting instructions, or (ii) reason mathematically?

**A3.** For the most part, I would say that reasoning mathematically is much harder than following formatting instructions. From my result at least, the model is easily able to learn the output format with a format rate of 92%, but the model did struggled more with solving the problems correctly since it has an accuracy of around 30%. The reasoning behind this is that formatting is mainly about following patterns which is something a model could easily do. On the other hand, reasoning does require multiple logic steps and number calculations which are simply more complex for the model to learn.