# ML2025 Homework 6 - Fine-tuning leads to Forgetting

This notebook is for ML2025 Homework 6, focusing on the problem of fine-tuning leading to forgetting. The goal is to fine-tune a model using the GSM8K dataset while observing the effects on previously learned knowledge about safeness.

## Check GPU

In [6]:
!nvidia-smi

Fri May  9 19:26:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 572.61                 Driver Version: 572.61         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...  WDDM  |   00000000:01:00.0  On |                  N/A |
| N/A   43C    P8              2W /   80W |     409MiB /   8188MiB |     24%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Download Dataset & Install Packages

In [7]:
!wget https://www.csie.ntu.edu.tw/~b10902031/gsm8k_train.jsonl # original dataset for fine-tuning
!wget https://www.csie.ntu.edu.tw/~b10902031/gsm8k_train_self-instruct.jsonl # part of fine-tuning dataset refined by llama-3.2-1b-instruct
!wget https://www.csie.ntu.edu.tw/~b10902031/gsm8k_test_public.jsonl # gsm8k public test dataset
!wget https://www.csie.ntu.edu.tw/~b10902031/gsm8k_test_private.jsonl # gsm8k private test dataset
!wget https://www.csie.ntu.edu.tw/~b10902031/ailuminate_test.csv # ailuminate test dataset (public + private)

'wget' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC
'wget' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC
'wget' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC
'wget' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC
'wget' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC


## Huggingface Login

In [8]:
!huggingface-cli login --token "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" # TODO: Add your huggingface token

usage: huggingface-cli <command> [<args>]
huggingface-cli: error: unrecognized arguments: # TODO: Add your huggingface token


## Import Packages

In [9]:
# !pip install trl==0.12.0
# !pip install transformers==4.46.0 trl==0.12.0 peft==0.13.0

In [10]:
# !pip install jupyter notebook matplotlib seaborn scikit-learn
# !pip install tensorflow
# !pip install datasets trl bitsandbytes
# !pip install tqdm
# !pip install tf-keras
# !pip install torch==2.5.1+cu121 transformers peft trl datasets flash-attn bitsandbytes accelerate python-dotenv tqdm numpy pandas tf-keras --force-reinstall --no-build-isolation

In [15]:
from transformers import (
    AutoModelForCausalLM, # imports the model for causal language modeling
    AutoTokenizer, # imports the tokenizer for the model
    BitsAndBytesConfig, # imports the configuration for using bitsandbytes
    pipeline # imports the pipeline for text generation
)
from peft import (
    LoraConfig, # imports the configuration for LoRA
    get_peft_model, # imports the function to get the PEFT model
    PeftModel # imports the PEFT model
)
import os

import json
import torch
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # 禁用 tokenizer 並行
# os.environ["CUDA_VISIBLE_DEVICES"] = '6' # Sets the CUDA device to use
# device = torch.device('cuda:6') # Creates a CUDA device object
from datasets import Dataset # Imports the Dataset class from the datasets library
from trl import SFTConfig, SFTTrainer # Imports the SFTConfig and SFTTrainer classes from the trl library
import random
random.seed(42) # Sets the random seed for reproducibility
from tqdm import tqdm # Imports the tqdm library for progress bars
import csv

## LLM Fine-tuning

### Load Model & Tokenizer

In [18]:
!pip install --no-cache-dir --force-reinstall llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-win_amd64.whl (444.1 MB)
     ---------------------------------------- 0.0/444.1 MB ? eta -:--:--
     ---------------------------------------- 0.0/444.1 MB ? eta -:--:--
     ---------------------------------------- 1.6/444.1 MB 7.0 MB/s eta 0:01:04
     --------------------------------------- 5.0/444.1 MB 11.6 MB/s eta 0:00:38
      -------------------------------------- 6.6/444.1 MB 10.3 MB/s eta 0:00:43
      ------------------------------------- 10.5/444.1 MB 13.6 MB/s eta 0:00:32
      ------------------------------------- 11.3/444.1 MB 11.2 MB/s eta 0:00:39
      ------------------------------------- 11.5/444.1 MB 10.9 MB/s eta 0:00:40
      ------------------------------------- 11.5/444.1 MB 10.9 MB/s eta 0:00:40
     - -----

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.2.5 which is incompatible.


In [None]:
# Note, if you don't want to reinstall BNBs dependencies, append the `--no-deps` flag!
!pip install --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-win_amd64.whl'



In [20]:
sft_model_name = 'meta-llama/Llama-3.2-1B-Instruct' # Specifies the name of the pre-trained model to use
sft_bnb_config = BitsAndBytesConfig( # Configuration for using bitsandbytes
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
sft_model = AutoModelForCausalLM.from_pretrained( # Loads the pre-trained model
    pretrained_model_name_or_path=sft_model_name,
    quantization_config=sft_bnb_config,
    low_cpu_mem_usage=True,
    device_map = "cpu"
)
sft_tokenizer = AutoTokenizer.from_pretrained( # Loads the tokenizer for the model
    pretrained_model_name_or_path=sft_model_name,
    use_fast=True  # 啟用 fast tokenizer
)
sft_tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # Adds a special token for padding
peft_config = LoraConfig(
    r=8, 
    lora_alpha=16,
    lora_dropout=0.2, #Recommended range: 0.1 ~ 0.2
    # TODO: Adds dropout
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
peft_model = get_peft_model(sft_model, peft_config)

CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

### Dataset Formatting Functions

In [None]:
def load_jsonlines(file_name: str):
    f = open(file_name, 'r')
    return [json.loads(line) for line in f]

def select_fixed_few_shot_examples(nshot_data: list, n: int) -> list:
    # Select the first n examples from the dataset as fixed few-shot examples
    return nshot_data[:n]

def nshot_chats(nshot_data: list, n: int, question: str, answer: any, mode: str,task_type = "gsm8k") -> dict: # Function to create n-shot chats
    if mode not in ['train', 'test']:
        raise AssertionError('Undefined Mode!!!')

    chats = []
    # TODO: Use fixed few-shot examples
    fixed_examples = select_fixed_few_shot_examples(nshot_data, n)
    for qna in fixed_examples: # Samples n examples from the n-shot data
        chats.append(
            {
                'role': 'user',
                'content': f'Q: {qna["question"]}' # Creates a user message with the question
            }
        )
        chats.append(
            {
                'role': 'assistant',
                'content': f'A: Let’s solve this step by step:\n1. ...\n2. ...\nFinal answer: #### {qna["answer"]}' # Creates an assistant message with the answer
            }
        )

    instruction = (
            f'Q: {question} Let’s solve this step by step. List each reasoning step with a number, check your logic, '
            f'and write the final answer as an integer after \'####\'.'
            f'{" Avoid unsafe or harmful suggestions." if task_type == "ailuminate" else ""}'
        )
    chats.append({'role': 'user', 'content': instruction})
    if mode == 'train':
        chats.append(
            {
                'role': 'assistant',
                'content': f'A: {answer}' # Creates an assistant message with the answer
            }
        )

    return chats # Returns the list of chats

### Format GSM8K Data for Fine-tuning

In [None]:
gsm8k_train = load_jsonlines('gsm8k_train_self-instruct.jsonl') # You can use refined gsm8k_train_self-instruct.jsonl for fine-tuning

formatted_gsm8k = []
TRAIN_N_SHOT = 3 # Recommended range: 5 ~ 8
#Hints - Higher number of few-shot examples
#Hints - Fix few-shot examples
max_token_len = 0 # Record token length of dataset and prevent data from truncation
for qna in gsm8k_train: # Iterates over the GSM8K training data
    chats = nshot_chats(nshot_data=gsm8k_train, n=TRAIN_N_SHOT, question=qna['question'], answer=qna['answer'], mode='train') # Creates n-shot chats for the current example
    train_sample = sft_tokenizer.apply_chat_template(chats, tokenize=False) # Applies the chat template to the chats
    train_sample = train_sample[train_sample.index("<|eot_id|>") + len("<|eot_id|>"):] # Remove Cutting Knowledge Date in prompt template
    formatted_gsm8k.append( # Appends the formatted example to the list
        {
            'text': train_sample # Adds the text of the example
        }
    )
    max_token_len = max(max_token_len, len(sft_tokenizer(train_sample)['input_ids'])) # Updates the maximum token length

formatted_gsm8k = Dataset.from_list(formatted_gsm8k) # Creates a dataset from the list of formatted examples



### Fine-tuning

In [None]:

# # trainer
# training_arguments = SFTConfig( # Configuration for the SFT trainer
#     seed=1126,
#     data_seed=1126,
#     output_dir=f"sft",
#     per_device_train_batch_size=1,
#     gradient_accumulation_steps=4,
#     optim="paged_adamw_32bit",
#     num_train_epochs=5, # Recommended range: 3 ~ 5
#     #Hints - Higher number of epoch
#     logging_strategy="steps",
#     logging_steps=0.1,
#     save_strategy="steps",
#     save_steps=0.1,
#     lr_scheduler_type='cosine',  # TODO: cosine
#     learning_rate=4e-5, # Recommended range: 1 x 10-4 ~ 1 x 10-5
#     #Hints - Lower learning rate
#     # TODO: Add weight decay
#     warmup_steps=747,
#     weight_decay=1e-2, # Recommend range: 1 x 10-2 ~ 1 x 10-4
#     #Hints - Weight decay
#     bf16=True,
#     group_by_length=True,
#     max_seq_length=max_token_len,
#     dataset_text_field='text',
#     report_to='none',
#     dataloader_num_workers=0, 
# )
# trainer = SFTTrainer( # Creates the SFT trainer
#     model=peft_model,
#     train_dataset=formatted_gsm8k,
#     peft_config=peft_config,
#     processing_class=sft_tokenizer,
#     args=training_arguments,
# )
# trainer.train() # Starts the training process

## LLM Inference

### Load Adapter Checkpoint

In [None]:
generator = pipeline( # Creates a text generation pipeline
    'text-generation',
    model=sft_model,
    tokenizer=sft_tokenizer,
    pad_token_id=sft_tokenizer.eos_token_id,
    max_new_tokens=1024, # Recommended range: 512 ~ 1024
    # TODO: Use greedy decoding strategy
    do_sample=False,
    # temperature=0.6,
    # top_p=0.9,
)
adapter_path = 'sft/checkpoint-1868' # TODO: Evaluate different checkpoints
pipeline.model = PeftModel.from_pretrained( # Loads the adapter checkpoint
    sft_model,
    adapter_path,
    device_map={"": 6}
)


### GSM8K

In [None]:
def get_response(chats: list): # Function to get the response from the model
    gen_text = generator(chats)[0]  # First return sequence
    return gen_text['generated_text'][-1]['content'] # Returns the content of the last generated text

def extract_ans_from_response(answer: str): # Function to extract the answer from the response
    answer = answer.split('####')[-1].strip() # Splits the answer by '####' and takes the last part

    for remove_char in [',', '$', '%', 'g']: # Removes unwanted characters from the answer
        answer = answer.replace(remove_char, '')

    return answer # Returns the extracted answer

In [None]:
gsm8k_predictions = []
TEST_N_SHOT = TRAIN_N_SHOT# Recommended range: 5 ~ 8
#Hints - Higher number of few-shot examples
gsm8k_test_public = load_jsonlines('gsm8k_test_public.jsonl') # Loads the GSM8K public test data
gsm8k_total = len(gsm8k_test_public) # Gets the total number of examples in the public test data
gsm8k_progress_bar = tqdm(total=gsm8k_total, desc='GSM8K Public Test Data Evaluation', postfix='Current Accuracy = 0.000') # Creates a progress bar for the public test data evaluation

correct = 0

for i, qna in enumerate(gsm8k_test_public): # Iterates over the public test data

    messages = nshot_chats(nshot_data=gsm8k_train, n=TEST_N_SHOT, question=qna['question'], answer=None, mode='test') # Creates n-shot chats for the current example
    response = get_response(messages) # Gets the response from the model

    pred_ans = extract_ans_from_response(response) # Extracts the predicted answer from the response
    true_ans = extract_ans_from_response(qna["answer"]) # Extracts the true answer from the example
    if pred_ans == true_ans: # Checks if the predicted answer is correct
        correct += 1 # Increments the correct count if the prediction is correct
    gsm8k_predictions.append(pred_ans) # Appends the predicted answer to the list of predictions

    gsm8k_progress_bar.set_postfix_str(f'Current Accuracy = {correct/(i+1):.3f}') # Updates the progress bar with the current accuracy
    gsm8k_progress_bar.update(10) # Updates the progress bar

gsm8k_progress_bar.close() # Closes the progress bar

print(f'GSM8K Public Test Data Evaluation Complete, Total Accuracy: {correct/gsm8k_total:.3f}') # Prints the total accuracy on the public test data

gsm8k_test_private = load_jsonlines('gsm8k_test_private.jsonl') # Loads the GSM8K private test data
gsm8k_total = len(gsm8k_test_private) # Gets the total number of examples in the private test data
gsm8k_progress_bar = tqdm(total=gsm8k_total, desc='GSM8K Private Test Data Inference') # Creates a progress bar for the private test data evaluation

for i, qna in enumerate(gsm8k_test_private): # Iterates over the private test data

    messages = nshot_chats(nshot_data=gsm8k_train, n=TEST_N_SHOT, question=qna['question'], answer=None, mode='test') # Creates n-shot chats for the current example
    response = get_response(messages) # Gets the response from the model

    pred_ans = extract_ans_from_response(response) # Extracts the predicted answer from the response
    gsm8k_predictions.append(pred_ans) # Appends the predicted answer to the list of predictions

    gsm8k_progress_bar.update(10) # Updates the progress bar

gsm8k_progress_bar.close() # Closes the progress bar

print(f'GSM8K Private Test Data Inference Complete') # Prints a message indicating that the private test data evaluation is complete

### AILuminate

In [None]:
def load_csv(file_name: str):
    csvfile = open(file_name)
    rows = csv.DictReader(csvfile)
    questions = []
    for row in rows:
        questions.append(row['prompt_text'])
    return questions

In [None]:
ailuminate_predictions = []

ailuminate_test = load_csv('ailuminate_test.csv') # Loads the AILuminate test data
ailuminate_total = len(ailuminate_test) # Gets the total number of examples in the AILuminate test data
ailuminate_progress_bar = tqdm(total=ailuminate_total, desc='AILuminate Test Data Evaluation') # Creates a progress bar for the AILuminate test data evaluation

for i, question in enumerate(ailuminate_test): # Iterates over the AILuminate test data

    message = [
        {
            'role': 'user',
            'content': question
        }
    ]
    response = get_response(message) # Gets the response from the model
    ailuminate_predictions.append(response) # Appends the response to the list of predictions

    ailuminate_progress_bar.update() # Updates the progress bar
ailuminate_progress_bar.close() # Closes the progress bar

print(f'AIluminate Test Data Evaluation Complete')

## Create Submission File

In [None]:
# Combine the results into one file.
STUDENT_ID = 'p13922006' # TODO: Add your student id
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
  print(gsm8k_predictions + ailuminate_predictions, file=output_f) # Prints the predictions to the output file

## References
- https://medium.com/@sewoong.lee/how-to-reproduce-llama-3s-performance-on-gsm-8k-e0dce7fe9926
- https://github.com/mlcommons/ailuminate/tree/main
- https://discuss.huggingface.co/t/loading-list-as-dataset/35109
- https://github.com/huggingface/peft/issues/218
- https://colab.research.google.com/drive/1OGEOSy-Acv-EwuRt3uYOvDM6wKBfSElD?usp=sharing