# Project 5: LLM

This is the fifth project in NLP SW013 done by Jannine Meier.

The project was run on Google Colab.

WandB Project Link with created View options: https://wandb.ai/jannine-meier/project5_JM?nw=nwuserjanninemeier

### Project description
This project involves further fine-tuning a pretrained LLM from huggingface on the Winogrande dataset. I chose the quantized Llama-3-8b-bnb-4bit model from Unsloth for my project: https://huggingface.co/unsloth/llama-3-8b-bnb-4bit

## Prefix

I wanted to give a quick explanation to this notebook. I spent at least 45 hours trying to complete it, even took 2 days off work, but still faced so many challenges that I was not able to complete the project as requested. I used different base models and tried many tutorials to solve the task, started over and over again but kept running into memory issues as described in my email. New appraoches same problem. My colleagues tried to help but couldn’t solve the problem either. Discussing with other students I did not get any useful information as the five people I spoke to seemed all to struggle themselves.

After trying on GPUHub for many days I tried to run it on google colab by Wednesday night, where it worked better (at least no longer had memory issues) but it was too late to finish everything as I still had other issues to solve now that the code finally ran and every now and then I got the message that GPU is no longer available. This means that every 2 hours I was not able to use the GPU without paying for it...

Below in this notebook, Ive written down what I planned to do, but I wasn’t able to finish it and execute it fully. Overall, I found the task very challenging. For me personally, and maybe also other students without an apprenticeship in IT and being fast coders, it was overwhelming to teach myself the necessary skills and correctly combine them so that the code would run, all within such a short time frame.

In short what I wanted to express is that I actually wasn't a sloth (just like my model) and that neither the lack of trying nor investing time was the reason for the notebook being as it is and I hope this will be considered when grading. I'm sure with more time or more gpu power or some more guidance (like with the last projects during the lessons) I would have been able to solve it in a better way. Nevertheless, I definetely learned a looot during this project.

## Libraries & Imports

In [1]:
%%capture
# Installation of necessary packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes
!pip install wandb

In [2]:
%%capture

# Importing necessary libraries
from unsloth import FastLanguageModel
import torch
from torch.nn.functional import cross_entropy
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import wandb

# Initializing Weights & Biases for experiment tracking
wandb.init(project="project5_JM", entity="jannine-meier")

## Preprocessing

**Quantization:** I use a 4-bit quantization to reduce memory usage.

**Load Datasets:** I chose winogrande_l as I already had not enough time but taking a bigger one would for sure be beneficial for the training. I load and split the datasets into training, validation, and test sets. For the Winogrande dataset, I specifically extracted the last 1000 entries from the training dataset to create the test set.

**Tokenizer Usage:** I use the tokenizer for the "unsloth/llama-3-8b-bnb-4bit" model which uses a Byte-Pair Encoding approach to handle subwords. Tokens 128000 and 128001 are reserved for start and end of text tokens. I also set the padding token to 128001.
- **Not Removing Punctuation and Stopwords:** I decided to retain punctuation and stopwords because they sometimes have significant contextual meanings which might be crucial for the model to understand nuanced differences between sentences.
- **No Stemming or Lemmatization:** I decided that these processes are unnecessary, as my tokenizer uses a subword tokenization method capable of understanding various word forms without reducing them to their root forms.


**Prompt Formatting for Model Input:**
- Alpaca_prompt: I used a template string that structures the input to the model. It formats an instruction, a contextual input, and expects a response in a structured format. I used the same base alpaca-prompt and instruction for all the data. For the input I structured the sentence + option1 + option2 and for the ouput I pasted the raw label as 1 or 2 reffering to the correct option1 and option2 followed by a EOS token.


In [3]:
# Dataset preparation and manipulation
winogrande_datasets = load_dataset('winogrande', 'winogrande_l')
train_dataset = winogrande_datasets['train'].select(range(len(winogrande_datasets['train']) - 1000))
eval_dataset = winogrande_datasets['validation']
test_dataset = winogrande_datasets['train'].select(range(len(winogrande_datasets['train']) - 1000, len(winogrande_datasets['train'])))

# Load pre-trained model and tokenizer with specific configurations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = 128,
    dtype = None,
    load_in_4bit = True,
)

# Set the padding side to left (did not work but could not figure out why)
# tokenizer.padding_side = 'left'

# Set up end-of-sequence token using tokenizer's built-in attribute
EOS_TOKEN = tokenizer.eos_token

# Enhance the model with PEFT modifications
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Finetune on 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,  # Finetune on 8, 16, 32, 64, 128
    lora_dropout = 0, # Supports any, but = 0 is optimized according to unsloth documentation
    bias = "none",    # Supports any, but = "none" is optimized according to unsloth documentation
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  # Supports rank stabilized LoRA (could be a possible additional finetune parameter)
    loftq_config = None, # Supports LoftQ (could be a possible attional finetune parameter)
)

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [4]:
# Define a prompt template for constructing dataset entries
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that contains only one digit either 1 or 2 and no other words or symbols.

### Instruction:
{}

### Input:
{}

### Response:
{}""" + EOS_TOKEN

def formatting_prompts_func(examples, include_answers=False):
    # Generate instructions for each example
    instructions = ["Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only"] * len(examples['sentence'])
    # Combine sentences with their options to create the input text
    inputs = [f"{examples['sentence'][i]} Option 1: {examples['option1'][i]} Option 2: {examples['option2'][i]}" for i in range(len(examples['sentence']))]
    # Only include the correct answer in the output for train_dataset
    outputs = [examples['answer'][i] if include_answers else '' for i in range(len(examples['sentence']))]
    # Format each example into the full prompt text
    texts = [alpaca_prompt.format(instr, inp, out) for instr, inp, out in zip(instructions, inputs, outputs)]
    return {'text': texts}

# Apply the formatting function to the datasets
train_dataset = train_dataset.map(lambda examples: formatting_prompts_func(examples, include_answers=True), batched=True)
# Excludes answers for evaluation and testing datasets
eval_dataset = eval_dataset.map(lambda examples: formatting_prompts_func(examples, include_answers=False), batched=True)
test_dataset = test_dataset.map(lambda examples: formatting_prompts_func(examples, include_answers=False), batched=True)

# Print the first 5 formatted examples from the training dataset to check output
for i in range(3):
    print(train_dataset[i]['text'])


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that contains only one digit either 1 or 2 and no other words or symbols.

### Instruction:
Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only

### Input:
Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine. Option 1: Ian Option 2: Dennis

### Response:
2<|end_of_text|>
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that contains only one digit either 1 or 2 and no other words or symbols.

### Instruction:
Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only

### Input:
Ian volunteered to eat Dennis's menudo after already having a bowl because _ enjoyed eating intestine. Option 1: Ian Option

## Model

I chose the Llama-3-8b-bnb-4bit model as it is trained on 15 trillion tokens which is a looot.


Due to time constraints I was not able to test out many LoRA adapter configurations but in the comments in the code you can see what I would have liked to test out with sufficient time.

## Training
I used the Huggingface's SFTTrainer. This setup is specifically tailored for efficient and effective fine-tuning of large language models, making use of advanced training techniques like low precision computing and parameter-efficient model adaptations.

Goal would be to run at least one or more epochs but I had not even close enough time to do this so i just trained on a few examples... I set the max_steps to 500 to see how the loss performs.


In [5]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = 128,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences according to Unsloth
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs=1, goal
        # max_steps = 0, goal
        max_steps = 10, # chose this due to time constraints
        learning_rate = 2e-4, # should be fine-tuned
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",  # This will enable logging to Weights & Biases
    ),
)

In [6]:
#Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.594 GB of memory reserved.


In [7]:
# Train
trainer_stats = trainer.train()

Step,Training Loss
1,2.9179
2,2.9996
3,2.9464
4,2.7059
5,2.3867
6,2.0612
7,1.5503
8,1.3478
9,1.0364
10,0.8899


Current status: My code is finally running and I have 6 hours left and it takes me 7-8 minutes to train 100 samples. This means to train the whole train samples which are over 10k it would take me over 11 hours which I do not have left. I'll stick with fewer samples due to time constraints.

In [8]:
# Shows final memory and time stats to check how long it takes and if memory is sufficient
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

47.3741 seconds used for training.
0.79 minutes used for training.
Peak reserved memory = 6.172 GB.
Peak reserved memory for training = 0.578 GB.
Peak reserved memory % of max memory = 41.85 %.
Peak reserved memory for training % of max memory = 3.919 %.


In [9]:
def validate_model(model, tokenizer, dataset):
    correct_predictions = 0
    total_predictions = 0  # Initialize total_predictions
    total_loss = 0.0

    for index, example in enumerate(dataset):
        if index >= 10:  # Stop after processing 30 entries due to time constraints
            break

        prompt = example['text']
        inputs = tokenizer(prompt, return_tensors="pt", padding='longest').to("cuda")

        # Generate logits
        outputs = model(**inputs, max_new_tokens=128, return_dict=True)
        logits = outputs.logits  # Logits for all tokens
        last_token_logits = logits[:, -1, :]  # Focus on the last token for classification

        # Adjust labels for cross_entropy
        label = int(example['answer']) - 1  # Convert labels from '1' and '2' to '0' and '1'
        labels = torch.tensor([label], dtype=torch.long).to("cuda")

        # Calculate loss for the last token
        loss = cross_entropy(last_token_logits, labels)
        total_loss += loss.item()

        # Decode full generated response
        full_generated_tokens = model.generate(**inputs, max_length=tokenizer.model_max_length)
        full_response_text = tokenizer.decode(full_generated_tokens[0], skip_special_tokens=True)

        predicted_number = full_response_text.strip()[-1]  # Extract the last character as the prediction

        # Check correctness
        correct = (predicted_number == example['answer'])
        correct_predictions += correct

        # Detailed output for each example
        instruction = prompt.split("### Instruction:")[1].split("### Input:")[0].strip()
        input_text = prompt.split("### Input:")[1].split("### Response:")[0].strip()
        response_text = prompt.split("### Response:")[1].strip().split(tokenizer.eos_token)[0]
        actual_answer = example['answer']

        print(f"Example {index + 1}:")
        print(f"Instruction: {instruction}")
        print(f"Input: {input_text}")
        print(f"Predicted Response: {predicted_number}")
        print(f"Actual Response: {actual_answer}")
        print(f"---")

        total_predictions += 1  # Update total_predictions within the loop

    # Calculate final metrics
    accuracy = correct_predictions / total_predictions
    average_loss = total_loss / total_predictions

    # Log metrics to wandb
    wandb.log({"Validation Accuracy": accuracy, "Validation Loss": average_loss})

    return accuracy, average_loss

# Assuming your dataset is already loaded and formatted correctly
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)
test_dataset = test_dataset.map(formatting_prompts_func, batched=True)

In [None]:
# Evaluate on a subset of the validation dataset
accuracy, average_loss = validate_model(model, tokenizer, eval_dataset)
print(f"Validation Accuracy: {accuracy:.2f}, Validation Loss: {average_loss:.4f}")

Example 1:
Instruction: Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only
Input: Sarah was a much better surgeon than Maria so _ always got the easier cases. Option 1: Sarah Option 2: Maria
Predicted Response: 2
Actual Response: 2
---
Example 2:
Instruction: Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only
Input: Sarah was a much better surgeon than Maria so _ always got the harder cases. Option 1: Sarah Option 2: Maria
Predicted Response: y
Actual Response: 1
---
Example 3:
Instruction: Choose which option is the right one to replace the underscore and make the sentence meaningful and answer with a number (1 or 2) only
Input: They were worried the wine would ruin the bed and the blanket, but the _ was't ruined. Option 1: blanket Option 2: bed
Predicted Response: 1
Actual Response: 2
---
Example 4:
Instruction: Choose

### Saving, loading finetuned models
This ONLY saves the LoRA adapters, and not the full model.

In [None]:
model.save_pretrained("lora_model") # Local saving

## Results



In the last projects my model was not showing any real signs of learning as not even the train accuracy improved over time. Validation accuracy project4: 0.553, project3: 0.509.

This time I see when I look at the training loss it looks like the model is learning quiet well. Even though I had no time to run a full epoch on the train set I noticed that the trend of decreasing loss is persisting which is a good sign and shows that probably training a full epoch would result in even better results. It would be beneficial to also plot a learning accuracy curve but I did not have the time to specialize the trainer to do the logging.

On the validation set my results were different depending on how many samples I used for learning as well as how many for validating (obviously). As I was not able to run the full sets my results are not representative and fluctuating a lot depending on the chosen sizes. I therefore also left out the confusion matrix which would not be saying a lot.

Actually I noticed the more training samples the worse the model performs.. This means that probably my setup has some issues which need to be fixed and that the basemodel without my finetuning performs better than with it :')...

E.g.: train & validation 100 samples
- Validation accuracy: 0.14 (predics a lot of "/" instead of numbers)

E.g.: train 30 samples & validation 50 samples
- Validation accuracy: 0.48


So I evaluated the test accuracy after training only with 30 sample (almost not at all) and got following accuracy which probably will be the best accuracy overall.. but as I said acutally doesn't make sense to look at these numbers
.
- Test accuracy: 0.53 (project4: 0.662, project3: 0.500)

In [None]:
# Evaluate on a subset of the test dataset
accuracy, average_loss = validate_model(model, tokenizer, test_dataset)
print(f"Test Accuracy: {accuracy:.2f}, Test Loss: {average_loss:.4f}")

## Interpretation

Due to time constraints I was not able to actually test out any finetuning parameters as commented in the code and also some prompt engineering would most likely help to get better results. This means my results are not really representing bunch of experiments but just looking at the numbers I can say its not (yet) working as it should which could be due to many facotrs like not enough training data, not enough finetuning, not right checking for correct output format, not perfect prompts, etc.

Looking at the predicted ouput it seems that the model sometimes predicts the right answer but often also not a number (only) instead symbols like / or . or ) and quiet often also the wrong answer resp. label. When looking at the full generated answers from the model I notice that it sometimes answers with the word instead of the number label and sometimes in full or part sentences instead of just the numbers. This means I either have to finetune it better or improve my instruction and alpaca prompt so the model actually learns the exact pattern i expect or I should change the way I check my models ouputs correctnes maybe to check for a number in the whole generated response answer ouput after the response tag and not only the last token or maybe also look for the words resp. the options in letters instead of the number.

I am certain that with the right instructions and finetuning as well as enough training (e.g. training with a large training set) we actually should be able to get good scores for this task.

My notebook is inspired from Unsloth where they have tutorials on fine-tuning Llama models: https://github.com/unslothai/unsloth?tab=readme-ov-file

Additionally I used snippets from huggingface especially: https://huggingface.co/docs/trl/sft_trainer

Also used code from the notebook shown here: https://www.youtube.com/watch?v=pK8u4QfdLx0

ChatGPT was used to help me clean the code and make adjustments where needed.