# Model Fine-Tuning

**OPTIONAL**: Follow the instructions below to fine-tune your model!

If you would like to do a fine-tune run that is longer than Google CoLab will allow, **email me (amu1@rice.edu)** your notebook and I will run it for you on the VM and send you the results.

## Import Dependencies

In [None]:
from transformers import PreTrainedModel, PreTrainedTokenizer, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, DataCollatorForLanguageModeling
from trl import SFTTrainer
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, TaskType
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from datasets import Dataset

# Load Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from getpass import getpass
import os
import torch

HF_TOKEN = os.environ.get("HF_TOKEN", None)
HF_CACHE = os.environ.get("HF_CACHE", None)
if not HF_TOKEN:
  HF_TOKEN = getpass("Enter your HuggingFace token:")

HF_MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct"

# Here we compress our model with 8-bit "quantization".
# You can think of quantization as rounding the model's
# parameters, which are stored as 32-bit floats, to 8 bits.
# This saves a lot of space on our GPU!
quantization_config = BitsAndBytesConfig(
  load_in_8bit=True,
  bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(
  HF_MODEL_NAME,
  cache_dir=HF_CACHE
)

tokenizer.pad_token = tokenizer.eos_token

# Notice the 'Causal' ain the constructor.
# Llama is a 'Causal' LM, meaning it predicts the next token given only the previous tokens.
# Some 'masked' LMs like BERT can predict a token in the middle of a sentence.
model = AutoModelForCausalLM.from_pretrained(
  HF_MODEL_NAME,
  quantization_config=quantization_config,
  # Let HuggingFace decide which device to put our model on.
  # This will efficiently share CPU and GPU resources.
  device_map="auto",
  token=HF_TOKEN
)


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.53s/it]


# Set up LoRA

LoRA (Low-Rank Analysis) is a great approach to fine-tuning in resource-constrained spaces.

Here is a short summary with (a LOT) of details missing

1) The original model's weights are fixed, meaning they remain unchanged during the training process.
2) For a selected subset of layers (generally attention layers), a new matrix is created. The weights of this matrix are trainable
3) During the model's forward function, for each layer, encodings are multiplied through the original model matrix and the LoRA matrix.
4) The output of the two matrix multiplications are summed before being passed on to the next layer.

A much better explanation can be found [here](https://codecompass00.substack.com/p/what-is-lora-a-visual-guide-llm-fine-tuning) for those who are curious.

You will be shocked at how well a LoRA fine-tuned model performs.

In [None]:
peft_config = LoraConfig(
  task_type=TaskType.CAUSAL_LM,
  # The rank of the LoRA matrix added alongside the original layer.
  # Higher rank --> more tunable parameters
  # Recommended to use powers of 2.
  r=8,

  # Names of modules (layers) that LoRA will target.
  # You can print a model's layers with model.modules().
  # Generally, self-attention layers
  # (like these) are targeted because they have little effect on the model's
  # encoding process, but a large effect on what parts of the sequence the model
  # pays attention to. So, you can change your output drastically without
  # touching the model's underlying knowledge.
  target_modules = ["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],

  # Before the LoRA matrix encoding is added to the original model's encoding,
  # the encoding is multiplied by this constant factor.
  # So, higher alpha --> LoRA layers have more effect on model.
  # Powers of 2 are most commonly used here.
  lora_alpha=16,

  # Higher dropout induces random noise and sets random parameters to 0 during training.
  # Keep this small! (< 0.2)
  lora_dropout=0.1,
)
model_for_training = get_peft_model(model, peft_config)

In [None]:
model_for_training.print_trainable_parameters()

trainable params: 10,780,672 || all params: 3,223,530,496 || trainable%: 0.3344


# Dataset Setup

Ideally, your dataset should consist of prompts in the Llama Instruct format, like:

> <|begin_of_text|>
>
> <|start_header_id|>system<|end_header_id|>
>
> {Instruction you want your model to respond to}
>
> <|eot_id|>
>
> <|start_header_id|>user<|end_header_id|>
>
> {Example input}
>
> <|eot_id|>
>
> <|start_header_id|>assistant<|end_header_id|>
>
> {Desired output}
>
> <|eos_token|>

An easy place to start fine-tuning is to pick a few outputs from your base model that you wish to improve, edit them, and then train your model on the edited version.

The provided code will work if you have done the following:
1) Saved **youtube_df** or **reddit_df**, or a random sample of them (with df.sample(n_samples)), to CSV
    - About 100 samples will do as a starting point
2) Edited the **inference_output {TOPIC}** columns to your desired output
3) Optionally, edited the **full_prompt {TOPIC}** columns if you would like the model to respond to a different (e.g. a shorter) prompt



In [None]:
# Load your spreadsheet
import re

EDITED_CSV_PATH = "reddit.csv"
edited_df = pd.read_csv(EDITED_CSV_PATH)

topics = [
    column.removeprefix("inference_output ")
    for column in edited_df.columns
    if re.match("inference_output .*", column)
]

full_text = []
for topic in topics:
    full_text.extend(
        edited_df[f"full_prompt {topic}"] + edited_df[f"inference_output {topic}"] + "<|eot_id|>"
    )

dataset = Dataset.from_dict({
    "text": full_text
})

train_test_split = dataset.train_test_split(0.2)

In [None]:
training_args = TrainingArguments(
    # Absolute path to store intermediate files
    # during training. Feel free to change this.
    output_dir="lora_layers/train",

    learning_rate=1e-3,
    weight_decay=1e-3,

    # Number of samples on each core per batch.
    # Raising this makes training faster but takes up more GPU space.
    per_device_train_batch_size=2,

    # Number of epochs
    # Advised that you raise this after you do 1 run to test
    num_train_epochs=2,

    # When to run over the validation set.
    # Set this to 'step' if you would like evaluation on every batch.
    eval_strategy="epoch",
    save_strategy="epoch",

    # Use 16-bit floats instead of 32-bit (saves vRAM)
    fp16=True, # Consider this!
)

trainer = SFTTrainer(
    model=model_for_training,
    args=training_args,
    tokenizer=tokenizer,

    # Transformers dataset
    train_dataset=train_test_split["train"],
    eval_dataset=train_test_split["test"],
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        # causal model, not masked language model.
        mlm=False
    ),
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=512
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Map: 100%|██████████| 16/16 [00:00<00:00, 1175.41 examples/s]
Map: 100%|██████████| 4/4 [00:00<00:00, 714.53 examples/s]


# Train the Model

In [None]:
# Disables KV caching, which stores tensors computed during
# passes over the attention layer instead of recomputing them
# for each token.
# Much slower, but large savings in GPU memory.
# Crucial during training!
model.config.use_cache = False
model_for_training.config.use_cache = False

trainer.train()

torch.cuda.empty_cache()

model.config.use_cache = True
model_for_training.config.use_cache = True

Epoch,Training Loss,Validation Loss
1,No log,0.566039
2,No log,0.505438


# Save output

All done! Now, save your layer somewhere. Uncomment the code in the **Load Model** section of analyze_social_media.ipynb to load in your layer!

You can also port over the inference() function from there to test your model now.

In [None]:
# Feel free to change this.
model_for_training.save_pretrained("lora")