# LoRA Fine-Tuning Notebook

__Goal of this notebook__: Guide you through an end-to-end process of fine-tuning a Large Language Model with LoRA Adapters. It assumes a basic understanding of LoRA.

__As a reminder__: LoRA reduces computational costs / GPU memory usage a lot, as training the whole network is not required. https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html

__Typical use-cases__:
- __Domain Adaptation__: Adapt a general LLM to a specific domain (e.g., legal, medical, finance) using domain-specific corpora
- __Style Tuning__: Align model outputs to follow specific instruction styles or tone (concise vs. verbose, formal vs. casual). E.g. talk like Angela Merkel
- __Task-Specific__: Train the model to perform new tasks, such as classification, summarization, or structured extraction (some adaptation are required to the use-case below)

In [None]:
# import necessary python libraries
from peft import LoraConfig, get_peft_model, PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling, pipeline
from torch.utils.data import Dataset
import torch
import pickle
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import time
import math
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import json

In [None]:
# parameters - you can use this for specifying all parameters for your experiments

# files
chunk_filename_json = "./data/kahneman_chunks.json" # in this case a pickle file with a list of paragraphs stored in it

# model
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # put in the model name from Hugging Face
#model_name = "deepseek-ai/deepseek-llm-7b-base"

# LoRA config
r = 16
lora_alpha = 32
lora_dropout = 0.1
target_modules=["q_proj", "v_proj"] # you could add more. In theory all weight matrices could be added
bias="none"
task_type="CAUSAL_LM"

# LoRA training

output_dir="./lora_finetuned_model/deepseek-llm-7b-base/LoRA_run_1" # path for saving the fine tuned model
per_device_train_batch_size=3 # batch size for each device (e.g. GPU - in your case 1)
gradient_accumulation_steps=5 # how many forward passes to accumulate before running a backward pass. simluates a larger batch size
learning_rate=2e-4
num_train_epochs=3 # number of training epochs
save_strategy="epoch" # when the model is saved
fp16=True #this and the following command ensure better numerical stability
bf16=False
logging_steps=10 # after how many iterations each logging is reported
report_to="none"  # whether to log to external services like WandB

In [None]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)
original_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name)

In [None]:
# Load the JSON file
with open(chunk_filename_json, 'r', encoding='utf-8') as f:
    kahneman_paragraphs = json.load(f)

# Ensure it's a list
if not isinstance(kahneman_paragraphs, list):
    raise ValueError("The loaded JSON is not a list!")

print(f"Loaded {len(kahneman_paragraphs)} paragraphs from JSON.")

Loaded 8581 paragraphs from JSON.


In [None]:
# train (80%) / test (20%) split
train_texts, test_texts = train_test_split(kahneman_paragraphs, test_size=0.2, random_state=42)

print(f"Training set size: {len(train_texts)} paragraphs")
print(f"Test set size: {len(test_texts)} paragraphs")

#Thoughts: is random splitting the best idea, as paragraphs are not independent
# + we loose some paragraphs for training, when splitting the data. We cannot use these paragraphs for compressing the data into the LLM

Training set size: 6864 paragraphs
Test set size: 1717 paragraphs


In [None]:
# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank: Controls adaptation capacity
    lora_alpha=lora_alpha,  # Scaling factor
    lora_dropout=lora_dropout,  # Dropout probability
    target_modules=target_modules,  # Target attention layers
    bias=bias,
    task_type=task_type,
)

# Apply LoRA to the model
model = get_peft_model(original_model, lora_config)
model.print_trainable_parameters()  # Verify trainable params

trainable params: 2,179,072 || all params: 1,779,267,072 || trainable%: 0.1225


In [None]:
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.examples = []

        for text in texts:
            encoding = tokenizer(
                text,
                truncation=True,
                padding="max_length",
                max_length=max_length,
                return_tensors="pt"
            )

            input_ids = encoding["input_ids"].squeeze()
            attention_mask = encoding["attention_mask"].squeeze()

            # Set labels the same as input_ids, but ignore padding with -100
            labels = input_ids.clone()
            labels[labels == tokenizer.pad_token_id] = -100

            self.examples.append({
                "input_ids": input_ids,
                "attention_mask": attention_mask,
                "labels": labels
            })

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]


# Create dataset
train_dataset = TextDataset(train_texts, tokenizer)
test_dataset = TextDataset(test_texts, tokenizer)

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)  # Causal LM, not masked LM

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    num_train_epochs=num_train_epochs,
    save_strategy=save_strategy,
    fp16=fp16, #this and the following command ensure better numerical stability
    bf16=bf16,
    logging_steps=logging_steps,
    report_to=report_to  # Disable logging to external services like WandB
)

In [None]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

In [None]:
# Record the start time
start_time = time.time()

# Train the model
trainer.train()

# Record the end time
end_time = time.time()

# Calculate the elapsed time
elapsed_time = (end_time - start_time)/60

# Print the training duration in seconds
print(f"Training took {elapsed_time:.2f} minutes.")

# In case your getting the following error: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1729647382455/work/c10/cuda/CUDACachingAllocator.cpp":995, please report a bug to PyTorch. 
# Your account doesn't fulfill the necessary GPU requirements. Keep in mind that due to self-attention the input length has quadratic costs
# You can check your VRAM and other GPU related metrics by typing nvidia-smi in the terminal
# You might need to restart your kernel to remove the current model and text chunks from the GPU, when running it again

Step,Training Loss
10,4.9948
20,4.5296
30,4.3137
40,4.2426
50,4.0458
60,3.9751
70,3.9866
80,4.0618
90,3.9662
100,3.9787


In [None]:
# Save the LoRA fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('./lora_finetuned_model/deepseek-llm-7b-base/LoRA_run_1/tokenizer_config.json',
 './lora_finetuned_model/deepseek-llm-7b-base/LoRA_run_1/special_tokens_map.json',
 './lora_finetuned_model/deepseek-llm-7b-base/LoRA_run_1/tokenizer.json')

In [None]:
var_names = [
    "chunk_filename_pkl", "model_name",
    "r", "lora_alpha", "lora_dropout", "target_modules", "bias", "task_type",
    "output_dir", "per_device_train_batch_size", "gradient_accumulation_steps",
    "learning_rate", "num_train_epochs", "save_strategy",
    "fp16", "bf16", "logging_steps", "report_to"
]

# Construct dictionary from current global variables
config = {var: globals()[var] for var in var_names}

# Save to JSON
with open(f"{output_dir}/experiment_config.json", "w") as f:
    json.dump(config, f, indent=4)

In [None]:
# Evaluate loss
trainer = Trainer(model=model)
eval_results = trainer.evaluate(test_dataset)
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

Perplexity: 11.50


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load fresh base model
original_model_clean = AutoModelForCausalLM.from_pretrained(model_name).to(device)

trainer_orig = Trainer(
    model=original_model_clean,
    args=training_args,
    data_collator=data_collator,
    eval_dataset=test_dataset,
)

eval_results_orig = trainer_orig.evaluate()
orig_perplexity = math.exp(eval_results_orig["eval_loss"])
print(f"Original Model Perplexity: {orig_perplexity:.2f}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Original Model Perplexity: 13.97


Model perplexity decreased :) __BUT__ keep in mind that perplexity is not necessarily sufficient for evaluating the performance of a LLM. Most likely the performance needs to be evaluated in a downstream task.