# Introduction to Fintuning LLMs using Huggingface
> Workshop by Tree at #Cosin2025!

In the rapidly evolving field of NLP and LLMs, Hugging Face 🤗 provides access to state-of-the-art AI models and making it easier for developers, researchers, and organizations to integrate advanced NLP capabilities into their applications.

The most important parts of the HF ecosystem include: 

- 🤗 Model Hub – A repository of thousands of open-source ML models.
- 🤗 Datasets – A vast collection of high-quality NLP datasets.
- 🤗 Tokenizers – Efficient and customizable tokenization methods for text processing.
- 🤗 Accelerate – Tools for optimizing and scaling model training.
- 🤗 PEFT (Parameter-Efficient Fine-Tuning) – Techniques like LoRA for efficient model adaptation.
- 🤗 Spaces – A platform to deploy and share AI applications.

and more. 

In this notebook we want to provide you with a quickstart and a basic overview on how to use the ecosystem. We will load a model, learn how to fine tune a model using Parameter-Efficient Fine-Tuning (PEFT) methods and evaluate them. 

This notebook is influenced by the following hugging face notebook: https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb 

## 0. Setup

Before you begin, make sure you have all the necessary libraries installed:

In [None]:
# !pip install transformers bitsandbytes accelerate peft datasets torch torchinfo matplotlib pandas

You can also run different functions with a tensorflow backend. But in the scope of this notebook we focus on the pytorch backend.

## 1. Quickstart

### 1.1 Pipeline

The pipeline() is the easiest and fastest way to use a pretrained model for inference. You can use the pipeline() out-of-the-box for many tasks across different modalities, some of which are shown in the table below:

| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |

**Note:** For a complete list of available tasks, check out the [pipeline API reference](https://huggingface.co/docs/transformers/main/en/./main_classes/pipelines).

The pipeline() downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now you can use the `classifier` on your target text:

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

In [None]:
classifier("We are very happy to show you the 🤗 Transformers library.")

Depending on the task it takes the default model for the pipeline and solves the task. To find out what model was used use the following: 

In [None]:
print(classifier.model.name_or_path)

As displayed in the table above, we can use the pipeline on a variety of tasks. 

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation")

In [None]:
text_generator("Cosin is a conferenc at ", max_length=50)

#### TASK

Use the pipeline function to generate text using the pythia 70M model. 

1. Load the model using the `AutoModelForCausalLM` from `transformers`.
2. Load the tokenizer usint the `AutoTokenizer` from `transformers`. 
3. Define the pipeline with the model and tokenizer.
4. Use the generator to generate `150` tokens after the given prompt.
5. Print out the generated Text.

**Note:** The 70m Model-Size of Pythia is quite small. The Llama models start at 8B parameters: 
- Llama 3.1 8B
- Llama 3.1 70B
- Llama 3.1 405B

To make this notebook accessible we use small models. If you want, feel free to change the models to bigger ones from HF. 

In [None]:
prompt = "The best thing about the CCC-CH is:\n"
model_name = "EleutherAI/pythia-70m"

### IMPLEMENT YOUR SOLUTION HERE ###

### 1.2 AutoClass

Under the hood, the `AutoModelForCausalLM` and `AutoTokenizer` classes work together to power the `pipeline` you used above. An `AutoClass` is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path. You only need to select the appropriate `AutoClass` for your task and it's associated preprocessing class. 

#### AutoTokenizer

The `AutoTokenizer` for example loads the correct tokenizer for the given model and we can observe the encodings for different models:

In [None]:
from transformers import AutoTokenizer

pythia = "EleutherAI/pythia-70m"
gpt2 = "gpt2"
falcon_7b = "tiiuae/falcon-7b"

pythia_tokenizer = AutoTokenizer.from_pretrained(pythia)
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2)
falcon_7b_tokenizer = AutoTokenizer.from_pretrained(falcon_7b)


text = "This text will be encoded by different tokenizers to show the differences in tokenization."

print(f"Pythia tokenizer: {pythia_tokenizer(text)}")
print(f"GPT-2 tokenizer: {gpt2_tokenizer(text)}")
print(f"Falcon-7B tokenizer: {falcon_7b_tokenizer(text)}")

The tokenizer returns a dictionary containing:
- `input_ids`: numerical representations of your tokens.
- `attention_mask`: indicates which tokens should be attended to.


It is important to use the correct tokenizer per model. The `AutoClass` helps us to load the correct tokenizer. 

#### Auto Model

Apart from the pipeline function, we can also load different models using the `AutoModel` functions. Similar to the `AutoTkenizer` we only need to provide the model name or HF path. When generating text we use the `AutoModelForCausalLM`function to load LLMs. Causal Language Modeling refers to next token prediction one at a time only using past tokens. 

In [None]:
from transformers import AutoModelForCausalLM

model_name = "EleutherAI/pythia-70m"
model = AutoModelForCausalLM.from_pretrained(model_name)

After loading we can inspect the model. 

**Note:** "GPTNeoX" is the old/internatl name for the Pythia models.

In [None]:
# uncomment to print the model and its configuration
# print(model)
# print(model.config)

## alternative use torchinfo and adjust the depth to see the model architecture
from torchinfo import summary
summary(model, depth=3)

To use the model we need the tokenizer we defined earlier again and we show how we can use a list of prompts to generate outputs for them:

In [None]:
prompt = ["A hackerspace is",
          "Access Granted!"]


# we need to define a padding token for the tokenizer to appand to the prompts
pythia_tokenizer.pad_token = pythia_tokenizer.eos_token

# set padding side left
pythia_tokenizer.padding_side = "left"

tokenized_batch = pythia_tokenizer(
    prompt,
    padding=True,
    truncation=True,  # if the prompt would be to long it would be truncated
    max_length=512,  # we set this so that the model is not confronted with to large input
    return_tensors="pt",  # return PyTorch tensors as encoding
)

Now we can give the encoded prompts to our model:

In [None]:
output = model.generate(**tokenized_batch, max_new_tokens=150)

After decoding we get back the generated texts: 

In [None]:
generated_texts = pythia_tokenizer.batch_decode(
    output, skip_special_tokens=True)
generated_texts

## 1.3 Saving Models 

After the fine tuning (what we will cover later) it is important to save and load the model. For that we need a path and the `PreTrainedModel.save_pretrained()` function. 
To save the model properly, we also save the tokenizer. This allows to use the models without internet connection.

In [None]:
save_dir = "./my_saved_pythia_model"
pythia_tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)

This creates a folder with the following structure:
```
my_saved_pythia_model/
│── config.json
│── pytorch_model.bin
│── special_tokens_map.json
│── tokenizer_config.json
│── tokenizer.json
```

Now we can load the model from our saved files:

In [None]:
save_dir = "./my_saved_pythia_model"

loaded_tokenizer = AutoTokenizer.from_pretrained(save_dir)
loaded_model = AutoModelForCausalLM.from_pretrained(save_dir)

## 2. Training

In the following we will observe how HF uses the PyTorch backend to train a model.

All models are a standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) so you can use them in any typical training loop. While you can write your own training loop, 🤖 Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mixed precision, and more.

Depending on your task, you'll typically pass the following parameters to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

1. A [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel) or a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module):




In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

2. [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. The default values are used if you don't specify any training arguments:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="path/to/save/folder/",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
)

3. A preprocessing class like a tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

# set padding token and token side left
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

4. Load a dataset:

In [None]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

for split in ["train", "test", "validation"]:
    print(f"{split} size:", len(dataset[split]))

5. Create a function to tokenize the dataset and apply it over the entire dataset with [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map):

In [None]:
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"], truncation=True, padding="max_length", max_length=512)

dataset = dataset.map(tokenize_dataset, batched=True)


6. A [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling) to create a batch of examples from your dataset:

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now gather all these classes in [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
) 

When you're ready, call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to start training. This might take a wile. Feel free to skip this step. We will learn faster training methods afterwards.

In [None]:
trainer.train()

The standard training of such a small dataset takes really long. So you might skip that. Since not every one is able to train the full model on thier hardware there are other methods to fine tune the model in a more efficient way. 

## 3. Parameter-Efficient Fine-Tuning (PEFT)


PEFT offers parameter-efficient methods for finetuning large pretrained models. The traditional paradigm is to finetune all of a model’s parameters for each downstream task, but this is becoming exceedingly costly and impractical because of the enormous number of parameters in models today. Instead, it is more efficient to train a smaller number of prompt parameters or use a reparametrization method like low-rank adaptation (LoRA) to reduce the number of trainable parameters.

In this notebook we focus on low-rank adaptation (LoRA) but there are much more aproaches and technics in the realm of PEFT. The following graphic displays a categorisation and a overview on the PEFT methods. Keep in mind that this is from a paper from 2023 by Lialin et al. called "[Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.15647)". Since the paper there are even more methods and approaches. 

![](https://cdn-uploads.huggingface.co/production/uploads/666b9ef5e6c60b6fc4156675/dz0AdSqt4QP7iRjpiXDE1.png)

### 3.1 Understanding LoRA

Low-Rank Adaptation (LoRA) is a technique used to fine-tune large language models efficiently by reducing the number of trainable parameters. The core idea is to approximate the weight updates using low-rank matrices:

**0. Neural Network definition**

In a standard neural network layer, the transformation of an input vector $x$ is performed using a weight matrix $W$ and a bias vector $b$:

$$
y = W x + b
$$

where:
- $x \in \mathbb{R}^{d}$ is the input feature vector,
- $W \in \mathbb{R}^{d \times k}$ is the weight matrix that maps the input to an output of dimension $k$,
- $b \in \mathbb{R}^{k}$ is the bias vector, and
- $y \in \mathbb{R}^{k}$ is the output of the layer.

This operation is repeated in every layer of a deep neural network. During training, the weight matrix $W$ is updated using gradient-based optimization to minimize a loss function.

**1. Original Weight Matrix**

Let $ W \in \mathbb{R}^{d \times k} $ be the original weight matrix of a neural network layer, where $ d $ is the input dimension and $ k $ is the output dimension.

**2. Low-Rank Decomposition**

Instead of updating the entire weight matrix $ W $, LoRA decomposes the update into two smaller matrices:

$$
\Delta W = A B
$$

where $ A \in \mathbb{R}^{d \times r} $ and $ B \in \mathbb{R}^{r \times k} $. Here, $ r $ is the rank of the decomposition, and $ r \ll \min(d, k) $.

**3. Updated Weight Matrix**

The updated weight matrix $ W' $ is then given by:

$$
W' = W + \Delta W = W + A B
$$

**4. Training**

During training, only the matrices $ A $ and $ B $ are updated, while the original weight matrix $ W $ remains fixed. This significantly reduces the number of trainable parameters from $ d \times k $ to $ r \times (d + k) $.

**5. Forward Pass**

During the forward pass, the input $ x $ is transformed using the updated weight matrix:

$$
y = W' x = (W + A B) x
$$

This can be computed efficiently by first computing $ B x $ and then $ A (B x) $.

By using low-rank matrices $ A $ and $ B $ the number of trainable parameters are reduced and so it reduces the computational and memory overhead associated with fine-tuning large models This makes it a practical approach for adapting pre-trained models to specific tasks.

```
          Embedding h                    
               ▲                         
               |                         
        +------+------+                  
        |             |                  
        |      +      |                  
        |             |                  
        ▲             ▲                  
+-----------------+   +----------------+ 
|  Pretrained     |   |  Weight Update | 
|  Weights  W     |   |  ΔW            | 
| (Frozen)        |   | (Trainable)    | 
+-----------------+   +----------------+ 
        ▲             ▲                  
        |             |                  
        +------+------+                  
               |                         
               ▲                         
            Inputs x                     
```

### 3.2 Using LoRA

Now lets dive in to a practical example. For this, we will finetune a "distilbert-base-uncased" model from hugging face on the imdb dataset for sequence classification.

For this we need to import everything we will use:

**NOTE: Restart the notebook at this point to make shure we have no models or things left in the ram.**

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
from torchinfo import summary

Then we define our model name and dataset name. You can switch them to different ones here, if you like to experiment with that.

In [None]:
# model
model_name = "distilbert-base-uncased"

# dataset
dataset_name = "imdb"

Then we load the dataset and shuffle it. We implement a seed to keep the shuffle everytime we shuffle the same. 

In [None]:
# Load dataset
dataset = load_dataset(dataset_name)
dataset = dataset.shuffle(seed=42)

#### TASK

Inspect the dataset: 
1. Print the dataset structure
2. Print two examples of the dataset with the sentences and the lables. 
3. Use matplotlib to plot the distribution of sentiment labels. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
### IMPLEMENT YOUR SOLUTION HERE ###

Now we load the tokenizer for our model using the `AutoTokenizer` from HF. After that we tokenize the dataset.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

To make a proper training, we split the dataset in train and test data.

In [None]:
tokenized_datasets

In [None]:
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]


# to reduce training time you can use the following to select only a subset of the dataset
TRAIN_SUBSET_SIZE = 500
TEST_SUBSET_SIZE = 100
SEED = 42

train_dataset = train_dataset.shuffle(seed=SEED).select(range(TRAIN_SUBSET_SIZE))
test_dataset = test_dataset.shuffle(seed=SEED).select(range(TEST_SUBSET_SIZE))


In [None]:
from datasets import DatasetDict

# make a train validation split from the train_dataset
split_dataset = train_dataset.train_test_split(
    test_size=0.2,
    stratify_by_column="label",  # Ensure same label distribution
    seed=SEED  
)

train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]


Now we load the model using the `AutoModel` class `AutoModelForSequenceClassification`:

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

We can now observe the model architecture with the `summary` function from the `torchinfo` framework.

In [None]:
summary(model, depth=3)

Now we start defining our LoRA adapter. For this, we define a lora configuration. 

In the configuration we need to define the following:
- task_type: The task we want to train our model
- r: The rank of the LoRA adapter ($\Delta W = A B$). We have more trainable parameters when the rank is bigger. 
- lora_alpha: The alpha is the scaling factor. This is calculated together with the rank: $$\Delta W = \frac{\alpha}{r} AB$$
- lora_dropout: This defines the amount of dropout we want to use to make the training more robust. 
- target_modules: Defines the layers we want to add our adapter to. In this case, we target the attention layers for the LoRA. 

After the configuration we need to add the adapter to the model. Now we can observe the number of trainable paramters vs. the number of non trainable parameters.


In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,   # Sequence classification
    r=8,                          # Rank
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,             # Dropout
    target_modules=["q_lin", "v_lin"]  # Target attention layers for LoRA
)

# you can experminet with different target modules
# target_modules=[
#     "q_lin", "k_lin", "v_lin",  # Self-attention projections
#     "out_lin",  # Output projection of self-attention
#     "ffn.lin1", "ffn.lin2"  # Feed-forward network layers
# ]

# add adapter to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

After that, we define the training arguments and the trainer. 

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./lora_imdb",
    save_strategy="epoch",
    eval_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    push_to_hub=False
)

# Define trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

After all that preparation we can now train the adapter on the model.

In [None]:
# Train model
trainer.train()


In [None]:
# Save model
trainer.save_model("./lora_sentiment_model")

Now we can plot the loss curves per epoch from our training to observe potential overfitting:

In [None]:
import matplotlib.pyplot as plt

# Extract training logs
log_history = trainer.state.log_history

# Prepare lists to store epoch-wise metrics
train_loss = []
eval_loss = []
epochs = []

for log in log_history:
    if 'loss' in log and 'epoch' in log:
        train_loss.append(log['loss'])
        epochs.append(log['epoch'])
    elif 'eval_loss' in log and 'epoch' in log:
        eval_loss.append(log['eval_loss'])

# Plot losses
plt.figure(figsize=(10, 5))
plt.plot(epochs[:len(train_loss)], train_loss, label="Training Loss")
plt.plot(epochs[:len(eval_loss)], eval_loss, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss per Epoch")
plt.legend()
plt.grid(True)
plt.show()

After saving we can observ how good the model performs on our test dataset using the hugging face evaluation:

In [None]:
results = trainer.evaluate(
    eval_dataset = test_dataset)
print(results)

#### TASK

Use the trained model to classify the following example sentences.

1. Load the model and tokenizer from the save directory.
2. Create a pipeline.
3. Classify the example sentences. 
4. Print out the results and format the output. 


The output per example should look something like this: 
```
--------------------------------------------------
Text: I am neutral
Sentiment: Negative
Confidence: 0.5093
```

In [None]:
examples_to_classify = [
    "I love this", 
    "I hate this", 
    "I am neutral", 
    "This movie was really bad. I would not recommend it.", 
    "\n",
    "This movie was really bad. I would not recommend it.",
    "I loved this movie! The acting was amazing.",
    "This is no art-house film, it's mainstream entertainment. <br /><br />Lot's of beautiful people, t&a, and action. I found it very entertaining. It's not supposed to be intellectually stimulating, it's a fun film to watch! Jesse and Chace are funny too, which is just gravy. Definitely worth a rental.<br /><br />So in summary, I'd recommend checking it out for a little Friday night entertainment with the boys or even your girl (if she likes to see other girls get it on!)<br /><br />The villains are good too. Vinnie, Corey Large, the hatian guy from Heroes. Very nasty villains.",
    "This film seemed way too long even at only 75 minutes. The problem with jungle horror films is that there is always way too much footage of people walking (through the jungle, up a rocky cliff, near a river or lake) to pad out the running time. The film is worth seeing for the laughable and naked native zombie with big bulging, bloody eyes which is always accompanied on the soundtrack with heavy breathing and lots of reverb. Eurotrash fans will be plenty entertained by the bad English dubbing, gratuitous female flesh and very silly makeup jobs on the monster and native extras. For a zombie/cannibal flick this was pretty light on the gore but then I probably didn't see an uncut version.",
]

In [None]:
### IMPLEMENT YOUR SOLUTION HERE ###

### 3.3 Instruction-tuning a LLM

After learning the basics we now want to use LoRa to finetune a LLM on Instruction-tuning. 

Wikipedia definition of Instruction-tuning: 
> "Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive responses, starting from human-generated corrections of a few cases. For example, in the instruction "Write an essay about the main themes represented in Hamlet," an initial naive completion might be "If you submit the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of this textual sequence in the corpus." [LLMs definition Wikipedia](https://en.wikipedia.org/wiki/Large_language_model)

**NOTE: Restart the notebook at this point to make shure we have no models or things left in the ram.**

In [None]:
# Imports
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, TaskType

In [None]:
# model -> https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# dataset -> https://huggingface.co/datasets/tatsu-lab/alpaca
dataset_name = "tatsu-lab/alpaca"

# set a seed
SEED = 42

In [None]:
# Load dataset
dataset = load_dataset(dataset_name)

dataset = dataset.shuffle(seed=SEED)

train_dataset = dataset["train"]

To make a instruction-tuning, we need to create instructions from the dataset:

In [None]:
def format_alpaca(example):
    prompt = f"### Instruction:\n{example['instruction']}\n\n"
    if example['input']:
        prompt += f"### Input:\n{example['input']}\n\n"
    prompt += f"### Response:\n{example['output']}"
    return {"text": prompt}

In [None]:
# using the map function to apply the instruction format to our dataset
train_dataset = dataset.map(format_alpaca)

In [None]:
# inspect the first sample from the dataset 
train_dataset

In [None]:
split_dataset = dataset["train"].train_test_split(test_size=0.05, seed=SEED)

After the dataset creation we need to convert the text in the token-space for the LLM to understand using the tokenizer. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

In [None]:
def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

In [None]:
tokenized_dataset = split_dataset.map(tokenize, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

In [None]:
train_dataset=tokenized_dataset["train"]
test_dataset=tokenized_dataset["test"]

# reducing the dataset size
TRAIN_SUBSET_SIZE = 50
TEST_SUBSET_SIZE = 10

train_dataset = train_dataset.shuffle(seed=SEED).select(range(TRAIN_SUBSET_SIZE))
test_dataset = test_dataset.shuffle(seed=SEED).select(range(TEST_SUBSET_SIZE))

Now we need to load the actual model and defining the LoRA adapter.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)

In [None]:
lora_config = LoraConfig(
    r=4,
    lora_alpha=8,
    target_modules=["q_proj", "v_proj"],  # typical for transformer-based models
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

In [None]:
# apply the adapter to the transformer
model = get_peft_model(model, lora_config)

# observe the trainable params
model.print_trainable_parameters()

In [None]:
# defining the Data Collator -> https://huggingface.co/docs/transformers/main_classes/data_collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
training_args = TrainingArguments(
    output_dir="./my_tinyllama_lora_alpaca",
    save_strategy="steps",
    logging_strategy="steps",
    # gradient_accumulation_steps=2,
    learning_rate=2e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=10,
    num_train_epochs=3,
    fp16=False,
    save_total_limit=2,
    weight_decay=0.01,
    logging_dir="./logs",
    push_to_hub=False,
)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
trainer.train()

In [None]:
# Save model
trainer.save_model("./my_tinyllama_lora_alpaca_model")

In [None]:
import matplotlib.pyplot as plt

# Extract training logs
log_history = trainer.state.log_history

# Prepare lists to store epoch-wise metrics
train_loss = []
eval_loss = []
epochs = []

for log in log_history:
    if 'loss' in log and 'epoch' in log:
        train_loss.append(log['loss'])
        epochs.append(log['epoch'])

# Plot losses
plt.figure(figsize=(10, 5))
plt.plot(epochs[:len(train_loss)], train_loss, label="Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss per Epoch")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
results = trainer.evaluate(
    eval_dataset = test_dataset)
print(results)

### 3.4 Using your own model

Since we only saved the LoRA adapter to our system we need to load the basemodel first.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
import torch

In [None]:
# model -> https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# set a seed
SEED = 42

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load your LoRA adapter
model = PeftModel.from_pretrained(base_model, "./my_tinyllama_lora_alpaca_model")
model.eval()

Depending on your device we want to move the model to the GPU or CPU. 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [None]:
prompt = "Explain the theory of relativity in simple terms."

In [None]:
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id
    )

In [None]:
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated response:")
print(generated_text)

Congrats you are now able to fine-tune your own models. Feel free to dive in, there is a lot learn!