# Start finetuning

First I tried the [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) model from Huggingface, which is an open-source conversational Dutch LLM. However, it is too large to run locally. Now the finetuning is done on [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2), a much smaller model of only 124M params for question-answer pairs. I will try to use this to finetune on the iBestuur dataset.

#### 0. Imports

In [1]:
import torch
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, AutoConfig, TrainingArguments, Trainer
import locale

In [2]:
#locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

#### 1. Loading the Data

In [3]:
path = '../Data/iBestuur/ibestuur_articles.csv'
dataset = load_dataset('csv', data_files=path)

In [4]:
dataset = dataset['train']

#### 2. Preprocessing the Data

In [5]:
model_name = 'BramVanroy/GEITje-7B-ultra'
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [6]:
def tokenize_function(examples):
    return tokenizer(examples['content'], padding="max_length", truncation=True)

In [7]:
# Splitting the dataset into training and test sets
split_datasets = dataset.train_test_split(test_size=0.1)  # Adjust the test_size as needed

# Applying the tokenization function to both splits
tokenized_datasets = DatasetDict({
    'train': split_datasets['train'].map(tokenize_function, batched=True),
    'test': split_datasets['test'].map(tokenize_function, batched=True)
})

Map:   0%|          | 0/4785 [00:00<?, ? examples/s]

Map:   0%|          | 0/532 [00:00<?, ? examples/s]

### 3. Move model and data to GPU

In [8]:
# device = torch.device("cpu")
device = torch.device("mps")

# Enlarge upper limit of memory (this may cause problems, be careful)
torch.mps.set_per_process_memory_fraction(0.0)

In [9]:
model = AutoModelForCausalLM.from_pretrained('BramVanroy/GEITje-7B-ultra')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
model = model.to(device)

#### 4. Fine-tuning the Model with LoRA

In [13]:
config = AutoConfig.from_pretrained(model_name)
config.lora = True # Enable LoRA
config.lora_r = 16  # Set the rank for LoRA. Adjust based on your model and dataset.
config.lora_alpha = 32  # Set the scaling factor. Adjust as needed.

In [14]:
training_args = TrainingArguments(
    output_dir="./results",           # Where to store the final model
    num_train_epochs=3,               # Number of training epochs
    per_device_train_batch_size=8,    # Batch size for training
    per_device_eval_batch_size=8,     # Batch size for evaluation
    gradient_accumulation_steps=8,    # Accumulate gradients to improve memory usage
    warmup_steps=500,                 # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                # Strength of weight decay
    logging_dir="./logs",             # Directory for storing logs
    evaluation_strategy="epoch",      # Evaluate the model at the end of each epoch
)

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [16]:
trainer.train()

RuntimeError: Invalid buffer size: 64.00 GB

#### 5. Validation and Results

In [1]:
eval_results = trainer.evaluate()
print(eval_results)

NameError: name 'trainer' is not defined

#### 5. Save models

In [None]:
trainer.save_model("fine_tuned_GEITje_model")

#### 6. Use the model to generate text

In [None]:
generator = pipeline('text-generation', model="fine_tuned_GEITje_model", tokenizer=tokenizer)
print(generator("Digital transformation in municipalities", max_length=50))