<a href="https://colab.research.google.com/github/RobinSmits/Dutch-LLMs/blob/main/Qwen1_5_7B_Dutch_Chat_SFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This notebook performs a finetuning of a QLoRA adapter model based on [Qwen/Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat).

Officially the Qwen1.5 model types don't support the Dutch language. However when doing some experiments I noticed that the chat quality for the Dutch language (for the 7B and larger sizes..) was comparable or may'be even better then with the Mistral models. Mistral officially also doesn't support Dutch however it already provided some interresting Dutch Chat Models as created by Bram van Roy and Edwin Rijgersberg.

This is basically my attempt to further fine-tune the Qwen1.5-7B-Chat model and optimize it for Dutch.

The dataset used is the Dutch Chat Dataset [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch) as created by Bram van Roy. Kudos to Bram for this dataset!

## Install and Import Modules

In [None]:
# Install Modules
!pip install -q accelerate==0.27.2
!pip install -q bitsandbytes==0.43.0
!pip install -q datasets==2.17.1
!pip install -q peft==0.9.0
!pip install -q transformers==4.38.2
!pip install -q trl==0.8.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m99.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [None]:
# Import Modules
from datasets import load_dataset, load_from_disk
from huggingface_hub import notebook_login
from peft import AutoPeftModelForCausalLM, prepare_model_for_kbit_training, LoraConfig, get_peft_model, TaskType
from transformers import (AutoTokenizer,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          DataCollatorForLanguageModeling,
                          TrainingArguments)
import torch
from trl import SFTTrainer

# Set TF32 for A100
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

## Constants

In [None]:
# Set Name Constants
model_name = 'Qwen/Qwen1.5-7B-Chat'
hf_model_name = 'Qwen1.5-7B-Dutch-Chat-Sft'

## Connect Google Drive

In [None]:
# Mount Google Drive
import os
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/QwenDutch/'
os.makedirs(WORK_DIR, exist_ok = True)

Mounted at /content/drive


## HuggingFace Login

In [None]:
# HuggingFace Hub Login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Tokenizer

In [None]:
# Create Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Max Length
MAX_LEN = 2048

# Tokenizer Summary
print(tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Qwen2TokenizerFast(name_or_path='Qwen/Qwen1.5-7B-Chat', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Create QLoRa Model based on Qwen1.5_7B_Chat Model

In [None]:
# Create Model
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map = "auto",
                                             quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                                                                      bnb_4bit_use_double_quant = True,
                                                                                      bnb_4bit_quant_type = 'nf4',
                                                                                      bnb_4bit_compute_dtype = torch.bfloat16))

# Set cache to False
model.config.use_cache = False

# Create LoRA config
lora_config = LoraConfig(r = 64,
                         lora_alpha = 16,
                         target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'gate_proj', 'down_proj'],
                         lora_dropout = 0.05,
                         bias = 'none',
                         task_type = TaskType.CAUSAL_LM)

# Prep for Training
model = prepare_model_for_kbit_training(model)

# Create LoRA Model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Show Model Summary
print(model)

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

trainable params: 159,907,840 || all params: 7,881,232,384 || trainable%: 2.0289699910972705
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 4096)
        (layers): ModuleList(
          (0-31): 32 x Qwen2DecoderLayer(
            (self_attn): Qwen2SdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
 

## Modify Default Chat template to Dutch

In [None]:
# Modify Default Chattemplate with Dutch System Message
tokenizer.chat_template = "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nJe bent een behulpzame AI assistent<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}"

# Summary Chat Template
tokenizer.chat_template

"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nJe bent een behulpzame AI assistent<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}"

## Test Dataset

We save the data to storage so that we can use the same data in the second run.

In [None]:
# Load Test Dataset
#test_data = load_dataset('BramVanroy/ultrachat_200k_dutch', split = 'test_sft')
#test_data = test_data.select(range(2048)).remove_columns(["prompt", "prompt_id"]) # Only part of the test data to limit the time spent on validation.

# Save for later use
#test_data.save_to_disk(f'{WORK_DIR}test_data')

# Load from disk
test_data = load_from_disk(f'{WORK_DIR}test_data')

# Summary
print(test_data)

Dataset({
    features: ['messages'],
    num_rows: 2048
})


## Training Dataset

The training dataset is split into 2 equal parts to make sure 1 part fits into the maximum of 24 hours runtime with Google Colab Pro.
We save the data to storage so that we can use the same data in the second run.

We use the HuggingFace Trainer 'resume_from_checkpoint' option to continue training on the second part of the training data.

In [None]:
# Load Train Dataset
#train_data1 = load_dataset('BramVanroy/ultrachat_200k_dutch', split = 'train_sft[:50%]').remove_columns(["prompt", "prompt_id"])
#train_data2 = load_dataset('BramVanroy/ultrachat_200k_dutch', split = 'train_sft[-50%:]').remove_columns(["prompt", "prompt_id"])

# Save for later use
#train_data1.save_to_disk(f'{WORK_DIR}train_data1') # 733
#train_data2.save_to_disk(f'{WORK_DIR}train_data2') # 733

# Load from disk
train_data1 = load_from_disk(f'{WORK_DIR}train_data1')
train_data2 = load_from_disk(f'{WORK_DIR}train_data2')

# Summary
print(train_data1)
print(train_data2)

Dataset({
    features: ['messages'],
    num_rows: 96299
})
Dataset({
    features: ['messages'],
    num_rows: 96299
})


## Chat Template Example

In [None]:
# Show sample
item_data = train_data1[0]
chat_template_string = tokenizer.apply_chat_template(item_data["messages"], tokenize = False)
print(chat_template_string)

<|im_start|>system
Je bent een behulpzame AI assistent<|im_end|>
<|im_start|>user
Kan je mij vertellen welke versie van mijn website thema ik gebruik? Er staat iets over sectie-gebaseerde thema's zoals Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+. En ook, hoe zit het met die functie die toelaat het tweede productafbeelding te tonen wanneer ik erover hover? Is dat voor alle secties of enkel de genoemden?<|im_end|>
<|im_start|>assistant
Om te bepalen welke themaversie u gebruikt, moet u meestal in de instellingen van uw websiteachtergrond kijken. Deze kunnen vaak gevonden worden in de thema-editor of het dashboard waar informatie staat over het huidige thema. Wat betreft de functie om de secundaire afbeelding van een product te tonen bij het erover hoveren; dit is een instellingsmogelijkheid die vaak in de secties 'Collecties pagina's' en 'Uitgelichte Collecties' zich bevindt. Deze functie is doorgaans beperkt tot de secties die in het materiaal genoemd worden, te

## Train Model

In [None]:
# Set Steps
eval_steps = 146
save_steps = 146
logging_steps = 73

# Set TrainingArguments
training_args = TrainingArguments(num_train_epochs = 1,
                                  max_steps = 1466,
                                  learning_rate = 3.0e-4,
                                  lr_scheduler_type = 'cosine',
                                  evaluation_strategy = "steps",
                                  logging_steps = logging_steps,
                                  save_strategy = 'steps',
                                  eval_steps = eval_steps,
                                  save_steps = save_steps,
                                  save_total_limit = 1,
                                  per_device_train_batch_size = 2,
                                  per_device_eval_batch_size = 4,
                                  gradient_accumulation_steps = 32,
                                  gradient_checkpointing = True,
                                  gradient_checkpointing_kwargs = {'use_reentrant': False},
                                  warmup_ratio = 0.05,
                                  weight_decay = 0.01,
                                  bf16 = True,
                                  tf32 = True,
                                  output_dir = f'{WORK_DIR}{hf_model_name}',
                                  hub_model_id = hf_model_name,
                                  push_to_hub = True,
                                  hub_private_repo = True,
                                  optim = 'paged_adamw_8bit',
                                  report_to = 'tensorboard')

# Config SFTTrainer
trainer = SFTTrainer(model,
                     train_dataset = train_data2,
                     eval_dataset = test_data,
                     tokenizer = tokenizer,
                     packing = True,
                     eval_packing = False,
                     max_seq_length = MAX_LEN,
                     data_collator = DataCollatorForLanguageModeling(tokenizer, mlm = False),
                     args = training_args)

# Perform Training
trainer.train(resume_from_checkpoint = True)


Generating train split: 0 examples [00:00, ? examples/s]



Step,Training Loss,Validation Loss
876,1.2388,1.199769
1022,1.2246,1.18821
1168,1.211,1.180156


Step,Training Loss,Validation Loss
876,1.2388,1.199769
1022,1.2246,1.18821
1168,1.211,1.180156
1314,1.204,1.176272
1460,1.2041,1.175572


TrainOutput(global_step=1466, training_loss=0.6111853034889877, metrics={'train_runtime': 51317.7302, 'train_samples_per_second': 1.828, 'train_steps_per_second': 0.029, 'total_flos': 8.368856319431541e+18, 'train_loss': 0.6111853034889877, 'epoch': 2.0})

## Push to Hub

In [None]:
# Push tokenizer to hub
tokenizer.push_to_hub(hf_model_name, private = True)

# Push model to hub
trainer.push_to_hub()

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1711435746.375aeff7f0a0.325.0:   0%|          | 0.00/9.07k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/640M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/robinsmits/Qwen1.5-7B-Dutch-Chat-Sft/commit/116810213ea3eb70828cc62d607742f3b36bc902', commit_message='End of training', commit_description='', oid='116810213ea3eb70828cc62d607742f3b36bc902', pr_url=None, pr_revision=None, pr_num=None)

## Merge Model and Push to Hub

In [None]:
# Set Name Constants
model_name = f'robinsmits/{hf_model_name}'
merged_model_name = f'{model_name}-Bf16'

# Summary
print(model_name)
print(merged_model_name)

robinsmits/Qwen1.5-7B-Dutch-Chat-Sft
robinsmits/Qwen1.5-7B-Dutch-Chat-Sft-Bf16


In [None]:
# Cleanup
del model, tokenizer

# Load from Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create Model - Dtype bfloat16
model = AutoPeftModelForCausalLM.from_pretrained(model_name,
                                                 torch_dtype = torch.bfloat16)

# Merge and Unload
model = model.merge_and_unload()

# Summary
print(model)

tokenizer_config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/640M [00:00<?, ?B/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151646, 4096)
    (layers): ModuleList(
      (0-31): 32 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): L

In [None]:
# Push To Hub
tokenizer.push_to_hub(merged_model_name, private = True)
model.push_to_hub(merged_model_name, private = True)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.23G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/robinsmits/Qwen1.5-7B-Dutch-Chat-Sft-Bf16/commit/99f984de64e580b5fe7625e39e1d8043ae489a4e', commit_message='Upload Qwen2ForCausalLM', commit_description='', oid='99f984de64e580b5fe7625e39e1d8043ae489a4e', pr_url=None, pr_revision=None, pr_num=None)