# **TULU-3 Fine-Tuning**
In this hands-on exercise, we will be fine-tuning different models for various tasks using classical fine-tuning. Classical fine-tuning is a common approach to establish a solid baseline for model specialization performance.

The goal of fine-tuning is to take a pre-trained model and adapt it to a specific task or dataset. By leveraging the knowledge and representations learned from a large-scale pre-training task, we can achieve better performance on downstream tasks with less training data.

In [1]:
from pathlib import Path
import os
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from torch.utils.data import Dataset, DataLoader, IterableDataset
from datasets import load_dataset
import torch

from utils import sft_tulu_tokenize_and_truncate

from tqdm.notebook import trange, tqdm


DSDIR = Path(os.environ['DSDIR'])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'


---

In [2]:
DSDIR = Path(os.environ["DSDIR"])
model_path = DSDIR / "HuggingFace_Models" / "Qwen/Qwen2.5-32B-Instruct"
dataset_path = DSDIR / "HuggingFace" / "allenai" / "tulu-3-sft-mixture" / "data" /"*.parquet"

In [3]:
dataset_path

PosixPath('/lustre/fsmisc/dataset/HuggingFace/allenai/tulu-3-sft-mixture/data/*.parquet')

In [4]:
dataset = load_dataset("parquet", data_files=str(dataset_path), split="train").shuffle()


In [5]:
tokenizer = AutoTokenizer.from_pretrained(str(model_path), padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id

In [6]:
row = dataset[2]
tokenizer.apply_chat_template(
            conversation=row['messages'],
            tokenize=False,
            return_tensors="pt",
            padding=False,
            truncation=False,
            add_generation_prompt=False,
        )

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\ncuales son los 12 jefes mas dificiles de la historia<|im_end|>\n<|im_start|>assistant\nCalificar a los jefes más difíciles en la historia de los videojuegos puede ser bastante subjetivo, ya que depende de la habilidad del jugador, el género del videojuego, y el tipo de dificultad que se aprecie (puro desafío técnico, necesidad de estrategia, resistencia, etc.). Sin embargo, algunos jefes han ganado notoriedad por su dificultad a lo largo de los años. He aquí una lista que combina diversos videojuegos y sus infames jefes:\n\n1. Ornstein y Smough - "Dark Souls": Este dúo es famoso por su difícil combate en equipo, donde derrotar primero a uno cambia significativamente la dificultad del jefe restante.\n\n2. Nameless King - "Dark Souls III": Famoso por ser uno de los combates más desafiantes y técnicamente exigentes de la serie.\n\n3. Lingering Will - "Kingdom Hearts II Fi

In [7]:
#sft_tulu_tokenize_and_truncate(row, tokenizer, 4096)

In [8]:
tokenizer = AutoTokenizer.from_pretrained(str(model_path), padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id

In [9]:
sample = dataset[1]

In [10]:
!mkdir datasets
!mkdir datasets/tulu_3_sft_mixture

mkdir: cannot create directory ‘datasets’: File exists
mkdir: cannot create directory ‘datasets/tulu_3_sft_mixture’: File exists


In [11]:
939343 // 16

58708

In [12]:
58708 * 16

939328

In [13]:
import json

# Écriture ligne par ligne dans un fichier .jsonl
for i in trange(16):
    with open(f"datasets/tulu_3_sft_mixture_{i}.jsonl", "w", encoding="utf-8") as f:
    
        for j in range(58708):
            sample = dataset[i * 58708 + j]
            message = tokenizer.apply_chat_template(
                        conversation=sample['messages'],
                        tokenize=False,
                        return_tensors="pt",
                        padding=False,
                        truncation=False,
                        add_generation_prompt=False,
                    )
            item = {'text': message}
            json_line = json.dumps(item, ensure_ascii=False)
            f.write(json_line + "\n")

  0%|          | 0/16 [00:00<?, ?it/s]

## Nanotron

In [14]:
#!git clone https://github.com/huggingface/nanotron.git

puis:

```bash
cd nanotron
[nanotron]$ pip install --user --no-cache-dir -e . --no-deps
[nanotron]$ pip install --user --no-cache-dir dacite
[nanotron]$ pip install --user --no-cache-dir datatrove loguru --no-deps
```

il faut utiliser une version de python inferieure à 2.12

il faut aussi binariser son dataset au format nanotron.

dans preprocess_data, je change

```python
def main(args):
    # Build datatrove reader
    if args.readers == "hf":
        datatrove_reader = HuggingFaceDatasetReader(
            dataset=args.dataset,
            text_key=args.column,
            streaming=True,
            dataset_options={"split": args.split},
        )
```

et 

```python
preprocess_executor = LocalPipelineExecutor(
        pipeline=[
            datatrove_reader,
            DocumentTokenizer(
                output_folder=args.output_folder,
                tokenizer_name_or_path=args.tokenizer_name_or_path,
                eos_token=args.eos_token,
                shuffle_documents=False,
                max_tokens_per_file=1e9,
            ),
        ],
        tasks=args.n_tasks,
        logging_dir=args.logging_dir,
    )
```

In [61]:
%%writefile preprocess_nanotron.slurm
#!/bin/bash
#SBATCH --job-name=nano_pp
#SBATCH --output=logs/out/nano_pp_%j.out 
#SBATCH --error=logs/err/nano_pp_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --hint=nomultithread 
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --partition=archive
#SBATCH --account=sos@h100

## load module 
module purge
module load arch/h100
module load pytorch-gpu/py3/2.4.0


## launch script on every task 
set -x
time srun python nanotron/tools/preprocess_data.py \
  --tokenizer-name-or-path Qwen/Qwen2.5-14B-Instruct \
  --output-folder datasets/tulu_3_sft_mixture \
  --n-tasks 16 \
  jsonl \
  --dataset datasets \
  --glob-pattern *.jsonl 
date

Overwriting preprocess_nanotron.slurm


In [62]:
!sbatch preprocess_nanotron.slurm

Submitted batch job 684543


In [68]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


In [67]:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("/lustre/fsmisc/dataset/HuggingFace_Models/Qwen/Qwen2.5-14B-Instruct")
print("tokenizer.vocab_size =", tok.vocab_size)
print("len(get_vocab())     =", len(tok.get_vocab()))
print("specials:", tok.all_special_tokens)

tokenizer.vocab_size = 151643
len(get_vocab())     = 151665
specials: ['<|im_end|>', '<|endoftext|>', '<|im_start|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']


In [32]:
%pwd

'/lustre/fsmisc/dataset'

In [56]:
%%writefile preprocess_nanotron.slurm
#!/bin/bash
#SBATCH --job-name=nano_pp
#SBATCH --output=logs/out/nano_pp_%j.out 
#SBATCH --error=logs/err/nano_pp_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --hint=nomultithread 
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --partition=archive
#SBATCH --account=sos@a100


## load module 
module purge
module load arch/h100
module load pytorch-gpu/py3/2.4.0


## launch script on every task 
set -x
time srun python config_qwen.py --out config_qwen_TP4_PP16.yaml --dp 1 --tp 4 --pp 16 --mbs 8 --acc 16
date

Overwriting preprocess_nanotron.slurm


In [57]:
!sbatch preprocess_nanotron.slurm

Submitted batch job 684477


In [58]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            684477   archive  nano_pp  ssos040 PD       0:00      1 (None)
