# üéØ Fine-tuning Model≈Ø s Hugging Face

**Autor:** Praut s.r.o. - AI Integration & Business Automation

V tomto notebooku se nauƒç√≠me:
- Jak funguje fine-tuning a kdy ho pou≈æ√≠t
- Fine-tuning klasifikaƒçn√≠ho modelu na vlastn√≠ch datech
- Fine-tuning modelu pro NER (Named Entity Recognition)
- Pokroƒçil√© techniky: LoRA a PEFT pro efektivn√≠ tr√©nov√°n√≠
- Evaluace a deployment fine-tunovan√©ho modelu

## Kdy pou≈æ√≠t fine-tuning?

| Situace | ≈òe≈°en√≠ |
|---------|--------|
| Obecn√© √∫lohy (sentiment, NER) | Pou≈æ√≠t p≈ôedtr√©novan√Ω model |
| Dom√©novƒõ specifick√© √∫lohy | Fine-tuning na vlastn√≠ch datech |
| Velmi specifick√° terminologie | Fine-tuning + vlastn√≠ tokenizer |
| Omezen√© GPU zdroje | LoRA/PEFT fine-tuning |

In [None]:
# Instalace pot≈ôebn√Ωch knihoven
!pip install -q transformers datasets accelerate evaluate peft bitsandbytes scikit-learn

In [None]:
import torch
import numpy as np
import pandas as pd
from typing import Dict, List, Any, Optional
import json
import os

# Hugging Face knihovny
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    DataCollatorForTokenClassification,
    EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict, load_dataset
import evaluate

# Kontrola GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Pou≈æ√≠v√°m za≈ô√≠zen√≠: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Dostupn√° pamƒõ≈•: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 1. Fine-tuning pro Klasifikaci Textu

Zaƒçneme s nejƒçastƒõj≈°√≠m p≈ô√≠padem - klasifikace textu na vlastn√≠ kategorie.

In [None]:
# P≈ô√≠klad: Klasifikace z√°kaznick√Ωch po≈æadavk≈Ø
# Vytvo≈ô√≠me uk√°zkov√° data pro e-shop

train_data = {
    "text": [
        # Reklamace
        "Objedn√°vka p≈ôi≈°la po≈°kozen√°, chci vr√°tit pen√≠ze",
        "Produkt nefunguje, po≈æaduji v√Ωmƒõnu",
        "Zbo≈æ√≠ neodpov√≠d√° popisu, chci reklamovat",
        "Bal√≠k dorazil rozbit√Ω",
        "V√Ωrobek je vadn√Ω hned po vybalen√≠",
        "Chci uplatnit reklamaci na tento produkt",
        "Zbo≈æ√≠ m√° vadu, pros√≠m o ≈ôe≈°en√≠",
        "Produkt p≈ôestal fungovat po t√Ωdnu pou≈æ√≠v√°n√≠",
        
        # Dotaz na produkt
        "Jak√© jsou rozmƒõry tohoto produktu?",
        "Je tento v√Ωrobek kompatibiln√≠ s iPhone?",
        "M√°te tento produkt i v modr√© barvƒõ?",
        "Kolik v√°≈æ√≠ tato polo≈æka?",
        "Z jak√©ho materi√°lu je vyrobeno?",
        "Jak√° je kapacita baterie?",
        "Podporuje tento p≈ô√≠stroj USB-C?",
        "Je mo≈æn√© koupit n√°hradn√≠ d√≠ly?",
        
        # Dotaz na dopravu
        "Kdy doraz√≠ moje objedn√°vka?",
        "Lze doruƒçit na Slovensko?",
        "Jak√© jsou mo≈ænosti dopravy?",
        "Kolik stoj√≠ express doruƒçen√≠?",
        "M≈Ø≈æu si vyzvednout na poboƒçce?",
        "Jak dlouho trv√° dod√°n√≠?",
        "Pos√≠l√°te i na po≈°tu?",
        "Je mo≈æn√© doruƒçen√≠ o v√≠kendu?",
        
        # Platba
        "Lze platit kartou?",
        "P≈ôij√≠m√°te platbu na fakturu?",
        "Mohu platit p≈ôi p≈ôevzet√≠?",
        "Jak funguje spl√°tkov√Ω prodej?",
        "Je mo≈æn√° platba p≈ôes PayPal?",
        "Nab√≠z√≠te firemn√≠ fakturu?",
        "Jak√© platebn√≠ metody akceptujete?",
        "Lze uplatnit d√°rkov√Ω poukaz?"
    ],
    "label": [
        0, 0, 0, 0, 0, 0, 0, 0,  # Reklamace
        1, 1, 1, 1, 1, 1, 1, 1,  # Dotaz na produkt
        2, 2, 2, 2, 2, 2, 2, 2,  # Dotaz na dopravu
        3, 3, 3, 3, 3, 3, 3, 3   # Platba
    ]
}

# Validaƒçn√≠ data
val_data = {
    "text": [
        "Produkt je po≈°kozen√Ω, co m√°m dƒõlat?",
        "Jak√° je z√°ruka na tento v√Ωrobek?",
        "Kde je teƒè m≈Øj bal√≠k?",
        "Akceptujete kryptomƒõny?",
        "Chci vr√°tit vadn√© zbo≈æ√≠",
        "M√°te to skladem?",
        "Doruƒçujete do zahraniƒç√≠?",
        "Mohu platit na spl√°tky?"
    ],
    "label": [0, 1, 2, 3, 0, 1, 2, 3]
}

# Mapov√°n√≠ kategori√≠
label_names = ["reklamace", "produkt", "doprava", "platba"]
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {label: i for i, label in enumerate(label_names)}

print(f"Tr√©novac√≠ vzorky: {len(train_data['text'])}")
print(f"Validaƒçn√≠ vzorky: {len(val_data['text'])}")
print(f"Kategorie: {label_names}")

In [None]:
# Vytvo≈ôen√≠ Hugging Face datasetu
train_dataset = Dataset.from_dict(train_data)
val_dataset = Dataset.from_dict(val_data)

dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset
})

print(dataset)

In [None]:
# Naƒçten√≠ modelu a tokenizeru
# Pou≈æijeme men≈°√≠ BERT model optimalizovan√Ω pro ƒçe≈°tinu
model_name = "Seznam/small-e-czech"  # ƒåesk√Ω model od Seznamu

# Alternativa pro obecn√© pou≈æit√≠:
# model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Model pro klasifikaci
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id
)

print(f"Model: {model_name}")
print(f"Poƒçet parametr≈Ø: {model.num_parameters():,}")

In [None]:
# Tokenizace dat
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Odebr√°n√≠ nepot≈ôebn√Ωch sloupc≈Ø
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset.set_format("torch")

print(tokenized_dataset)

In [None]:
# Definice metrik
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
    
    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"]
    }

In [None]:
# Nastaven√≠ tr√©ninku
training_args = TrainingArguments(
    output_dir="./results/customer_classifier",
    
    # Parametry tr√©ninku
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    
    # Learning rate a scheduler
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=50,
    
    # Evaluace
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    
    # Logging
    logging_dir="./logs",
    logging_steps=10,
    
    # Optimalizace
    fp16=torch.cuda.is_available(),  # Mixed precision na GPU
    
    # Reproducibilita
    seed=42,
)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

In [None]:
# Spu≈°tƒõn√≠ tr√©ninku
print("Zaƒç√≠n√°m tr√©nink...")
train_result = trainer.train()

# V√Ωsledky
print("\n" + "="*50)
print("V√Ωsledky tr√©ninku:")
print(f"  Tr√©ninkov√° loss: {train_result.training_loss:.4f}")
print(f"  Poƒçet krok≈Ø: {train_result.global_step}")

In [None]:
# Evaluace na validaƒçn√≠ch datech
eval_results = trainer.evaluate()

print("V√Ωsledky evaluace:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

In [None]:
# Ulo≈æen√≠ modelu
model_save_path = "./models/customer_classifier"
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"Model ulo≈æen do: {model_save_path}")

In [None]:
# Test fine-tunovan√©ho modelu
from transformers import pipeline

# Naƒçten√≠ ulo≈æen√©ho modelu
classifier = pipeline(
    "text-classification",
    model=model_save_path,
    tokenizer=model_save_path,
    device=0 if torch.cuda.is_available() else -1
)

# Testovac√≠ zpr√°vy
test_messages = [
    "Produkt dorazil rozbit√Ω, chci vr√°tit",
    "Jak√© jsou parametry baterie?",
    "Kdy mi p≈ôijde bal√≠k?",
    "M≈Ø≈æu platit na dob√≠rku?",
    "Zbo≈æ√≠ nefunguje spr√°vnƒõ",
    "Je skladem varianta XL?"
]

print("Test fine-tunovan√©ho modelu:")
print("="*50)

for msg in test_messages:
    result = classifier(msg)[0]
    print(f"\nZpr√°va: {msg}")
    print(f"Kategorie: {result['label']} ({result['score']:.2%})")

## 2. Fine-tuning pro NER (Named Entity Recognition)

Nauƒç√≠me model rozpozn√°vat vlastn√≠ entity specifick√© pro n√°≈° dom√©nov√Ω kontext.

In [None]:
# Data pro NER - rozpozn√°v√°n√≠ entit v z√°kaznick√Ωch zpr√°v√°ch
# Form√°t: BIO tagging (Beginning, Inside, Outside)

# Definice entit
ner_labels = [
    "O",           # Outside - nen√≠ entita
    "B-PRODUCT",   # Beginning of product name
    "I-PRODUCT",   # Inside product name
    "B-ORDER",     # Order number
    "I-ORDER",
    "B-DATE",      # Date
    "I-DATE",
    "B-PRICE",     # Price
    "I-PRICE",
    "B-PERSON",    # Person name
    "I-PERSON"
]

ner_id2label = {i: label for i, label in enumerate(ner_labels)}
ner_label2id = {label: i for i, label in enumerate(ner_labels)}

print(f"NER labels: {ner_labels}")

In [None]:
# Uk√°zkov√° tr√©novac√≠ data pro NER
# Ka≈æd√Ω p≈ô√≠klad obsahuje tokeny a odpov√≠daj√≠c√≠ NER tagy

ner_train_data = [
    {
        "tokens": ["Objednal", "jsem", "iPhone", "15", "Pro", "dne", "15", ".", "ledna", "."],
        "ner_tags": [0, 0, 1, 2, 2, 0, 5, 6, 6, 0]  # O, O, B-PRODUCT, I-PRODUCT, I-PRODUCT, O, B-DATE, I-DATE, I-DATE, O
    },
    {
        "tokens": ["Objedn√°vka", "ƒç√≠slo", "ORD", "-", "2024", "-", "001", "obsahuje", "vadn√Ω", "Samsung", "Galaxy", "."],
        "ner_tags": [0, 0, 3, 4, 4, 4, 4, 0, 0, 1, 2, 0]
    },
    {
        "tokens": ["Cena", "1999", "Kƒç", "je", "p≈ô√≠li≈°", "vysok√°", "."],
        "ner_tags": [0, 7, 8, 0, 0, 0, 0]
    },
    {
        "tokens": ["Kontaktujte", "pros√≠m", "pana", "Nov√°ka", "."],
        "ner_tags": [0, 0, 0, 9, 0]
    },
    {
        "tokens": ["MacBook", "Air", "M2", "za", "35000", "Kƒç", "objedn√°n", "20", ".", "prosince", "."],
        "ner_tags": [1, 2, 2, 0, 7, 8, 0, 5, 6, 6, 0]
    },
    {
        "tokens": ["Pan√≠", "Svobodov√°", "reklamuje", "PlayStation", "5", "."],
        "ner_tags": [0, 9, 0, 1, 2, 0]
    }
]

# Konverze na Dataset
ner_dataset = Dataset.from_dict({
    "tokens": [item["tokens"] for item in ner_train_data],
    "ner_tags": [item["ner_tags"] for item in ner_train_data]
})

print(f"NER tr√©novac√≠ vzorky: {len(ner_dataset)}")

In [None]:
# Model pro NER
ner_model_name = "bert-base-multilingual-cased"

ner_tokenizer = AutoTokenizer.from_pretrained(ner_model_name)
ner_model = AutoModelForTokenClassification.from_pretrained(
    ner_model_name,
    num_labels=len(ner_labels),
    id2label=ner_id2label,
    label2id=ner_label2id
)

print(f"NER model: {ner_model_name}")

In [None]:
# Tokenizace pro NER - mus√≠me spr√°vnƒõ zarovnat labely s tokeny
def tokenize_and_align_labels(examples):
    tokenized_inputs = ner_tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding="max_length",
        max_length=128
    )
    
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Speci√°ln√≠ tokeny ignorujeme
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                # Pro subword tokeny pou≈æijeme stejn√Ω label
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
            
        labels.append(label_ids)
    
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_ner_dataset = ner_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=ner_dataset.column_names
)

print(tokenized_ner_dataset)

In [None]:
# NER metriky
seqeval = evaluate.load("seqeval")

def compute_ner_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)
    
    # Konverze na label stringy
    true_predictions = [
        [ner_id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [ner_id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"]
    }

In [None]:
# Tr√©nink NER modelu (zkr√°cen√Ω pro demo)
ner_training_args = TrainingArguments(
    output_dir="./results/ner_model",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="epoch",
    fp16=torch.cuda.is_available(),
    seed=42
)

ner_data_collator = DataCollatorForTokenClassification(tokenizer=ner_tokenizer)

ner_trainer = Trainer(
    model=ner_model,
    args=ner_training_args,
    train_dataset=tokenized_ner_dataset,
    tokenizer=ner_tokenizer,
    data_collator=ner_data_collator,
)

# Tr√©nink
print("Tr√©nuji NER model...")
ner_trainer.train()

# Ulo≈æen√≠
ner_save_path = "./models/custom_ner"
ner_trainer.save_model(ner_save_path)
ner_tokenizer.save_pretrained(ner_save_path)
print(f"NER model ulo≈æen do: {ner_save_path}")

## 3. Efektivn√≠ Fine-tuning s LoRA (PEFT)

LoRA (Low-Rank Adaptation) umo≈æ≈àuje fine-tuning velk√Ωch model≈Ø s minim√°ln√≠mi n√°roky na pamƒõ≈•.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType, PeftModel

# Konfigurace LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence classification
    r=8,                          # Rank - ni≈æ≈°√≠ = men≈°√≠ model, vy≈°≈°√≠ = lep≈°√≠ v√Ωsledky
    lora_alpha=32,                # Scaling faktor
    lora_dropout=0.1,             # Dropout pro regularizaci
    target_modules=["query", "value"],  # Kter√© vrstvy adaptovat
    bias="none"
)

print("LoRA konfigurace:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Target modules: {lora_config.target_modules}")

In [None]:
# Vytvo≈ôen√≠ PEFT modelu
base_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-multilingual-cased",
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id
)

# Aplikace LoRA
peft_model = get_peft_model(base_model, lora_config)

# Porovn√°n√≠ poƒçtu parametr≈Ø
total_params = sum(p.numel() for p in peft_model.parameters())
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)

print(f"\nPorovn√°n√≠ parametr≈Ø:")
print(f"  Celkem parametr≈Ø: {total_params:,}")
print(f"  Tr√©novateln√Ωch parametr≈Ø: {trainable_params:,}")
print(f"  Procento tr√©novateln√Ωch: {100 * trainable_params / total_params:.2f}%")

In [None]:
# Tr√©nink s LoRA - mnohem rychlej≈°√≠ a m√©nƒõ n√°roƒçn√Ω na pamƒõ≈•
peft_training_args = TrainingArguments(
    output_dir="./results/lora_classifier",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=1e-4,  # Vy≈°≈°√≠ learning rate pro LoRA
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_steps=10,
    fp16=torch.cuda.is_available(),
    seed=42
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Tr√©nuji LoRA model...")
peft_trainer.train()

In [None]:
# Ulo≈æen√≠ LoRA adapt√©r≈Ø (velmi mal√© soubory!)
lora_save_path = "./models/lora_classifier"
peft_model.save_pretrained(lora_save_path)

# Zjist√≠me velikost
import os
lora_size = sum(
    os.path.getsize(os.path.join(lora_save_path, f))
    for f in os.listdir(lora_save_path)
    if os.path.isfile(os.path.join(lora_save_path, f))
)

print(f"LoRA adapt√©ry ulo≈æeny: {lora_save_path}")
print(f"Velikost LoRA adapt√©r≈Ø: {lora_size / 1024:.1f} KB")

In [None]:
# Naƒçten√≠ a pou≈æit√≠ LoRA modelu
def load_lora_model(base_model_name: str, lora_path: str, num_labels: int):
    """Naƒçte z√°kladn√≠ model s LoRA adapt√©ry."""
    # Naƒçten√≠ z√°kladn√≠ho modelu
    base = AutoModelForSequenceClassification.from_pretrained(
        base_model_name,
        num_labels=num_labels
    )
    
    # Naƒçten√≠ LoRA adapt√©r≈Ø
    model = PeftModel.from_pretrained(base, lora_path)
    
    return model

# Test
loaded_model = load_lora_model(
    "bert-base-multilingual-cased",
    lora_save_path,
    len(label_names)
)

print("LoRA model √∫spƒõ≈°nƒõ naƒçten!")

## 4. Kompletn√≠ Pipeline pro Fine-tuning

Vytvo≈ô√≠me znovupou≈æitelnou t≈ô√≠du pro fine-tuning.

In [None]:
class FineTuningPipeline:
    """
    Kompletn√≠ pipeline pro fine-tuning klasifikaƒçn√≠ch model≈Ø.
    Podporuje standardn√≠ fine-tuning i LoRA.
    """
    
    def __init__(
        self,
        base_model: str = "bert-base-multilingual-cased",
        use_lora: bool = True,
        lora_r: int = 8,
        lora_alpha: int = 32
    ):
        self.base_model = base_model
        self.use_lora = use_lora
        self.lora_r = lora_r
        self.lora_alpha = lora_alpha
        
        self.tokenizer = None
        self.model = None
        self.trainer = None
        self.label_names = None
        
    def prepare_data(
        self,
        train_texts: List[str],
        train_labels: List[int],
        val_texts: List[str],
        val_labels: List[int],
        label_names: List[str],
        max_length: int = 128
    ) -> DatasetDict:
        """P≈ôiprav√≠ data pro tr√©nink."""
        self.label_names = label_names
        self.id2label = {i: l for i, l in enumerate(label_names)}
        self.label2id = {l: i for i, l in enumerate(label_names)}
        
        # Tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(self.base_model)
        
        # Vytvo≈ôen√≠ dataset≈Ø
        train_dataset = Dataset.from_dict({"text": train_texts, "label": train_labels})
        val_dataset = Dataset.from_dict({"text": val_texts, "label": val_labels})
        
        # Tokenizace
        def tokenize(examples):
            return self.tokenizer(
                examples["text"],
                padding="max_length",
                truncation=True,
                max_length=max_length
            )
        
        train_tokenized = train_dataset.map(tokenize, batched=True, remove_columns=["text"])
        val_tokenized = val_dataset.map(tokenize, batched=True, remove_columns=["text"])
        
        train_tokenized.set_format("torch")
        val_tokenized.set_format("torch")
        
        return DatasetDict({"train": train_tokenized, "validation": val_tokenized})
    
    def setup_model(self):
        """Nastav√≠ model pro tr√©nink."""
        # Z√°kladn√≠ model
        self.model = AutoModelForSequenceClassification.from_pretrained(
            self.base_model,
            num_labels=len(self.label_names),
            id2label=self.id2label,
            label2id=self.label2id
        )
        
        # Aplikace LoRA pokud je povolena
        if self.use_lora:
            from peft import LoraConfig, get_peft_model, TaskType
            
            lora_config = LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=self.lora_r,
                lora_alpha=self.lora_alpha,
                lora_dropout=0.1,
                target_modules=["query", "value"]
            )
            
            self.model = get_peft_model(self.model, lora_config)
            print(f"LoRA aplikov√°na (r={self.lora_r}, alpha={self.lora_alpha})")
        
        trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total = sum(p.numel() for p in self.model.parameters())
        print(f"Tr√©novateln√© parametry: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
    
    def train(
        self,
        dataset: DatasetDict,
        output_dir: str,
        epochs: int = 10,
        batch_size: int = 8,
        learning_rate: float = 2e-5
    ) -> Dict[str, Any]:
        """Spust√≠ tr√©nink modelu."""
        
        # Metriky
        accuracy_metric = evaluate.load("accuracy")
        f1_metric = evaluate.load("f1")
        
        def compute_metrics(eval_pred):
            logits, labels = eval_pred
            predictions = np.argmax(logits, axis=-1)
            return {
                "accuracy": accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"],
                "f1": f1_metric.compute(predictions=predictions, references=labels, average="weighted")["f1"]
            }
        
        # Training arguments
        lr = learning_rate if not self.use_lora else learning_rate * 5  # Vy≈°≈°√≠ LR pro LoRA
        
        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=epochs,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            learning_rate=lr,
            weight_decay=0.01,
            warmup_steps=50,
            eval_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            logging_steps=10,
            fp16=torch.cuda.is_available(),
            seed=42
        )
        
        # Trainer
        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=dataset["train"],
            eval_dataset=dataset["validation"],
            tokenizer=self.tokenizer,
            data_collator=DataCollatorWithPadding(tokenizer=self.tokenizer),
            compute_metrics=compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
        )
        
        # Tr√©nink
        print("\nZaƒç√≠n√°m tr√©nink...")
        train_result = self.trainer.train()
        
        # Evaluace
        eval_result = self.trainer.evaluate()
        
        return {
            "training_loss": train_result.training_loss,
            "eval_accuracy": eval_result["eval_accuracy"],
            "eval_f1": eval_result["eval_f1"]
        }
    
    def save(self, path: str):
        """Ulo≈æ√≠ model a tokenizer."""
        self.trainer.save_model(path)
        self.tokenizer.save_pretrained(path)
        
        # Ulo≈æen√≠ konfigurace
        config = {
            "base_model": self.base_model,
            "use_lora": self.use_lora,
            "label_names": self.label_names
        }
        with open(f"{path}/pipeline_config.json", "w") as f:
            json.dump(config, f, indent=2)
        
        print(f"Model ulo≈æen do: {path}")
    
    def predict(self, texts: List[str]) -> List[Dict[str, Any]]:
        """Predikce na nov√Ωch datech."""
        self.model.eval()
        results = []
        
        for text in texts:
            inputs = self.tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=128
            ).to(self.model.device)
            
            with torch.no_grad():
                outputs = self.model(**inputs)
                probs = torch.softmax(outputs.logits, dim=-1)[0]
                pred_idx = probs.argmax().item()
            
            results.append({
                "text": text,
                "label": self.label_names[pred_idx],
                "confidence": probs[pred_idx].item(),
                "all_scores": {l: probs[i].item() for i, l in enumerate(self.label_names)}
            })
        
        return results

In [None]:
# Pou≈æit√≠ pipeline
pipeline = FineTuningPipeline(
    base_model="bert-base-multilingual-cased",
    use_lora=True,
    lora_r=8
)

# P≈ô√≠prava dat
dataset = pipeline.prepare_data(
    train_texts=train_data["text"],
    train_labels=train_data["label"],
    val_texts=val_data["text"],
    val_labels=val_data["label"],
    label_names=label_names
)

# Setup modelu
pipeline.setup_model()

# Tr√©nink
results = pipeline.train(
    dataset=dataset,
    output_dir="./results/pipeline_model",
    epochs=10,
    batch_size=8
)

print("\n" + "="*50)
print("V√Ωsledky:")
print(f"  Accuracy: {results['eval_accuracy']:.2%}")
print(f"  F1 Score: {results['eval_f1']:.2%}")

In [None]:
# Test pipeline
test_texts = [
    "Zbo≈æ√≠ je vadn√©, co s t√≠m?",
    "Jak√© jsou rozmƒõry produktu?",
    "Kdy doraz√≠ z√°silka?",
    "P≈ôij√≠m√°te Apple Pay?"
]

predictions = pipeline.predict(test_texts)

print("Predikce:")
print("="*50)
for pred in predictions:
    print(f"\nText: {pred['text']}")
    print(f"Kategorie: {pred['label']} ({pred['confidence']:.2%})")

In [None]:
# Ulo≈æen√≠ kompletn√≠ pipeline
pipeline.save("./models/complete_pipeline")

## 5. Tipy pro Produkƒçn√≠ Fine-tuning

### Doporuƒçen√© postupy:

1. **Data**
   - Minim√°lnƒõ 100-500 vzork≈Ø na kategorii
   - Vyv√°≈æen√© t≈ô√≠dy nebo pou≈æit√≠ weighted loss
   - Kvalitn√≠ anotace jsou d≈Øle≈æitƒõj≈°√≠ ne≈æ kvantita

2. **Hyperparametry**
   - Learning rate: 1e-5 a≈æ 5e-5 pro full fine-tuning
   - Learning rate: 1e-4 a≈æ 1e-3 pro LoRA
   - Batch size: 8-32 (vƒõt≈°√≠ = stabilnƒõj≈°√≠ gradient)
   - Epochs: 3-10 s early stopping

3. **LoRA vs Full Fine-tuning**
   - LoRA: rychlej≈°√≠, m√©nƒõ pamƒõti, snadn√© p≈ôep√≠n√°n√≠ mezi √∫lohami
   - Full: lep≈°√≠ v√Ωsledky na velmi specifick√Ωch dom√©n√°ch

4. **Evaluace**
   - V≈ædy pou≈æ√≠vat holdout test set
   - Sledovat F1 score pro nevyv√°≈æen√° data
   - Cross-validace pro mal√© datasety

In [None]:
# Bonus: Export modelu pro ONNX (rychl√° inference)
from transformers import AutoModelForSequenceClassification
import torch.onnx

def export_to_onnx(model_path: str, output_path: str):
    """Exportuje model do ONNX form√°tu pro rychlou inference."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    model.eval()
    
    # Dummy input
    dummy_input = tokenizer(
        "Test text",
        return_tensors="pt",
        padding="max_length",
        max_length=128,
        truncation=True
    )
    
    # Export
    torch.onnx.export(
        model,
        (dummy_input["input_ids"], dummy_input["attention_mask"]),
        output_path,
        input_names=["input_ids", "attention_mask"],
        output_names=["logits"],
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence"},
            "attention_mask": {0: "batch_size", 1: "sequence"},
            "logits": {0: "batch_size"}
        },
        opset_version=14
    )
    
    print(f"Model exportov√°n do: {output_path}")

# P≈ô√≠klad pou≈æit√≠ (zakomentov√°no - vy≈æaduje ulo≈æen√Ω model)
# export_to_onnx("./models/customer_classifier", "./models/classifier.onnx")

## Shrnut√≠

V tomto notebooku jsme se nauƒçili:

1. **Z√°kladn√≠ fine-tuning** - klasifikace textu na vlastn√≠ch datech
2. **NER fine-tuning** - rozpozn√°v√°n√≠ vlastn√≠ch entit
3. **LoRA/PEFT** - efektivn√≠ fine-tuning s minim√°ln√≠ pamƒõt√≠
4. **Pipeline** - znovupou≈æiteln√° t≈ô√≠da pro fine-tuning
5. **Produkƒçn√≠ tipy** - best practices pro deployment

### Dal≈°√≠ kroky:
- Experiment s vƒõt≈°√≠mi modely (BERT-large, RoBERTa)
- Quantizace pro rychlej≈°√≠ inference
- A/B testov√°n√≠ v produkci
- Continuous learning s nov√Ωmi daty