# Fine-Tuned LLM Term Extraction with Unsloth

This notebook demonstrates term extraction by fine-tuning a Large Language Model:
- Uses **Gemma-3-4B** model with efficient fine-tuning via Unsloth
- Trains on ATE-IT waste management terminology dataset
- Optimizes using LoRA (Low-Rank Adaptation) for efficient training
- Evaluates with official competition metrics

Unsloth provides 2x faster training with 60% less memory usage.

Dataset: EvalITA 2025 ATE-IT (Automatic Term Extraction - Italian Testbed)

<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
</div>

In [6]:
!mkdir data
!mv *.csv ./data
!mv *.json ./data

## Setup and Imports

In [1]:
# Install Unsloth and dependencies
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

print("âœ“ Unsloth and dependencies installed")

## Load Base Model with Unsloth

Load Gemma-3-4B model with 4-bit quantization for efficient fine-tuning.

In [2]:
from unsloth import FastModel
import torch

# Load Gemma-3-4B model with 4-bit quantization
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048,
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
)

print("âœ“ Gemma-3-4B model loaded")
print(f"  Model: {model.__class__.__name__}")
print(f"  Max sequence length: 2048")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.1: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

âœ“ Gemma-3-4B model loaded
  Model: Gemma3ForConditionalGeneration
  Max sequence length: 2048


## Add LoRA Adapters

Configure LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

In [3]:
# Add LoRA adapters for efficient fine-tuning
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,
    r = 16,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

print("âœ“ LoRA adapters configured")
print(f"  Rank (r): 16")
print(f"  Alpha: 16")

Unsloth: Making `model.base_model.model.model.language_model` require gradients
âœ“ LoRA adapters configured
  Rank (r): 16
  Alpha: 16


## Data Loading and Processing

Load ATE-IT dataset and format for term extraction task.

In [4]:
# Configure chat template for Gemma-3
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

print("âœ“ Gemma-3 chat template configured")

âœ“ Gemma-3 chat template configured


In [7]:
import json

# Load ATE-IT training data
def load_jsonl(path: str):
    """Load a JSON lines file or JSON array file."""
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    if not text:
        return []
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        data = []
        for line in text.splitlines():
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data

# Load training data
train_data = load_jsonl('data/subtask_a_train.json')
dev_data = load_jsonl('data/subtask_a_dev.json')

print(f"âœ“ Data loaded")
print(f"  Training samples: {len(train_data['data'])}")
print(f"  Dev samples: {len(dev_data['data'])}")

âœ“ Data loaded
  Training samples: 2308
  Dev samples: 577


## Format Data for Term Extraction

Convert ATE-IT data to conversation format for fine-tuning.

In [8]:
# Convert to conversation format for fine-tuning
from datasets import Dataset

conversations = []
for entry in train_data['data']:
    sentence = entry['sentence_text']
    terms = entry['term_list']

    # Format as instruction-response pairs
    terms_str = ', '.join(terms) if terms else 'none'

    conversation = [
        {"role": "user", "content": f"Extract all waste management terms from this Italian sentence: {sentence}"},
        {"role": "assistant", "content": f"Terms: {terms_str}"}
    ]
    conversations.append({"conversations": conversation})

# Create dataset
dataset = Dataset.from_list(conversations)

print(f"âœ“ Dataset formatted for fine-tuning")
print(f"  Total examples: {len(dataset)}")

âœ“ Dataset formatted for fine-tuning
  Total examples: 2308


### Example Training Sample

In [9]:
# Show example conversation
print("Example training conversation:")
print(dataset[100]['conversations'])

Example training conversation:
[{'content': "Extract all waste management terms from this Italian sentence: Per ulteriori informazioni e chiarimenti rivolgersi all'Ufficio Tributi utilizzando il seguente contatto telefonico unicamente il giovedÃ¬ di ogni settimana disponibile: tel. 0975-1900676", 'role': 'user'}, {'content': 'Terms: none', 'role': 'assistant'}]


## Apply Chat Template

Apply Gemma-3 chat template to format conversations for training.

In [10]:
# Apply chat template to conversations
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        ).removeprefix('<bos>')
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

print("âœ“ Chat template applied to dataset")

Map:   0%|          | 0/2308 [00:00<?, ? examples/s]

âœ“ Chat template applied to dataset


### Formatted Training Example

In [11]:
# Show formatted text
print("Formatted training example:")
print(dataset[100]["text"][:300] + "...")

Formatted training example:
<start_of_turn>user
Extract all waste management terms from this Italian sentence: Per ulteriori informazioni e chiarimenti rivolgersi all'Ufficio Tributi utilizzando il seguente contatto telefonico unicamente il giovedÃ¬ di ogni settimana disponibile: tel. 0975-1900676<end_of_turn>
<start_of_turn>mo...


## Configure Training

Set up training arguments and trainer with response masking.

In [12]:
# Configure trainer
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,  # Full training run
        learning_rate = 2e-4,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

print("âœ“ Trainer configured")
print(f"  Epochs: 1")
print(f"  Batch size: 2")
print(f"  Learning rate: 2e-4")

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/2308 [00:00<?, ? examples/s]

âœ“ Trainer configured
  Epochs: 1
  Batch size: 2
  Learning rate: 2e-4


### Enable Response-Only Training

Train only on model responses, not user inputs.

In [13]:
# Train only on assistant responses
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

print("âœ“ Response-only training enabled")

Map (num_proc=6):   0%|          | 0/2308 [00:00<?, ? examples/s]

âœ“ Response-only training enabled


In [14]:
# Check initial memory usage
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved: {start_gpu_memory} GB")

GPU: Tesla T4
Max memory: 14.741 GB
Reserved: 5.59 GB


## Train the Model

Note: Training may take several hours depending on hardware.

In [15]:
# Start training
print("="*60)
print("Starting fine-tuning...")
print("="*60)

trainer_stats = trainer.train()

print("\n" + "="*60)
print("âœ“ TRAINING COMPLETED!")
print("="*60)

Starting fine-tuning...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,308 | Num Epochs = 1 | Total steps = 289
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,802,496 of 4,329,881,968 (0.69% trained)


Step,Training Loss
10,4.6087
20,0.4573
30,0.3961
40,0.4226
50,0.3172
60,0.3899
70,0.3431
80,0.2775
90,0.3058
100,0.3109


Unsloth: Will smartly offload gradients to save VRAM!

âœ“ TRAINING COMPLETED!


In [16]:
# Show training statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
training_time_minutes = round(trainer_stats.metrics['train_runtime']/60, 2)

print(f"Training time: {training_time_minutes} minutes")
print(f"Peak memory: {used_memory} GB")
print(f"Memory for training: {used_memory_for_lora} GB")

Training time: 13.39 minutes
Peak memory: 5.736 GB
Memory for training: 0.146 GB


## Save Fine-Tuned Model

In [17]:
# Save LoRA adapters
model.save_pretrained("models/gemma3_term_extraction")
tokenizer.save_pretrained("models/gemma3_term_extraction")

print("âœ“ Model saved to models/gemma3_term_extraction")

âœ“ Model saved to models/gemma3_term_extraction


## Inference and Evaluation

Run predictions on dev set and evaluate with official metrics.

In [18]:
# Prepare dev data for evaluation
def build_sentence_gold_map(records):
    """Convert dataset rows into list of sentences with aggregated terms."""
    out = {}

    if isinstance(records, dict) and 'data' in records:
        rows = records['data']
    else:
        rows = records

    for r in rows:
        key = (r.get('document_id'), r.get('paragraph_id'), r.get('sentence_id'))
        if key not in out:
            out[key] = {
                'document_id': r.get('document_id'),
                'paragraph_id': r.get('paragraph_id'),
                'sentence_id': r.get('sentence_id'),
                'sentence_text': r.get('sentence_text', ''),
                'terms': []
            }

        if isinstance(r.get('term_list'), list):
            for t in r.get('term_list'):
                if t and t not in out[key]['terms']:
                    out[key]['terms'].append(t)

    return list(out.values())

dev_sentences = build_sentence_gold_map(dev_data)
dev_texts = [s['sentence_text'] for s in dev_sentences]
dev_gold = [s['terms'] for s in dev_sentences]

print(f"âœ“ Dev set prepared: {len(dev_sentences)} sentences")

âœ“ Dev set prepared: 577 sentences


## Evaluation Metrics

Using the official evaluation metrics from the competition.

In [19]:
def micro_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Precision, Recall, and F1 score
    based on individual term matching (micro-average).
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0

    for gold, system in zip(gold_standard, system_output):
        gold_set = set(gold)
        system_set = set(system)

        true_positives = len(gold_set.intersection(system_set))
        false_positives = len(system_set - gold_set)
        false_negatives = len(gold_set - system_set)

        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives

    precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
    recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1, total_true_positives, total_false_positives, total_false_negatives


def type_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Type Precision, Type Recall, and Type F1 score
    based on the set of unique terms extracted at least once across the entire dataset.
    """
    all_gold_terms = set()
    for item_terms in gold_standard:
        all_gold_terms.update(item_terms)

    all_system_terms = set()
    for item_terms in system_output:
        all_system_terms.update(item_terms)

    type_true_positives = len(all_gold_terms.intersection(all_system_terms))
    type_false_positives = len(all_system_terms - all_gold_terms)
    type_false_negatives = len(all_gold_terms - all_system_terms)

    type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
    type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
    type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0

    return type_precision, type_recall, type_f1


print("âœ“ Evaluation functions defined")

âœ“ Evaluation functions defined


## Run Predictions on Dev Set

In [24]:
# Run inference on dev set
print("Running predictions on dev set...")

llm_preds = []
for i, text in enumerate(dev_texts):
    if i % 100 == 0:
        print(f"  Processing {i}/{len(dev_texts)}...")

    # Create prompt with proper format for Gemma-3
    prompt = f"Extract all waste management terms from this Italian sentence: {text}"

    # Format as conversation
    conversation = [{"role": "user", "content": prompt}]

    # Apply chat template and tokenize
    formatted_prompt = tokenizer.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=1.0,
        top_p=0.95,
        top_k=64,
    )

    # Decode and parse
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract terms from "Terms: term1, term2, ..." format
    try:
        if "Terms:" in response:
            terms_part = response.split("Terms:")[-1].strip()
            if terms_part.lower() == 'none':
                terms = []
            else:
                terms = [t.strip().lower() for t in terms_part.split(',') if t.strip()]
        else:
            terms = []
    except:
        terms = []

    llm_preds.append(terms)

print(f"âœ“ Predictions completed: {len(llm_preds)} predictions")

Running predictions on dev set...
  Processing 0/577...
  Processing 100/577...
  Processing 200/577...
  Processing 300/577...
  Processing 400/577...
  Processing 500/577...
âœ“ Predictions completed: 577 predictions


## Evaluate Performance

In [25]:
# Evaluate using competition metrics
precision, recall, f1, tp, fp, fn = micro_f1_score(dev_gold, llm_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, llm_preds)

print("\n" + "="*60)
print("FINE-TUNED LLM RESULTS")
print("="*60)
print("\nMicro-averaged Metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")

print("\nType-level Metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")
print("="*60)


FINE-TUNED LLM RESULTS

Micro-averaged Metrics:
  Precision: 0.6164
  Recall:    0.5987
  F1 Score:  0.6074
  TP=270, FP=168, FN=181

Type-level Metrics:
  Type Precision: 0.5255
  Type Recall:    0.5537
  Type F1 Score:  0.5392


## Save Predictions to File

In [26]:
# Save predictions in competition format
import os

def save_predictions(predictions, sentences, output_path):
    """Save predictions in competition format."""
    output = {'data': []}
    for pred, sent in zip(predictions, sentences):
        output['data'].append({
            'document_id': sent['document_id'],
            'paragraph_id': sent['paragraph_id'],
            'sentence_id': sent['sentence_id'],
            'term_list': pred
        })

    os.makedirs(os.path.dirname(output_path) or '.', exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"âœ“ Saved {len(predictions)} predictions to {output_path}")

save_predictions(llm_preds, dev_sentences, 'predictions/gemma3_finetuned_preds.json')

âœ“ Saved 577 predictions to predictions/gemma3_finetuned_preds.json


## Example Predictions

In [27]:
# Show example predictions
print("Example Predictions:\n")

count = 0
for i in range(len(dev_sentences)):
    if len(dev_gold[i]) > 0 and count < 5:
        print(f"Sentence: {dev_texts[i][:100]}...")
        print(f"Gold terms: {dev_gold[i][:5]}")
        print(f"LLM predictions: {llm_preds[i][:5]}")

        correct = set(dev_gold[i]) & set(llm_preds[i])
        missed = set(dev_gold[i]) - set(llm_preds[i])
        wrong = set(llm_preds[i]) - set(dev_gold[i])

        print(f"âœ“ Correct: {len(correct)}")
        print(f"âœ— Missed: {len(missed)}")
        print(f"âœ— Wrong: {len(wrong)}")
        print("-"*80)
        print()

        count += 1

Example Predictions:

Sentence: Il presente disciplinare per la gestione dei centri di raccolta comunali Ã¨ stato redatto ai sensi e ...
Gold terms: ['disciplina dei centri di raccolta dei rifiuti urbani raccolti in modo differenziato', 'disciplinare per la gestione dei centri di raccolta comunali']
LLM predictions: []
âœ“ Correct: 0
âœ— Missed: 2
âœ— Wrong: 0
--------------------------------------------------------------------------------

Sentence: Ãˆ un Servizio Supplementare di raccolta, rivolto a famiglie con bambini al di sotto dei 3 anni o con...
Gold terms: ['raccolta']
LLM predictions: ['servizio supplementare di raccolta']
âœ“ Correct: 0
âœ— Missed: 1
âœ— Wrong: 1
--------------------------------------------------------------------------------

Sentence: ll servizio di raccolta dei rifiuti derivanti da sfalci e potature Ã¨ gestito dalla Buttol Srl con il...
Gold terms: ['servizio di raccolta dei rifiuti', 'sfalci e potature']
LLM predictions: ['servizio di raccolta dei rifiut