## Label Studio to GLiNER2 Training Data Conversion
This code is used to convert the human labeled entities in Label Studio to a format that can be used for training GLiNER2. It reads the labeled data from Label Studio, extracts the relevant information, and creates a new JSON file that can be used for training.
It follows the recommended format:
```json
{"input": "text to process", "output": {"schema_definition": "with_annotations"}}
```

In [5]:
import json
from collections import defaultdict


def load_json(path):
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)


def convert_to_gliner2_format(annotation_data, schema):
    """
    Converts Label Studio-style annotations to GLiNER2 format.
    Only uses labels defined inside schema["entities"].
    """

    # ‚úÖ Only take actual NER entity types
    entity_types = list(schema["entities"].keys())

    gliner_data = []

    for item in annotation_data:
        text = item["text"]
        annotations = item.get("label", [])

        entities_dict = defaultdict(list)

        # Collect entities
        for ann in annotations:
            entity_text = ann["text"].strip()
            labels = ann.get("labels", [])

            for label in labels:
                # ‚úÖ Only include labels that exist in schema
                if label in entity_types:
                    entities_dict[label].append(entity_text)

        # Deduplicate while preserving order
        for label in entities_dict:
            seen = set()
            deduped = []
            for ent in entities_dict[label]:
                if ent not in seen:
                    deduped.append(ent)
                    seen.add(ent)
            entities_dict[label] = deduped

        # Ensure ALL schema entity types are present (even if empty)
        for entity_type in entity_types:
            if entity_type not in entities_dict:
                entities_dict[entity_type] = []

        formatted_item = {
            "input": text,
            "output": {
                "entities": dict(entities_dict)
            }
        }

        gliner_data.append(formatted_item)

    return gliner_data


def save_json(data, path):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)


if __name__ == "__main__":
    annotation_file = "LS_labeled_data.json"
    schema_file = "gliner_schema_hagiographics.json"
    output_file = "gliner2_training_data.json"

    annotations = load_json(annotation_file)
    schema = load_json(schema_file)

    gliner_data = convert_to_gliner2_format(annotations, schema)
    save_json(gliner_data, output_file)

    print(f"Converted {len(gliner_data)} examples.")

Converted 43 examples.


## finetuning GLiNER2 with LoRA
Now that we have the training data in the correct format, we can use it to fine-tune GLiNER2 specifically for our use case.

In [22]:
!pip install gliner2



In [25]:
from gliner2 import GLiNER2
from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig
import torch

# Detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# -------------------------------
# LoRA configuration
# -------------------------------

config = TrainingConfig(
    output_dir="./hagiography_lora",
    experiment_name="hagiography_latin_ner",

    # Training schedule
    num_epochs=10,
    batch_size=8 if device == "cpu" else 16,
    gradient_accumulation_steps=2,
    warmup_ratio=0.1,
    scheduler_type="cosine",

    # Learning rate (important)
    task_lr=2e-4,

    # LoRA settings
    use_lora=True,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    lora_target_modules=["encoder"],
    save_adapter_only=True,

    # Stability
    fp16=(device == "cuda"),
    eval_strategy="epoch",
    save_best=True,
    early_stopping=True,
    early_stopping_patience=5,
)

# -------------------------------
# Load multilingual base model
# -------------------------------

model = GLiNER2.from_pretrained("fastino/gliner2-multi-v1")

# -------------------------------
# Create trainer
# -------------------------------

trainer = GLiNER2Trainer(model=model, config=config)

# -------------------------------
# Train from JSONL files
# -------------------------------

trainer.train(
    train_data="train.jsonl",
    eval_data="val.jsonl"
)

best_model = GLiNER2.from_pretrained("./hagiography_lora/best")

print("LoRA training complete.")

2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/fastino/gliner2-multi-v1/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/fastino/gliner2-multi-v1/65627079a40a2b19a94eb4f08ebd00c429ccb6d4/config.json "HTTP/1.1 200 OK"


Using device: cpu


2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/fastino/gliner2-multi-v1/resolve/main/encoder_config/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/fastino/gliner2-multi-v1/65627079a40a2b19a94eb4f08ebd00c429ccb6d4/encoder_config%2Fconfig.json "HTTP/1.1 200 OK"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/fastino/gliner2-multi-v1/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/fastino/gliner2-multi-v1/65627079a40a2b19a94eb4f08ebd00c429ccb6d4/config.json "HTTP/1.1 200 OK"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/fastino/gliner2-multi-v1/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-27 16:48:32 - INFO - httpx - HTTP Request: HEAD https://huggingface.co/

üß† Model Configuration
Encoder model      : microsoft/mdeberta-v3-base
Counting layer     : count_lstm
Token pooling      : first


2026-02-27 16:48:41 - INFO - gliner2.training.trainer - Using device: cpu
2026-02-27 16:48:41 - INFO - gliner2.training.trainer - Setting up LoRA for parameter-efficient fine-tuning...
2026-02-27 16:48:41 - INFO - gliner2.training.trainer - Froze all model parameters for LoRA training
2026-02-27 16:48:41 - INFO - gliner2.training.lora - Applied LoRA to 72 layers
2026-02-27 16:48:42 - INFO - gliner2.training.trainer - LoRA setup complete: 1,327,104 trainable params out of 308,425,749 total (0.43%)


üîß LoRA Configuration
Enabled            : True
Rank (r)           : 8
Alpha              : 16
Scaling (Œ±/r)      : 2.0000
Dropout            : 0.05
Target modules     : encoder
LoRA layers        : 72
----------------------------------------------------------------------
Trainable params   : 1,327,104 / 308,425,749 (0.43%)
Memory savings     : ~99.6% fewer gradients


Validating records: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 34/34 [00:00<?, ?record/s]
2026-02-27 16:48:44 - INFO - gliner2.training.trainer - Optimizer: LoRA params only = 144, LR=0.0002
2026-02-27 16:48:44 - INFO - gliner2.training.trainer - ***** Running Training *****
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Num examples = 34
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Num epochs = 10
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Batch size = 8
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Gradient accumulation steps = 2
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Effective batch size = 16
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Total optimization steps = 20
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   Warmup steps = 2
2026-02-27 16:48:44 - INFO - gliner2.training.trainer -   LoRA enabled: 1,327,104 trainable / 308,425,749 total (0.43%)
Training:  10%|‚ñà         | 2/20 [03:08<25:14, 84

OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './hagiography_lora/best\config.json'. Use `repo_type` argument if needed.

In [26]:
import json
from gliner2 import GLiNER2

extractor = best_model

schema = create_gliner_schema_from_config_file(extractor, SCHEMA_CONFIG_PATH)

text = """
[2] Impellitis itaque, ut sacratissim√¶ virginis Christi Glodesindis, cui devotis continue adh√¶retis excubiis, qu√¶ conversationis initia, qui medii actus in finem usque feliciter consummatum exstiterint, quantum ex scriptis, qu√¶ ad nostram √¶tatem qualibuscumque litteris annotata perdurant, queam advertere, stylo, quoquo possibile sit, audeam pertentare. Quod tanto tempore, tantaque instantia flagitatum, quia indecens videtur obniti, ope ejusdem pr√¶stantissim√¶ Virginis, Deo spiritu mundo viventis virtutibus, tum vestris pariter fretus orationum subsidiis, etsi non sine quodam rubore, uti qui parcitatem proprii perpendam ingenii, non diu per longa moratus exordia, ocius narrationi accedam.
"""

results = extractor.extract(text, schema, threshold=0.1, include_confidence=True, include_spans=True,
                            format_results=False)
print(json.dumps(results, indent=2, ensure_ascii=False))


{
  "entities": [
    {
      "person": [],
      "group": [],
      "institution": [
        {
          "text": "mundo",
          "confidence": 0.589316725730896,
          "start": 486,
          "end": 491
        }
      ],
      "place": [],
      "object": [],
      "divine_entity": [],
      "text_title": []
    }
  ]
}
