In [1]:
!git clone https://github.com/AnnaGhost2713/daia-eon.git
%cd daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline

Cloning into 'daia-eon'...
remote: Enumerating objects: 1408, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 1408 (delta 0), reused 0 (delta 0), pack-reused 1407 (from 2)[K
Receiving objects: 100% (1408/1408), 96.31 MiB | 23.21 MiB/s, done.
Resolving deltas: 100% (815/815), done.
Updating files: 100% (967/967), done.
/content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline


In [2]:

%cd /content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline

/content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline


In [3]:
# 🧩 Schritt 1: Imports und Setup
import spacy
from spacy.tokens import DocBin
from spacy.training.example import Example
import json
import random
from pathlib import Path
from spacy.util import minibatch
from spacy.util import compounding
!python -m spacy download de_core_news_md
!pip install spacy-lookups-data



# 📁 Schritt 2: Funktion zum Laden von Daten aus JSON
def load_data_from_json(path):
    with open(path, "r", encoding="utf-8") as f:
        raw_data = json.load(f)
    if isinstance(raw_data, dict):
        raw_data = [raw_data]

    TRAIN_DATA = []
    for entry in raw_data:
        text = entry["text"]
        entities = [(label["start"], label["end"], label["label"]) for label in entry["labels"]]
        TRAIN_DATA.append((text, {"entities": entities}))
    return TRAIN_DATA

# 🔄 Lade Trainings- und Dev-Daten separat
train_data = load_data_from_json("../../../data/synthetic/synthetic_mails_option_b_cleaned.json")
dev_data = load_data_from_json("../../../data/original/ground_truth_split/validation_norm.json")
print(f"📥 Trainingsbeispiele: {len(train_data)}, Dev-Beispiele: {len(dev_data)}")

# 🧠 Schritt 3: Lade spaCy-Basismodell
base_model = "de_core_news_md"
nlp = spacy.load(base_model)


# Stelle sicher, dass NER-Komponente existiert
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")

# Registriere alle Labels aus beiden Datensätzen
for dataset in (train_data, dev_data):
    for _, annotations in dataset:
        for start, end, label in annotations["entities"]:
            ner.add_label(label)

# 🚀 Schritt 4: Modell-Initialisierung mit allen Daten (nur für Labels!)
def get_examples():
    for text, ann in train_data + dev_data:
        yield Example.from_dict(nlp.make_doc(text), ann)

optimizer = nlp.resume_training()


# 🏋️ Schritt 5: Training (nur auf Trainingsdaten)
n_iter = 13
for i in range(n_iter):
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(8.0, 64.0, 1.001))
    for batch in batches:
        examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
        nlp.update(examples, drop=0.2, losses=losses)

    print(f"🔁 Iteration {i+1}/{n_iter}, Loss: {losses['ner']:.4f}")

# 💾 Schritt 6: Modell speichern
output_dir = Path("custom_spacy_model_synthetic_data_b")
output_dir.mkdir(exist_ok=True)
nlp.to_disk(output_dir)
print(f"\n✅ Modell gespeichert unter: {output_dir.resolve()}")

# 🔍 Schritt 7: Modell laden und auf dev_data testen
nlp = spacy.load(output_dir)

print("\n📊 Evaluation auf dev_data:")
for text, _ in random.sample(dev_data, min(5, len(dev_data))):  # max. 5 Beispiele
    doc = nlp(text)
    print(f"\n> {text}")
    for ent in doc.ents:
        print(f"  - {ent.text} ({ent.label_})")

Collecting de-core-news-md==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
📥 Trainingsbeispiele: 14360, Dev-Beispiele: 25


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
nach Rücksprache mi..." with entities "[(198, 205, 'VORNAME'), (206, 211, 'NACHNAME'), (2...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
infolge der telefon..." with entities "[(195, 204, 'VORNAME'), (205, 210, 'NACHNAME'), (2...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
Verbrauchsstelle:..." with entities "[(9, 32, 'FIRMA'), (51, 67, 'STRASSE'), (68, 73, '...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
leider ist meine Zäh..." with entities "[(154, 165, 'ZÄHLERNUMMER'), (191, 196, 'VORNAME')...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)`

🔁 Iteration 1/13, Loss: 19619.8320
🔁 Iteration 2/13, Loss: 2261.2937
🔁 Iteration 3/13, Loss: 1311.2625
🔁 Iteration 4/13, Loss: 873.5128
🔁 Iteration 5/13, Loss: 1079.0535
🔁 Iteration 6/13, Loss: 1006.8538
🔁 Iteration 7/13, Loss: 605.3550
🔁 Iteration 8/13, Loss: 1128.7581
🔁 Iteration 9/13, Loss: 989.1884
🔁 Iteration 10/13, Loss: 926.6961
🔁 Iteration 11/13, Loss: 491.4452
🔁 Iteration 12/13, Loss: 437.5450
🔁 Iteration 13/13, Loss: 795.6146

✅ Modell gespeichert unter: /content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_synthetic_data_b

📊 Evaluation auf dev_data:

> Sehr geehrte Damen und Herren,
anbei übersende ich Ihnen schriftlich meine Kündigung des Erdgasvertrages mit der Vertragsnummer 409107498.
Die Kündigung habe ich auch bereits über das Onlineportal "Mein Rogner" getätigt.
Mit freundlichen Grüßen
Gabriel Mitschke-Seip

  - 409107498. (VERTRAGSNUMMER)
  - Rogner (FIRMA)
  - Gabriel (VORNAME)
  - Mitschke-Seip (NACHNAME)

> Hiermit sende ich i

In [2]:
# 🧩 Schritt 1: Imports und Setup
import spacy
import random
import json
from pathlib import Path
from spacy.training.example import Example
from spacy.util import minibatch, compounding
from spacy.scorer import Scorer
import matplotlib.pyplot as plt

# 📁 Schritt 2: Daten-Loader
def load_data_from_json(path):
    with open(path, "r", encoding="utf-8") as f:
        raw = json.load(f)
    if isinstance(raw, dict):
        raw = [raw]
    out = []
    for entry in raw:
        ents = [(lab["start"], lab["end"], lab["label"]) for lab in entry["labels"]]
        out.append((entry["text"], {"entities": ents}))
    return out

train_data = load_data_from_json("../../../data/synthetic/synthetic_mails_option_b_cleaned.json")
dev_data   = load_data_from_json("../../../data/original/ground_truth_split/validation_norm.json")
print(f"📥 Train: {len(train_data)} Beispiele, Dev: {len(dev_data)} Beispiele")

# 🧠 Schritt 3: spaCy-Modell laden und NER initialisieren
nlp = spacy.load("de_core_news_md")
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")
# Labels registrieren
for data in (train_data, dev_data):
    for _, ann in data:
        for start, end, label in ann["entities"]:
            ner.add_label(label)

optimizer = nlp.resume_training()


# 🔍 Hilfs-Funktion: Eval auf Dev-Daten
def evaluate_dev(nlp, data):
    scorer = Scorer()
    examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in data]
    scores = scorer.score(examples)
    # gesamt (macro) ausgeben, sowie pro Label wenn gewünscht
    return scores["ents_p"], scores["ents_r"], scores["ents_f"]


# 🏋️ Schritt 4: Training mit Tracking
n_iter = 30
dropout = 0.2
batch_compound = compounding(8.0, 64.0, 1.001)
early_stopping_rounds = 5

loss_history = []
f1_history   = []
best_f1 = 0.0
no_improve = 0

for i in range(n_iter):
    # Training
    random.shuffle(train_data)
    losses = {}
    batches = minibatch(train_data, size=batch_compound)
    for batch in batches:
        exs = [Example.from_dict(nlp.make_doc(t), ann) for t, ann in batch]
        nlp.update(exs, drop=dropout, losses=losses)
    loss = losses.get("ner", 0.0)
    loss_history.append(loss)

    # Evaluation auf Dev
    p, r, f = evaluate_dev(nlp, dev_data)
    f1_history.append(f)

    # Statusausgabe
    print(f"Epoch {i+1:2d}/{n_iter} — Loss: {loss:7.2f}  |  P: {p*100:5.1f}%  R: {r*100:5.1f}%  F1: {f*100:5.1f}%")

    # Over-/Underfitting-Signal
    if i>0:
        # Overfitting: Loss fällt, F1 fällt
        if loss_history[-1] < loss_history[-2] and f1_history[-1] < f1_history[-2]:
            print(" ⚠️ Achtung: Trainings-Loss ↓ & Validierung F1 ↓ → mögliches Overfitting")
        # Underfitting: Loss hoch & F1 niedrig
        if loss > 1000 and f < 0.3:
            print(" ℹ️ Modell scheint Underfitted zu sein (hoher Loss bei niedriger F1)")

    # Early Stopping
    if f > best_f1:
        best_f1 = f
        no_improve = 0
        # hier könntest du das beste Modell sichern: nlp.to_disk("best_model")
    else:
        no_improve += 1
        if no_improve >= early_stopping_rounds:
            print(f"⏹️ Early stopping nach {i+1} Epochen — keine F1-Verbesserung in {early_stopping_rounds} Runden")
            break

# 💾 Schritt 5: finales Modell speichern
out_dir = Path("custom_spacy_model_synthetic_data_b")
out_dir.mkdir(exist_ok=True)
nlp.to_disk(out_dir)
print(f"\n✅ Modell gespeichert in: {out_dir.resolve()}")

# 📈 Schritt 6: Kurven plotten
plt.figure(figsize=(8,4))
plt.plot(loss_history,  label="Train Loss")
plt.ylabel("Loss"); plt.xlabel("Epoch"); plt.legend(); plt.title("Trainingsverlust");
plt.show()

plt.figure(figsize=(8,4))
plt.plot([v*100 for v in f1_history],  label="Dev F1")
plt.ylabel("F1 (%)"); plt.xlabel("Epoch"); plt.legend(); plt.title("Validierungs F1");
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '../../../data/synthetic/synthetic_mails_option_b_cleaned.json'

In [1]:
import os
import shutil
from getpass import getpass

# 🔐 GitHub-Zugangsdaten
username = "AnnaGhost2713"  # Besitzer:in des Repos
token=()

# 📁 Pfade
model_folder = "/content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_testing"
repo_dir     = "/content/daia-eon"
target_subdir = "notebooks/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_testing2"
repo_target  = os.path.join(repo_dir, target_subdir)

# 📂 Modellordner ins Repo kopieren
if os.path.exists(repo_target):
    shutil.rmtree(repo_target)
shutil.copytree(model_folder, repo_target)

# 💻 In das Repo wechseln
%cd {repo_dir}

# 🛠 Git konfigurieren
!git config user.email "timon.martens@tum.de"
!git config user.name "timonmartens"
!git remote set-url origin https://{username}:{token}@github.com/{username}/daia-eon.git

# ✅ Commit & Push
!git add {target_subdir}
!git commit -m "📦 Add trained spaCy model"
!git push origin main


FileNotFoundError: [Errno 2] No such file or directory: '/content/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_testing'

In [3]:
# 🔍 Schritt 7: Modell laden und auf dev_data testen
nlp = spacy.load(output_dir)

print("\n📊 Evaluation auf dev_data:")
for text, _ in random.sample(dev_data, min(5, len(dev_data))):  # max. 5 Beispiele
    doc = nlp(text)
    print(f"\n> {text}")
    for ent in doc.ents:
        print(f"  - {ent.text} ({ent.label_})")


📊 Evaluation auf dev_data:

> Sehr geehrte Damen und Herren,
hiermit lege Ich, Catharina Thies, Vertragsnummer 402157398, bei der Schlussrechnung 2022/ 2023 Wiederspruch ein.
Die Wohnungsabnahme war am 15.05.2022, anbei das Übergabeprotokoll der Hausverwaltung. Der Abbrechnungszeitraum vom 16.05.-31.05.22 fäll somit nicht mehr in meinen Bemessungszeitraum. Anbei auch die neue Meldebescheinigung. Ich bitte hiermit um Klärung der Abrechnung.
Mit freundlichen Grüßen
Catharina Thies

  - Catharina (VORNAME)
  - Thies (NACHNAME)
  - 402157398 (VERTRAGSNUMMER)
  - 15.05.2022 (DATUM)
  - 16.05.-31.05.22 (ZÄHLERSTAND)
  - Catharina (VORNAME)
  - Thies (NACHNAME)

> Vertragsnumer 403038396
Hello
Wir haben Stromzähler am 21.04.23 in Büro in Essen abgegeben. Bei Herr Hiller.
Mit freundlichen Grüßen
Holsten KaffeeZimmer
ส่งจาก Outlook สำหรับ Android<https://aka.ms/AAb9ysg>

  - 403038396 (VERTRAGSNUMMER)
  - Hiller (NACHNAME)
  - Holsten KaffeeZimmer (FIRMA)
  - ส่งจาก Outlook สำหรับ Android<http

In [None]:
# 🔍 Schritt 7: Modell laden
nlp2 = spacy.load("custom_spacy_model_doccano_labeling")

# 🧩 Schritt 7.1: EntityRuler hinzufügen
# ❌ Entferne ggf. vorher vorhandenen EntityRuler
if "entity_ruler" in nlp2.pipe_names:
    nlp2.remove_pipe("entity_ruler")

# ✅ EntityRuler nach dem NER einfügen, damit er bevorzugt wird
ruler = nlp2.add_pipe("entity_ruler", before="ner")
# Beispiel: Lade Muster
# Funktion zum Laden von Namen aus Datei
def load_names(path, label):
    with open(path, "r", encoding="utf-8") as f:
          names = [name.strip() for name in f if name.strip()]
          patterns = [{"label": label, "pattern": name} for name in names]
    return patterns, names

# Gazetteer laden
vornamen_patterns, vornamen_liste = load_names("Vornamen.txt", "VORNAME")
nachnamen_patterns, nachnamen_liste = load_names("Nachnamen.txt", "NACHNAME")
titel_patterns, titel_liste = load_names("Titel.txt", "TITEL")
wohnort_patterns, wohnort_liste = load_names("Orte.txt", "WOHNORT")
postleitzahl_patterns, postleitzahl_liste = load_names("Postleitzahlen.txt", "POSTLEITZAHL")
strasse_patterns = [
    {
        "label": "STRASSE",
        "pattern": [
            {"TEXT": {"REGEX": r".*(straße|gasse|allee|weg|platz|str.|grund)$"}},
            {"TEXT": {"REGEX": r"^\d+[a-zA-Z]?$"}}
        ]
    }
]
vertragsnummer_patterns = [
    {
        "label": "VERTRAGSNUMMER",
        "pattern": [
            {"LOWER": {"IN": ["vertragsnummer", "vertragsnr.", "vnr", "vn"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d{6,12}\.?$"}}

        ]
    }
]

kundennummer_patterns = [
    {
        "label": "KUNDENNUMMER",
        "pattern": [
            {"LOWER": {"IN": ["kundennummer", "kundennr.", "kdnr", "kd"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d{6,12}\.?$"}}
        ]
    }
]

zuordnungsnummer_patterns = [
    {
        "label": "ZUORDNUNGSNUMMER",
        "pattern": [
            {"LOWER": {"IN": ["znr", "zuordnungsnummer"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d{6,12}\.?$"}}
        ]
    }
]
iban_pattern = [
    {"label": "IBAN", "pattern": [{"TEXT": {"REGEX": r"^[A-Z]{2}[0-9]{2}[A-Z0-9]{11,30}$"}}]}
]

bic_pattern = [
    {"label": "BIC", "pattern": [{"TEXT": {"REGEX": r"^[A-Z]{6}[A-Z2-9][A-NP-Z0-9]([A-Z0-9]{3})?$"}}]}
]

zahlung_pattern = [
    {
        "label": "ZAHLUNG",
        "pattern": [
            {"TEXT": {"REGEX": r"^\d+[.,]?\d{0,2}$"}},
            {"TEXT": {"REGEX": r"^(€|euro|eur)$"}}
        ]
    }
]

zählerstand_patterns = [
    {
        "label": "ZÄHLERSTAND",
        "pattern": [
            {"LOWER": {"IN": ["zählerstand"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}}
        ]
    },
    {
        "label": "ZÄHLERSTAND",
        "pattern": [
            {"LOWER": {"IN": ["zählerstand"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"IN": [":"]}, "OP": "?"},
            {"TEXT": {"REGEX": r"^\d{1,5}([.,]\d{1,2})?$"}}
        ]
    }
]

zählernummer_patterns = [
    {
        "label": "ZÄHLERNUMMER",
        "pattern": [
            {"LOWER": {"IN": ["zählernummer"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^[A-Z0-9]{6,20}$"}}  # Groß- und Kleinbuchstaben + Ziffern erlaubt
        ]
    }
]

verbrauch_patterns = [
    {
        "label": "VERBRAUCH",
        "pattern": [
            {"LOWER": {"IN": ["verbrauch"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": ["kwh", "m³", "kw"]}, "OP": "?"}
        ]
    }
]

verbrauch_patterns += [
    {
        "label": "VERBRAUCH",
        "pattern": [
            {"LOWER": {"IN": ["verbrauch"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d+(?:[.,]\d+)?(kwh|m³|kw)$"}}
        ]
    }
]


wlv_patterns = [
    {
        "label": "WLV",
        "pattern": [
            {"LOWER": {"IN": ["wlv"]}},
            {"IS_PUNCT": True, "OP": "*"},
            {"TEXT": {"REGEX": r"^\d{4,12}$"}}
        ]
    }
]

email_pattern = [
    {
        "label": "EMAIL",
        "pattern": [
            {"TEXT": {"REGEX": r"^[\w\.-]+@[\w\.-]+\.\w{2,}$"}}
        ]
    }
]

telefon_pattern = [
    {
        "label": "TELEFON",
        "pattern": [
            {"TEXT": {"REGEX": r"^(\+49|0)[\d\s/-]{7,}$"}}
        ]
    }
]

url_pattern = [
    {
        "label": "LINK",
        "pattern": [
            {"TEXT": {"REGEX": r"^https?://[\w\-\.]+\.\w{2,}(/[\w\-\.]*)*$"}}
        ]
    },
    {
        "label": "LINK",
        "pattern": [
            {"TEXT": {"REGEX": r"^www\.[\w\-\.]+\.\w{2,}(/[\w\-\.]*)*$"}}
        ]
    }
]

datum_pattern = [
    {
        "label": "DATUM",
        "pattern": [
            {"TEXT": {"REGEX": r"^(\d{1,2}[./-]){2}\d{2,4}$"}}  # z. B. 15.06.2024
        ]
    },
    {
        "label": "DATUM",
        "pattern": [
            {"TEXT": {"REGEX": r"^\d{4}-\d{2}-\d{2}$"}}  # z. B. 2024-06-15
        ]
    },
    {
        "label": "DATUM",
        "pattern": [
            {"TEXT": {"REGEX": r"^\d{1,2}$"}},  # z. B. 15
            {"LOWER": {"IN": [
                "januar", "jan", "februar", "feb", "märz", "maerz", "mrz", "april", "apr",
                "mai", "juni", "jun", "juli", "jul", "august", "aug", "september", "sep",
                "oktober", "okt", "november", "nov", "dezember", "dez"
            ]}},
            {"TEXT": {"REGEX": r"^\d{2,4}$"}, "OP": "?"}  # optional Jahr
        ]
    }
]





# EntityRuler erstellen und Muster hinzufügen

ruler.add_patterns(zahlung_pattern + url_pattern + iban_pattern + bic_pattern + zahlung_pattern + zählerstand_patterns + email_pattern + telefon_pattern)  # 👈 Muster hinzufügen!


# 💾 (Optional) Modell MIT Ruler neu speichern
output_dir_ruler = Path("custom_spacy_model_with_ruler")
output_dir_ruler.mkdir(exist_ok=True)
nlp2.to_disk(output_dir_ruler)
print(f"✅ Modell mit EntityRuler gespeichert unter: {output_dir_ruler.resolve()}")

✅ Modell mit EntityRuler gespeichert unter: /Users/timonmartens/Library/CloudStorage/OneDrive-Persönlich/Desktop/Veranstaltungen/Data Analytics in Applications/daia-eon/notebooks/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_with_ruler


In [10]:
from spacy.training import Example
from spacy.scorer import Scorer
import pandas as pd
from collections import defaultdict

def evaluate_model_with_counts(nlp, examples):
    scorer = Scorer()
    example_list = []
    counts = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0})

    for text, ann in examples:
        doc = nlp(text)
        example = Example.from_dict(doc, ann)
        pred_ents = {(ent.start_char, ent.end_char, ent.label_) for ent in example.predicted.ents}
        true_ents = {(ent.start_char, ent.end_char, ent.label_) for ent in example.reference.ents}

        for ent in pred_ents:
            if ent in true_ents:
                counts[ent[2]]["tp"] += 1
            else:
                counts[ent[2]]["fp"] += 1
        for ent in true_ents:
            if ent not in pred_ents:
                counts[ent[2]]["fn"] += 1

        example_list.append(example)

    # Evaluation mit spaCy
    scores = scorer.score(example_list)
    results = []

    sum_tp = sum_fp = sum_fn = 0
    all_precisions = []
    all_recalls = []
    all_f1s = []

    for label, metrics in scores["ents_per_type"].items():
        tp = counts[label]["tp"]
        fp = counts[label]["fp"]
        fn = counts[label]["fn"]

        sum_tp += tp
        sum_fp += fp
        sum_fn += fn

        precision = metrics["p"] * 100
        recall = metrics["r"] * 100
        f1 = metrics["f"] * 100

        all_precisions.append(precision)
        all_recalls.append(recall)
        all_f1s.append(f1)

        results.append({
            "Label": label,
            "Precision (%)": round(precision, 2),
            "Recall (%)": round(recall, 2),
            "F1-Score (%)": round(f1, 2),
            "True Positives": tp,
            "False Positives": fp,
            "False Negatives": fn
        })

    # Macro-Average: ungewichteter Durchschnitt der Label-Scores
    if all_precisions:
        results.append({
            "Label": "⚙️ Overall (Macro-Average)",
            "Precision (%)": round(sum(all_precisions) / len(all_precisions), 2),
            "Recall (%)": round(sum(all_recalls) / len(all_recalls), 2),
            "F1-Score (%)": round(sum(all_f1s) / len(all_f1s), 2),
            "True Positives": "-",
            "False Positives": "-",
            "False Negatives": "-"
        })

    # Micro-Average: globaler TP/FP/FN → globaler Score
    if (sum_tp + sum_fp) > 0 and (sum_tp + sum_fn) > 0:
        precision_micro = sum_tp / (sum_tp + sum_fp)
        recall_micro = sum_tp / (sum_tp + sum_fn)
        f1_micro = 2 * (precision_micro * recall_micro) / (precision_micro + recall_micro)
        results.append({
            "Label": "📦 Overall (Micro-Average)",
            "Precision (%)": round(precision_micro * 100, 2),
            "Recall (%)": round(recall_micro * 100, 2),
            "F1-Score (%)": round(f1_micro * 100, 2),
            "True Positives": sum_tp,
            "False Positives": sum_fp,
            "False Negatives": sum_fn
        })

    return pd.DataFrame(results)

In [11]:
nlp = spacy.load("custom_spacy_model_doccano_labeling")
#nlp2 = spacy.load("custom_spacy_model_with_ruler")
nlp3=spacy.load("custom_spacy_model_synthetic_data")

results_nlp = evaluate_model_with_counts(nlp, dev_data)
#results_nlp2= evaluate_model_with_counts(nlp2, dev_data)
results_nlp3= evaluate_model_with_counts(nlp3, dev_data)
display(results_nlp)
#display(results_nlp2)
display(results_nlp3)


Unnamed: 0,Label,Precision (%),Recall (%),F1-Score (%),True Positives,False Positives,False Negatives
0,VORNAME,93.55,93.55,93.55,29,2,2
1,NACHNAME,85.37,92.11,88.61,35,6,3
2,VERTRAGSNUMMER,70.0,100.0,82.35,14,6,0
3,TITEL,66.67,100.0,80.0,4,2,0
4,STRASSE,80.0,100.0,88.89,8,2,0
5,HAUSNUMMER,100.0,100.0,100.0,8,0,0
6,POSTLEITZAHL,100.0,100.0,100.0,8,0,0
7,WOHNORT,100.0,87.5,93.33,7,0,1
8,TELEFONNUMMER,33.33,50.0,40.0,2,4,2
9,ZÄHLERNUMMER,66.67,28.57,40.0,2,1,5


Unnamed: 0,Label,Precision (%),Recall (%),F1-Score (%),True Positives,False Positives,False Negatives
0,VORNAME,93.75,96.77,95.24,30,2,1
1,NACHNAME,97.22,92.11,94.59,35,1,3
2,VERTRAGSNUMMER,92.86,92.86,92.86,13,1,1
3,ZÄHLERNUMMER,100.0,85.71,92.31,6,0,1
4,FIRMA,40.0,66.67,50.0,2,3,1
5,STRASSE,100.0,87.5,93.33,7,0,1
6,HAUSNUMMER,100.0,87.5,93.33,7,0,1
7,POSTLEITZAHL,100.0,75.0,85.71,6,0,2
8,WOHNORT,100.0,75.0,85.71,6,0,2
9,TELEFONNUMMER,80.0,100.0,88.89,4,1,0


TypeError: '<' not supported between instances of 'str' and 'int'

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
    x_label: grp['False Negatives'].value_counts()
    for x_label, grp in results_nlp3.groupby('False Positives')
})
sns.heatmap(df_2dhist, cmap='viridis')
plt.xlabel('False Positives')
_ = plt.ylabel('False Negatives')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [15]:
!git add notebook/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_synthetic_data
!git commit -m "📦 Add synthetic-trained spaCy model"
!git push origin main

fatal: pathspec 'notebook/3_model_training_and_testing/spacy_pipeline/custom_spacy_model_synthetic_data' did not match any files
Author identity unknown

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'root@a169e07582e2.(none)')
fatal: could not read Username for 'https://github.com': No such device or address
