## 📄 Synthetic Email Generator – Option B (Balanced Labeled Output)

This notebook generates a large set of realistic, German-style synthetic emails for Named Entity Recognition (NER) training. It uses paraphrased template sentences containing placeholders (e.g. `<<VORNAME>>`, `<<ZÄHLERSTAND>>`) and fills them with artificial but plausible values using `Faker`, `schwifty`, and custom generators.

**Key features:**
- Smart substitution logic for entities like IBAN, meter numbers, invoice amounts, and more.
- Balanced sampling strategy to ensure uniform coverage of all label types across the dataset.
- Final output is a spaCy-compatible JSON format with exact character-level label spans.

The output (`synthetic_emails_labeled.json`) contains 14,360 fully labeled examples.

In [48]:
# Cloning the GitHub repository and move to the notebooks folder
# it is required since this notebook was running in the Google Colab environment
!git clone https://github.com/AnnaGhost2713/daia-eon.git
%cd daia-eon/data/
%cd synthetic

Cloning into 'daia-eon'...
remote: Enumerating objects: 1204, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 1204 (delta 42), reused 60 (delta 28), pack-reused 1113 (from 1)[K
Receiving objects: 100% (1204/1204), 48.44 MiB | 16.08 MiB/s, done.
Resolving deltas: 100% (685/685), done.
/content/daia-eon/data
/content/daia-eon/data/synthetic


In [18]:
# ── Install required dependency ─────────────────────────────────────────
!pip install faker --quiet

# ── Imports ─────────────────────────────────────────────────────────────
from faker import Faker
from faker.providers import bank, internet, misc, date_time
import random
import re
import string

# ── 1. Faker setup for German locale ───────────────────────────────────
fake = Faker("de_DE")               # Use German locale → realistic names, addresses, phone numbers etc.
fake.add_provider(bank)            # Add IBAN, BIC generation
fake.add_provider(internet)        # Add email, URL generation
fake.add_provider(misc)            # Add misc tools
fake.add_provider(date_time)       # Add date/time generation

# Optional: Fix seed for reproducibility (remove or customize for variation)
random.seed()



In [19]:
# ── HAUSNUMMER ─────────────────────────────────────────────────────
# Updated logic based on earlier evaluation results:
# → Use Faker's built-in German building number generator for more realistic results

def german_house_number() -> str:
    return fake.building_number()

In [20]:
# ── ZÄHLERNUMMER ───────────────────────────────────────────────────────
# Enhanced logic based on evaluation insights:
# - Lowered spacing probability for better readability
# - Grouped characters in chunks of 3–4 to mimic natural chunking
# - Allowed occasional leading zero in alphanumeric variant
# - Biased all-digit variant toward 12-digit formats
# - Reduced probability of year-based variants to 10%

# --- helper: optionally sprinkle spaces into a sequence -----------
def insert_random_spaces(seq: str, prob: float = 0.3) -> str:
    """
    With probability *prob* return the sequence with random
    spaces (groups of 1–4 chars).  Otherwise return seq unchanged.
    """
    if random.random() > prob:
        return seq
    out, i = [], 0
    while i < len(seq):
        grp_len = random.randint(3, 4)
        out.append(seq[i: i + grp_len])
        i += grp_len
    return " ".join(out)

# --- main generator ----------------------------------------------
def zaehlernummer() -> str:
    r = random.random()
    # 1) Alphanumeric
    if r < 0.35:
        prefix = str(random.randint(0,9))
        letters = ''.join(random.choices(string.ascii_lowercase,
                                         k=random.randint(2,4)))
        digits  = ''.join(random.choices(string.digits,
                                         k=random.randint(7,12)))
        core = prefix + letters + digits

    # 2) Pure digits, biased to length 12
    elif r < 0.75:
        length = random.choices([12,11,10,9], weights=[0.6,0.1,0.1,0.2])[0]
        core = ''.join(random.choices(string.digits, k=length))

    # 3) Hyphen or slash separated
    else:
        left  = ''.join(random.choices(string.digits,
                                       k=random.randint(5,8)))
        right = ''.join(random.choices(string.digits,
                                       k=random.randint(4,6)))
        sep   = random.choice(["-", "/"])
        core  = f"{left}{sep}{right}"

    # sprinkle spaces more realistically
    return insert_random_spaces(core, prob=0.25).strip()



In [21]:
# ── VERTRAGSNUMMER ────────────────────────────────────────────────
# Simulates realistic German contract numbers with optional spacing
# Example formats:
#   405395 728 192
#   402123456789

def vertragsnummer() -> str:
    prefix = str(random.randint(400, 409))  # realistic prefix range
    suffix = str(random.randint(100_000_000, 999_999_999))  # 9-digit body

    if random.random() < 0.35:
        # Split the suffix into 3-digit groups with spaces
        suffix_spaced = " ".join(re.findall("...", suffix))
        return f"{prefix} {suffix_spaced}"
    
    return prefix + suffix

In [22]:
# ── ZÄHLERSTAND ──────────────────────────────────────────────────

### improved logic after evaluation results of first synthetic dataset
# -> Restricting “kWh” variants to what people actually write: Eight random letter‑case combinations plus random internal spaces produce things like “Kw H” or “kWH”, which you almost never see.
# -> Stripping any stray spaces: return f"{int_part}{decimals}{suffix}".strip()

_KWH_VARIANTS = ["kWh", "kwh", "KWh", "KWH"]

def zaehlstand() -> str:
    # 1) integer part
    value = random.randint(1, 9_999_999)
    if value >= 1000 and random.random() < 0.35:
        int_part = f"{value:,}".replace(",", ".")
    else:
        int_part = str(value)

    # 2) decimal part
    if random.random() < 0.5:
        dec_len = random.choice([1, 2])
        decimals = f",{random.randint(0, 10**dec_len - 1):0{dec_len}d}"
    else:
        decimals = ""

    # 3) unit
    if random.random() < 0.65:
        unit = random.choice(_KWH_VARIANTS)
        # space only 10% of the time
        spacer = " " if random.random() < 0.10 else ""
        suffix = f"{spacer}{unit}"
    else:
        suffix = ""

    return f"{int_part}{decimals}{suffix}"


In [23]:
# ── ZAHLUNG ──────────────────────────────────────────────────

### improved logic after evaluation results of first synthetic dataset
# -> Using integer cents instead of random.uniform: Floats can introduce odd rounding artifacts.
# -> Biasing toward smaller amounts: Invoices rarely top out at €50 000
# -> Formatting integer + decimal: Deciding 0–2 decimals, but base it on cent_part
# -> Euro token placement & spacing: Tightening the probabilities to mirror real invoices


import math, random

_EURO_TOKENS = ["€", "EUR", "Euro"]

def zahlung() -> str:
    # 1) sample log-uniform cents
    log_min, log_max = math.log(10), math.log(50_000)
    amount = math.exp(random.uniform(log_min, log_max))
    cents = int(amount * 100)
    euros, cent_part = divmod(cents, 100)

    # 2) choose decimals
    decimals = random.choices([0,1,2], weights=[0.4,0.3,0.3])[0]
    if decimals == 2:
        fmt = f"{euros:,}".replace(",",".") + f",{cent_part:02d}"
    elif decimals == 1:
        fmt = f"{euros:,}".replace(",",".") + f",{cent_part//10}"
    else:
        fmt = f"{euros:,}".replace(",",".")

    # 3) euro token placement
    r = random.random()
    if r < 0.10:
        pos, token = "before", random.choice(_EURO_TOKENS)
    elif r < 0.80:
        pos, token = "after", random.choice(_EURO_TOKENS)
    else:
        pos, token = None, ""
    space = " " if token and random.random() < 0.8 else ""

    if pos == "before":
        return f"{token}{space}{fmt}"
    elif pos == "after":
        return f"{fmt}{space}{token}"
    else:
        return fmt

In [24]:
# ── IBAN_DE ────────────────────────────────────────────────────────
# Generates a realistic German IBAN (22 characters):
# Format: DEkk bbbb bbbb cccc cccc cc
# - 'DE': country code
# - 'kk': checksum (fake in our case)
# - 'b': bank code, 'c': account number

def iban_de() -> str:
    # Faker’s bban() returns 18-character German bank+account code
    bban = fake.bban()
    
    # Prepend country code — note: checksum is not validated
    return f"DE{bban}"

In [25]:
# ── BIC ─────────────────────────────────────────────────────────────
# Returns a realistic German BIC (Bank Identifier Code)
# Ensures that the returned code belongs to a German institution (country code 'DE')

def bic() -> str:
    try:
        # Preferred: use faker 19+ method
        code = fake.swift()
    except AttributeError:
        # Fallback for older versions
        code = fake.swift_ascii()

    # Ensure German BIC (country code at position 5–6)
    if code[4:6] == "DE":
        return code
    else:
        # Fallback to a valid German BIC (e.g. Deutsche Bank)
        return "DEUTDEFFXXX"

In [26]:
# ── GESENDET MIT ────────────────────────────────────────────────────────

### improved logic after evaluation results of first synthetic dataset
# -> Adjusting Qualifier Placement: Right now you sometimes end up with double‑qualifiers like “Gmail for Android” after already saying “using”. In English, you’d usually say either “Sent from my iPhone using Mail App for iOS” or “Sent from my iPhone for iOS” but not both
# -> Refine Probabilities for Realism
# -> Handling Punctuation Variants: People sometimes use a dash or parentheses instead of a space
# -> Adding a “no suffix” option



# ── building blocks ───────────────────────────────────────────────
PREFIXES_DE = [
    "Gesendet von meinem", "Von meinem", "Mit meinem",
    "Gesendet mit meinem", "Gesendet mit der", "Mit der"
]
PREFIXES_EN = ["Sent from my", "Sent using my"]

DEVICES = [
    "iPhone", "iPad", "MacBook Pro", "Samsung Galaxy S23",
    "Samsung Galaxy", "Google Pixel 8", "Fairphone 5",
    "Huawei P30", "Xiaomi Redmi Note 12", "Surface Pro 9",
    "Lenovo ThinkPad", "OnePlus 12", "Nokia 8.3",
    "BlackBerry Key2", "Galaxy Tab S9", "Steam Deck"
]

MAIL_APPS = [
    "Mail App", "Outlook", "Gmail", "GMX Mail", "web.de Mail",
    "Yahoo Mail", "Thunderbird", "Apple Mail", "BlueMail",
    "Telekom Mail", "Proton Mail", "Posteo", "Tutanota"
]

QUALIFIERS = ["", " für Android", " für iOS", " for Android", " for iOS", " Desktop"]

# ── generator ─────────────────────────────────────────────────────
import re

def gesendet_mit() -> str:
    # 20% English, 80% German
    is_english = random.random() < 0.20

    if is_english:
        prefix = random.choice(PREFIXES_EN)
    else:
        prefix = random.choice(PREFIXES_DE)

    device = random.choice(DEVICES)

    # ~10% chance of no app info at all
    if random.random() < 0.10:
        footer = f"{prefix} {device}"
    else:
        app = random.choice(MAIL_APPS)
        # Qualifier only if app present
        qual = random.choice(["", " for Android", " for iOS"]) if is_english else random.choice(["", " für Android", " für iOS"])
        # Choose separator style
        sep = random.choice([" ", " — ", " ("])
        suffix = f"{sep}{app}{qual}{')' if sep == ' (' else ''}"
        if is_english:
            footer = f"{prefix} {device} using{suffix}"
        else:
            footer = f"{prefix} {device}{suffix}"

    # Clean up whitespace
    footer = footer.strip()
    footer = re.sub(r"\s+", " ", footer)
    return footer


In [27]:
# ── BANK ────────────────────────────────────────────────────────────
# Returns a random German bank name using the schwifty IBAN registry.

!pip install schwifty --quiet

from schwifty import registry
import random

# Load all bank entries from the registry (returns a list of dicts)
bank_entries = registry.get("bank")  # [{'bank_code': '10000000', 'name': 'Bundesbank', ...}, …]

# Extract unique German bank names only (based on country code)
banks_de = list({entry["name"] for entry in bank_entries if entry.get("country_code") == "DE"})

def german_bank() -> str:
    """Return a randomly selected German bank name."""
    return random.choice(banks_de)



In [28]:
# ── PLACEHOLDER → GENERATOR MAP ──────────────────────────────────────
# Maps placeholder labels (e.g., <<VORNAME>>) to a corresponding
# Faker-based generator function that produces realistic German-style data.

from typing import Dict, Callable  # For static typing of the mapping

GEN: Dict[str, Callable[[], str]] = {
    "TITEL"         : lambda: fake.prefix().rstrip("."),             # e.g., "Dr", "Prof"
    "VORNAME"       : fake.first_name,                               # First name
    "NACHNAME"      : fake.last_name,                                # Last name
    "FIRMA"         : fake.company,                                  # Company name
    "TELEFONNUMMER" : fake.phone_number,                             # Phone number
    "EMAIL"         : fake.email,                                    # Email address
    "FAX"           : fake.phone_number,                             # Fax number (reusing phone)
    "STRASSE"       : fake.street_name,                              # Street name
    "HAUSNUMMER"    : german_house_number,                           # House number (custom)
    "POSTLEITZAHL"  : fake.postcode,                                 # Postal code (PLZ)
    "WOHNORT"       : fake.city,                                     # City
    "ZÄHLERNUMMER"  : zaehlernummer,                                 # Meter number (custom)
    "ZÄHLERSTAND"   : zaehlstand,                                    # Meter reading (custom)
    "VERTRAGSNUMMER": vertragsnummer,                                # Contract number (custom)
    "ZAHLUNG"       : zahlung,                                       # Payment amount (custom)
    "BANK"          : german_bank,                                   # Bank name (custom)
    "IBAN"          : iban_de,                                       # German IBAN (custom)
    "BIC"           : bic,                                           # BIC (custom)
    "DATUM"         : lambda: fake.date(pattern="%d.%m.%Y"),         # Date in DD.MM.YYYY format
    "GESENDET_MIT"  : gesendet_mit,                                  # Email footer (custom)
    "LINK"          : fake.uri                                       # Website URL
}

In [29]:
# ── PLACEHOLDER SUBSTITUTION HELPER ──────────────────────────────────────
# Maps all possible aliases (e.g., <<CITY>>, <<ORT>>, etc.)
# to canonical placeholder keys defined in the GEN dictionary.

_alias_to_key = {
    alias: key
    for key, aliases in {
        "TITEL":["TITEL"], "VORNAME":["VORNAME"], "NACHNAME":["NACHNAME"],
        "FIRMA":["FIRMA"], "TELEFONNUMMER":["TELEFONNUMMER"], "EMAIL":["EMAIL"],
        "FAX":["FAX"], "STRASSE":["STRASSE"], "HAUSNUMMER":["HAUSNUMMER"],
        "POSTLEITZAHL":["POSTLEITZAHL"],
        "WOHNORT":["WOHNORT","ORT","CITY"],
        "ZÄHLERNUMMER":["ZÄHLERNUMMER"],
        "ZÄHLERSTAND":["ZÄHLERSTAND"],
        "VERTRAGSNUMMER":["VERTRAGSNUMMER","ANGEBOTSNUMMER","KUNDENNUMMER"],
        "ZAHLUNG":["BETRAG","ZAHLUNG"],
        "BANK":["BANK"], "IBAN":["IBAN"], "BIC":["BIC"],
        "DATUM":["DATUM","DATE"], "GESENDET_MIT":["GESENDET_MIT"], "LINK":["LINK"],
    }.items() for alias in aliases
}
_pattern = re.compile(r"<<\s*([^\s<>]+?)\s*>>")

In [30]:
# ── TEMPLATE SELECTION BASED ON LABEL DEFICITS (FOR BALANCING LABELS) ──────────────────────────────────────
import random

def weighted_choice(templates, label_counts, observed_counts, target_dist):
    """
    Selects one template from the list, weighted by how much it helps
    reduce the imbalance between current and target label distributions.

    Parameters:
    - templates: list of available template strings
    - label_counts: dict with how many times each label has been used
    - observed_counts: dict of all labels observed across templates
    - target_dist: target count for each label

    Returns:
    - A template string selected based on label underrepresentation.
    """
    # compute a “deficit” for each label
    deficits = {lbl: target_dist[lbl] - label_counts[lbl]
                for lbl in observed_counts}

    # score each template by summing deficits of the labels it contains
    template_scores = []
    for t in templates:
        labels_in_t = re.findall(_pattern, t)  # list of aliases
        keys_in_t   = [_alias_to_key[lab] for lab in labels_in_t]
        # sum only positive deficits
        score = sum(max(deficits.get(key,0), 0) for key in keys_in_t)
        # ensure a minimum weight
        template_scores.append(score + 1e-3)

    # normalize and pick one
    total = sum(template_scores)
    weights = [s/total for s in template_scores]
    return random.choices(templates, weights)[0]

In [50]:
# ── GENERATE LABELED SYNTHETIC EMAILS WITH BALANCED ENTITY DISTRIBUTION ──────────────────────────────────────

import json
import re
from collections import Counter

# -- Assumes the helper functions and mappings (weighted_choice, _pattern, _alias_to_key, GEN) are already defined --

# 1. Load paraphrases JSON
with open("option_b_paraphrased.json", encoding="utf-8") as f:
    entries = json.load(f)

# 2. Flatten templates
templates = [tpl for entry in entries for tpl in entry["paraphrases"]]

# 3. Compute observed label counts
observed_counts = Counter()
for t in templates:
    for alias in re.findall(_pattern, t):
        observed_counts[_alias_to_key[alias]] += 1

# 4. Define target distribution (equalize to max observed)
max_obs = max(observed_counts.values())
target_dist = { label: max_obs for label in observed_counts }

# 5. Define fill_and_label to replace placeholders and record spans
def fill_and_label(template: str):
    parts = []
    labels = []
    last_index = 0
    for match in re.finditer(_pattern, template):
        alias = match.group(1)
        key = _alias_to_key[alias]
        value = GEN[key]()
        # Append text before placeholder
        parts.append(template[last_index:match.start()])
        # Record start and end in new text
        start = sum(len(p) for p in parts)
        parts.append(value)
        end = start + len(value)
        labels.append({"start": start, "end": end, "label": key})
        last_index = match.end()
    # Append remainder
    parts.append(template[last_index:])
    return "".join(parts), labels

# 6. Generate labeled, balanced dataset
def generate_labeled_dataset(templates, N):
    label_counts = Counter()
    outputs = []
    for i in range(N):
        tpl = weighted_choice(templates, label_counts, observed_counts, target_dist)
        text, labels = fill_and_label(tpl)
        # update counts
        for lab in labels:
            label_counts[lab['label']] += 1
        outputs.append({"file": str(i+1), "text": text, "labels": labels})
    return outputs, label_counts

# Generate 14360 examples (because synthetic data a were 14360 samples)
N = 14_360
generated, final_counts = generate_labeled_dataset(templates, N)

# 7. Sanity check: print first two entries and final label frequencies
print("=== Sample Outputs ===")
for entry in generated[:2]:
    print(json.dumps(entry, ensure_ascii=False, indent=2))
print("\n=== Final Label Frequencies ===")
print(final_counts)

# 8. Save to JSON
with open("synthetic_emails_labeled.json", "w", encoding="utf-8") as f:
    json.dump(generated, f, ensure_ascii=False, indent=2)

print(f"\nGenerated {len(generated)} examples and saved to synthetic_emails_labeled.json")

=== Sample Outputs ===
{
  "file": "1",
  "text": "Sehr geehrte Damen und Herren, wir bitten um die Reduzierung der Abschlagskosten der Ehepaar Schmiedt, da das Anwesen in Girschnerplatz 3-5 in 81121 Gadebusch ab dem 07.10.1982 unbewohnt ist und verkauft wird. Mit freundlichen Grüßen Serpil Finke Steinberg AG Bärbel-Trommler-Weg 1-3 85894 Lichtenfels Tel.: +49(0)9097623861 http://knappe.net/wp-content/blogfaq.html https://klotz.de/category/category/postsindex.html http://etzold.de/search/listabout.jsp",
  "labels": [
    {
      "start": 93,
      "end": 101,
      "label": "NACHNAME"
    },
    {
      "start": 121,
      "end": 135,
      "label": "STRASSE"
    },
    {
      "start": 136,
      "end": 139,
      "label": "HAUSNUMMER"
    },
    {
      "start": 143,
      "end": 148,
      "label": "POSTLEITZAHL"
    },
    {
      "start": 149,
      "end": 158,
      "label": "WOHNORT"
    },
    {
      "start": 166,
      "end": 176,
      "label": "DATUM"
    },
    {
      "st

In [51]:
# ── DOWNLOAD JSON FILE TO LOCAL MACHINE ──────────────────────────────────────
from google.colab import files
files.download("synthetic_emails_labeled.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>