# 🧪 Option A – Synthetic Email Generation with Faker

This notebook implements the **synthetic data generation pipeline** using placeholder substitution and the `Faker` library. Starting from paraphrased German email templates (produced via Option A: backtranslation), the notebook replaces placeholder tags such as `<<VORNAME>>`, `<<ZAHLUNG>>`, and `<<ZÄHLERSTAND>>` with **realistic, randomly generated values**. 

The key steps include:
- Custom generators for fields like meter IDs, bank names, IBANs, and German-style house numbers.
- Mapping of each placeholder tag to a generator.
- Replacing placeholders with Faker values while tracking **exact character offsets** for each synthetic entity.
- Producing a large set of fully synthetic emails in **NER-friendly JSON format**, compatible with spaCy.
- Writing the output to disk, with an example preview included.

This approach enables the creation of a **privacy-compliant, labeled dataset** for training and evaluating Named Entity Recognition (NER) models on German customer communication — without using any real personal data.

In [16]:
# Cloning the GitHub repository and move to the notebooks folder
# it is required since this notebook was running in the Google Colab environment
!git clone https://github.com/AnnaGhost2713/daia-eon.git
%cd daia-eon/data

Cloning into 'daia-eon'...
remote: Enumerating objects: 1007, done.[K
remote: Counting objects: 100% (135/135), done.[K
remote: Compressing objects: 100% (99/99), done.[K
remote: Total 1007 (delta 56), reused 95 (delta 35), pack-reused 872 (from 1)[K
Receiving objects: 100% (1007/1007), 3.47 MiB | 13.02 MiB/s, done.
Resolving deltas: 100% (561/561), done.
/content/daia-eon/notebooks/daia-eon/data


In [17]:
# --- Step 1: Install and Import Dependencies ---
!pip install faker
from faker import Faker
from faker.providers import bank, internet, misc, date_time
import random, re, json, itertools
import string, random

# --- Step 2: Initialize Faker for German Locale ---
fake = Faker("de_DE")            # Generate German-style names, addresses, etc.
fake.add_provider(bank)          # Add financial fields (IBAN, SWIFT)
fake.add_provider(internet)      # Add email and URL generators
fake.add_provider(misc)          # Add miscellaneous generators
fake.add_provider(date_time)     # Add date and time generators

random.seed()                    # Set random seed (optional: pass a value for reproducibility)



In [18]:
# --- Step 3: Generate Realistic German House Numbers ---

def german_house_number():
    """
    Generate a plausible German house number, e.g., '23', '41a', '102 B'.
    Follows typical conventions: numeric base + optional letter (upper/lower).
    """
    num = random.randint(1, 2000)          # Base number between 1 and 2000
    if random.random() < 0.50:             # ~50% are pure numbers (no letter)
        return str(num)

     # Choose a letter (a–z or A–Z)
    letter = random.choice(string.ascii_lowercase + string.ascii_uppercase)

    # ~50% chance of adding a space between number and letter
    sep = " " if random.random() < 0.50 else ""

    return f"{num}{sep}{letter}"

In [19]:
# --- Step 4: Generate Realistic Zählernummer (Meter Numbers) ---

# Helper function: insert optional random spaces into a string
def insert_random_spaces(seq: str, prob: float = 0.4) -> str:
    """
    Randomly inserts spaces into the input string with a given probability.
    - Groups characters in chunks of 1–4 before adding spaces.
    - Returns original sequence unchanged if random check fails.
    """
    if random.random() > prob:
        return seq
    out, i = [], 0
    while i < len(seq):
        grp_len = random.randint(1, 4)
        out.append(seq[i: i + grp_len])
        i += grp_len
    return " ".join(out)

# --- main generator ----------------------------------------------
def zaehlernummer() -> str:
    """
    Generate a plausible German-style meter number with random formatting.
    Variants (with respective probabilities):
    1) 40% → Alphanumeric (e.g. 1GMT00984726553)
    2) 30% → Pure digits (e.g. 486498046387)
    3) 30% → Digits + hyphen + year (e.g. 63746253-1992)

    Each variant may optionally contain internal spaces.
    """
    r = random.random()

    if r < 0.4:                                 # --- variant 1
        prefix  = str(random.randint(1, 9))
        letters = ''.join(random.choices(string.ascii_uppercase,
                                         k=random.randint(2, 4)))
        digits  = ''.join(random.choices(string.digits,
                                         k=random.randint(7, 12)))
        core = prefix + letters + digits
        return insert_random_spaces(core)

    elif r < 0.7:                               # --- variant 2
        digits = ''.join(random.choices(string.digits,
                                        k=random.randint(5, 12)))
        return insert_random_spaces(digits)

    else:                                       # --- variant 3
        left  = ''.join(random.choices(string.digits,
                                       k=random.randint(5, 8)))
        year  = str(random.randint(1900, 2099))
        core  = f"{left}-{year}"
        return insert_random_spaces(core, prob=0.25)  # fewer spaces here

In [20]:
# --- Step 5: Generate Realistic Vertragsnummer (Contract Number) ---

def vertragsnummer() -> str:
    """
    Generate a plausible German contract number.
    Format:
    - Prefix: three-digit number between 400–409
    - Main part: nine-digit random number
    - With 35% probability: insert spaces every 3 digits in the main part

    Examples:
    - "407123456789"
    - "401 123 456 789"
    """
    a = str(random.randint(400, 409))  # Contract type prefix
    b = str(random.randint(100_000_000, 999_999_999))  # 9-digit ID

    if random.random() < 0.35:
        # Split `b` into 3-digit groups with spaces (e.g., "123 456 789")
        b_spaced = " ".join(re.findall("...", b))
        return f"{a} {b_spaced}"

    return a + b

In [21]:
# --- Step 6: Generate Realistic Zählerstand (Meter Reading) ---

# Precompute all case variants of "kWh" (e.g., kwh, KWh, KW H, etc.)
_KWH_VARIANTS = [''.join(p) for p in itertools.product(
    ('k', 'K'), ('w', 'W'), ('h', 'H')
)]

def zaehlstand() -> str:
    """
    Generate a realistic German electricity meter reading.
    Examples include:
    - "1234567"
    - "1.234 kWh"
    - "7.890.123,45 KWh"
    - "987,6KW H"

    Variability includes:
    - Optional thousands separators (dots)
    - Optional decimal part (comma-separated)
    - Optional unit (kWh in random casing and spacing)
    """
    # 1. Integer base value (1 to 9,999,999)
    value = random.randint(1, 9_999_999)

    # Optionally format with dot-separated thousands
    if value >= 1000 and random.random() < 0.35:
        int_part = f"{value:,}".replace(",", ".")  # e.g., "1.234.567"
    else:
        int_part = str(value)

    # 2. Optional decimal part (e.g., ",45")
    if random.random() < 0.5:
        dec_len = random.choice([1, 2])
        decimals = f",{random.randint(0, 10**dec_len - 1):0{dec_len}d}"
    else:
        decimals = ""

    # 3. Optional unit suffix (e.g., " kWh", "KW H")
    if random.random() < 0.65:
        unit = random.choice(_KWH_VARIANTS)
        spacer = " " if random.random() < 0.5 else ""
        suffix = f"{spacer}{unit}"
    else:
        suffix = ""

    return f"{int_part}{decimals}{suffix}"


In [22]:
# --- Step 7: Generate Realistic Payment Amount (Zahlung) ---

import random

# ── all common Euro tokens, upper-/lower-case variants ───────────
_EURO_TOKENS = ["€", " EUR", "EUR", " Euro", "Euro", " EURO",
                " eur", "eur", "EURO"]

def zahlung() -> str:
    """
    Builds a German-style payment amount such as
      512,30€
      € 9.800
      12.345,6 Euro
      EUR 1.234,56
      7400
    """
    # 1.  choose magnitude 10 … 50 000  (tweak upper bound as needed)
    amount = random.uniform(10, 50_000)

    # 2.  integer / decimal decision
    decimals = random.choices([0, 1, 2], weights=[0.4, 0.3, 0.3])[0]
    fmt = f"{{:,.{decimals}f}}".format(amount).replace(",", "X").replace(".", ",").replace("X", ".")
    # German format → thousands '.'  decimal ','

    # strip trailing ",0" or ",00" if decimals==0
    if decimals == 0:
        fmt = fmt.split(",")[0]

    # 3.  euro token (or none) and position
    token = random.choice(_EURO_TOKENS + [""])        # ~10 % chance of empty
    before = random.random() < 0.25 and token         # 25 % “€ 123”
    after  = not before and token                     # otherwise after / none

    # optional spaces around token
    space = " " if random.random() < 0.6 else ""      # 60 % get a space

    if before:
        return f"{token}{space}{fmt}"
    elif after:
        return f"{fmt}{space}{token.lstrip()}"        # keep trailing space logic
    else:
        return fmt

In [23]:
# --- Step 8: Generate a German-Style IBAN ---

def iban_de() -> str:
    """
    Generate a German IBAN-like string starting with 'DE' followed by
    a realistic-looking 18-digit BBAN using Faker's bank provider.
    
    Note: This is not guaranteed to pass IBAN validation checks (e.g., checksum),
    but is sufficient for anonymization purposes.
    """
    bban = fake.bban()   # 18-digit Basic Bank Account Number
    return "DE" + bban

In [24]:
# --- Step 9: Generate a German BIC (Bank Identifier Code) ---

def bic() -> str:
    """
    Generate a German-style BIC using Faker's SWIFT/BIC provider.
    Falls back to a default German BIC if the generated code does not have 'DE' as country code.
    
    Returns:
        str: A realistic or fallback German BIC code.
    """
    try:
        code = fake.swift()           # Newer versions of Faker
    except AttributeError:
        code = fake.swift_ascii()     # Fallback for older Faker versions

    return code if code[4:6] == "DE" else "DEUTDEFFXXX"

In [25]:
# --- Step 10: Generate realistic German/English email footer (e.g. "Sent from my iPhone") ---

# ── building blocks ───────────────────────────────────────────────
PREFIXES_DE = [
    "Gesendet von meinem", "Von meinem", "Mit meinem",
    "Gesendet mit meinem", "Gesendet mit der", "Mit der"
]
PREFIXES_EN = ["Sent from my", "Sent using my"]

DEVICES = [
    "iPhone", "iPad", "MacBook Pro", "Samsung Galaxy S23",
    "Samsung Galaxy", "Google Pixel 8", "Fairphone 5",
    "Huawei P30", "Xiaomi Redmi Note 12", "Surface Pro 9",
    "Lenovo ThinkPad", "OnePlus 12", "Nokia 8.3",
    "BlackBerry Key2", "Galaxy Tab S9", "Steam Deck"
]

MAIL_APPS = [
    "Mail App", "Outlook", "Gmail", "GMX Mail", "web.de Mail",
    "Yahoo Mail", "Thunderbird", "Apple Mail", "BlueMail",
    "Telekom Mail", "Proton Mail", "Posteo", "Tutanota"
]

QUALIFIERS = ["", " für Android", " für iOS", " for Android", " for iOS", " Desktop"]

# ── generator ─────────────────────────────────────────────────────
def gesendet_mit() -> str:
    """Return a varied German/English mobile mail footer."""
    # Choose language flavour (30 % English, 70 % German)
    if random.random() < 0.30:
        prefix = random.choice(PREFIXES_EN)
        device = random.choice(DEVICES)
        # ~50 % add app + qualifier
        if random.random() < 0.5:
            app = random.choice(MAIL_APPS)
            qual = random.choice(QUALIFIERS).strip()
            return f"{prefix} {device} using {app}{(' ' + qual) if qual else ''}".strip()
        return f"{prefix} {device}"

    # German variant
    prefix = random.choice(PREFIXES_DE)
    device = random.choice(DEVICES)
    # ~65 % add “mit <App> <Qualifier>”
    if random.random() < 0.65:
        app = random.choice(MAIL_APPS)
        qual = random.choice(QUALIFIERS).strip()
        suffix = f" {app}{(' ' + qual) if qual else ''}"
    else:
        suffix = ""
    return f"{prefix} {device}{suffix}".strip()


In [26]:
# --- Step 11: Generate random German bank name using IBAN registry ---

# Install dependency for IBAN/bank data
!pip install schwifty

from schwifty import registry
import random

# Load all bank entries from the registry (returns a list of dicts)
bank_entries = registry.get("bank")  # e.g. [{'bank_code': '10000000', 'name': 'Bundesbank', ...}, …]

# Filter: keep only German banks (country_code == 'DE'), and de-duplicate names
banks_de = list({e["name"] for e in bank_entries if e.get("country_code") == "DE"})

# Generator: return a random German bank name
def german_bank() -> str:
    return random.choice(banks_de)



In [27]:
# --- Step 12: Map placeholder tags to corresponding data generators ---
from typing import Dict, Callable  # For better type safety

# Mapping of entity placeholders to synthetic data generator functions
GEN: Dict[str, Callable[[], str]] = {
    "TITEL"         : lambda: fake.prefix().rstrip("."),  # Remove trailing period (e.g., "Dr.")
    "VORNAME"       : fake.first_name,
    "NACHNAME"      : fake.last_name,
    "FIRMA"         : fake.company,
    "TELEFONNUMMER" : fake.phone_number,
    "EMAIL"         : fake.email,
    "FAX"           : fake.phone_number,
    "STRASSE"       : fake.street_name,
    "HAUSNUMMER"    : german_house_number,
    "POSTLEITZAHL"  : fake.postcode,
    "WOHNORT"       : fake.city,
    "ZÄHLERNUMMER"  : zaehlernummer,
    "ZÄHLERSTAND"   : zaehlstand,
    "VERTRAGSNUMMER": vertragsnummer,
    "ZAHLUNG"       : zahlung,
    "BANK"          : german_bank,
    "IBAN"          : iban_de,
    "BIC"           : bic,
    "DATUM"         : lambda: fake.date(pattern="%d.%m.%Y"),
    "GESENDET_MIT"  : gesendet_mit,
    "LINK"          : fake.uri,
}

In [28]:
# ── Step 13: Placeholder substitution helper ───────────────────────

# Maps placeholder aliases (e.g. ORT, CITY) to their canonical keys (e.g. WOHNORT)
_alias_to_key = {
    alias: key
    for key, aliases in {
        "TITEL"         : ["TITEL"],
        "VORNAME"       : ["VORNAME"],
        "NACHNAME"      : ["NACHNAME"],
        "FIRMA"         : ["FIRMA"],
        "TELEFONNUMMER" : ["TELEFONNUMMER"],
        "EMAIL"         : ["EMAIL"],
        "FAX"           : ["FAX"],
        "STRASSE"       : ["STRASSE"],
        "HAUSNUMMER"    : ["HAUSNUMMER"],
        "POSTLEITZAHL"  : ["POSTLEITZAHL"],
        "WOHNORT"       : ["WOHNORT", "ORT", "CITY"],
        "ZÄHLERNUMMER"  : ["ZÄHLERNUMMER"],
        "ZÄHLERSTAND"   : ["ZÄHLERSTAND"],
        "VERTRAGSNUMMER": ["VERTRAGSNUMMER", "ANGEBOTSNUMMER", "KUNDENNUMMER"],
        "ZAHLUNG"       : ["ZAHLUNG", "BETRAG"],
        "BANK"          : ["BANK"],
        "IBAN"          : ["IBAN"],
        "BIC"           : ["BIC"],
        "DATUM"         : ["DATUM", "DATE"],
        "GESENDET_MIT"  : ["GESENDET_MIT"],
        "LINK"          : ["LINK"],
    }.items()
    for alias in aliases
}

# Regex pattern to detect <<PLACEHOLDER>> fields in text
_pattern = re.compile(r"<<\s*([^\s<>]+?)\s*>>")

# Main substitution function
def substitute_placeholders(text: str) -> str:
    """
    Replaces all <<PLACEHOLDER>> tags in the input text using the GEN mapping.
    If a placeholder alias is unrecognized, it is left unchanged.
    """
    def repl(match):
        alias = match.group(1)
        key   = _alias_to_key.get(alias)
        return GEN[key]() if key in GEN else match.group(0)

    return _pattern.sub(repl, text)

In [29]:
# ── Step 14: Test placeholder substitution on paraphrased examples ──

# Load paraphrased templates (make sure path exists and is correct)
with open("synthetic/option_a_paraphrases.json", encoding="utf-8") as fh:
    data = json.load(fh)

# Pick the first record for preview/testing
first = data[0]

# For each paraphrased variant, generate 3 filled samples with fake data
out = [
    [substitute_placeholders(tpl) for _ in range(3)]
    for tpl in first["variants"]
]

# Pretty print the result to inspect filled examples
print(json.dumps(out, ensure_ascii=False, indent=2))

FileNotFoundError: [Errno 2] No such file or directory: '../../data/synthetic/option_a_paraphrases.json'

In [31]:
# ── Step 15: Fill placeholders and return text + entity spans ──

def fill_and_tag(text: str):
    """
    Replace <<PLACEHOLDER>> markers in the input text with synthetic values
    and return:
      - the fully substituted string
      - a list of [start, end, label] spans for use in NER training.

    This format is compatible with spaCy JSONL or similar frameworks.
    """
    spans = []
    offset = 0  # character offset shift due to substitutions

    def repl(m):
        nonlocal offset
        alias = m.group(1)
        key   = _alias_to_key.get(alias)
        value = GEN[key]() if key in GEN else m.group(0)

        # Only record spans for known placeholder types
        if key in GEN:
            start = m.start() + offset
            end   = start + len(value)
            spans.append([start, end, key])

        # Update offset to account for character length change
        offset += len(value) - len(m.group(0))
        return value

    filled = _pattern.sub(repl, text)
    return filled, spans

In [None]:
# ── Step 16: Sample generation and export for NER training ──

# --- Configurable Parameters ---
SOURCE         = "synthetic/option_a_paraphrases.json"
OUT_PATH       = "sample_filled_mails.json"
VARIANTS_EACH  = 3           # how many times to fill each template
START_INDEX    = 1           # starting file index in output
MAX_TEMPLATES  = 3           # number of paraphrased templates to use per email
MAX_RECORDS    = 1           # number of email records to process

# --- Driver: Generate Sample Data ---
import json, pathlib, random

with open(SOURCE, encoding="utf-8") as fh:
    data = json.load(fh)

records = []
counter = START_INDEX

for record in data[:MAX_RECORDS]:  # iterate through limited records
    for template in record["variants"][:MAX_TEMPLATES]:
        for _ in range(VARIANTS_EACH):
            text, ents = fill_and_tag(template)
            records.append({
                "file": str(counter),
                "text": text,
                "labels": [{"start": s, "end": e, "label": L} for s, e, L in ents]
            })
            counter += 1

# --- Preview the Result ---
print("── Preview of", len(records), "generated mails ──\n")
for rec in records:
    print(json.dumps(rec, ensure_ascii=False, indent=2), end="\n\n")

# --- Save to JSON File ---
pathlib.Path(OUT_PATH).write_text(
    json.dumps(records, ensure_ascii=False, indent=2),
    encoding="utf-8"
)
print(f"\n✅ Wrote {len(records)} mails to {OUT_PATH}")

── Preview of 9 generated mails ──

{
  "file": "1",
  "text": "Hallo liebes Eon Team, es geht um die Vertragsnummer 406 882 430 808. Bei der Errichtung meines neuen Vertrages wurde leider die Banküberweisung von dem jungen Kollegen an der Wohnungstür als Zahlungsmittel gewählt. Ich möchte, dass es wieder per Lastschrift belastet wird, um den Stress zu vermeiden. Das Konsumbüro ist immer noch die Birnbaumgasse 1053 d in 71706 Kleve. Gruß Antonia van der Dussen",
  "labels": [
    {
      "start": 53,
      "end": 68,
      "label": "VERTRAGSNUMMER"
    },
    {
      "start": 336,
      "end": 349,
      "label": "STRASSE"
    },
    {
      "start": 350,
      "end": 356,
      "label": "HAUSNUMMER"
    },
    {
      "start": 360,
      "end": 365,
      "label": "POSTLEITZAHL"
    },
    {
      "start": 366,
      "end": 371,
      "label": "WOHNORT"
    },
    {
      "start": 378,
      "end": 385,
      "label": "VORNAME"
    },
    {
      "start": 386,
      "end": 400,
      

In [32]:
# ── Step 17: Full Dataset Run — Generate 10 Faker Variants per Template ──

# --- Configuration ---
SOURCE         = "synthetic/option_a_paraphrases.json"      # input file with paraphrased templates
OUT_PATH       = "synthetic/synthetic_mails_option_a.json"  # output path for synthetic NER data
VARIANTS_EACH  = 10                                          # number of faker-filled variants per template
START_INDEX    = 1                                           # starting index for "file" field

# --- Main Driver: Generate NER-Compatible Data ---
import json, pathlib, random

with open(SOURCE, encoding="utf-8") as fh:
    data = json.load(fh)

records = []
counter = START_INDEX

for rec in data:                         # iterate through each paraphrased email group
    for tpl in rec["variants"]:         # iterate through each paraphrased sentence
        for _ in range(VARIANTS_EACH):  # generate multiple filled variants
            text, ents = fill_and_tag(tpl)  # substitute placeholders + return entity spans
            records.append({
                "file": str(counter),
                "text": text,
                "labels": [
                    {"start": s, "end": e, "label": L} for s, e, L in ents
                ]
            })
            counter += 1

# --- Save Output ---
pathlib.Path(OUT_PATH).write_text(
    json.dumps(records, ensure_ascii=False, indent=2),
    encoding="utf-8"
)

print(f"✅ Wrote {len(records):,} synthetic emails to {OUT_PATH}")

✅ wrote 14,360 mails to /content/daia-eon/data/synthetic/synthetic_mails_option_a.json


In [33]:
# --- Step 18: Download generated JSON file to local machine ---
from google.colab import files
files.download("/content/daia-eon/data/synthetic/synthetic_mails_option_a.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>