# Backtranslation of German Customer Emails for Synthetic Data Generation

This notebook performs **backtranslation** on real German customer emails to generate semantically equivalent but lexically diverse text variants. These synthetic variations are used to evaluate and improve PII anonymization performance.

**Backtranslation Pipeline:**
1. Translate original German emails into one or more pivot languages (e.g., English, French, Spanish).
2. Translate the text from the pivot language(s) back into German.
3. Store the resulting backtranslations for further processing and evaluation.

Using multiple pivot languages introduces greater lexical and syntactic diversity compared to standard backtranslation. This method helps generate more natural and varied synthetic datasets, improving model robustness and generalizability in downstream NLP tasks.

In [None]:
# Cloning the GitHub repository and move to the notebooks folder
# it is required since this notebook was running in the Google Colab environment

!git clone https://github.com/AnnaGhost2713/daia-eon.git
%cd daia-eon/notebooks

### Preview and Pipeline Sanity Check

Before running the full backtranslation pipeline, we first verify that our approach works end-to-end on a small sample of real email files. This section serves as a quick sanity check to confirm:

1. Correct loading and parsing of `.txt` emails and PII placeholders.
2. Functionality of the multilingual backtranslation pipeline, including pivoting and unmasking.

We sample five random training emails and generate diverse paraphrases using probabilistic pivoting through French, Spanish, or Italian.

The output is not saved to disk — this is just for visual inspection and debugging.

In [None]:
#### PREVIEW OF TXT FILES (WHETHER IT WORKS) ####

# --- Step 0: Install required libraries ---
!pip install -q transformers sentencepiece tqdm

# --- Step 1: Imports ---
import re, time, random
from math        import ceil
from pathlib     import Path
from random      import seed
from collections import Counter
from tqdm.auto   import tqdm
from transformers import pipeline, set_seed

# --- Step 2: Load labeled German email files ---
DATA_DIR = Path("../../data/original/golden_dataset_anonymized_granular")
all_txt  = sorted(DATA_DIR.glob("*.txt"))
records  = []
for f in all_txt:
    txt = f.read_text("utf-8")
    labs = [{"start":m.start(),"end":m.end(),"label":m.group(1)}
           for m in re.finditer(r"<<([^>]+)>>", txt)]
    records.append({"file":f.name, "text":txt, "labels":labs})

# --- Step 3: Use pre-defined test split IDs ---
TEST_IDS   = {0,142,2,3,146,145,157,165,19,18,20,166,176,177,
              32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,
              96,102,105,108,109,112,115,122,129,132,134}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}
train_recs = [r for r in records if r["file"] not in TEST_FILES]

# --- Step 4: Preview a random sample of training emails ---
seed(1)
preview = random.sample(train_recs, k=5)
print("Previewing:", [r["file"] for r in preview])

# --- Step 5: Compute how many backtranslated variants are needed per record ---
tag_counts = Counter(l["label"] for r in preview for l in r["labels"])
max_cnt    = max(tag_counts.values(), default=1)
def n_variants_for(r):
    freqs = [tag_counts.get(l["label"],1) for l in r["labels"]]
    return ceil(max_cnt / max(min(freqs),1)) if freqs else 1

# --- Step 6: Load translation pipelines (de↔en, en↔fr, en↔es, en↔it) ---
kw = dict(device=-1,
          do_sample=True,
          top_k=300,
          top_p=0.95,
          temperature=1.5)
de_en = pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en", **kw)
en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", **kw)
en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr", **kw)
fr_en = pipeline("translation_fr_to_en", model="Helsinki-NLP/opus-mt-fr-en", **kw)
en_es = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es", **kw)
es_en = pipeline("translation_es_to_en", model="Helsinki-NLP/opus-mt-es-en", **kw)
en_it = pipeline("translation_en_to_it", model="Helsinki-NLP/opus-mt-en-it", **kw)
it_en = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en", **kw)

pivot_pipes = {
  "fr": (en_fr, fr_en),
  "es": (en_es, es_en),
  "it": (en_it, it_en),
}

# --- Step 7: Define the multilingual backtranslation function ---
def bt_super_diverse(text: str, want: int) -> list[str]:
    # 1. Mask placeholders like <<VORNAME>> with [TAG1], [TAG2], etc.
    tags, masked = [], text
    for i, t in enumerate(re.findall(r"(<<[^>]+>>)", text), 1):
        tags.append(t)
        masked = masked.replace(t, f"[TAG{i}]")

    # 2. Translate German → English using beam sampling
    en_beams = de_en(
      masked,
      max_length=512, truncation=True,
      num_beams=want*2,
      num_return_sequences=want,
      early_stopping=True
    )
    out_variants = []
    for beam in en_beams:
        en = beam["translation_text"]
        time.sleep(0.1)

        # 3. With some probability, route through an additional pivot language
        hop = random.random()
        if hop < 0.3:
            lang, (e2p, p2e) = random.choice(list(pivot_pipes.items()))
            en = p2e(e2p(en, max_length=512, truncation=True)[0]["translation_text"],
                     max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.2)
        elif hop < 0.5:
            # Double-hop: en → fr → en
            mid = pivot_pipes["fr"][0](en, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)
            en  = pivot_pipes["fr"][1](mid, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)

        # 4. Translate back to German
        de = en_de(en, max_length=512, truncation=True)[0]["translation_text"]
        time.sleep(0.1)

        # 5. Unmask tags like [TAG1] → <<VORNAME>>
        for i, t in enumerate(tags, 1):
            de = de.replace(f"[TAG{i}]", t)
        out_variants.append(de)

    return out_variants

# --- Step 8: Run preview with generated backtranslations ---
for rec in tqdm(preview, desc="Super-Diverse Preview"):
    want = n_variants_for(rec)
    print(f"\n→ {rec['file']} (need {want} variants)")
    for v in bt_super_diverse(rec["text"], want):
        print("  ", v)

### Backtranslation Preview — JSON Export

In the following section, we run our multilingual backtranslation pipeline on a small random sample of the training data and write the output into a structured JSON file. Each record contains:

- The original file name
- The number of paraphrased variants generated
- A list of the resulting backtranslated texts

This is useful for downstream evaluation and ensures that our logic, including placeholder masking, multilingual pivoting, and unmasking, works correctly and reproducibly before scaling.

The resulting JSON file `preview_paraphrases.json` can be easily inspected or reused for manual quality checks.

In [None]:
#### PREVIEW OF JSON FILE (WHETHER IT WORKS) ####

# --- Step 0: Install required libraries ---
!pip install -q transformers sentencepiece tqdm

# --- Step 1: Imports ---
import re, time, random
from math        import ceil
from pathlib     import Path
from random      import seed
from collections import Counter
from tqdm.auto   import tqdm
from transformers import pipeline, set_seed
import json

# --- Step 2: Load anonymized .txt files and extract placeholder spans ---
DATA_DIR = Path("../../data/original/golden_dataset_anonymized_granular")
all_txt  = sorted(DATA_DIR.glob("*.txt"))
records  = []
for f in all_txt:
    txt = f.read_text("utf-8")
    labs = [{"start":m.start(),"end":m.end(),"label":m.group(1)}
           for m in re.finditer(r"<<([^>]+)>>", txt)]
    records.append({"file":f.name, "text":txt, "labels":labs})

# --- Step 3: Apply fixed train/test split ---
TEST_IDS   = {0,142,2,3,146,145,157,165,19,18,20,166,176,177,
              32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,
              96,102,105,108,109,112,115,122,129,132,134}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}
train_recs = [r for r in records if r["file"] not in TEST_FILES]

# --- Step 4: Sample a small preview of training data ---
seed(1)
preview = random.sample(train_recs, k=2)
print("Previewing:", [r["file"] for r in preview])

# --- Step 5: Estimate how many variants to generate per email ---
tag_counts = Counter(l["label"] for r in preview for l in r["labels"])
max_cnt    = max(tag_counts.values(), default=1)

def n_variants_for(r):
    freqs = [tag_counts.get(l["label"],1) for l in r["labels"]]
    return ceil(max_cnt / max(min(freqs),1)) if freqs else 1

# --- Step 6: Define backtranslation pipelines with diverse decoding ---
kw = dict(device=-1,
          do_sample=True,
          top_k=300,
          top_p=0.95,
          temperature=1.5)
de_en = pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en", **kw)
en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", **kw)
en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr", **kw)
fr_en = pipeline("translation_fr_to_en", model="Helsinki-NLP/opus-mt-fr-en", **kw)
en_es = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es", **kw)
es_en = pipeline("translation_es_to_en", model="Helsinki-NLP/opus-mt-es-en", **kw)
en_it = pipeline("translation_en_to_it", model="Helsinki-NLP/opus-mt-en-it", **kw)
it_en = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en", **kw)

pivot_pipes = {
  "fr": (en_fr, fr_en),
  "es": (en_es, es_en),
  "it": (en_it, it_en),
}

# --- Step 7: Backtranslation function with optional pivot hops ---
def bt_super_diverse(text: str, want: int) -> list[str]:
    """
    Applies diverse backtranslation to a single email.
    Optionally uses random pivot hops through FR/ES/IT.
    """
    # Step 1: Mask PII placeholders
    tags, masked = [], text
    for i, t in enumerate(re.findall(r"(<<[^>]+>>)", text), 1):
        tags.append(t)
        masked = masked.replace(t, f"[TAG{i}]")

    # Step 2: Translate DE → EN (beam sampling)
    en_beams = de_en(
      masked,
      max_length=512, truncation=True,
      num_beams=want*2,
      num_return_sequences=want,
      early_stopping=True
    )
    out_variants = []
    for beam in en_beams:
        en = beam["translation_text"]
        time.sleep(0.1)

        # Step 3: Random multilingual pivot
        hop = random.random()
        if hop < 0.3:
            lang, (e2p, p2e) = random.choice(list(pivot_pipes.items()))
            en = p2e(e2p(en, max_length=512, truncation=True)[0]["translation_text"],
                     max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.2)
        elif hop < 0.5:
            # Double pivot via FR
            mid = pivot_pipes["fr"][0](en, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)
            en  = pivot_pipes["fr"][1](mid, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)

        # Step 4: Translate EN → DE
        de = en_de(en, max_length=512, truncation=True)[0]["translation_text"]
        time.sleep(0.1)

        # Step 5: Unmask original tags
        for i, t in enumerate(tags, 1):
            de = de.replace(f"[TAG{i}]", t)
        out_variants.append(de)

    return out_variants


# --- Step 8: Run paraphrasing and save preview to JSON ---
OUT_FILE = Path("data/preview_paraphrases.json")
results = []

for rec in tqdm(preview, desc="Building preview JSON"):
    want     = n_variants_for(rec)
    variants = bt_super_diverse(rec["text"], want)
    results.append({
        "file":       rec["file"],
        "n_variants": want,
        "variants":   variants
    })

with OUT_FILE.open("w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"✓ Wrote preview to {OUT_FILE}")

# Option — Super-Diverse Backtranslation Pipeline (JSON Export)

This notebook generates synthetic paraphrases of anonymized German customer emails via backtranslation.
The pipeline translates DE → EN with beam search and optionally inserts random multilingual pivots (FR, ES, IT) before translating back to German.

We sample 3 representative training examples and write the output to `option_a_paraphrases_preview.json`.
This configuration uses:
- Increased beam width (`max(want, 20)`) for richer diversity
- GPU-based HuggingFace pipelines for fast multilingual translation
- Full placeholder masking/unmasking to preserve entity structure

This notebook is intended as a prototype for evaluating an option of our paraphrasing strategies.


In [None]:
# --- Step 8: Run paraphrasing and save preview to JSON ---
!pip install -q transformers sentencepiece tqdm

# --- Step 1: Imports ---
import re, time, random, json
from math          import ceil
from pathlib       import Path
from random        import seed
from collections   import Counter
from tqdm.auto     import tqdm
from transformers  import pipeline, set_seed

# --- Step 2: Load all anonymized .txt files into memory ---
DATA_DIR = Path("../../data/original/golden_dataset_anonymized_granular")
all_txt  = sorted(DATA_DIR.glob("*.txt"))
records  = []
for f in all_txt:
    txt = f.read_text("utf-8")
    labs = [{"start":m.start(),"end":m.end(),"label":m.group(1)}
           for m in re.finditer(r"<<([^>]+)>>", txt)]
    records.append({"file":f.name, "text":txt, "labels":labs})

# --- Step 3: Apply train/test split (based on fixed test IDs) ---
TEST_IDS   = {0,142,2,3,146,145,157,165,19,18,20,166,176,177,
              32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,
              96,102,105,108,109,112,115,122,129,132,134}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}
train_recs = [r for r in records if r["file"] not in TEST_FILES]

# --- Step 4: Estimate number of needed paraphrase variants per record ---
tag_counts = Counter(l["label"] for r in train_recs for l in r["labels"])
max_cnt    = max(tag_counts.values(), default=1)
def n_variants_for(r):
    freqs = [tag_counts.get(l["label"],1) for l in r["labels"]]
    return ceil(max_cnt / max(min(freqs),1)) if freqs else 1

# --- Step 5: Instantiate translation pipelines with sampling parameters (on GPU) ---
kw = dict(device=0, do_sample=True, top_k=300, top_p=0.95, temperature=1.5)
de_en = pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en", **kw)
en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", **kw)
en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr", **kw)
fr_en = pipeline("translation_fr_to_en", model="Helsinki-NLP/opus-mt-fr-en", **kw)
en_es = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es", **kw)
es_en = pipeline("translation_es_to_en", model="Helsinki-NLP/opus-mt-es-en", **kw)
en_it = pipeline("translation_en_to_it", model="Helsinki-NLP/opus-mt-en-it", **kw)
it_en = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en", **kw)

pivot_pipes = {
    "fr": (en_fr, fr_en),
    "es": (en_es, es_en),
    "it": (en_it, it_en),
}

# --- Step 6: Define backtranslation with optional random multilingual pivoting ---
def bt_super_diverse(text: str, want: int) -> list[str]:
    # 1) Mask PII placeholders like <<NAME>> to preserve them across translations
    tags, masked = [], text
    for i, t in enumerate(re.findall(r"(<<[^>]+>>)", text), 1):
        tags.append(t)
        masked = masked.replace(t, f"[TAG{i}]")
    
    # 2)  Translate DE → EN with beam sampling
    en_beams = de_en(
    masked,
    max_length=512,
    truncation=True,
    num_beams=max(want, 20),
    num_return_sequences=want,
    early_stopping=True
)
    out_variants = []
    for beam in en_beams:
        en = beam["translation_text"]
        time.sleep(0.1)

        # Randomly apply one or two pivot translations to increase variation
        hop = random.random()
        if hop < 0.3:
            lang, (e2p, p2e) = random.choice(list(pivot_pipes.items()))
            en = p2e(
                e2p(en, max_length=512, truncation=True)[0]["translation_text"],
                max_length=512, truncation=True
            )[0]["translation_text"]
            time.sleep(0.2)
        elif hop < 0.5:
            # Double-hop through French: EN → FR → EN
            mid = pivot_pipes["fr"][0](en, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)
            en  = pivot_pipes["fr"][1](mid, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)

        # Translate back EN → DE
        de = en_de(en, max_length=512, truncation=True)[0]["translation_text"]
        time.sleep(0.1)

        # Unmask original placeholder tags
        for i, t in enumerate(tags, 1):
            de = de.replace(f"[TAG{i}]", t)
        out_variants.append(de)

    return out_variants

# --- Step 7: Sample a few records for this preview run ---
seed(2)
sample_recs = random.sample(train_recs, 3)

# --- Step 8: Run paraphrasing and save preview to JSON ---
OUT_FILE = Path("data/option_a_paraphrases_preview.json")
results  = []

for rec in tqdm(sample_recs, desc="Building preview JSON"):
    want     = n_variants_for(rec)
    variants = bt_super_diverse(rec["text"], want)
    results.append({
        "file":       rec["file"],
        "n_variants": want,
        "variants":   variants
    })

OUT_FILE.parent.mkdir(exist_ok=True, parents=True)
with OUT_FILE.open("w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"✓ Wrote preview to {OUT_FILE}")

# --- Step 9: Preview final output from saved JSON ---
with OUT_FILE.open("r", encoding="utf-8") as f:
    data = json.load(f)

for entry in data:
    print(f"\n📂 File: {entry['file']} (generated {entry['n_variants']} variants)")
    for i, variant in enumerate(entry['variants'], 1):
        print(f"  Variant {i}: {variant}")

## Full Synthetic Dataset Generation via Multilingual Backtranslation

This script generates a full set of paraphrased variants for all 120 training emails from the anonymized dataset. Each email is translated from German to English (with optional pivoting through French, Spanish, or Italian) and back to German, using Hugging Face translation pipelines and stochastic sampling to ensure diversity.

The number of paraphrase variants per email is proportional to the rarity of its entity types, helping to balance data representation. Placeholders like `<<NAME>>` are preserved across translations.

The output is saved as a structured JSON file to be used in downstream training or evaluation of anonymization models.

In [None]:
### CODE FOR THE TOTAL OF 120 TRAINING MAILS ###

# --- Step 0: Install required packages ---
!pip install -q transformers sentencepiece tqdm

# --- Step 1: Imports ---
import re, time, random, json
from math          import ceil
from pathlib       import Path
from collections   import Counter
from tqdm.auto     import tqdm
from transformers  import pipeline

# --- Step 2: Load all .txt files into memory ---
DATA_DIR = Path("../../data/original/golden_dataset_anonymized_granular")
all_txt  = sorted(DATA_DIR.glob("*.txt"))
records  = []
for f in all_txt:
    txt = f.read_text("utf-8")
    labs = [{"start":m.start(),"end":m.end(),"label":m.group(1)}
           for m in re.finditer(r"<<([^>]+)>>", txt)]
    records.append({"file":f.name, "text":txt, "labels":labs})

# --- Step 3: Train/test split (based on fixed test IDs) ---
TEST_IDS   = {0,142,2,3,146,145,157,165,19,18,20,166,176,177,
              32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,
              96,102,105,108,109,112,115,122,129,132,134}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}
train_recs = [r for r in records if r["file"] not in TEST_FILES]

# --- Step 4: Compute how many variants each record needs (rarer tags = more variants) ---
tag_counts = Counter(l["label"] for r in train_recs for l in r["labels"])
max_cnt    = max(tag_counts.values(), default=1)
def n_variants_for(r):
    freqs = [tag_counts.get(l["label"],1) for l in r["labels"]]
    return ceil(max_cnt / max(min(freqs),1)) if freqs else 1

# --- Step 5: Load HuggingFace translation pipelines on GPU ---
kw = dict(device=0, do_sample=True, top_k=300, top_p=0.95, temperature=1.5)
de_en = pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en", **kw)
en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", **kw)
en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr", **kw)
fr_en = pipeline("translation_fr_to_en", model="Helsinki-NLP/opus-mt-fr-en", **kw)
en_es = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es", **kw)
es_en = pipeline("translation_es_to_en", model="Helsinki-NLP/opus-mt-es-en", **kw)
en_it = pipeline("translation_en_to_it", model="Helsinki-NLP/opus-mt-en-it", **kw)
it_en = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en", **kw)

pivot_pipes = {
    "fr": (en_fr, fr_en),
    "es": (en_es, es_en),
    "it": (en_it, it_en),
}

# --- Step 6: Define super-diverse backtranslation function (same as before) ---
def bt_super_diverse(text: str, want: int) -> list[str]:
    # Step 1: Mask all placeholder tags (e.g. <<NAME>>)
    tags, masked = [], text
    for i, t in enumerate(re.findall(r"(<<[^>]+>>)", text), 1):
        tags.append(t)
        masked = masked.replace(t, f"[TAG{i}]")

    # Step 2: Translate DE → EN with beam sampling
    num_beams = want + 1  # ensure enough beams to satisfy num_return_sequences
    en_beams = de_en(
        masked,
        max_length=512,
        truncation=True,
        num_beams=num_beams,
        num_return_sequences=want,
        early_stopping=True
    )

    out_variants = []
    for beam in en_beams:
        en = beam["translation_text"]
        time.sleep(0.1)

        # Step 3: Randomly apply pivot hops for extra variation
        hop = random.random()
        if hop < 0.3:
            lang, (e2p, p2e) = random.choice(list(pivot_pipes.items()))
            en = p2e(
                e2p(en, max_length=512, truncation=True)[0]["translation_text"],
                max_length=512, truncation=True
            )[0]["translation_text"]
            time.sleep(0.2)
        elif hop < 0.5:
            mid = pivot_pipes["fr"][0](en, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)
            en = pivot_pipes["fr"][1](mid, max_length=512, truncation=True)[0]["translation_text"]
            time.sleep(0.1)

        # Step 4: Translate EN → DE
        de = en_de(en, max_length=512, truncation=True)[0]["translation_text"]
        time.sleep(0.1)

        # Step 5: Restore original placeholder tags
        for i, t in enumerate(tags, 1):
            de = de.replace(f"[TAG{i}]", t)
        out_variants.append(de)

    return out_variants

# --- Step 7: Generate paraphrases for full training set and save to JSON ---
OUT_FILE = Path("../../data/synthetic/option_a_paraphrases.json")
results  = []

for rec in tqdm(train_recs, desc="Building full JSON"):
    want     = n_variants_for(rec)
    variants = bt_super_diverse(rec["text"], want)
    results.append({
        "file":       rec["file"],
        "n_variants": want,
        "variants":   variants
    })

# --- Step 8: Save output JSON to disk ---
OUT_FILE.parent.mkdir(exist_ok=True, parents=True)
with OUT_FILE.open("w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"✓ Wrote all  paraphases to {OUT_FILE}")

In [None]:
# --- Step 9: Download generated JSON file to local machine ---
from google.colab import files
files.download("data/option_a_paraphrases.json")