## Paraphrasing Option B – LLM-based Variant Generation with GPT

This notebook explores **Option B** for paraphrasing emails using large language models (LLMs), specifically `gpt-3.5-turbo` (due to cost-efficiency reasons we used the 3.5 and not more recent versions, we are still students ;). The objective is to generate multiple stylistic variants for each anonymized German email in the training set. This approach uses prompt-based paraphrasing via the OpenAI API and ensures that entity placeholders (e.g., `<<NAME>>`, `<<VERTRAGSNUMMER>>`) and formatting are preserved.

The notebook contains two code sections:

1. A **preview run** on a single sample email for debugging and validation.
2. A **full pipeline** to paraphrase all training emails and output the results in a structured JSON format.

⚠️ **Note**: The provided API key (`MY_KEY`) is no longer valid. Anyone using this notebook must replace it with their own OpenAI API key or set it via the `OPENAI_API_KEY` environment variable.

In [24]:
# Cloning the GitHub repository and move to the notebooks folder
# it is required since this notebook was running in the Google Colab environment
!git clone https://github.com/AnnaGhost2713/daia-eon.git
%cd daia-eon/data

Cloning into 'daia-eon'...
remote: Enumerating objects: 1174, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 1174 (delta 26), reused 43 (delta 19), pack-reused 1113 (from 1)[K
Receiving objects: 100% (1174/1174), 48.32 MiB | 14.70 MiB/s, done.
Resolving deltas: 100% (669/669), done.
/content/daia-eon/data/daia-eon/data/daia-eon/data/daia-eon/data/daia-eon/data/daia-eon/data


### 🔹 Preview: Single Email Paraphrasing with GPT

This section runs a test paraphrasing of one anonymized German email using `gpt-3.5-turbo`. It defines a helper function that sends the prompt to OpenAI, receives multiple paraphrased variants, and prints the output in JSON format for inspection.

In [11]:
# --- Step 1: Install + import dependencies ---
import os, json, glob
import openai

# --- Step 2: Set up your OpenAI API key ---
# IMPORTANT: This API key is no longer valid and must be replaced.
# You can either:
#  - Set the environment variable `OPENAI_API_KEY`, or
#  - Replace `MY_KEY` below with your own API key.
MY_KEY = "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"
openai.api_key = os.getenv("OPENAI_API_KEY", MY_KEY)

# --- Step 3: Load training file paths (excluding predefined test set) ---
TEST_IDS   = {0,142,2,3,146,145,157,165,19,18,20,166,176,177,32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,96,102,105,108,109,112,115,122,129,132,134}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}

# --- Step 4: Load one sample training email for debugging ---
folder    = "original/golden_dataset_anonymized_granular"
all_txt   = sorted(glob.glob(f"{folder}/*.txt"))
train_txt = [p for p in all_txt if os.path.basename(p) not in TEST_FILES]

# ── Pick one sample to debug ──────────────────────────────────────
sample_path = train_txt[0]
with open(sample_path, encoding="utf-8") as f:
    sample_text = f.read().strip()

print("=== Sample text ===\n", sample_text[:300], "…\n")

# --- Step 5: Define a paraphrasing function using GPT ---
def paraphrase_block(block: str, n: int = 5) -> list[str]:
    """
    Paraphrases a block of German text into `n` variants using GPT.
    Preserves placeholder tags (e.g., <<NAME>>), line breaks, and structure.
    """
    prompt = (
        "You are a German copy editor. Paraphrase the entire following text block "
        f"in {n} distinct variants. Preserve every placeholder tag and keep the "
        "same line‑breaks. Each variant should convey the same meaning with different phrasing.\n"
        "Output *only* a JSON array of arrays (no explanations), then on its own line write `<END>`.\n\n"
        "Original:\n```\n"
        f"{block}"
        "\n```"
    )
    resp = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"user","content":prompt}],
        temperature=0.7,
        max_tokens=1000
    )
    content = resp.choices[0].message.content.strip()

    # Cut off at end marker
    if "<END>" in content:
        content = content.split("<END>")[0].strip()

    # Extract valid JSON portion
    start = content.find("[")
    end   = content.rfind("]") + 1
    json_str = content[start:end]

    # Parse and re-join line-separated paraphrase arrays
    paras = json.loads(json_str)
    return ["\n".join(para) for para in paras]

# --- Step 6: Run paraphrasing for the sample file ---
paraphrases = paraphrase_block(sample_text, n=5)

# --- Step 7: Display the structured result as JSON ---
print("=== JSON output ===")
print(json.dumps({
    "file": os.path.basename(sample_path),
    "original": sample_text,
    "paraphrases": paraphrases
}, ensure_ascii=False, indent=2))

=== Sample text ===
 Hallo liebes Eon Team,
es geht um die Vertragsnummer <<VERTRAGSNUMMER>>.
Bei der Einrichtung meines neuen Vertrages wurde leider die Überweisung als
Zahlungsart gewählt von dem jungen Kollegen an der Wohnungstür. Ich würde
es gerne wieder per Lastschrift abbuchen lassen, um mir den Stress zu
erspare …

=== JSON output ===
{
  "file": "1.txt",
  "original": "Hallo liebes Eon Team,\nes geht um die Vertragsnummer <<VERTRAGSNUMMER>>.\nBei der Einrichtung meines neuen Vertrages wurde leider die Überweisung als\nZahlungsart gewählt von dem jungen Kollegen an der Wohnungstür. Ich würde\nes gerne wieder per Lastschrift abbuchen lassen, um mir den Stress zu\nersparen.\nVerbraucherstelle ist weiterhin die <<STRASSE>> <<HAUSNUMMER>> in <<POSTLEITZAHL>> <<WOHNORT>>.\nGruß <<VORNAME>> <<NACHNAME>>",
  "paraphrases": [
    "Hallo Eon-Team,\ndie Vertragsnummer <<VERTRAGSNUMMER>> ist Gegenstand meines Anliegens.\nBei Abschluss meines neuen Vertrages wurde fälschlicherweise die Zah

### 🔸 Full Pipeline: Paraphrasing All Training Emails

This section runs the GPT-based paraphrasing pipeline over all training emails (excluding predefined test IDs). For each email, 5 stylistic variants are generated using a structured prompt, and the final results are saved as a JSON file for downstream use.

In [22]:
#### PARAPHRASING ALL 120 EMAILS 10 TIMES ####


# --- Step 1: Install Required Dependencies ---
!pip install openai tqdm --quiet

# --- Step 2: Imports ---
import os
import json
import glob
import re
import openai
from tqdm.auto import tqdm

# --- Step 3: API Key Setup ---
# NOTE: This key is no longer valid. Replace with your own or set the environment variable OPENAI_API_KEY.
MY_KEY = "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"
openai.api_key = os.getenv("OPENAI_API_KEY", MY_KEY)

# --- Step 4: Exclusion List for Test Files ---
TEST_IDS = {
    0,142,2,3,146,145,157,165,19,18,20,166,176,177,
    32,34,40,45,52,57,61,65,66,70,71,73,75,78,81,
    96,102,105,108,109,112,115,122,129,132,134
}
TEST_FILES = {f"{i}.txt" for i in TEST_IDS}

# --- Step 5: Define Input/Output Paths ---
INPUT_FOLDER   = "original/golden_dataset_anonymized_granular"
OUTPUT_JSON    = "synthetic/paraphrased_full_10.json"

# --- Step 6: Paraphrasing Function using GPT ---
def paraphrase_block(block: str, n: int = 10) -> list[str]:
    """
    Sends a prompt to GPT-3.5 to generate 'n' paraphrased variants of the input block.
    Preserves formatting and placeholder tags.
    """
    prompt = (
        "You are a German copy editor. Paraphrase the entire following text block "
        f"in {n} distinct variants. Preserve every placeholder tag and keep the "
        "same line-breaks and paragraph structure. Each variant should convey the same meaning with different phrasing and synonyms.\n"
        "Output only a JSON array of arrays (no explanations), then on its own line `<END>`.\n\n"
        "Original:\n```\n"
        f"{block}"
        "\n```"
    )

    resp = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=1500,
        stop=["<END>"]
    )
    content = resp.choices[0].message.content.strip()

    # Trim output to before the <END> marker
    if "<END>" in content:
        content = content.split("<END>")[0].strip()

    # Extract the JSON array substring
    start = content.find("[")
    end   = content.rfind("]") + 1
    json_str = content[start:end]

    # 1) Try parsing JSON
    try:
        paras = json.loads(json_str)
        return ["\n".join(lines) for lines in paras]
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        pattern = r"^\s*\d+\.\s*(.+?)(?=\n\s*\d+\.|\s*$)"
        matches = re.findall(pattern, content, flags=re.MULTILINE | re.DOTALL)
        extracted = [m.replace("\n", " ").strip() for m in matches]
        return (extracted + [""] * n)[:n]

# --- Step 7: Load and Filter Training Emails ---
all_txt = sorted(glob.glob(f"{INPUT_FOLDER}/*.txt"))
train_txt = [p for p in all_txt if os.path.basename(p) not in TEST_FILES]

print(f"Paraphrasing {len(train_txt)} emails with 10 variants each...")

# --- Step 8: Run Paraphrasing on All Files ---
output = []
for path in tqdm(train_txt, desc="Emails"):
    text = open(path, encoding="utf-8").read().strip()
    try:
        variants = paraphrase_block(text, n=10)
    except Exception as e:
        print(f"Error on {path}: {e}")
        continue
    output.append({
        "file": os.path.basename(path),
        "original": text,
        "paraphrases": variants
    })

# --- Step 9: Save Output to JSON ---
os.makedirs(os.path.dirname(OUTPUT_JSON), exist_ok=True)
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(output, f, ensure_ascii=False, indent=2)

print(f"\n✅ Wrote {len(output)} records to {OUTPUT_JSON}")

Paraphrasing 120 emails with 10 variants each...


Emails:   0%|          | 0/120 [00:00<?, ?it/s]


✅ Wrote 120 records to synthetic/paraphrased_full_10.json


In [23]:
# --- Step 10: Download generated JSON file to local machine ---
from google.colab import files
files.download("synthetic/paraphrased_full_10.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>