# AskQE Baseline — Synthetic Dataset Generation

This notebook reproduces the **synthetic dataset** used for the AskQE baseline:
1. **Perturbation** — generate controlled translation errors via Groq LLM
2. **Back-translation** — translate perturbed sentences back to English

> Works on **Colab** and **Kaggle** (GPU not required for this notebook).

In [None]:
!git clone https://github.com/AlessandroMaini/CucumBERT_askqe.git
!pip install -q groq deep_translator

In [None]:
import os
from pathlib import Path

BASE = Path("CucumBERT_askqe")
os.environ["GROQ_API_KEY"] = ""  # ← paste your key here

## 1. Perturbation
Generate four perturbation types (synonym, alteration, omission, expansion) for both language pairs.

In [None]:
PERTURB_SCRIPT = BASE / "contratico" / "perturb_groq.py"
PERTURBATIONS = ["synonym", "alteration", "omission", "expansion_noimpact"]
LANG_PAIRS = {"en-es": "es", "en-fr": "fr"}

for lp, lang in LANG_PAIRS.items():
    input_file = f"data/processed/{lp}.jsonl"
    for pert in PERTURBATIONS:
        print(f"\n── {lp} / {pert} ──")
        !python {PERTURB_SCRIPT} --input_file {input_file} --language {lang} --perturbation_type {pert}

## 2. Back-Translation
Translate each perturbed file back to English using Google Translate.

In [None]:
BT_SCRIPT = BASE / "backtranslation" / "backtranslate.py"
BT_LANG = {"en-es": ("es", "en"), "en-fr": ("fr", "en")}

for lp, (src, tgt) in BT_LANG.items():
    for pert in PERTURBATIONS:
        input_file = f"contratico/{lp}/{pert}.jsonl"
        print(f"\n── {lp} / {pert} ──")
        !python {BT_SCRIPT} {input_file} --source_lang {src} --target_lang {tgt}