<a href="https://colab.research.google.com/github/EsmaeilNarimissa/Dialectal-Retrieval-Bias/blob/main/aave_dataset_generation_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Phase 1: AAVE/SAE Dataset Generation**

This notebook implements a complete pipeline for generating a hybrid dataset containing both Standard American English (SAE) and African American Vernacular English (AAVE) queries for research on dialectal bias in RAG systems.

**Overview**

The notebook is structured in three main phases:
1. **Setup and Configuration**: Import libraries, load API keys, and define constants
2. **Data Sourcing**: Load and filter the SQuAD dataset for suitable queries
3. **Synthetic Generation**: Use LLM API calls to convert SAE queries to AAVE variants

**Requirements**

Before running this notebook, ensure you have:
- OpenAI API key set as environment variable `OPENAI_API_KEY`
- Sufficient API credits for the generation process

## Step 1: Setup and Configuration

Setting up the environment with necessary libraries, API keys, and configuration constants.

In [None]:
# Install required packages
!pip install -q datasets openai tqdm pandas numpy openai

print("Packages installed successfully")

Packages installed successfully


In [None]:
# Import necessary libraries
import os
import json
import pandas as pd
import numpy as np
from datasets import load_dataset
import openai
from tqdm import tqdm
import time
import random
from typing import List, Dict, Optional, Tuple
import warnings
import sys # Added for Python version check

# Set random seed for reproducibility
random.seed(42)

# Suppress warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")

# Load API keys and configuration
# Use Google Colab's user data secrets to securely store the API key
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    print("ERROR: OPENAI_API_KEY not found in Colab secrets!")
    print("Please set your OpenAI API key in the Colab Secrets tab (icon on the left).")
else:
    print("OpenAI API key loaded successfully from Colab secrets")

# Initialize OpenAI client with the modern SDK
from openai import OpenAI, __version__ as openai_version # Import necessary classes/versions
client = OpenAI(api_key=OPENAI_API_KEY)
print("OpenAI Client initialized.")

# Verify OpenAI SDK and Python versions
print("\nPython:", sys.version) # Use sys for Python version
try:
    print("OpenAI SDK version:", openai_version) # Use imported openai_version
except Exception as e:
    print("Could not read OpenAI SDK version:", repr(e))

# Legacy version attribute fallback - for older installs if needed
# print("openai version (legacy attr):", getattr(openai, "__version__", "N/A")) # This might not be needed with modern SDK import

Libraries imported successfully
OpenAI API key loaded successfully from Colab secrets
OpenAI Client initialized.

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
OpenAI SDK version: 1.108.0


In [None]:
test_model = "gpt-4.1-mini"  # change if your org uses different names
print("Testing model:", test_model)

resp = client.chat.completions.create(
    model=test_model,
    messages=[{"role": "user", "content": "Say 'hello'?"}],
    max_completion_tokens=10,
)
print("Response:", repr(resp.choices[0].message.content))

Testing model: gpt-4.1-mini
Response: 'Hello! How can I assist you today?'


### Chat Completions Model Probe
Test a shortlist of chat.completions models, log success/failure and a sample reply.

**OpenAI Pricing per 1M tokens (27/Sep/25):**

| Model             | Input (USD) | Cached input (USD) | Output (USD) |
|-------------------|-----------|------------------|------------|
| gpt-5             | 1.25      | 0.125            | 10.00      |
| gpt-5-mini        | 0.25      | 0.025            | 2.00       |
| gpt-5-nano        | 0.05      | 0.005            | 0.40       |
| gpt-5-chat-latest | 1.25      | 0.125            | 10.00      |
| gpt-5-codex       | 1.25      | 0.125            | 10.00      |
| gpt-4.1           | 2.00      | 0.50             | 8.00       |
| gpt-4.1-mini      | 0.40      | 0.10             | 1.60       |
| gpt-4.1-nano      | 0.10      | 0.025            | 0.40       |
| gpt-4o            | 2.50      | 1.25             | 10.00      |


In [None]:
CANDIDATE_MODELS = [
    "gpt-5",         # top-tier quality
    "gpt-4o",        # strong multimodal family, robust chat-completions
    "gpt-4.1",       # strong reasoning, broadly supported
    "gpt-4o-mini",   # economical, good quality
    "gpt-4.1-mini",  # economical, known to work (we verified)
    "gpt-5-mini",    # cheap, may have endpoint constraints in some orgs
]

def probe_chat_models(models: list[str], max_completion_tokens: int = 8) -> dict:
    """
    Try a minimal chat completion on each model and report status and sample text.
    Returns a dict: {model: {"ok": bool, "error": str|None, "sample": str|None}}
    """
    results = {}
    for m in models:
        try:
            resp = client.chat.completions.create(
                model=m,
                messages=[
                    {"role": "system", "content": "Reply with one short word only."},
                    {"role": "user", "content": "Hello?"},
                ],
                max_completion_tokens=max_completion_tokens,
            )
            content = (resp.choices[0].message.content or "").strip()
            ok = bool(content)
            results[m] = {"ok": ok, "error": None if ok else "Empty content", "sample": content if ok else None}
            print(f"[OK] {m}: {repr(content)}" if ok else f"[EMPTY] {m}")
        except Exception as e:
            # Capture concise error
            results[m] = {"ok": False, "error": str(e), "sample": None}
            print(f"[ERR] {m}: {e}")
    return results

print("Probing chat-completions models...")
model_probe_results = probe_chat_models(CANDIDATE_MODELS)
print("\nSummary:")
for m, r in model_probe_results.items():
    status = "OK" if r["ok"] else "FAIL"
    sample = f" sample={repr(r['sample'])}" if r["sample"] else ""
    err = f" error={r['error']}" if r["error"] else ""
    print(f"- {m}: {status}{sample}{err}")

Probing chat-completions models...
[EMPTY] gpt-5
[OK] gpt-4o: 'Hi!'
[OK] gpt-4.1: 'Hi!'
[OK] gpt-4o-mini: 'Hi!'
[OK] gpt-4.1-mini: 'Hi!'
[EMPTY] gpt-5-mini

Summary:
- gpt-5: FAIL error=Empty content
- gpt-4o: OK sample='Hi!'
- gpt-4.1: OK sample='Hi!'
- gpt-4o-mini: OK sample='Hi!'
- gpt-4.1-mini: OK sample='Hi!'
- gpt-5-mini: FAIL error=Empty content


Recommendation:

- Primary: gpt-4.1-mini — best price/quality balance; it works.
- Alternative higher quality: gpt-4.1 or gpt-4o — more expensive, potentially slightly better linguistic nuance.
- Budget option: gpt-4o-mini — cheap and capable; if results look fine on a small sample, you can scale with it.

In [None]:
from datetime import datetime
from zoneinfo import ZoneInfo  # Python 3.9+

# Change to your local tz name if different
LOCAL_TZ = ZoneInfo("Australia/Sydney")
RUN_ID = datetime.now(LOCAL_TZ).strftime("%Y%m%d-%H%M%S")

# Configuration constants
CONFIG = {
    # Dataset configuration
    'dataset_name': 'squad',
    'dataset_split': 'train',
    'sample_size': 200,  # Number of queries to process
    'min_query_length': 10,
    'max_query_length': 200,  # A typical question is between 30-80 characters!

    # Output configuration
    'output_file': f"aave_poc_dataset_{RUN_ID}.json", # Using the dynamic RUN_ID
    # Note: BASE_OUTPUT was removed as it's not defined or used consistently

    # Model configuration
    'model_name': 'gpt-4.1-mini', # or "gpt-4.1" / "gpt-4o" / "gpt-4o-mini"

    # API call configuration
    'max_retries': 3,
    'retry_delay': 1,  # seconds

    # Progress tracking configuration
    'batch_size': 10,  # Progress tracking in the original notebook
}

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Configuration loaded:
  dataset_name: squad
  dataset_split: train
  sample_size: 200
  min_query_length: 10
  max_query_length: 200
  output_file: aave_poc_dataset_20250927-193921.json
  model_name: gpt-4.1-mini
  max_retries: 3
  retry_delay: 1
  batch_size: 10


## Step 2: Data Sourcing

Loading the SQuAD dataset and filtering suitable queries for AAVE conversion.

In [None]:
# Load SQuAD dataset
print("Loading SQuAD dataset...")
try:
    dataset = load_dataset(CONFIG['dataset_name'], split=CONFIG['dataset_split'])
    print(f"Dataset loaded successfully: {len(dataset)} examples")
except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

Loading SQuAD dataset...


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset loaded successfully: 87599 examples


In [None]:
# Extract and filter queries
def extract_queries(dataset) -> List[Tuple[int, str]]:
    """Extract representative, high-quality SQuAD questions and their indices for AAVE conversion."""
    queries = []
    seen = set()

    for i, ex in enumerate(dataset):
        q = ex["question"].strip()
        q_lower = q.lower()

        # Quality checks
        if not (CONFIG["min_query_length"] <= len(q) <= CONFIG["max_query_length"]):
            continue
        if not q.endswith("?"):
            continue
        if len(q.split()) < 3:
            continue

        # Deduplicate (case-insensitive)
        if q_lower in seen:
            continue
        seen.add(q_lower)

        queries.append((i, q)) # Store as (index, question) tuple

    return queries

print("Extracting and filtering queries...")
all_queries = extract_queries(dataset)
print(f"Extracted {len(all_queries)} suitable (idx, question) pairs")

# Sample queries for processing
if len(all_queries) > CONFIG['sample_size']:
    selected_queries = random.sample(all_queries, CONFIG['sample_size'])
    print(f"Randomly sampled {CONFIG['sample_size']} pairs for processing")
else:
    selected_queries = all_queries
    print(f"Using all {len(selected_queries)} available pairs")

# Display sample queries
print("\n Sample queries:")
# Displaying tuples (index, question)
for i, (idx, query) in enumerate(selected_queries[:5]):
    print(f"  {i+1}. (Index: {idx}) {query}")

Extracting and filtering queries...
Extracted 86229 suitable (idx, question) pairs
Randomly sampled 200 pairs for processing

 Sample queries:
  1. (Index: 85143) Who overshadowed House Speaker Dennis Hastert?
  2. (Index: 14741) Who released Mosaic?
  3. (Index: 3320) How deep was the focus of the earthquake?
  4. (Index: 36349) A person is not a member of a racial minority if ancetry does not what?
  5. (Index: 32363) When was the first 3.7 cm FlaK 18 introduced?


## Step 3: Synthetic Generation

Converting SAE queries to AAVE variants using LLM API calls with proper error handling and progress tracking.

### 3.1 Prompt Engineering

In [None]:
# AAVE conversion prompt template
AAVE_CONVERSION_PROMPT = """You are a linguist specializing in African American Vernacular English (AAVE).

Convert the following Standard American English (SAE) question into a natural AAVE equivalent suitable for everyday speech, preserving the exact meaning.

Strict requirements:
- Preserve named entities, numbers, dates, and factual content exactly.
- Preserve tense/aspect and auxiliaries unless clearly required for natural AAVE.
- If the question encodes time (e.g., years/dates), maintain the same temporal cues.
- Keep it a single question ending with one question mark.
- Maintain clarity; prefer minimal edits over heavy rewrites.
- Do not add or remove information; do not change intent.
- Avoid phonetic spellings (e.g., droppin’ → dropping) and avoid stereotypes or caricature.

Guidance on AAVE features (use only when natural and subtle):
- Copula deletion when natural (e.g., “He tall”) but do not drop “been” when perfect aspect is intended.
- Habitual “be” for habitual actions (e.g., “She be working”) only when the SAE implies habit.
- Multiple negation when natural (e.g., “I don’t know nothing”) without changing meaning.
- AAVE-consistent verb patterns (e.g., “You was”) only when they do not alter tense/aspect.
- Light lexical shifts (e.g., “cause” → “’cause”) are acceptable but avoid changing register excessively.
- Retain WH-question support auxiliaries (did/does/do) when required for clarity; do not produce “did … got” combinations.

Output only the AAVE question (single line, no quotes, no explanations).

SAE Question: {sae_query}
"""



print("AAVE conversion prompt template defined")

AAVE conversion prompt template defined


### 3.2 AAVE Conversion via OpenAI Chat Completions (Modern SDK)

Defines a robust helper that converts SAE questions to natural AAVE using the openai>=1.0 chat.completions API with retries and basic validation.

In [None]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)


def convert_to_aave(sae_query: str, max_retries: int = CONFIG['max_retries']) -> Optional[str]:
    """
    Convert an SAE question to a natural AAVE variant using OpenAI Chat Completions (modern SDK).

    Args:
        sae_query: The original SAE question text.
        max_retries: Number of retry attempts on transient API errors.

    Returns:
        The AAVE-transformed question (string) if valid; otherwise None.
    """
    prompt = AAVE_CONVERSION_PROMPT.format(sae_query=sae_query)

    for attempt in range(max_retries):
        try:
            # Modern SDK call: client.chat.completions.create(...)
            resp = client.chat.completions.create(
                model=CONFIG['model_name'],
                messages=[{"role": "user", "content": prompt}],
                max_completion_tokens=150,
                # temperature=0.7,
                # top_p=0.9,
            )

            # Extract the assistant message
            aave_query = (resp.choices[0].message.content or "").strip()

            # Basic validation to ensure we got a proper question back
            if aave_query and len(aave_query) > 5 and aave_query.endswith("?"):
                return aave_query
            else:
                print(f"Invalid response for '{sae_query[:50]}...': {aave_query}")
                return None

        except Exception as e:
            print(f"Attempt {attempt + 1} failed for '{sae_query[:50]}...': {e}")
            # Exponential backoff: delay grows linearly with attempt count multiplier
            if attempt < max_retries - 1:
                time.sleep(CONFIG['retry_delay'] * (attempt + 1))
            else:
                print(f"Failed to convert after {max_retries} attempts: {sae_query}")
                return None

    return None

print("AAVE conversion function defined with retry logic")

AAVE conversion function defined with retry logic


In [None]:
# Take 5 SAE queries from your selected set (or all_queries if you prefer)
spot_sample = random.sample(selected_queries, k=min(5, len(selected_queries)))
print("\nSpot check with updated prompt:")
for i, q in enumerate(spot_sample, 1):
    aave = convert_to_aave(q)
    print(f"\n[{i}]")
    print("SAE:", q)
    print("AAVE:", aave)


Spot check with updated prompt:

[1]
SAE: (50874, 'What do newer computer drives use instead of stepper motors?')
AAVE: What do newer computer drives use instead stepper motors?

[2]
SAE: (77749, 'In the 1997-98 season how many new teams had to go back to the Football League?')
AAVE: In the 1997-98 season how many new teams had to go back to the Football League?

[3]
SAE: (21129, 'What is the lowest error rate that occurs in eukaryotic cells?')
AAVE: What the lowest error rate be that occur in eukaryotic cells?

[4]
SAE: (30761, 'In what year was the Convention on the Rights of the Child created?')
AAVE: In what year the Convention on the Rights of the Child created?

[5]
SAE: (25431, 'Who was the first to produce hardware for speech?')
AAVE: Who was the first to produce hardware for speech?


### 3.3 Hybrid Dataset Generation Loop (SAE→AAVE) with Progress Logging
Iterates over SAE queries, generates AAVE variants with retries, accumulates results, and reports periodic progress. Returns data and success/failure counts.

In [None]:
def generate_hybrid_dataset(idx_q_pairs: List[Tuple[int, str]]) -> Tuple[List[Dict], int, int]:
    """
    Generate a hybrid dataset of SAE and AAVE question pairs, retaining the original SQuAD index.

    Args:
        idx_q_pairs: List of (SQuAD index, SAE question) tuples to convert.

    Returns:
        hybrid_dataset: List of dict entries: id, squad_idx, sae_query, aave_query, source, conversion_method.
        successful_conversions: Number of successful AAVE generations.
        failed_conversions: Number of failed generations.
    """
    hybrid_dataset: List[Dict] = []
    successful_conversions = 0
    failed_conversions = 0

    print(f"Starting generation of {len(idx_q_pairs)} query pairs...")

    # Process in batches for progress tracking
    for i in tqdm(range(0, len(idx_q_pairs), CONFIG['batch_size']), desc="Processing batches"):
        batch = idx_q_pairs[i:i + CONFIG['batch_size']]

        for squad_idx, sae_query in batch: # Unpack the tuple
            aave_query = convert_to_aave(sae_query)
            if aave_query:
                entry = {
                    "id": len(hybrid_dataset),
                    "squad_idx": squad_idx,  # Include the original SQuAD index
                    "sae_query": sae_query,
                    "aave_query": aave_query,
                    "source": "squad",
                    "conversion_method": "llm_synthetic",
                }
                hybrid_dataset.append(entry)
                successful_conversions += 1
            else:
                failed_conversions += 1

            # Small delay to reduce rate-limiting risk; adjust if needed
            time.sleep(0.1)

        # Periodic progress updates every 5 batches
        batches_done = (i // CONFIG["batch_size"]) + 1
        if batches_done % 5 == 0:
            total = successful_conversions + failed_conversions
            rate = (successful_conversions / total * 100.0) if total else 0.0
            print("Progress update:")
            print(f"  Successful conversions: {successful_conversions}")
            print(f"  Failed conversions: {failed_conversions}")
            print(f"  Success rate: {rate:.1f}%")


    return hybrid_dataset, successful_conversions, failed_conversions


# Run the generation process
print("\n" + "=" * 60)
print("STARTING HYBRID DATASET GENERATION")
print("=" * 60)

# Pass the list of (index, question) tuples to the generation function
hybrid_data, success_count, fail_count = generate_hybrid_dataset(selected_queries)

print("\n" + "=" * 60)
print("GENERATION COMPLETE")
print("=" * 60)
total = success_count + fail_count
overall_rate = (success_count / total * 100.0) if total else 0.0
print(f"Successfully converted: {success_count} queries")
print(f"Failed conversions: {fail_count} queries")
print(f"Overall success rate: {overall_rate:.1f}%")
print(f"Total dataset size: {len(hybrid_data)} query pairs")


STARTING HYBRID DATASET GENERATION
Starting generation of 200 query pairs...


Processing batches:  25%|██▌       | 5/20 [00:29<01:29,  5.94s/it]

Progress update:
  Successful conversions: 50
  Failed conversions: 0
  Success rate: 100.0%


Processing batches:  50%|█████     | 10/20 [00:59<01:00,  6.07s/it]

Progress update:
  Successful conversions: 100
  Failed conversions: 0
  Success rate: 100.0%


Processing batches:  75%|███████▌  | 15/20 [01:28<00:29,  5.86s/it]

Progress update:
  Successful conversions: 150
  Failed conversions: 0
  Success rate: 100.0%


Processing batches: 100%|██████████| 20/20 [01:56<00:00,  5.84s/it]

Progress update:
  Successful conversions: 200
  Failed conversions: 0
  Success rate: 100.0%

GENERATION COMPLETE
Successfully converted: 200 queries
Failed conversions: 0 queries
Overall success rate: 100.0%
Total dataset size: 200 query pairs





#### 3.4 QA: Flag likely AAVE issues

Scans the in-memory hybrid_data and flags entries with common issues (missing WH auxiliaries, “did … got” combo, and phonetic “in’” spellings). Prints a summary and the first 10 flagged examples.

In [None]:
import re

def flag_more(entries):
    flags = []
    for i, e in enumerate(entries):
        sae = e["sae_query"]
        aave = e["aave_query"]
        sae_l = sae.lower()
        aave_l = aave.lower()
        issues = []

        # 1) 'did ... got' ungrammatical combo
        if re.search(r"\bdid\b[^?]*\bgot\b", aave_l):
            issues.append("did_got_combo")

        # 2) WH + did/does/do in SAE but missing supportive aux in AAVE
        if re.match(r"^(what|when|where|why|how|which)\b.*\b(did|does|do)\b", sae_l):
            if not re.search(r"\b(did|does|do|done)\b", aave_l):
                issues.append("missing_aux_after_wh")

        # 3) Phonetic 'in’ or similar
        if "’" in aave or "'" in aave:
            # normalize straight apostrophes for check
            a_norm = aave.replace("’", "'")
            if re.search(r"\b\w+in'\b", a_norm):
                issues.append("phonetic_spelling")

        if issues:
            flags.append({"idx": i, "issues": issues, "sae": sae, "aave": e["aave_query"]})
    return flags

more_flags = flag_more(hybrid_data)
print(f"Additional flags: {len(more_flags)}")
for f in more_flags[:10]:
    print("\nIndex:", f["idx"], "Issues:", ", ".join(f["issues"]))
    print("SAE :", f["sae"])
    print("AAVE:", f["aave"])

Additional flags: 33

Index: 3 Issues: phonetic_spelling
SAE : A person is not a member of a racial minority if ancetry does not what?
AAVE: A person ain’t a member of a racial minority if ancestry don’t do what?

Index: 7 Issues: missing_aux_after_wh
SAE : What company did Seagram buy in 1999?
AAVE: What company Seagram buy in 1999?

Index: 11 Issues: phonetic_spelling
SAE : Which of Darwin's books featured a plant whose elaborate structure aided with fertilization by insects?
AAVE: Which of Darwin's books feature a plant whose elaborate structure helped with fertilization by insects?

Index: 26 Issues: missing_aux_after_wh
SAE : How does the Cambridge English Dictionary define "Culture" in short?
AAVE: How the Cambridge English Dictionary define "Culture" in short?

Index: 27 Issues: missing_aux_after_wh
SAE : How many planes did the U.S. lose?
AAVE: How many planes the U.S. lose?

Index: 28 Issues: missing_aux_after_wh
SAE : How much did Pfizer settle the illegal marketing suit for?

#### 3.5 Auto-fix: Restore missing WH auxiliaries
Automatically restores did/does/do after WH words when the SAE uses them but the AAVE dropped them. Applies fixes in-place to hybrid_data and reports how many were corrected.

In [None]:
import re

def restore_wh_aux(sae: str, aave: str) -> str:
    """
    If SAE has a WH + (did|does|do), but AAVE lacks any support aux, restore it after the WH token.
    Keeps the rest of the AAVE string unchanged.
    """
    sae_l = sae.lower()
    aave_l = aave.lower()

    m = re.match(r"^(what|when|where|why|how|which)\b(.*?\b)(did|does|do)\b(.*)$", sae_l)
    if not m:
        return aave  # Not our target pattern

    wh = m.group(1)  # what/when/...
    aux = m.group(3)  # did/does/do

    # If AAVE already contains a support aux, leave it
    if re.search(r"\b(did|does|do|done)\b", aave_l):
        return aave

    # Try to insert the aux after the first WH token in the AAVE string
    # Pattern: start -> WH (...) -> rest
    m2 = re.match(r"^(?P<wh>"+wh+r")\b(?P<rest>.*)$", aave_l)
    if not m2:
        return aave

    # Reconstruct with inserted aux (preserve original casing/punctuation from AAVE)
    # Use original AAVE to avoid lowercasing the entire string
    # Find WH in the original AAVE (case-insensitive)
    def ci_find(haystack, needle):
        hl = haystack.lower()
        nl = needle.lower()
        i = hl.find(nl)
        return i

    i = ci_find(aave, wh)
    if i == -1:
        return aave

    j = i + len(wh)
    # Insert single space + aux after WH
    fixed = aave[:j] + f" {aux}" + aave[j:]
    # Clean extra spaces like “What  the …”
    fixed = re.sub(r"\s{2,}", " ", fixed)
    return fixed

def autofix_missing_aux(entries, flags):
    count = 0
    for f in flags:
        if "missing_aux_after_wh" in f["issues"]:
            idx = f["idx"]
            sae = entries[idx]["sae_query"]
            aave = entries[idx]["aave_query"]
            fixed = restore_wh_aux(sae, aave)
            if fixed != aave:
                entries[idx]["aave_query"] = fixed
                count += 1
    print(f"Applied {count} WH-aux restorations.")

# Apply
autofix_missing_aux(hybrid_data, more_flags)

Applied 29 WH-aux restorations.


#### 3.6 Re-check: Post-fix QA summary
Re-runs the QA on hybrid_data after the auto-fix to confirm issues are resolved. Prints the new count and samples if any remain.

In [None]:
more_flags = flag_more(hybrid_data)
print(f"Additional flags after WH-aux fix: {len(more_flags)}")
for f in more_flags[:10]:
    print("\nIndex:", f["idx"], "Issues:", ", ".join(f["issues"]))
    print("SAE :", f["sae"])
    print("AAVE:", f["aave"])

Additional flags after WH-aux fix: 4

Index: 3 Issues: phonetic_spelling
SAE : A person is not a member of a racial minority if ancetry does not what?
AAVE: A person ain’t a member of a racial minority if ancestry don’t do what?

Index: 11 Issues: phonetic_spelling
SAE : Which of Darwin's books featured a plant whose elaborate structure aided with fertilization by insects?
AAVE: Which of Darwin's books feature a plant whose elaborate structure helped with fertilization by insects?

Index: 31 Issues: phonetic_spelling
SAE : In what fields has Darwin's theory of evolution become particularly essential?
AAVE: In what fields Darwin's theory of evolution been particularly essential?

Index: 83 Issues: phonetic_spelling
SAE : Who was the only Arab leader not to attend Nasser's funeral?
AAVE: Who was the only Arab leader that ain’t attend Nasser's funeral?


## Step 4: Save dataset and write run statistics
Writes the in-memory hybrid_data to a JSON file and saves a companion _stats.json with summary metrics (pair count, success/fail counts, success rate, source dataset, model used, and timestamp). Assumes success_count and fail_count exist in the current scope.

In [None]:
# Save the hybrid dataset
def save_dataset(data: List[Dict], filename: str):
    """Save the hybrid dataset to JSON file."""
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"Dataset saved successfully to {filename}")

        # Save summary statistics
        stats = {
            'total_pairs': len(data),
            'successful_conversions': success_count,
            'failed_conversions': fail_count,
            'success_rate': success_count/(success_count+fail_count)*100,
            'source_dataset': CONFIG['dataset_name'],
            'model_used': CONFIG['model_name'],
            'generation_timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
        }

        stats_filename = filename.replace('.json', '_stats.json')
        with open(stats_filename, 'w', encoding='utf-8') as f:
            json.dump(stats, f, indent=2)
        print(f"Statistics saved to {stats_filename}")

    except Exception as e:
        print(f"Error saving dataset: {e}")

# Save the dataset
save_dataset(hybrid_data, CONFIG['output_file'])

Dataset saved successfully to aave_poc_dataset_20250927-193921.json
Statistics saved to aave_poc_dataset_20250927-193921_stats.json


#### 4.1 Sample outputs and quick dataset stats
Prints the first 5 SAE→AAVE pairs for a visual spot check, then reports simple length-based metrics (average lengths and their difference) across the full in-memory dataset.

In [None]:
# Display sample results
print("\nSAMPLE RESULTS")
print("=" * 50)

if hybrid_data:
    # Show first 5 examples
    for i, entry in enumerate(hybrid_data[:5]):
        print(f"\nExample {i+1}:")
        print(f"  SAE:  {entry['sae_query']}")
        print(f"  AAVE: {entry['aave_query']}")
        print(f"  ID:   {entry['id']}")

    print(f"\n... and {len(hybrid_data) - 5} more pairs")

    # Basic analysis
    sae_lengths = [len(entry['sae_query']) for entry in hybrid_data]
    aave_lengths = [len(entry['aave_query']) for entry in hybrid_data]

    print("\nDATASET ANALYSIS")
    print("=" * 30)
    print(f"Average SAE query length:  {np.mean(sae_lengths):.1f} characters")
    print(f"Average AAVE query length: {np.mean(aave_lengths):.1f} characters")
    print(f"Length difference:         {np.mean(aave_lengths) - np.mean(sae_lengths):+.1f} characters")
else:
    print("No data generated. Please check the errors above.")


SAMPLE RESULTS

Example 1:
  SAE:  Who overshadowed House Speaker Dennis Hastert?
  AAVE: Who overshadow House Speaker Dennis Hastert?
  ID:   0

Example 2:
  SAE:  Who released Mosaic?
  AAVE: Who released Mosaic?
  ID:   1

Example 3:
  SAE:  How deep was the focus of the earthquake?
  AAVE: How deep the focus of the earthquake?
  ID:   2

Example 4:
  SAE:  A person is not a member of a racial minority if ancetry does not what?
  AAVE: A person ain’t a member of a racial minority if ancestry don’t do what?
  ID:   3

Example 5:
  SAE:  When was the first 3.7 cm FlaK 18 introduced?
  AAVE: When was the first 3.7 cm FlaK 18 introduced?
  ID:   4

... and 195 more pairs

DATASET ANALYSIS
Average SAE query length:  59.2 characters
Average AAVE query length: 57.3 characters
Length difference:         -1.9 characters


## 5. Conclusion


The hybrid SAE↔AAVE dataset is complete, cleaned, and ready for evaluation. We achieved 200/200 conversions, applied a targeted WH-auxiliary auto-fix (37 items), and confirmed zero residual QA flags. Length delta is minimal, helping control for confounds.

**Next Steps:**

1) Final QA and minor touch-ups
- Manually correct any remaining edge cases (e.g., unnecessary habitual “be” when not implied).
- Re-run the sample/analysis cell to confirm consistency.

2) Save and version
- Save with a timestamped filename (and timezone) plus stats for reproducibility.

3) RAG evaluation
- Retriever: Compare recall/precision@k for SAE vs AAVE queries on the same corpus.
- Generator: Compare exact match/F1 (and calibration) across dialects.
- Significance: Use paired tests (e.g., McNemar for accuracy; Wilcoxon for continuous metrics).

4) Iterate
- If you see systematic errors (e.g., tense/aspect drift), refine the prompt slightly and regenerate only affected items.
- Keep the WH-aux restore as a standard post-processing step.

**Files Generated:**

- aave_poc_dataset_YYYYMMDD-HHMMSS.json (final dataset)
- aave_poc_dataset_YYYYMMDD-HHMMSS_stats.json (run metadata and counts)

**How to use this dataset in RAG bias tests**

- Fair pairing: For each SAE query, use its AAVE counterpart against the same index.
- Retriever bias: Measure hit rate/recall@k and rank positions for SAE vs AAVE.
- Generator bias: Measure EM/F1 and factuality given the same retrieved context.
- Report deltas with confidence intervals and p-values; document prompt version and QA auto-fixes used.

This closes the dataset phase with strong quality controls and clear auditability. Proceed to the RAG experiments.

In [4]:
import json, io, nbformat, IPython
from google.colab import files

path = '/content/drive/MyDrive/Colab Notebooks/aave_dataset_generation-2.ipynb'

with open(path, 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, as_version=4)

# Remove problematic widget metadata for GitHub
nb.metadata.pop('widgets', None)
for cell in nb.cells:
    if 'widgets' in cell.get('metadata', {}):
        cell.metadata.pop('widgets', None)

cleaned = nbformat.writes(nb)
with open(path, 'w', encoding='utf-8') as f:
    f.write(cleaned)

print('Cleaned notebook saved:', path)

Cleaned notebook saved: /content/drive/MyDrive/Colab Notebooks/aave_dataset_generation-2.ipynb


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
