# Survey Metadata Testing & Profile Generator Demo

**Purpose**: This notebook helps validate survey metadata files and test out the profile generator functionality.

We will use profile generator to create text representation of respondents and conduct various experiments. As such, it is a key project functionality. It would be good to give it a trial run and see if we can spot any bugs before we run the main analysis in January.

It is also crucial we get the metadata files to be as polished as possible. We want to avoid survey/metadata creation artifacts messing up the experiments.

## What this notebook covers:
1. **Metadata Validation** - Check structure, completeness, and common issues
2. **Profile Generation** - Generate respondent profiles with varying richness
3. **Information Leakage Analysis** - Verify semantic filtering excludes related features
4. **Target Question Handling** - Test country-specific options
5. **Output Formatting** - Preview different prompt formats
6. **Quality Assurance Checklist** - What to look for when testing

---

## **1. Setup**

Run the cells below to import required packages and profile generator.

In [84]:
# Clone and install the package
# !git clone https://github.com/Oxford-LLMs-Research/synthetic_sampling
# %cd synthetic_sampling
# !pip install -e . -q  # Install in editable mode

In [85]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from collections import Counter
import warnings
import sys
import os

sys.path.insert(0, 'src')

# Import the profile generator (adjust path as needed)
from synthetic_sampling.profiles import (
    RespondentProfileGenerator,
    list_profile_formats,
    PROFILE_FORMATS
)

# For pretty printing
from IPython.display import display, Markdown, HTML

print("‚úì Imports successful")

‚úì Imports successful


---
**Load Your Metadata**

Replace the path below with your metadata JSON file. The files would be in one of the folders within `/content/synthetic_sampling/src/synthetic_sampling/profiles`.

In [86]:
# ============================================================
# CONFIGURE YOUR FILES HERE
# ============================================================

METADATA_PATH = "./ess11_profiles_metadata.json"  # <-- UPDATE THIS

# Load metadata
with open(METADATA_PATH, 'r', encoding='utf-8') as f:
    metadata = json.load(f)

print(f"‚úì Loaded metadata from: {METADATA_PATH}")
print(f"  Number of sections: {len(metadata)}")
print(f"  Sections: {list(metadata.keys())}")

‚úì Loaded metadata from: ./ess11_profiles_metadata.json
  Number of sections: 11
  Sections: ['demographics', 'other', 'trust_in_social_groups', 'politics', 'governance', 'civic_engagement', 'trust_in_institutions', 'political_affiliation', 'economic_outlook', 'climate_change', 'covid']


In [87]:
# this cell is optional; it is to pull out survey csv files from gdrive.

# mount_point = '/content/drive'
# from google.colab import drive
# drive.mount(mount_point)

In [88]:
# SURVEY_DATA_PATH = "/content/drive/MyDrive/Oxford LLMs Research/surveys/WVS/WVS_2017_22.csv"  # <-- UPDATE THISSURVEY_DATA_PATH = "/content/drive/MyDrive/Oxford LLMs Research/surveys/WVS/WVS_2017_22.csv"  # <-- UPDATE THIS

# load survey csv's
# wvs_data = pd.read_csv("/content/drive/MyDrive/Oxford LLMs Research/surveys/WVS/WVS_2017_22.csv")
ess_data = pd.read_csv("../../../../../data/ess/ESS11.csv", low_memory = False)

## **2. Metadata Validation**

---
**Metadata Structure Validation**

This section checks that your metadata follows the expected structure.

In [89]:
#@title metadata validation
"""
Enhanced Survey Metadata Validation for LLM-based Survey Response Prediction

This module validates survey metadata structure and detects common survey artifacts
that need to be cleaned or reformulated before use with LLMs.

Artifact Categories Detected:
1. Interviewer instructions (SHOWCARD, Do not read, etc.)
2. Placeholders in question text ([president], [country], [year], etc.)
3. Binary marking patterns (Marked/Not marked, Yes/No checkboxes)
4. Skip/routing logic artifacts (Go to Q__, If respondent answers...)
5. Response code artifacts (Code 7, Code 8, etc.)
6. Administrative/meta artifacts (Interview Record, Post-code, etc.)
7. Scale presentation artifacts (Likert abbreviations like SA, SWA, SWD, SD)
8. Multi-select/checkbox battery artifacts
9. Country-specific placeholder patterns
10. Numeric code artifacts in values (97, 98, 99 for DK/Refuse)

Author: Research Team
Purpose: Pre-processing validation for LLM survey prediction experiments
"""

from collections import Counter
import re
from typing import Dict, List, Tuple, Any, Optional
from dataclasses import dataclass, field


@dataclass
class ArtifactMatch:
    """Represents a detected artifact in survey metadata."""
    artifact_type: str
    location: str  # section/variable path
    field: str  # 'question', 'description', 'values', etc.
    matched_text: str
    pattern_name: str
    severity: str  # 'critical', 'warning', 'info'
    suggestion: Optional[str] = None


@dataclass
class ValidationReport:
    """Complete validation report for survey metadata."""
    critical: List[str] = field(default_factory=list)
    warnings: List[str] = field(default_factory=list)
    info: List[str] = field(default_factory=list)
    artifacts: List[ArtifactMatch] = field(default_factory=list)
    statistics: Dict[str, Any] = field(default_factory=dict)


# =============================================================================
# ARTIFACT DETECTION PATTERNS
# =============================================================================

# 1. Interviewer instruction patterns
INTERVIEWER_INSTRUCTION_PATTERNS = [
    # Showcard instructions
    (r'\(?\s*SHOWCARD\s*\)?', 'showcard_instruction'),
    (r'\[?\s*SHOW\s*CARD\s*\]?', 'showcard_instruction'),

    # Do not read instructions
    (r'\[?\s*[Dd]o\s*not\s*read\s*[:\]]?', 'do_not_read'),
    (r'\[?\s*DO\s*NOT\s*READ\s*[:\]]?', 'do_not_read'),
    (r'\(?\s*[Dd]o\s*not\s*read\s*[:\)]?', 'do_not_read'),

    # Interviewer notes
    (r'\[?\s*[Ii]nterviewer\s*:?\s*[^\]]*\]?', 'interviewer_note'),
    (r'\[?\s*INTERVIEWER\s*:?\s*[^\]]*\]?', 'interviewer_note'),
    (r'\[?\s*[Nn]ote\s*:?\s*[^\]]*\]?', 'note_instruction'),
    (r'\[?\s*NOTE\s*:?\s*[^\]]*\]?', 'note_instruction'),
    (r'\[?\s*[Ii]nstruction\s*:?\s*[^\]]*\]?', 'instruction_note'),

    # Read aloud instructions (including READ EACH ITEM from Latinobar√≥metro)
    (r'\[?\s*[Rr]ead\s*out\s*[^\]]*\]?', 'read_aloud'),
    (r'\[?\s*READ\s*OUT\s*[^\]]*\]?', 'read_aloud'),
    (r'\(?\s*[Rr]ead\s*out\s*options?\s*\)?', 'read_aloud'),
    (r'READ\s*EACH\s*ITEM', 'read_aloud'),

    # One answer only
    (r'\(?\s*ONE\s*ANSWER\s*ONLY\s*\)?', 'single_answer_instruction'),
    (r'\(?\s*[Oo]ne\s*answer\s*only\s*\)?', 'single_answer_instruction'),
    (r'\[?\s*[Mm]ultiple\s*answers?\s*allowed\s*\]?', 'multiple_answer_instruction'),

    # Probe instructions
    (r'\[?\s*[Pp]robe\s*[^\]]*\]?', 'probe_instruction'),
    (r'\[?\s*PROBE\s*[^\]]*\]?', 'probe_instruction'),

    # Optional question markers
    (r'<\s*[Oo]ptional\s*>', 'optional_marker'),
    (r'<\s*OPTIONAL\s*>', 'optional_marker'),

    # GBS markers (Global Barometer Survey standard questions)
    (r'\[?\s*GBS\s*\]?', 'gbs_marker'),

    # For coding only instructions (Latinobar√≥metro pattern)
    (r'FOR\s*CODING\s*ONLY', 'coding_only_instruction'),
    (r'[Ff]or\s*coding\s*only', 'coding_only_instruction'),

    # If volunteered pattern (Latinobar√≥metro)
    (r'\(?\s*if\s*volunteered\s*\)?', 'if_volunteered'),
    (r'\(?\s*IF\s*VOLUNTEERED\s*\)?', 'if_volunteered'),

    # Assessment/observation instructions (for interviewer to fill)
    (r'ASSESSMENT\s*OF\s*THE\s*INTERVIEWEE', 'interviewer_assessment'),
    (r'[Tt]ake\s*as\s*reference', 'interviewer_assessment'),
]

# 2. Placeholder patterns (need country/context-specific filling)
PLACEHOLDER_PATTERNS = [
    # Bracketed placeholders - various styles
    (r'\[country\s*name?\]', 'country_placeholder'),
    (r'\[Country\s*Name?\]', 'country_placeholder'),
    (r'\[COUNTRY\]', 'country_placeholder'),
    (r'\[country\]', 'country_placeholder'),
    (r'\[Country\s*X?\]', 'country_placeholder'),
    # Parenthetical country placeholders (Latinobar√≥metro style)
    (r'\(country\)', 'country_placeholder'),
    (r'\(COUNTRY\)', 'country_placeholder'),
    (r'\(Country\)', 'country_placeholder'),
    # Country possessive forms
    (r'\(COUNTRY¬¥S\)', 'country_placeholder'),
    (r'\(COUNTRY\'S\)', 'country_placeholder'),
    (r"\(country's\)", 'country_placeholder'),

    # Leader/official placeholders
    (r'\[president\]', 'leader_placeholder'),
    (r'\[President\]', 'leader_placeholder'),
    (r'\[prime\s*minister\]', 'leader_placeholder'),
    (r'\[Prime\s*Minister\]', 'leader_placeholder'),
    (r'\[PM\]', 'leader_placeholder'),
    (r'\[name\s*of\s*president[^\]]*\]', 'leader_placeholder'),
    (r'\[current\s*ruling[^\]]*\]', 'leader_placeholder'),
    # Latinobar√≥metro style: (president¬¥s name)
    (r"\(president['\u00b4\u2019]?s?\s*name\)", 'leader_placeholder'),
    (r"\(PRESIDENT['\u00b4\u2019]?S?\s*NAME\)", 'leader_placeholder'),

    # Year placeholders
    (r'\[year\]', 'year_placeholder'),
    (r'\[Year\]', 'year_placeholder'),
    (r'\[YEAR\]', 'year_placeholder'),
    (r'\[20\d{2}\]', 'year_placeholder'),

    # Institution placeholders
    (r'\[specify\s*institution[^\]]*\]', 'institution_placeholder'),
    (r'\[name\s*of\s*institution[^\]]*\]', 'institution_placeholder'),
    (r'\[election\s*commission[^\]]*\]', 'institution_placeholder'),

    # Party placeholders - only match generic placeholder patterns like "Party A", "Party B"
    # NOT legitimate party names like "Republican Party of Namibia"
    # Require: standalone "Party X" at word boundary, or bracketed, or followed by ellipsis
    (r'\bParty\s+[A-Z]\s*$', 'party_placeholder'),  # "Party A" at end of string
    (r'\bParty\s+[A-Z]\s*[,\.]', 'party_placeholder'),  # "Party A," or "Party A."
    (r'\bParty\s+[A-Z]\s+Party\s+[A-Z]', 'party_placeholder'),  # "Party A Party B" pattern
    (r'\[Party\s*[A-Z]\]', 'party_placeholder'),  # "[Party A]" bracketed
    (r'Party\s+[A-Z]\.{2,}', 'party_placeholder'),  # "Party A..." with ellipsis
    (r'\[country-specific[^\]]*\]', 'country_specific_placeholder'),
    (r'\[use\s*two-digit\s*code[^\]]*\]', 'coding_instruction_placeholder'),

    # Generic fill-in placeholders
    (r'\[fill\s*in[^\]]*\]', 'fill_in_placeholder'),
    (r'\[specify[^\]]*\]', 'specify_placeholder'),
    (r'___+', 'blank_line_placeholder'),
    (r'\(\s*\)', 'empty_parens_placeholder'),

    # Capital city placeholder
    (r'\[in\s*capital\s*city\]', 'location_placeholder'),

    # Item placeholder (Latinobar√≥metro battery questions)
    (r'\[ITEM\]', 'item_placeholder'),
    (r'\[item\]', 'item_placeholder'),

    # National/local placeholders (Latinobar√≥metro)
    (r'\(NATIONAL\)', 'national_placeholder'),
    (r'\(national\)', 'national_placeholder'),
]

# 3. Skip/routing logic patterns
SKIP_LOGIC_PATTERNS = [
    (r'[Gg]o\s*to\s*[Qq]\.?\s*\d+', 'skip_instruction'),
    (r'GO\s*TO\s*Q\.?\s*\d+', 'skip_instruction'),
    (r'\[?\s*[Ss]kip\s*to\s*[Qq]\.?\s*\d+\s*\]?', 'skip_instruction'),
    (r'[Gg]o\s*to\s*INTRO', 'skip_instruction'),
    (r'GO\s*TO\s*INTRO', 'skip_instruction'),
    (r'[Ii]f\s*the\s*respondent\s*answers?', 'conditional_routing'),
    (r'IF\s*THE\s*RESPONDENT', 'conditional_routing'),
    (r'\[?\s*[Ll]ogical\s*check\s*with\s*[Qq]\.?\s*\d+\s*\]?', 'logic_check'),
    (r'[Ff]or\s*those\s*go\s*to', 'routing_note'),
]

# 4. Response code artifacts
RESPONSE_CODE_PATTERNS = [
    # Code references in text
    (r'[Cc]ode\s*\d+\s*:', 'code_reference'),
    (r'\(\s*[Cc]ode\s*\d+\s*[:\)]', 'code_reference'),
    (r'[Cc]ode\s*\d+-\d+', 'code_range'),

    # Post-coding instructions
    (r'\(?\s*[Pp]ost[\s-]*[Cc]ode\s*\)?', 'post_code_instruction'),
    (r'\(?\s*POST[\s-]*CODE\s*\)?', 'post_code_instruction'),

    # Pre-coded instructions
    (r'open-ended,?\s*pre-coded', 'precoded_instruction'),
    (r'\(?\s*[Rr]ecord\s*[Vv]erbatim[^\)]*\)?', 'verbatim_instruction'),
]

# 5. Scale abbreviation patterns (Likert scales)
SCALE_ABBREVIATION_PATTERNS = [
    # Header row abbreviations
    (r'\bSA\b.*\bSWA\b.*\bSWD\b.*\bSD\b', 'likert_header_abbreviations'),
    (r'\bDU\b.*\bCC\b.*\bDA\b', 'response_code_abbreviations'),

    # Individual abbreviations that might appear in values
    (r'^SA$', 'strongly_agree_abbrev'),
    (r'^SWA$', 'somewhat_agree_abbrev'),
    (r'^SWD$', 'somewhat_disagree_abbrev'),
    (r'^SD$', 'strongly_disagree_abbrev'),
    (r'^DU$', 'dont_understand_abbrev'),
    (r'^CC$', 'cant_choose_abbrev'),
    (r'^DA$', 'decline_answer_abbrev'),
    (r'^DK$', 'dont_know_abbrev'),
]

# 6. Binary/checkbox marking patterns
BINARY_MARKING_PATTERNS = [
    (r'[Mm]arked', 'marked_response'),
    (r'[Nn]ot\s*[Mm]arked', 'not_marked_response'),
    (r'^1$.*^2$', 'binary_yes_no_codes'),  # Simple 1/2 coding
    (r'[Cc]hecked', 'checkbox_response'),
    (r'[Uu]nchecked', 'checkbox_response'),
    (r'[Tt]icked', 'checkbox_response'),
]

# 7. Multi-select battery patterns
MULTISELECT_PATTERNS = [
    (r'\d+[a-z]\.\s', 'sub_question_numbering'),
    (r'[Qq]\d+[a-z]', 'sub_question_reference'),
    (r'First\s*[Oo]rganization', 'multi_response_first'),
    (r'Second\s*[Oo]rganization', 'multi_response_second'),
    (r'Third\s*[Oo]rganization', 'multi_response_third'),
]

# 8. Administrative/meta patterns
ADMINISTRATIVE_PATTERNS = [
    (r'[Ii]nterview\s*[Nn]o\.?', 'interview_number'),
    (r"[Ii]nterviewer['\u2019]?s?\s*number", 'interviewer_id'),
    (r'[Dd]ate\s*of\s*[Ii]nterview', 'interview_date'),
    (r'[Pp]ostal\s*[Zz]ip\s*[Cc]ode', 'postal_code'),
    (r'TO\s*BE\s*FILLED\s*IN\s*BY', 'admin_instruction'),
    (r'[Ff]illed\s*in\s*by\s*the\s*interviewer', 'admin_instruction'),
]

# 9. Special response value patterns (missing data codes)
SPECIAL_VALUE_PATTERNS = [
    # Standard missing data codes
    (r'^97$', 'do_not_understand_code'),
    (r'^98$', 'cant_choose_code'),
    (r'^99$', 'decline_to_answer_code'),
    (r'^0$', 'not_applicable_code'),
    (r'^7$', 'dont_understand_single_digit'),
    (r'^8$', 'cant_choose_single_digit'),
    (r'^9$', 'decline_answer_single_digit'),

    # Ranges indicating special codes
    (r'^9[0-9]$', 'special_code_90s'),
]

# 10. Missing data label patterns (text patterns indicating non-response)
MISSING_DATA_LABEL_PATTERNS = [
    # DNK/DK patterns (Don't Know)
    (r'\bDNK\b', 'dont_know_abbreviation'),
    (r'\bDK\b', 'dont_know_abbreviation'),
    (r'\bDKN\b', 'dont_know_abbreviation'),
    (r"[Dd]on['\u2019]?t\s*[Kk]now", 'dont_know_label'),

    # NA patterns (No Answer / Not Applicable)
    (r'\bNA\b', 'no_answer_abbreviation'),
    (r'\bDNA\b', 'no_answer_abbreviation'),
    (r'\bDNK/NA\b', 'combined_dk_na'),
    (r'\bDK/NA\b', 'combined_dk_na'),
    (r'\bDNK/DNA\b', 'combined_dk_na'),

    # Not applicable patterns
    (r'[Nn]ot\s*[Aa]pplicable', 'not_applicable_label'),
    (r'[Nn]o\s*[Aa]nswer', 'no_answer_label'),
    (r'[Nn]o\s*[Dd]ata', 'no_data_label'),
]

# 11. Column position references (SPSS-style codebook artifacts)
COLUMN_POSITION_PATTERNS = [
    # Column position in parentheses like (60) or (63-64)
    (r'\(\d{1,3}\)$', 'column_position_single'),
    (r'\(\d{1,3}-\d{1,3}\)$', 'column_position_range'),
    # SPSS variable position notation
    (r'\d+-\d+\s*$', 'position_notation'),
]

# 12. Ranking/priority question patterns (multi-part questions)
RANKING_QUESTION_PATTERNS = [
    # "Which is second most important" patterns
    (r'[Ww]hich\s*would\s*be\s*second', 'ranking_second'),
    (r'[Ss]econd\s*most\s*important', 'ranking_second'),
    (r'[Ff]irst\s*most\s*important', 'ranking_first'),
    (r'[Tt]hird\s*most\s*important', 'ranking_third'),
    # Sequential question indicators
    (r'\.\d+\.\s', 'sequential_question_number'),
]

# 13. Recode/derived variable patterns
RECODE_PATTERNS = [
    (r'RECODE\s*OF', 'recode_variable'),
    (r'RECODED', 'recode_variable'),
    (r'[Rr]ecode\s*of', 'recode_variable'),
    # Local/weight variables
    (r'LOCAL\s*VARIABLE', 'local_variable'),
    (r'\bWT\b\.?\s*[Ww]eight', 'weight_variable'),
]

# 14. Survey series identifiers in variable names
SURVEY_SERIES_PATTERNS = [
    # Common suffixes indicating survey standardization
    (r'ST(?:GBS)?\.', 'standard_question_marker'),  # ST or STGBS prefix
    (r'GBS\.', 'global_barometer_marker'),
    (r'WVS', 'world_values_survey_marker'),
    (r'CSN?\.', 'country_specific_marker'),
    (r'INN?\.', 'innovation_module_marker'),
    (r'SDN\.', 'special_module_marker'),
]

# 10. Question text quality issues
QUESTION_QUALITY_PATTERNS = [
    # Questions that are too short
    (r'^.{1,10}$', 'very_short_question'),

    # Questions with just variable codes (but not common short words)
    # Must be at least 2 chars and look like a variable code (e.g., Q1, SE3, ABC123)
    (r'^[A-Z]{2,}\d+$', 'variable_code_only'),
    (r'^[A-Z]\d{2,}$', 'variable_code_only'),

    # Questions referencing other questions
    (r'[Aa]nswer\s*in\s*[Qq]\.?\s*\d+', 'answer_reference'),
    (r'[Ss]ee\s*[Qq]\.?\s*\d+', 'question_reference'),

    # System/form artifacts
    (r'for\s*presidential\s*system', 'system_conditional'),
    (r'for\s*parliamentary\s*system', 'system_conditional'),
]


def compile_patterns(pattern_list: List[Tuple[str, str]]) -> List[Tuple[re.Pattern, str]]:
    """Compile regex patterns for efficiency."""
    return [(re.compile(pattern, re.IGNORECASE), name) for pattern, name in pattern_list]


# Compile all pattern groups
COMPILED_PATTERNS = {
    'interviewer_instructions': compile_patterns(INTERVIEWER_INSTRUCTION_PATTERNS),
    'placeholders': compile_patterns(PLACEHOLDER_PATTERNS),
    'skip_logic': compile_patterns(SKIP_LOGIC_PATTERNS),
    'response_codes': compile_patterns(RESPONSE_CODE_PATTERNS),
    'scale_abbreviations': compile_patterns(SCALE_ABBREVIATION_PATTERNS),
    'binary_marking': compile_patterns(BINARY_MARKING_PATTERNS),
    'multiselect': compile_patterns(MULTISELECT_PATTERNS),
    'administrative': compile_patterns(ADMINISTRATIVE_PATTERNS),
    'special_values': compile_patterns(SPECIAL_VALUE_PATTERNS),
    'question_quality': compile_patterns(QUESTION_QUALITY_PATTERNS),
    'missing_data_labels': compile_patterns(MISSING_DATA_LABEL_PATTERNS),
    'column_positions': compile_patterns(COLUMN_POSITION_PATTERNS),
    'ranking_questions': compile_patterns(RANKING_QUESTION_PATTERNS),
    'recode_variables': compile_patterns(RECODE_PATTERNS),
    'survey_series': compile_patterns(SURVEY_SERIES_PATTERNS),
}

# Severity mapping for artifact types
SEVERITY_MAP = {
    # Critical - must fix before use
    'country_placeholder': 'critical',
    'leader_placeholder': 'critical',
    'year_placeholder': 'critical',
    'institution_placeholder': 'critical',
    'party_placeholder': 'critical',
    'country_specific_placeholder': 'critical',
    'fill_in_placeholder': 'critical',
    'specify_placeholder': 'critical',
    'blank_line_placeholder': 'critical',
    'variable_code_only': 'critical',
    'item_placeholder': 'critical',
    'national_placeholder': 'critical',

    # Warning - should review/clean
    'showcard_instruction': 'warning',
    'do_not_read': 'warning',
    'interviewer_note': 'warning',
    'instruction_note': 'warning',
    'read_aloud': 'warning',
    'single_answer_instruction': 'warning',
    'multiple_answer_instruction': 'warning',
    'probe_instruction': 'warning',
    'skip_instruction': 'warning',
    'conditional_routing': 'warning',
    'logic_check': 'warning',
    'code_reference': 'warning',
    'post_code_instruction': 'warning',
    'precoded_instruction': 'warning',
    'verbatim_instruction': 'warning',
    'marked_response': 'warning',
    'not_marked_response': 'warning',
    'checkbox_response': 'warning',
    'system_conditional': 'warning',
    'answer_reference': 'warning',
    'question_reference': 'warning',
    'location_placeholder': 'warning',
    'coding_instruction_placeholder': 'warning',
    'coding_only_instruction': 'warning',
    'if_volunteered': 'warning',
    'interviewer_assessment': 'warning',
    'recode_variable': 'warning',
    'local_variable': 'warning',
    'weight_variable': 'warning',

    # Info - awareness only
    'optional_marker': 'info',
    'gbs_marker': 'info',
    'note_instruction': 'info',
    'sub_question_numbering': 'info',
    'sub_question_reference': 'info',
    'multi_response_first': 'info',
    'multi_response_second': 'info',
    'multi_response_third': 'info',
    'do_not_understand_code': 'info',
    'cant_choose_code': 'info',
    'decline_to_answer_code': 'info',
    'not_applicable_code': 'info',
    'dont_understand_single_digit': 'info',
    'cant_choose_single_digit': 'info',
    'decline_answer_single_digit': 'info',
    'special_code_90s': 'info',
    'very_short_question': 'info',
    'routing_note': 'info',
    'empty_parens_placeholder': 'info',
    'admin_instruction': 'info',
    'interview_number': 'info',
    'interviewer_id': 'info',
    'interview_date': 'info',
    'postal_code': 'info',
    'binary_yes_no_codes': 'info',
    'likert_header_abbreviations': 'info',
    'response_code_abbreviations': 'info',
    # Missing data labels
    'dont_know_abbreviation': 'info',
    'dont_know_label': 'info',
    'no_answer_abbreviation': 'info',
    'combined_dk_na': 'info',
    'not_applicable_label': 'info',
    'no_answer_label': 'info',
    'no_data_label': 'info',
    # Column positions
    'column_position_single': 'info',
    'column_position_range': 'info',
    'position_notation': 'info',
    # Ranking questions
    'ranking_second': 'info',
    'ranking_first': 'info',
    'ranking_third': 'info',
    'sequential_question_number': 'info',
    # Survey series markers
    'standard_question_marker': 'info',
    'global_barometer_marker': 'info',
    'world_values_survey_marker': 'info',
    'country_specific_marker': 'info',
    'innovation_module_marker': 'info',
    'special_module_marker': 'info',
}


def detect_artifacts_in_text(
    text: str,
    location: str,
    field_name: str
) -> List[ArtifactMatch]:
    """Detect all artifacts in a given text string."""
    artifacts = []

    if not text or not isinstance(text, str):
        return artifacts

    for category, patterns in COMPILED_PATTERNS.items():
        for pattern, pattern_name in patterns:
            matches = pattern.finditer(text)
            for match in matches:
                severity = SEVERITY_MAP.get(pattern_name, 'info')
                suggestion = get_suggestion(pattern_name, match.group())

                artifacts.append(ArtifactMatch(
                    artifact_type=category,
                    location=location,
                    field=field_name,
                    matched_text=match.group(),
                    pattern_name=pattern_name,
                    severity=severity,
                    suggestion=suggestion
                ))

    return artifacts


def get_suggestion(pattern_name: str, matched_text: str) -> str:
    """Generate a suggestion for fixing the artifact."""
    suggestions = {
        # Placeholders
        'country_placeholder': 'Replace with actual country name for each survey wave',
        'leader_placeholder': 'Replace with actual leader title/name (e.g., "the president" or specific name)',
        'year_placeholder': 'Replace with actual election/reference year',
        'institution_placeholder': 'Replace with actual institution name',
        'party_placeholder': 'Use country-specific party names from survey codebook',
        'fill_in_placeholder': 'Remove or replace with appropriate context',
        'blank_line_placeholder': 'Remove blank lines or convert to natural language',

        # Interviewer instructions
        'showcard_instruction': 'Remove SHOWCARD reference - not relevant for LLM prompting',
        'do_not_read': 'Consider whether these options should be included in LLM prompt',
        'interviewer_note': 'Remove interviewer instructions',
        'read_aloud': 'Remove read-aloud instructions',
        'probe_instruction': 'Remove probe instructions',

        # Response artifacts
        'marked_response': 'Convert to Yes/No or actual response labels',
        'not_marked_response': 'Convert to Yes/No or actual response labels',
        'checkbox_response': 'Convert to meaningful response labels',

        # Skip logic
        'skip_instruction': 'Remove skip logic - not applicable for independent question prompting',
        'conditional_routing': 'Document skip conditions separately; remove from question text',

        # Codes
        'code_reference': 'Remove code references from question/option text',
        'do_not_understand_code': 'Consider whether to include as valid response option',
        'cant_choose_code': 'Consider whether to include as valid response option',
        'decline_to_answer_code': 'Consider whether to include as valid response option',
    }

    return suggestions.get(pattern_name, 'Review and clean as appropriate')


def validate_metadata_structure(metadata: dict) -> ValidationReport:
    """
    Validate metadata structure and detect artifacts.

    Expected structure:
    {
        "section_name": {
            "VARIABLE_CODE": {
                "description": str,
                "question": str,
                "values": {"code": "label", ...}
            },
            ...
        },
        ...
    }

    Returns:
        ValidationReport with issues, artifacts, and statistics
    """
    report = ValidationReport()
    required_fields = ['description', 'question', 'values']
    all_variables = []
    total_value_codes = 0
    artifact_counts = Counter()

    for section_name, section_data in metadata.items():
        if not isinstance(section_data, dict):
            report.critical.append(
                f"Section '{section_name}' is not a dict (got {type(section_data).__name__})"
            )
            continue

        for var_code, var_data in section_data.items():
            all_variables.append(var_code)
            location = f"[{section_name}] {var_code}"

            if not isinstance(var_data, dict):
                report.critical.append(
                    f"{location} is not a dict"
                )
                continue

            # Check required fields
            for field in required_fields:
                if field not in var_data:
                    report.critical.append(
                        f"{location} missing required field: '{field}'"
                    )

            # Validate and check description field
            if 'description' in var_data:
                desc = var_data['description']
                if not isinstance(desc, str):
                    report.warnings.append(
                        f"{location} description is not a string"
                    )
                else:
                    # Check for artifacts in description
                    desc_artifacts = detect_artifacts_in_text(desc, location, 'description')
                    report.artifacts.extend(desc_artifacts)
                    for a in desc_artifacts:
                        artifact_counts[a.pattern_name] += 1

            # Validate and check question field
            if 'question' in var_data:
                q = var_data['question']
                if not isinstance(q, str):
                    report.warnings.append(
                        f"{location} question is not a string"
                    )
                elif len(q) < 10:
                    report.warnings.append(
                        f"{location} question seems too short: '{q}'"
                    )
                else:
                    # Check for artifacts in question text
                    q_artifacts = detect_artifacts_in_text(q, location, 'question')
                    report.artifacts.extend(q_artifacts)
                    for a in q_artifacts:
                        artifact_counts[a.pattern_name] += 1

            # Validate and check values field
            if 'values' in var_data:
                values = var_data['values']
                if not isinstance(values, dict):
                    report.critical.append(
                        f"{location} values is not a dict"
                    )
                elif len(values) == 0:
                    report.warnings.append(
                        f"{location} has empty values dict"
                    )
                else:
                    total_value_codes += len(values)

                    # Check for non-string keys
                    non_str_keys = [k for k in values.keys() if not isinstance(k, str)]
                    if non_str_keys:
                        report.warnings.append(
                            f"{location} has non-string value codes: {non_str_keys}"
                        )

                    # Check for artifacts in value labels
                    for code, label in values.items():
                        if isinstance(label, str):
                            # Check the value label
                            v_artifacts = detect_artifacts_in_text(
                                label, location, f'values[{code}]'
                            )
                            report.artifacts.extend(v_artifacts)
                            for a in v_artifacts:
                                artifact_counts[a.pattern_name] += 1

                            # Also check the code itself (as string)
                            code_artifacts = detect_artifacts_in_text(
                                str(code), location, f'value_code[{code}]'
                            )
                            # Filter to only special value patterns
                            code_artifacts = [
                                a for a in code_artifacts
                                if a.artifact_type == 'special_values'
                            ]
                            report.artifacts.extend(code_artifacts)
                            for a in code_artifacts:
                                artifact_counts[a.pattern_name] += 1

    # Check for duplicate variable codes
    duplicates = [v for v, count in Counter(all_variables).items() if count > 1]
    if duplicates:
        report.critical.append(
            f"Duplicate variable codes found: {duplicates}"
        )

    # Compile statistics
    report.statistics = {
        'total_sections': len(metadata),
        'total_variables': len(all_variables),
        'total_value_codes': total_value_codes,
        'variables_per_section': {
            section: len(vars_) for section, vars_ in metadata.items()
            if isinstance(vars_, dict)
        },
        'artifact_counts': dict(artifact_counts),
        'artifacts_by_severity': {
            'critical': len([a for a in report.artifacts if a.severity == 'critical']),
            'warning': len([a for a in report.artifacts if a.severity == 'warning']),
            'info': len([a for a in report.artifacts if a.severity == 'info']),
        },
        'artifacts_by_type': Counter(a.artifact_type for a in report.artifacts),
    }

    # Generate summary info
    report.info.append(f"Total sections: {report.statistics['total_sections']}")
    report.info.append(f"Total variables: {report.statistics['total_variables']}")
    report.info.append(f"Total value codes: {report.statistics['total_value_codes']}")

    for section, count in report.statistics['variables_per_section'].items():
        report.info.append(f"  - {section}: {count} variables")

    return report


def print_validation_report(report: ValidationReport, show_artifacts: bool = True):
    """Pretty-print the validation report."""
    print("=" * 70)
    print("METADATA VALIDATION REPORT")
    print("=" * 70)

    # Critical issues
    if report.critical:
        print("\n‚ùå CRITICAL ISSUES (must fix):")
        for issue in report.critical:
            print(f"   ‚Ä¢ {issue}")
    else:
        print("\n‚úì No critical structural issues found")

    # Warnings
    if report.warnings:
        print("\n‚ö†Ô∏è  WARNINGS (should review):")
        for issue in report.warnings[:20]:  # Limit output
            print(f"   ‚Ä¢ {issue}")
        if len(report.warnings) > 20:
            print(f"   ... and {len(report.warnings) - 20} more warnings")
    else:
        print("\n‚úì No warnings")

    # Artifact summary
    if report.artifacts:
        print("\nüîç ARTIFACT DETECTION SUMMARY:")
        print(f"   Total artifacts found: {len(report.artifacts)}")
        print(f"   - Critical: {report.statistics['artifacts_by_severity']['critical']}")
        print(f"   - Warning: {report.statistics['artifacts_by_severity']['warning']}")
        print(f"   - Info: {report.statistics['artifacts_by_severity']['info']}")

        print("\n   By category:")
        for cat, count in report.statistics['artifacts_by_type'].most_common():
            print(f"   - {cat}: {count}")

        if show_artifacts:
            # Show critical artifacts
            critical_artifacts = [a for a in report.artifacts if a.severity == 'critical']
            if critical_artifacts:
                print("\n   üö® CRITICAL ARTIFACTS (require fixing):")
                for a in critical_artifacts[:15]:
                    print(f"      [{a.location}] {a.field}")
                    print(f"         Pattern: {a.pattern_name}")
                    print(f"         Matched: '{a.matched_text}'")
                    print(f"         Suggestion: {a.suggestion}")
                if len(critical_artifacts) > 15:
                    print(f"      ... and {len(critical_artifacts) - 15} more critical artifacts")

            # Show sample warning artifacts
            warning_artifacts = [a for a in report.artifacts if a.severity == 'warning']
            if warning_artifacts:
                print("\n   ‚ö†Ô∏è  SAMPLE WARNING ARTIFACTS:")
                for a in warning_artifacts[:10]:
                    print(f"      [{a.location}] {a.field}: '{a.matched_text}' ({a.pattern_name})")
                if len(warning_artifacts) > 10:
                    print(f"      ... and {len(warning_artifacts) - 10} more warning artifacts")

    # Info
    print("\n‚ÑπÔ∏è  INFO:")
    for info in report.info:
        print(f"   {info}")


def get_artifact_cleaning_recommendations(report: ValidationReport) -> Dict[str, List[str]]:
    """
    Generate specific cleaning recommendations based on detected artifacts.

    Returns dict with categories and specific actions needed.
    """
    recommendations = {
        'placeholder_filling': [],
        'instruction_removal': [],
        'value_label_cleaning': [],
        'question_reformulation': [],
        'code_handling': [],
    }

    for artifact in report.artifacts:
        if artifact.artifact_type == 'placeholders':
            recommendations['placeholder_filling'].append(
                f"{artifact.location}: Replace '{artifact.matched_text}' - {artifact.suggestion}"
            )

        elif artifact.artifact_type == 'interviewer_instructions':
            recommendations['instruction_removal'].append(
                f"{artifact.location}: Remove '{artifact.matched_text}'"
            )

        elif artifact.artifact_type in ['binary_marking', 'scale_abbreviations']:
            recommendations['value_label_cleaning'].append(
                f"{artifact.location}: Convert '{artifact.matched_text}' to natural language"
            )

        elif artifact.artifact_type == 'skip_logic':
            recommendations['question_reformulation'].append(
                f"{artifact.location}: Remove routing logic '{artifact.matched_text}'"
            )

        elif artifact.artifact_type in ['response_codes', 'special_values']:
            recommendations['code_handling'].append(
                f"{artifact.location}: Handle special code '{artifact.matched_text}'"
            )

    # Deduplicate
    for key in recommendations:
        recommendations[key] = list(set(recommendations[key]))

    return recommendations

In [90]:
report = validate_metadata_structure(metadata)
print_validation_report(report, show_artifacts=True)

METADATA VALIDATION REPORT

‚úì No critical structural issues found


üîç ARTIFACT DETECTION SUMMARY:
   Total artifacts found: 8341
   - Critical: 2
   - Info: 8335

   By category:
   - question_quality: 5314
   - special_values: 1689
   - missing_data_labels: 1283
   - column_positions: 24
   - interviewer_instructions: 20
   - survey_series: 11

   üö® CRITICAL ARTIFACTS (require fixing):
      [[political_affiliation] prtvtinl] values[16]
         Pattern: variable_code_only
         Matched: 'JA21'
         Suggestion: Review and clean as appropriate
      [[political_affiliation] prtclhnl] values[16]
         Pattern: variable_code_only
         Matched: 'JA21'
         Suggestion: Review and clean as appropriate

      [[demographics] isco08] values[4227]: ' interviewers' (interviewer_note)
      [[demographics] isco08] values[4227]: ' interviewers' (interviewer_note)
      [[demographics] isco08p] values[4227]: ' interviewers' (interviewer_note)
      [[demographics] isco08p

**For ESS 11: 'JA21' is a valid reponse category, warning artefacts are due to 'interviewer' being a valid isco job category, no changes needed.**

---
**Question Quality Review**

Check the quality of question reformulation - a critical aspect of metadata creation. Code below samples random questions for examining them visually. Run a few times to see if you notice anything strange.

In [91]:
#@title question quality review
def review_question_quality(metadata: dict,
                            sample_size: int = 10,
                            seed: int = 42) -> None:
    """
    Display a sample of questions for manual review.

    Things to check:
    - Is the question natural and conversational?
    - Were survey artifacts removed ("looking at card", "interviewer records"...)?
    - Does the question make sense as a standalone question?
    """
    all_questions = []

    for section, variables in metadata.items():
        for var_code, var_data in variables.items():
            if 'question' in var_data:
                all_questions.append({
                    'section': section,
                    'code': var_code,
                    'description': var_data.get('description', ''),
                    'question': var_data['question'],
                    'n_options': len(var_data.get('values', {}))
                })

    # Random sample
    np.random.seed(seed)
    sample_idx = np.random.choice(len(all_questions), min(sample_size, len(all_questions)), replace=False)
    sample = [all_questions[i] for i in sample_idx]

    print("=" * 70)
    print("QUESTION QUALITY REVIEW (Random Sample)")
    print("=" * 70)
    print("\nCheck each question for:")
    print("  ‚úì Natural, conversational tone")
    print("  ‚úì Survey artifacts removed ('read card', 'interviewer records')")
    print("  ‚úì Makes sense as standalone question")
    print("  ‚úì Description accurately summarizes the question\n")

    for i, q in enumerate(sample, 1):
        print(f"\n--- [{i}/{len(sample)}] {q['code']} ({q['section']}) ---")
        print(f"Description: {q['description']}")
        print(f"Question: {q['question']}")
        print(f"Answer options: {q['n_options']}")


# Review questions
review_question_quality(metadata, sample_size=10, seed=None)

QUESTION QUALITY REVIEW (Random Sample)

Check each question for:
  ‚úì Natural, conversational tone
  ‚úì Survey artifacts removed ('read card', 'interviewer records')
  ‚úì Makes sense as standalone question
  ‚úì Description accurately summarizes the question


--- [1/10] marsts (demographics) ---
Description: Legal marital status
Question: What is your current legal marital status? (This refers to legal status, not who you live with.)
Answer options: 10

--- [2/10] medtrnl (covid) ---
Description: Medical care availability
Question: Was medical consultation or treatment unavailable in your area?
Answer options: 2

--- [3/10] ipmodsta (other) ---
Description: Importance of modesty
Question: How well does this describe you: "I think it's important to be humble and modest, and I try not to draw attention to myself"?
Answer options: 10

--- [4/10] prtclgsi (political_affiliation) ---
Description: Party closeness (Slovenia)
Question: Which political party in Slovenia do you feel closest

---
**Value Labels Review**

Check that answer options are properly formatted and include special categories.

In [92]:
#@title value labels review
def review_value_labels(metadata: dict) -> None:
    """
    Review value labels for common issues:
    - Likert scales: Are they collapsed appropriately?
    - Missing values: Are special categories included?
    - Country-specific: Are options realistic for respondents?
    """

    # Common missing value patterns
    missing_patterns = ['missing', 'refused', "don't know", 'no answer', 'not asked', 'not applicable']

    likert_indicators = ['strongly', 'agree', 'disagree', 'satisfied', 'trust', 'important']

    issues = []
    string_vars = []  # Track open-text variables
    stats = {
        'total_vars': 0,
        'vars_with_missing': 0,
        'likely_likert': 0,
        'n_options_dist': []
    }

    for section, variables in metadata.items():
        for var_code, var_data in variables.items():
            if 'values' not in var_data:
                continue

            values = var_data['values']

            # Handle string/open-text variables
            if not isinstance(values, dict):
                string_vars.append((var_code, section, values))
                continue

            # Skip empty values dicts
            if len(values) == 0:
                continue

            stats['total_vars'] += 1
            labels_lower = [str(v).lower() for v in values.values()]

            stats['n_options_dist'].append(len(values))

            # Check for missing value categories
            has_missing = any(
                any(pattern in label for pattern in missing_patterns)
                for label in labels_lower
            )
            if has_missing:
                stats['vars_with_missing'] += 1

            # Check if likely Likert scale
            is_likert = any(
                any(indicator in label for indicator in likert_indicators)
                for label in labels_lower
            )
            if is_likert:
                stats['likely_likert'] += 1

            # Check for potential issues
            if len(set(values.values())) > 10 and is_likert:
                issues.append(
                    f"[{var_code}] Likert-like scale with {len(values)} options - may need collapsing"
                )

            # Check for numeric-only labels
            numeric_labels = [v for v in values.values() if str(v).strip().isdigit()]
            if numeric_labels and len(numeric_labels) > 2:
                issues.append(
                    f"[{var_code}] Has numeric-only labels: {numeric_labels[:3]}... - may need verbal anchors"
                )

    print("=" * 60)
    print("VALUE LABELS ANALYSIS")
    print("=" * 60)

    # Handle case where no categorical variables found
    if stats['total_vars'] == 0:
        print("\n‚ö†Ô∏è  No categorical variables with value mappings found")
    else:
        print(f"\nTotal categorical variables: {stats['total_vars']}")
        print(f"Variables with missing value categories: {stats['vars_with_missing']} ({100*stats['vars_with_missing']/stats['total_vars']:.1f}%)")
        print(f"Likely Likert scales: {stats['likely_likert']}")
        print(f"\nOptions per variable distribution:")
        print(f"  Min: {min(stats['n_options_dist'])}, Max: {max(stats['n_options_dist'])}, Median: {np.median(stats['n_options_dist']):.0f}")

    # Report string/open-text variables
    if string_vars:
        print(f"\n‚ÑπÔ∏è  Open-text/verbatim variables ({len(string_vars)}):")
        for var_code, section, val_type in string_vars[:10]:
            print(f"   ‚Ä¢ {var_code} ({section}): {val_type}")
        if len(string_vars) > 10:
            print(f"   ... and {len(string_vars) - 10} more")

    if issues:
        print("\n‚ö†Ô∏è  POTENTIAL ISSUES:")
        for issue in issues[:50]:  # Limit output
            print(f"   ‚Ä¢ {issue}")
        if len(issues) > 50:
            print(f"   ... and {len(issues) - 50} more")
    else:
        print("\n‚úì No obvious issues with value labels")


review_value_labels(metadata)

VALUE LABELS ANALYSIS

Total categorical variables: 610
Variables with missing value categories: 519 (85.1%)
Likely Likert scales: 27

Options per variable distribution:
  Min: 1, Max: 2304, Median: 10

‚ö†Ô∏è  POTENTIAL ISSUES:
   ‚Ä¢ [inprdsc] Has numeric-only labels: ['1', '2', '3']... - may need verbal anchors


---

Inspect specific variables in detail.

In [93]:
#@title **Detailed Variable Inspection**
def inspect_variable(metadata: dict, var_code: str) -> None:
    """Display full details for a specific variable."""

    for section, variables in metadata.items():
        if var_code in variables:
            var_data = variables[var_code]
            print(f"\n{'='*60}")
            print(f"VARIABLE: {var_code}")
            print(f"{'='*60}")
            print(f"Section: {section}")
            print(f"Description: {var_data.get('description', 'N/A')}")
            print(f"\nQuestion:")
            print(f"  {var_data.get('question', 'N/A')}")
            print(f"\nAnswer Options ({len(var_data.get('values', {}))} total):")
            for code, label in var_data.get('values', {}).items():
                print(f"  [{code}] {label}")
            if 'notes' in var_data:
                print(f"\nNotes: {var_data['notes']}")
            return

    print(f"Variable '{var_code}' not found in metadata")


# Example: inspect a specific variable
# Replace with a variable code from your metadata
# inspect_variable(metadata, 'Q35A')

In [94]:
# List all variable codes for reference
print("All variable codes in metadata:\n")
for section, variables in metadata.items():
    print(f"[{section}]")
    print(f"  {', '.join(variables.keys())}\n")

All variable codes in metadata:

[demographics]
  cntry, health, hlthhmp, rlgblg, rlgdnm, rlgdnbat, rlgdnacy, rlgdnafi, rlgdnade, rlgdnagr, rlgdnhu, rlgdnais, rlgdnie, rlgdnlv, rlgdnlt, rlgdme, rlgdnanl, rlgdnno, rlgdnapl, rlgdnapt, rlgdnrs, rlgdnask, rlgdnase, rlgdnach, rlgdnaua, rlgdngb, rlgblge, rlgdnme, rlgdebat, rlgdeacy, rlgdeafi, rlgdeade, rlgdeagr, rlgdehu, rlgdeais, rlgdeie, rlgdelv, rlgdelt, rlgdeme, rlgdeanl, rlgdeno, rlgdeapl, rlgdeapt, rlgders, rlgdeask, rlgdease, rlgdeach, rlgdeaua, rlgdegb, rlgdgr, rlgatnd, ctzcntr, brncntr, cntbrthd, livecnta, lnghom1, lnghom2, feethngr, facntr, fbrncntc, mocntr, mbrncntc, cgtsmok, alcfreq, icgndra, height, weighta, medtrnt, hltprdi, hltprca, jbexpml, jbexpmc, jbexevc, nobingnd, mascfel, femifel, impbemw, trwrkmw, hhmmb, gndr, gndr2, gndr3, gndr4, gndr5, gndr6, gndr7, gndr8, gndr9, gndr10, gndr11, gndr12, gndr13, yrbrn, agea, agegroup, yrbrn2, yrbrn3, yrbrn4, yrbrn5, yrbrn6, yrbrn7, yrbrn8, yrbrn9, yrbrn10, yrbrn11, yrbrn12, yrbrn13, rs

In [95]:
inspect_variable(metadata, 'wrclmch')


VARIABLE: wrclmch
Section: climate_change
Description: Worry about climate change

Question:
  How worried are you about climate change?

Answer Options (9 total):
  [1] Not at all worried
  [2] Not very worried
  [3] Somewhat worried
  [4] Very worried
  [5] Extremely worried
  [6] Not applicable
  [7] Refusal
  [8] Don't know
  [9] No answer


---
## **3. Profile Generator Overview**

### **Purpose**

The `RespondentProfileGenerator` converts tabular survey data into natural language "interview" representations suitable for LLM inference. Given a respondent's row in a survey dataset and structured metadata about the questions, it produces text-based profiles that describe what we know about a person, paired with a target question we want the model to predict.

This enables our core research question: can LLMs predict individual survey responses given partial information about respondents? The generator handles the critical task of transforming structured survey data into the text format LLMs expect, while implementing safeguards against information leakage and ensuring experimental reproducibility.

### **Core Functionality**

The generator provides several key capabilities:

- **Stratified random sampling**: Selects features across thematic sections (demographics, political attitudes, etc.) rather than clustering features from one domain
- **Seedable reproducibility**: All sampling is deterministic given a seed, enabling exact replication of experiments
- **Profile expansion**: Generates nested profiles where smaller profiles are strict subsets of larger ones‚Äîessential for information richness experiments where we vary how much context the model receives
- **Target question handling**: Automatically excludes target questions from the feature pool and handles country-specific answer options (e.g., showing only German parties to German respondents)
- **Semantic similarity filtering**: Uses sentence embeddings to exclude features too similar to the target, preventing information leakage (e.g., excluding "party identification" when predicting "party vote")
- **Missing value handling**: Filters out survey artifacts like "Not asked in this country" or "Refused" from both feature values and answer options
- **Flexible output formatting**: Produces prompts in multiple formats (Q&A, interview, bullet points, etc.) to test robustness against surface-level variations

### **What to Test For**

When validating metadata with the generator, watch for these common issues:

**Data-metadata misalignment**: Variable codes in metadata don't match column names in the survey data, causing `KeyError` or silent failures where respondents have no valid features. The generator should warn about this, but verify the overlap percentage is high.

**Missing value contamination**: If missing value patterns aren't configured correctly, profiles may include nonsensical features like "What is your religion? ‚Üí Not asked in this country" or target options may include "Refused" as a valid answer choice.

**Semantic filtering edge cases**: The similarity model may exclude too many features (overly aggressive) or miss obviously related questions (too permissive). Check the exclusion lists make sense‚Äîif predicting party vote, party identification should be excluded; general political interest probably shouldn't be.

**Country-specific option failures**: For questions like party preference, respondents should only see parties from their country. If you see "Democrats (USA)" as an option for a German respondent, the country column mapping or country-specific target configuration is wrong.

**Profile expansion violations**: When generating profiles of increasing size with the same seed, smaller profiles must be strict subsets of larger ones. If a feature appears in the 5-feature profile but not the 10-feature profile, the expansion logic is broken‚Äîthis would confound information richness experiments.

**Empty or sparse profiles**: Some respondents may have too many missing values to generate valid profiles, or some targets may have all features excluded by similarity filtering. The generator should handle these gracefully, but check that you can generate profiles for a reasonable proportion of respondents.

**Value code mismatches**: If a respondent's answer in the data (e.g., `"7"`) doesn't appear in the metadata's value mapping, the generator may fail or produce raw codes instead of human-readable labels. Spot-check that generated profiles show text labels, not numeric codes.

### **Prerequisites**:
You need actual survey data (CSV) that matches the metadata.

In [96]:
# Load survey data (skip if not available)
try:
    survey_df = pd.read_csv("../../../../../data/ess/ESS11.csv", low_memory=False)  # ESS

    # Generate unique respondent ID (if needed for particular dataset, as some use the same ID numbers across countries)
    survey_df["cntry_idno"] = (
        survey_df["cntry"].astype("string") + "_" + survey_df["idno"].astype("string")
    )

    print(f"‚úì Loaded survey data: {survey_df.shape[0]} respondents, {survey_df.shape[1]} columns")

    n_unique = survey_df["cntry_idno"].nunique(dropna=False)
    if n_unique != len(survey_df):
        warnings.warn(
            f'cntry_idno is not unique: {n_unique} unique IDs for {survey_df.shape[0]} rows.',
            category=UserWarning
        )

    print(f"  Columns: {list(survey_df.columns)[:10]}...")

except FileNotFoundError:
    print("‚ö†Ô∏è  Survey data file not found. Using mock data for demo.")
    survey_df = None 

‚úì Loaded survey data: 50116 respondents, 692 columns
  Columns: ['name', 'essround', 'edition', 'proddate', 'idno', 'cntry', 'dweight', 'pspwght', 'pweight', 'anweight']...




In [97]:
# Find duplicate ids and fix (assign new id to second instance):
print(survey_df.loc[survey_df["cntry_idno"].duplicated(keep=False), "cntry_idno"].sort_values().unique())

# change duplicated id (I checked for ESS11, this is an entirely different observation)
survey_df["cntry_idno"] = (
    survey_df["cntry_idno"]
    + survey_df.groupby("cntry_idno").cumcount().add(1).astype(str).radd("_")
      .where(survey_df["cntry_idno"].duplicated(keep=False), "")
)

# verify that no we are unique (should return 'True')
n_unique = survey_df["cntry_idno"].nunique(dropna=False)
n_unique == len(survey_df)


<StringArray>
['UA_54047']
Length: 1, dtype: string


True

In [98]:
# ============================================================
# CONFIGURE GENERATOR PARAMETERS
# ============================================================

RESPONDENT_ID_COL = 'cntry_idno'  # Column with respondent IDs
COUNTRY_COL = 'cntry'  # Set to column name if your data has country info (e.g., 'B_COUNTRY')

# Missing value configuration
MISSING_VALUE_LABELS = ['Refusal', "Don't know", 'No answer', '', 'Not applicable']  # Exact matches
MISSING_VALUE_PATTERNS = ['not asked', "don't know", 'missing', 'refused']  # Substring matches

# Semantic similarity (optional - requires sentence-transformers)
USE_SEMANTIC_FILTERING = True  # Set to True if sentence-transformers is installed
SIMILARITY_MODEL = 'all-MiniLM-L6-v2'  # Fast and effective, but feel free to experiment with other models
SIMILARITY_THRESHOLD = 0.8  # Features with similarity >= this are excluded

In [99]:
# Initialize the generator
try:
    generator = RespondentProfileGenerator(
        survey_data=survey_df,
        metadata=metadata,
        respondent_id_col=RESPONDENT_ID_COL,
        country_col=COUNTRY_COL,
        missing_value_labels=MISSING_VALUE_LABELS,
        missing_value_patterns=MISSING_VALUE_PATTERNS,
        similarity_model=SIMILARITY_MODEL if USE_SEMANTIC_FILTERING else None,
        similarity_threshold=SIMILARITY_THRESHOLD
    )
    print("‚úì Generator initialized successfully!")
    print(f"  Available sections: {list(generator._section_to_features.keys())}")
    print(f"  Total features: {len(generator._all_features)}")
except Exception as e:
    print(f"‚ùå Error initializing generator: {e}")
    raise

Missing value exclusion configured:
  Exact labels: {'', 'Refusal', "Don't know", 'No answer', 'Not applicable'}
  Patterns (case-insensitive): ['not asked', "don't know", 'missing', 'refused']
Semantic similarity filtering enabled:
  Model: all-MiniLM-L6-v2
  Threshold: 0.8
‚úì Generator initialized successfully!
  Available sections: ['demographics', 'other', 'trust_in_social_groups', 'politics', 'governance', 'civic_engagement', 'trust_in_institutions', 'political_affiliation', 'economic_outlook', 'climate_change', 'covid']
  Total features: 610


---
**Generate Sample Profiles**

Test profile generation with different settings.

In [100]:
# ============================================================
# CONFIGURE TARGET QUESTIONS
# ============================================================

# Select some variables as target questions (what we want to predict)
# Replace with actual variable codes from your metadata
TARGET_CODES = ["gndr", "mnactp", "trstplt", "eisced", "health", "eiscedm", "hincsrca", 
                "hincfel", "uempla", "happy", "polintr", "loylead", "psppsgva", 
                "vote", "volunfp", "ipudrsta", "impricha", "stfeco", 
                "impenva"]  # <-- Add target variable codes here, e.g., ['Q35A', 'Q35B']

# For country-specific targets (like party vote), specify them here
COUNTRY_SPECIFIC_TARGETS = []  # e.g., ['PARTY_VOTE'] if applicable

if not TARGET_CODES:
    # Auto-select first few variables as targets for demo
    all_vars = []
    for section, variables in metadata.items():
        all_vars.extend(list(variables.keys())[:2])
    TARGET_CODES = all_vars[:3]
    print(f"Auto-selected target codes for demo: {TARGET_CODES}")

In [101]:
# Set target questions
generator.set_target_questions(
    target_codes=TARGET_CODES,
    country_specific_targets=COUNTRY_SPECIFIC_TARGETS if COUNTRY_SPECIFIC_TARGETS else None
)
print(f"\n‚úì Target questions set: {TARGET_CODES}")

  Computing semantic similarity (model: all-MiniLM-L6-v2)...
    gndr: excluding 1 similar features
      - icgndra (sim=1.000): "What is your gender?..."
    trstplt: excluding 1 similar features
      - trstprt (sim=0.813): "How much do you personally trust political parties..."
    eisced: excluding 55 similar features
      - edulvlb (sim=1.000): "What is the highest level of education you have co..."
      - edlvdrs (sim=1.000): "What is the highest level of education you have co..."
      - edlvesi (sim=1.000): "What is the highest level of education you have co..."
      ... and 52 more
    eiscedm: excluding 6 similar features
      - edumbde2 (sim=0.803): "What is the highest level of education your mother..."
      - edlvmfit (sim=0.801): "What is the highest level of education your mother..."
      - edlvmebg (sim=0.801): "What is the highest level of education your mother..."
      ... and 3 more
    uempla: excluding 2 similar features
      - uempli (sim=0.960): "During t

In [102]:
# Check which labels are removed as missing options (make sure the removal approach is not too aggressive)
filtered_out_by_code = {}

for code, tq in generator._target_questions.items():
    values_map = tq.values_map  # dict: raw_value -> label

    removed = {}
    kept = {}

    for raw_value, label in values_map.items():
        if generator._is_missing_value_label(label):
            removed[raw_value] = label
        else:
            kept[raw_value] = label

    if removed:
        filtered_out_by_code[code] = {
            "removed": removed,
            "removed_n": len(removed),
            "kept_n": len(kept),
            "total_n": len(values_map),
        }

# Summary sorted by how many were removed
for code, info in sorted(filtered_out_by_code.items(), key=lambda x: x[1]["removed_n"], reverse=True):
    print(f"{code}: removed {info['removed_n']} of {info['total_n']} (kept {info['kept_n']})")

# Inspect one target in detail
example_code = 'trstplt'
print("\nDetailed removed options for:", example_code)
for k, v in sorted(filtered_out_by_code[example_code]["removed"].items(), key=lambda x: str(x[0])):
    print(f"  {k!r}: {v}")

mnactp: removed 4 of 13 (kept 9)
ipudrsta: removed 4 of 10 (kept 6)
impricha: removed 4 of 10 (kept 6)
impenva: removed 4 of 10 (kept 6)
trstplt: removed 3 of 14 (kept 11)
eisced: removed 3 of 12 (kept 9)
health: removed 3 of 8 (kept 5)
eiscedm: removed 3 of 12 (kept 9)
hincsrca: removed 3 of 11 (kept 8)
hincfel: removed 3 of 7 (kept 4)
happy: removed 3 of 14 (kept 11)
polintr: removed 3 of 7 (kept 4)
loylead: removed 3 of 8 (kept 5)
psppsgva: removed 3 of 8 (kept 5)
vote: removed 3 of 6 (kept 3)
volunfp: removed 3 of 5 (kept 2)
stfeco: removed 3 of 14 (kept 11)
gndr: removed 1 of 3 (kept 2)

Detailed removed options for: trstplt
  '77': Refusal
  '88': Don't know
  '99': No answer


In [103]:
# import inspect
# print(inspect.getsource(generator.set_target_questions))

In [104]:
# Generate a sample profile
sample_respondent_id = survey_df[RESPONDENT_ID_COL].iloc[0]
sample_target = TARGET_CODES[0]

print(f"Generating profile for respondent {sample_respondent_id}, target: {sample_target}")
print("=" * 70)

# Generate with different richness levels
for n_sections, m_features in [(1, 2), (2, 3), (3, 4)]:
    print(f"\n--- Profile: {n_sections} sections √ó {m_features} features/section ---")

    try:
        instance = generator.generate_prediction_instance(
            respondent_id=sample_respondent_id,
            target_code=sample_target,
            n_sections=n_sections,
            m_features_per_section=m_features,
            seed=42  # Reproducible
        )

        print(f"\nProfile ({len(instance.features)} features):")
        for q, a in instance.features.items():
            print(f"  ‚Ä¢ {q[:50]}... ‚Üí {a}")

        print(f"\nTarget question: {instance.target_question[:75]}...")
        print(f"Answer options: {instance.options}")
        print(f"True answer: {instance.answer}")

    except Exception as e:
        print(f"Error: {e}")

Generating profile for respondent AT_50014, target: gndr

--- Profile: 1 sections √ó 2 features/section ---

Profile (2 features):
  ‚Ä¢ If large numbers of people limited their energy us... ‚Üí Fairly likely
  ‚Ä¢ If many people reduced their energy use, how likel... ‚Üí Likely

Target question: What is your gender?...
Answer options: ['Male', 'Female']
True answer: Male

--- Profile: 2 sections √ó 3 features/section ---

Profile (6 features):
  ‚Ä¢ How much does this description sound like you: "Yo... ‚Üí Like me
  ‚Ä¢ To what extent do you agree or disagree with the s... ‚Üí Disagree
  ‚Ä¢ Overall, do people who come to live in this countr... ‚Üí Neither worse nor better
  ‚Ä¢ If large numbers of people limited their energy us... ‚Üí Fairly likely
  ‚Ä¢ How likely do you think it is that governments in ... ‚Üí Somewhat likely
  ‚Ä¢ How much personal responsibility do you feel for r... ‚Üí Somewhat

Target question: What is your gender?...
Answer options: ['Male', 'Female']
True answ

---
**Information Leakage Analysis**

**Critical for experiment validity**: When predicting a target question, we must exclude features that would trivially reveal the answer.

This section helps you:
1. See which features are excluded for each target (via semantic similarity)
2. Verify exclusions make sense
3. Identify potential leakage the filter might miss

In [105]:
generator.similarity_threshold

0.8

In [106]:
#@title analyze information leakage
def analyze_information_leakage(
    generator: RespondentProfileGenerator,
    target_codes: list,
    top_k: int = 10
) -> None:
    """
    Analyze potential information leakage for each target question.

    Shows:
    - Features excluded by semantic similarity filtering
    - Top-k most similar features (even if below threshold)
    - Potential leakage risks to review manually
    """

    # Check if semantic filtering is enabled
    if not hasattr(generator, 'similarity_model_name') or generator.similarity_model_name is None:
        print("‚ö†Ô∏è  Semantic similarity filtering is DISABLED")
        print("   To enable, reinitialize generator with:")
        print("   similarity_model='all-MiniLM-L6-v2'")
        print("\n   Without semantic filtering, only exact target exclusion is applied.")
        return

    print("="*70)
    print("INFORMATION LEAKAGE ANALYSIS")
    print(f"Similarity threshold: {generator.similarity_threshold}")
    print("="*70)

    for target_code in target_codes:
        print(f"\n--- Target: {target_code} ---")

        # Get target question text
        target_text = None
        for section, variables in generator.metadata.items():
            if target_code in variables:
                target_text = variables[target_code].get('question', '')
                break

        if not target_text:
            print(f"  Target not found in metadata")
            continue

        print(f"  Question: {target_text[:80]}...")

        # Get excluded features for this target (correct attribute name)
        if hasattr(generator, '_target_similar_features') and target_code in generator._target_similar_features:
            excluded = generator._target_similar_features[target_code]
            if excluded:
                print(f"\n  ‚ùå EXCLUDED features ({len(excluded)}):")

                # Get similarity scores using the generator's method
                similar_features = generator.get_similar_features(target_code)

                # Sort excluded features by similarity score
                excluded_with_scores = [
                    (feat, similar_features.get(feat, 0.0))
                    for feat in excluded
                ]
                excluded_with_scores.sort(key=lambda x: -x[1])

                for feat_code, sim_score in excluded_with_scores[:top_k]:
                    # Get feature question text
                    feat_text = ""
                    for section, variables in generator.metadata.items():
                        if feat_code in variables:
                            feat_text = variables[feat_code].get('question', '')[:80]
                            break
                    print(f"     ‚Ä¢ {feat_code} (sim={sim_score:.3f}): {feat_text}...")

                if len(excluded_with_scores) > top_k:
                    print(f"     ... and {len(excluded_with_scores) - top_k} more")
            else:
                print(f"\n  ‚úì No features excluded by similarity filtering")
        else:
            print(f"\n  ‚úì No features excluded by similarity filtering")

        # Show top similar features (even if not excluded)
        print(f"\n  ‚ÑπÔ∏è  Top {top_k} most similar features (for review):")
        try:
            similar_features = generator.get_similar_features(target_code)
            sorted_similar = sorted(similar_features.items(), key=lambda x: -x[1])[:top_k]

            for feat_code, sim_score in sorted_similar:
                status = "‚ùå" if sim_score >= generator.similarity_threshold else "‚úì"
                feat_text = generator._code_to_question_text.get(feat_code, feat_code)[:80]
                print(f"     {status} {feat_code} (sim={sim_score:.3f}): {feat_text}...")
        except Exception as e:
            print(f"     Could not compute similarities: {e}")

In [107]:
# Run information leakage analysis
analyze_information_leakage(generator, TARGET_CODES)

INFORMATION LEAKAGE ANALYSIS
Similarity threshold: 0.8

--- Target: gndr ---
  Question: What is your gender?...

  ‚ùå EXCLUDED features (1):
     ‚Ä¢ icgndra (sim=1.000): What is your gender?...

  ‚ÑπÔ∏è  Top 10 most similar features (for review):
     ‚ùå icgndra (sim=1.000): What is your gender?...
     ‚úì nobingnd (sim=0.668): Which of these options best describes your gender?...
     ‚úì gndr4 (sim=0.614): What is the gender of the fourth person in your household?...
     ‚úì gndr5 (sim=0.612): What is the gender of the fifth person in your household?...
     ‚úì gndr6 (sim=0.608): What is the gender of the sixth person in your household?...
     ‚úì gndr2 (sim=0.606): What is the gender of the second person in your household?...
     ‚úì gndr11 (sim=0.604): What is the gender of the 11th person in your household?...
     ‚úì gndr9 (sim=0.602): What is the gender of the ninth person in your household?...
     ‚úì gndr12 (sim=0.581): What is the gender of the twelfth person in y

In [109]:
#@title Manual leakage check: Look for obviously related questions
# This catches things semantic similarity might miss

def manual_leakage_check(metadata: dict, target_codes: list) -> None:
    """
    Flag potential leakage based on keyword matching.

    This is a safety net for cases where:
    - Semantic model isn't loaded
    - Questions are phrased differently but measure same construct
    """

    # Get target keywords
    target_keywords = {}
    for target in target_codes:
        for section, variables in metadata.items():
            if target in variables:
                q = variables[target].get('question', '').lower()
                # Extract key terms
                keywords = set()
                for word in q.split():
                    if len(word) > 4:  # Skip short words
                        keywords.add(word.strip('?.,!'))
                target_keywords[target] = keywords
                break

    print("="*70)
    print("MANUAL LEAKAGE CHECK (Keyword-based)")
    print("="*70)

    for target, keywords in target_keywords.items():
        print(f"\n--- Target: {target} ---")
        print(f"  Keywords: {keywords}")

        matches = []
        for section, variables in metadata.items():
            for var_code, var_data in variables.items():
                if var_code == target:
                    continue
                q = var_data.get('question', '').lower()
                overlap = keywords & set(w.strip('?.,!') for w in q.split() if len(w) > 4)
                if len(overlap) >= 2:  # At least 2 keyword matches
                    matches.append((var_code, overlap, q[:60]))

        if matches:
            print(f"\n  ‚ö†Ô∏è  Potential leakage ({len(matches)} features):")
            for var_code, overlap, q_text in matches[:10]:
                print(f"     ‚Ä¢ {var_code}: {q_text}...")
                print(f"       Matching keywords: {overlap}")
        else:
            print(f"\n  ‚úì No obvious keyword matches found")


manual_leakage_check(metadata, TARGET_CODES)

MANUAL LEAKAGE CHECK (Keyword-based)

--- Target: gndr ---
  Keywords: {'gender'}

  ‚úì No obvious keyword matches found

--- Target: mnactp ---
  Keywords: {'activity', 'days', "partner's"}

  ‚ö†Ô∏è  Potential leakage (3 features):
     ‚Ä¢ mainact: what has been your main activity in the last seven days?...
       Matching keywords: {'activity', 'days'}
     ‚Ä¢ mnactic: what was your main activity in the last 7 days?...
       Matching keywords: {'activity', 'days'}
     ‚Ä¢ dosprt: in the past 7 days, on how many days did you walk briskly, p...
       Matching keywords: {'activity', 'days'}

--- Target: trstplt ---
  Keywords: {'politicians', 'personally', 'trust'}

  ‚ö†Ô∏è  Potential leakage (6 features):
     ‚Ä¢ trstprl: how much do you personally trust your country's parliament?...
       Matching keywords: {'personally', 'trust'}
     ‚Ä¢ trstlgl: how much do you personally trust the legal system?...
       Matching keywords: {'personally', 'trust'}
     ‚Ä¢ trstplc: how mu

### Interpreting Leakage Analysis

**If features are being excluded:**
- Review each exclusion - does it make sense?
- If too aggressive (excluding unrelated features), raise the threshold
- If too permissive, lower the threshold

**If nothing is excluded but you expect exclusions:**
- Check that semantic filtering is enabled
- Lower the similarity threshold (try 0.5 or 0.6)
- The questions may be phrased too differently for semantic matching

**For manual review:**
- Look for questions measuring the same construct with different wording
- Party vote prediction: exclude party identification, past voting, party closeness
- Trust questions: watch for batteries of related trust items

---
**Preview Output Formats**

See how profiles look in different prompt formats.

In [110]:
# Show available formats
print("Available profile formats:")
for fmt in list_profile_formats():
    print(f"  ‚Ä¢ {fmt}")

Available profile formats:
  ‚Ä¢ qa
  ‚Ä¢ interview
  ‚Ä¢ bullet
  ‚Ä¢ colon
  ‚Ä¢ arrow
  ‚Ä¢ brackets
  ‚Ä¢ xml
  ‚Ä¢ json
  ‚Ä¢ narrative
  ‚Ä¢ card


In [111]:
# Generate a sample instance for format comparison
instance = generator.generate_prediction_instance(
    respondent_id=sample_respondent_id,
    target_code=sample_target,
    n_sections=2,
    m_features_per_section=2,
    seed=42
)

# Preview each format
formats_to_show = ['qa', 'interview', 'bullet', 'arrow', 'brackets', 'json', 'narrative', 'card']

for fmt in formats_to_show:
    print(f"\n{'='*70}")
    print(f"FORMAT: {fmt}")
    print(f"{'='*70}")

    prompt = instance.to_prompt(profile_format=fmt)
    print(prompt)


FORMAT: qa
Here is information about a survey respondent:

Q: How much does this description sound like you: "You believe people should do what they're told and follow rules at all times, even when no one is watching"?
A: Like me
Q: To what extent do you agree or disagree with the statement: "If a close family member were a gay man or a lesbian, I would feel ashamed"?
A: Disagree
Q: If large numbers of people limited their energy use, how likely is it that this would reduce climate change?
A: Fairly likely
Q: How likely do you think it is that governments in enough countries will take action to reduce climate change?
A: Somewhat likely

Based on this information, please answer:
What is your gender?

Options:
1. Male
2. Female

FORMAT: interview
Here is information about a survey respondent:

Interviewer: How much does this description sound like you: "You believe people should do what they're told and follow rules at all times, even when no one is watching"?
Respondent: Like me
Interv

---
**Profile Expansion Test**

Test that profile expansion preserves existing features (critical for information richness experiments).

In [112]:
# Generate base profile
base_instance = generator.generate_prediction_instance(
    respondent_id=sample_respondent_id,
    target_code=sample_target,
    n_sections=1,
    m_features_per_section=2,
    seed=42
)

print("BASE PROFILE (1 section √ó 2 features):")
print("-" * 50)
for q, a in base_instance.features.items():
    print(f"  ‚Ä¢ {q[:50]}...")

# Expand profile
expanded_instance = generator.generate_prediction_instance(
    respondent_id=sample_respondent_id,
    target_code=sample_target,
    n_sections=2,
    m_features_per_section=3,
    seed=42  # Same seed!
)

print("\nEXPANDED PROFILE (2 sections √ó 3 features):")
print("-" * 50)
for q, a in expanded_instance.features.items():
    print(f"  ‚Ä¢ {q[:50]}...")

# Verify subset relationship
base_questions = set(base_instance.features.keys())
expanded_questions = set(expanded_instance.features.keys())

print("\n" + "=" * 50)
if base_questions.issubset(expanded_questions):
    print("‚úì PASS: Base features preserved in expanded profile")
    new_features = expanded_questions - base_questions
    print(f"  New features added: {len(new_features)}")
else:
    print("‚ùå FAIL: Base features NOT preserved!")
    missing = base_questions - expanded_questions
    print(f"  Missing features: {missing}")

BASE PROFILE (1 section √ó 2 features):
--------------------------------------------------
  ‚Ä¢ If large numbers of people limited their energy us...
  ‚Ä¢ If many people reduced their energy use, how likel...

EXPANDED PROFILE (2 sections √ó 3 features):
--------------------------------------------------
  ‚Ä¢ How much does this description sound like you: "Yo...
  ‚Ä¢ To what extent do you agree or disagree with the s...
  ‚Ä¢ Overall, do people who come to live in this countr...
  ‚Ä¢ If large numbers of people limited their energy us...
  ‚Ä¢ How likely do you think it is that governments in ...
  ‚Ä¢ How much personal responsibility do you feel for r...

‚ùå FAIL: Base features NOT preserved!
  Missing features: {'If many people reduced their energy use, how likely do you think that would be to reduce climate change?'}


---
**Data-Metadata Alignment Check**

Verify that metadata variables exist in the survey data.

In [114]:
#@title check metadata alignment
def check_data_metadata_alignment(survey_df: pd.DataFrame, metadata: dict) -> None:
    """Check alignment between survey data columns and metadata variables."""

    metadata_vars = set()
    for section, variables in metadata.items():
        metadata_vars.update(variables.keys())

    data_cols = set(survey_df.columns)

    # Variables in metadata but not in data
    missing_in_data = metadata_vars - data_cols

    # Variables in data but not in metadata
    missing_in_metadata = data_cols - metadata_vars

    # Overlap
    overlap = metadata_vars & data_cols

    print("=" * 60)
    print("DATA-METADATA ALIGNMENT")
    print("=" * 60)
    print(f"\nMetadata variables: {len(metadata_vars)}")
    print(f"Data columns: {len(data_cols)}")
    print(f"Overlap: {len(overlap)} ({100*len(overlap)/len(metadata_vars):.1f}% of metadata)")

    if missing_in_data:
        print(f"\n‚ö†Ô∏è  Variables in metadata but NOT in data ({len(missing_in_data)}):")
        for v in sorted(list(missing_in_data))[:20]:
            print(f"   ‚Ä¢ {v}")
        if len(missing_in_data) > 20:
            print(f"   ... and {len(missing_in_data) - 20} more")

    if len(missing_in_metadata) < 50:  # Only show if not too many
        print(f"\n‚ÑπÔ∏è  Data columns not in metadata ({len(missing_in_metadata)}):")
        for v in sorted(list(missing_in_metadata))[:10]:
            print(f"   ‚Ä¢ {v}")


check_data_metadata_alignment(survey_df, metadata)

DATA-METADATA ALIGNMENT

Metadata variables: 610
Data columns: 692
Overlap: 610 (100.0% of metadata)


---
## **4. Quality Assurance Checklist**

Use this checklist when reviewing metadata:

### Structure ‚úì
- [ ] All sections are dictionaries
- [ ] All variables have `description`, `question`, `values` fields
- [ ] No duplicate variable codes across sections

### Questions ‚úì
- [ ] Questions are natural and conversational
- [ ] Survey artifacts removed ("looking at card", "interviewer records")
- [ ] Questions make sense as standalone (without survey context)
- [ ] Transformations documented in `notes` field where applicable

### Answer Options ‚úì
- [ ] All value codes have text labels (no numeric-only labels)
- [ ] Likert scales collapsed to ~5 interpretable categories
- [ ] Missing value categories included (Missing, Refused, Don't know, etc.)
- [ ] Country-specific options are realistic (e.g., political parties)

### Thematic Grouping ‚úì
- [ ] Variables logically grouped into sections
- [ ] Section names are descriptive (demographics, political_attitudes, etc.)
- [ ] Sections are balanced (not one giant section with everything)

### Data Alignment ‚úì
- [ ] Variable codes match column names in survey data
- [ ] Value codes in metadata match actual values in data

---
## **5. Export Sample Profiles**

Generate and export sample profiles for review.

In [115]:
def export_sample_profiles(
    generator: RespondentProfileGenerator,
    survey_df: pd.DataFrame,
    target_codes: list,
    n_samples: int = 5,
    output_path: str = 'sample_profiles_ESS11.json'
) -> None:
    """
    Export sample prediction instances for manual review.
    """
    np.random.seed(42)

    samples = []
    respondent_ids = survey_df[generator.respondent_id_col].sample(n_samples).tolist()

    for rid in respondent_ids:
        for target in target_codes:
            try:
                instance = generator.generate_prediction_instance(
                    respondent_id=rid,
                    target_code=target,
                    n_sections=2,
                    m_features_per_section=3,
                    seed=42
                )

                # generate_prediction_instance returns None if target answer is missing
                if instance is None:
                    print(f"Warning: Respondent {rid} has missing answer for {target}, skipping")
                    continue

                samples.append({
                    'respondent_id': instance.id,
                    'country': instance.country,
                    'target_code': instance.target_code,
                    'profile': instance.features,
                    'target_question': instance.target_question,
                    'target_options': instance.options,
                    'target_answer': instance.answer,
                    'prompt_qa': instance.to_prompt(profile_format='qa'),
                    'prompt_bullet': instance.to_prompt(profile_format='bullet')
                })
            except Exception as e:
                print(f"Warning: Could not generate for {rid}, {target}: {e}")

    with open(output_path, 'w') as f:
        json.dump(samples, f, indent=2)

    print(f"‚úì Exported {len(samples)} sample profiles to {output_path}")


# Export samples
# export_sample_profiles(generator, survey_df, TARGET_CODES, n_samples=5)

In [54]:
export_sample_profiles(generator, survey_df, TARGET_CODES, n_samples=5)

‚úì Exported 15 sample profiles to sample_profiles.json
