
# Schema Builder Notebook

This notebook is intended to support the iterative development of the `standard_dv_mapping.yaml` schema.  
It allows users to:
- Load raw datasets from the `data/raw/` folder
- Extract and inspect column names
- Identify and flag dependent variable (DV) names not covered by the existing schema
- Suggest new alias-standard mappings
- Export proposed additions for manual review or direct YAML update


In [None]:

import os
import pandas as pd
import yaml
from collections import defaultdict

RAW_DATA_PATH = "../data/raw/"
SCHEMA_PATH = "../schemas/standard_dv_mapping.yaml"


## Load and Preview Raw Datasets

In [None]:

# List available datasets
datasets = [f for f in os.listdir(RAW_DATA_PATH) if f.endswith(".csv")]
datasets


## Extract Column Names from Selected Dataset

In [None]:

# Choose a dataset (edit index if needed)
df = pd.read_csv(os.path.join(RAW_DATA_PATH, datasets[0]))
df.columns.tolist()


## Load Current Schema

In [None]:

with open(SCHEMA_PATH, "r") as f:
    schema = yaml.safe_load(f)

# Flatten schema into list of all known aliases
known_aliases = {alias for group in schema.values() for alias in group}
known_aliases


## Identify Columns Not Covered by Current Schema

In [None]:

raw_columns = set(df.columns)
unmatched = raw_columns - known_aliases
unmatched


## Build Suggested Schema Extensions (Manual Input Recommended)

In [None]:

# Suggest new mappings manually (example)
suggestions = {
    "task_completion_time": ["time_spent", "completion_duration"],
    "user_engagement_score": ["engagement_rating", "user_rating"]
}

# Merge with existing schema
updated_schema = schema.copy()
for std_name, aliases in suggestions.items():
    if std_name in updated_schema:
        updated_schema[std_name].extend([a for a in aliases if a not in updated_schema[std_name]])
    else:
        updated_schema[std_name] = aliases

# Preview updated schema
updated_schema


## Export Updated Schema (Preview or Save)

In [None]:

with open("../schemas/standard_dv_mapping_UPDATED.yaml", "w") as f:
    yaml.dump(updated_schema, f, sort_keys=False)

print("✅ Updated schema exported to: schemas/standard_dv_mapping_UPDATED.yaml")



---

## Summary

This notebook enables iterative schema building through structured alias review and suggestion. Use it to populate or refine your DV naming scheme based on multiple datasets.

You may repeat the process across datasets to improve coverage and reuse.
