
# Visual Validation of DV Standardization

This notebook is designed to support the validation and qualitative inspection of standardized dependent variable (DV) mappings. It compares column names in raw vs. processed datasets, helping researchers visually confirm the effectiveness and coverage of the naming scheme.

Use this notebook to:
- Display side-by-side comparisons of column names before and after standardization
- Visualize common or unresolved variable names
- Identify patterns or inconsistencies for schema refinement


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

RAW_PATH = Path("../data/raw/")
PROCESSED_PATH = Path("../data/processed/")


## Load Raw and Processed Datasets

In [None]:

# Select one file to validate
filename = "sample_dataset.csv"
raw_df = pd.read_csv(RAW_PATH / filename)
processed_df = pd.read_csv(PROCESSED_PATH / f"standardized_{filename}")


## Compare Column Names (Before vs After)

In [None]:

comparison_df = pd.DataFrame({
    "Original": raw_df.columns,
    "Standardized": processed_df.columns
})
comparison_df


## Visualize Distribution of Variable Names

In [None]:

# Count frequencies of column names (optional if dataset is small)
raw_counts = raw_df.columns.value_counts()
standardized_counts = processed_df.columns.value_counts()

# Plot bar chart comparison (only if column names repeat)
plt.figure(figsize=(10, 4))
plt.bar(raw_counts.index, raw_counts.values, alpha=0.6, label="Raw", color="gray")
plt.bar(standardized_counts.index, standardized_counts.values, alpha=0.6, label="Standardized", color="green")
plt.title("Column Name Frequencies Before and After Standardization")
plt.xlabel("Column Name")
plt.ylabel("Count")
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()



---

## Summary

This notebook enables manual and visual comparison of column names before and after schema-based standardization. You can use this to validate correctness, check for overlooked aliases, and refine the schema logic iteratively.

Next step: Use this feedback to improve `standard_dv_mapping.yaml` or enhance the transformation script (`convert_dv.py`).
