# Data Loading Options (DataLoader)

This notebook demonstrates different configuration patterns for `scripts/data-loader.py`, showcasing how chunk sizes, selective column loading, and dtype optimization flags influence runtime behavior. Use it as a playground when deciding which options fit your workflow.



## What You'll Learn

- How to dynamically import the loader module from a notebook
- How to compare multiple `DataConfig` settings in a single session
- How to capture summary metadata (rows, columns, memory) for each scenario



In [None]:
from pathlib import Path
import sys
import logging
import importlib.util
import time
import pandas as pd

try:
    NOTEBOOK_DIR = Path(__file__).resolve().parent
except NameError:
    NOTEBOOK_DIR = Path.cwd()

PROJECT_ROOT = NOTEBOOK_DIR.parent.parent
SCRIPTS_DIR = PROJECT_ROOT / "scripts"
DATA_FILE = PROJECT_ROOT / "data" / "national_water_plan.csv"

if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")

print(f"Project root: {PROJECT_ROOT}")
print(f"Data file:    {DATA_FILE}")



In [None]:
def load_data_loader_module(module_name: str = "data_loader_demo"):
    module_path = SCRIPTS_DIR / "data-loader.py"
    spec = importlib.util.spec_from_file_location(module_name, module_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module


data_loader_module = load_data_loader_module()
DataConfig = data_loader_module.DataConfig
DataLoader = data_loader_module.DataLoader

print("Data loader module imported successfully.")



## Scenario Helper

We'll define a reusable helper that runs the loader with a specific `DataConfig` and captures execution time plus key metadata.



In [None]:
def run_scenario(name: str, **config_kwargs):
    """Run the DataLoader with a custom configuration and collect summary stats."""
    print(f"\nRunning scenario: {name}")
    config = DataConfig(filepath=str(DATA_FILE), **config_kwargs)
    loader = DataLoader(config=config)

    start = time.perf_counter()
    df, report = loader.load_and_explore_data()
    duration = time.perf_counter() - start

    metadata = report.metadata
    summary = {
        "scenario": name,
        "chunk_size": config.chunk_size or "auto",
        "columns_loaded": "All" if not config.usecols else ", ".join(config.usecols),
        "dtype_optimization": config.dtype_optimization,
        "rows": metadata.rows,
        "columns": metadata.columns,
        "memory_mb": round(metadata.memory_usage, 2),
        "missing_pct": round(metadata.missing_values_percent, 2),
        "duplicate_rows": metadata.duplicate_rows,
        "warnings": len(report.warnings),
        "errors": len(report.errors),
        "duration_sec": round(duration, 2),
    }

    return summary, df, report



## Compare Multiple Configurations

Feel free to edit the scenarios below. Each run will capture metadata so you can easily compare the trade-offs between different settings.



In [None]:
scenarios = [
    {
        "name": "Default load (auto)",
        "config": {}
    },
    {
        "name": "Chunked selective columns",
        "config": {
            "chunk_size": 10_000,
            "usecols": [
                "ID",
                "Water company",
                "Site name",
                "River Basin District",
                "Spill Events 2022",
                "Sewage Reduction Plan Targets Met Flag"
            ],
            "required_columns": ["ID", "Site name"]
        }
    },
    {
        "name": "Full dataset (dtype optimization off)",
        "config": {
            "dtype_optimization": False
        }
    }
]

scenario_summaries = []
scenario_outputs = {}

for scenario in scenarios:
    summary, df_obj, report = run_scenario(scenario["name"], **scenario["config"])
    scenario_summaries.append(summary)
    scenario_outputs[scenario["name"]] = {"df": df_obj, "report": report}

summary_df = pd.DataFrame(scenario_summaries)
summary_df



In [None]:
for name, outputs in scenario_outputs.items():
    report = outputs["report"]
    print(f"\n{name}")
    print("-" * len(name))
    if report.warnings:
        for warning in report.warnings:
            print(f"  ⚠️  {warning}")
    else:
        print("  No warnings")

    if report.errors:
        for error in report.errors:
            print(f"  ❌ {error}")
    else:
        print("  No errors")



## Inspect a Specific Scenario

Use the snippet below to inspect the Dask DataFrame returned by any of the scenarios (here we use the chunked example).



In [None]:
chunked_df = scenario_outputs["Chunked selective columns"]["df"]
chunked_df.head()



## Next Steps

- Tweak the `scenarios` list to try other combinations (e.g., stricter `max_memory_mb`, different subsets of columns, or `required_columns` checks)
- Persist whichever configuration works best for your pipeline inside `load_data.py` or downstream scripts
- Pair these scenarios with the cleaning pipeline to compare performance end-to-end

