## Sanity Check - World Bank Data

The goal of this notebook is to do a simple sanity check for the datasets, like no duplicate data and if the year range is as expected, as well as if the files themselves exist.

#### 1. Imports

In [5]:
import pandas as pd
import pathlib

RAW_DIR = pathlib.Path("../data/raw")
EXPECTED_COLS = ["country_code", "country_name", "year", "indicator_code", "value"]

In [6]:
results = []

for csv_path in RAW_DIR.glob("*.csv"):
    print(f"\nChecking {csv_path.name}")
    try:
        df = pd.read_csv(csv_path)
    except Exception as e:
        print(f"Could not read {csv_path.name}: {e}")
        results.append((csv_path.name, False))
        continue

    # 1. Check columns
    if list(df.columns) != EXPECTED_COLS:
        print(f"Columns mismatch: {df.columns.tolist()}")
        results.append((csv_path.name, False))
        continue
    else:
        print("Columns OK")

    # 2. Check year range
    years = df["year"]
    if years.min() < 2000 or years.max() > 2024:
        print(f"Year out of range: min={years.min()}, max={years.max()}")
        results.append((csv_path.name, False))
        continue
    else:
        print("Year range OK")

    # 3. Check for duplicates
    dups = df.duplicated(subset=["country_code", "indicator_code", "year"]).sum()
    if dups > 0:
        print(f"Found {dups} duplicate rows on (country_code, indicator_code, year)")
        results.append((csv_path.name, False))
        continue
    else:
        print("No duplicates on (country_code, indicator_code, year)")

    print(f"{csv_path.name} PASSED all checks")
    results.append((csv_path.name, True))

# Summary
print("\n=== SUMMARY ===")
for fname, passed in results:
    print(f"{fname}: {'PASSED' if passed else 'FAILED'}")

# Document what was checked
print("""
This notebook checks each CSV in data/raw/ for:
- Correct columns
- Years between 2000–2024
- No duplicate (country_code, indicator_code, year)
If all checks pass, the file is marked as PASSED.
""")


Checking worldbank_SH.PRV.SMOK_2000_2024.csv
Columns OK
Year range OK
No duplicates on (country_code, indicator_code, year)
worldbank_SH.PRV.SMOK_2000_2024.csv PASSED all checks

Checking worldbank_SH.XPD.CHEX.GD.ZS_2000_2024.csv
Columns OK
Year range OK
No duplicates on (country_code, indicator_code, year)
worldbank_SH.XPD.CHEX.GD.ZS_2000_2024.csv PASSED all checks

Checking worldbank_SP.DYN.IMRT.IN_2000_2024.csv
Columns OK
Year range OK
No duplicates on (country_code, indicator_code, year)
worldbank_SP.DYN.IMRT.IN_2000_2024.csv PASSED all checks

Checking worldbank_SP.DYN.LE00.IN_2000_2024.csv
Columns OK
Year range OK
No duplicates on (country_code, indicator_code, year)
worldbank_SP.DYN.LE00.IN_2000_2024.csv PASSED all checks

=== SUMMARY ===
worldbank_SH.PRV.SMOK_2000_2024.csv: PASSED
worldbank_SH.XPD.CHEX.GD.ZS_2000_2024.csv: PASSED
worldbank_SP.DYN.IMRT.IN_2000_2024.csv: PASSED
worldbank_SP.DYN.LE00.IN_2000_2024.csv: PASSED

This notebook checks each CSV in data/raw/ for:
- Corr