## ETL for Country Health Indicators (World Bank API)

#### 1. Import Dependencies and Configuration

This first block imports the libraries needed to call the API, manipulate JSON into tabular form, and connect to PostgreSQL through the shared engine configuration.

In [2]:
import requests
import pandas as pd
from config import pg_engine

#### 2. Define World Bank Health Indicators

Here we specify which World Bank indicators we want to pull.
Each friendly name (like diabetes_prevalence) is mapped to its official World Bank indicator code.

In [3]:
INDICATORS = {
    "diabetes_prevalence": "SH.STA.DIAB.ZS",
    "health_expenditure_per_capita": "SH.XPD.CHEX.PC.CD",
    "hospital_beds_per_1k": "SH.MED.BEDS.ZS",
}

This makes the code reusable and easy to extend if more indicators are needed later.

#### 3. Helper Function: Fetch Indicator Time Series

This function encapsulates the logic for calling the World Bank API for a single indicator, parsing the JSON response into a structured DataFrame.

In [6]:
def fetch_indicator(indicator_code, indicator_name):
    url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator_code}?format=json&per_page=20000"
    resp = requests.get(url)
    resp.raise_for_status()
    data = resp.json()
    rows = []
    for rec in data[1]:
        rows.append({
            "country": rec["country"]["value"],
            "year": int(rec["date"]),
            "indicator": indicator_name,
            "value": rec["value"],
            "source": "WorldBankAPI",
        })
    return pd.DataFrame(rows)

**Explanation:**
For each indicator, the script calls the World Bank REST API, loops over the returned records, and extracts the country name, year, indicator label, numeric value, and a source tag. The result is a tidy table with one row per country–year–indicator.

#### 4. Extract: Download All Required Indicators

In this step, the function above is applied to each indicator defined in the dictionary, and the results are combined.

In [7]:
frames = []
for name, code in INDICATORS.items():
    frames.append(fetch_indicator(code, name))

all_indicators = pd.concat(frames, ignore_index=True)
all_indicators = all_indicators[all_indicators["year"] >= 2010]

**Explanation:**
The ETL pipeline concatenates all indicator-specific DataFrames into a single long-form table and filters to recent years (≥ 2010), keeping the dataset relevant and manageable.

#### 5. Transform: Pivot to Country-Level Wide Format

This block reshapes the long table into a country-level wide table, where each indicator becomes a column.

In [8]:
pivot = (
    all_indicators
    .pivot_table(
        index="country",
        columns="indicator",
        values="value",
        aggfunc="last"
    )
    .reset_index()
)
pivot.columns.name = None

**Explanation:**
Using a pivot, the ETL aggregates the latest available value per indicator for each country. This produces one row per country with columns like diabetes_prevalence, health_expenditure_per_capita, and hospital_beds_per_1k. This format aligns naturally with a country dimension table in the warehouse.

#### 6. Enrich Schema with Structural Columns

Before loading, we add placeholder columns for standard country attributes and a source tag.

In [9]:
pivot["iso2_code"] = None
pivot["region"] = None
pivot["subregion"] = None
pivot["income_level"] = None
pivot["source_system"] = "WorldBankAPI"

**Explanation:**
These fields (ISO code, region, subregion, income level) are reserved for future enrichment or manual population. The source_system column explicitly tracks that this dimension data originated from the World Bank API.

#### 7. Prepare Final dim_country DataFrame

This step renames and orders the columns to match the PostgreSQL dim_country schema.

In [11]:
dim_country_df = pivot.rename(columns={"country": "country_name"})
dim_country_df = dim_country_df[
    [
        "country_name",
        "iso2_code",
        "region",
        "subregion",
        "diabetes_prevalence",
        "health_expenditure_per_capita",
        "hospital_beds_per_1k",
        "income_level",
        "source_system",
    ]
]

**Explanation:**
The schema now reflects a clean country dimension, combining macro-level health indicators with structural attributes and a provenance field.

#### 8. Load: Write dim_country into PostgreSQL

Finally, the transformed DataFrame is loaded into the dim_country table in the data warehouse.

In [12]:
with pg_engine.begin() as conn:
    dim_country_df.to_sql(
        "dim_country",
        con=conn,
        if_exists="append",
        index=False,
    )

**Explanation:**
The load phase persists the country health indicators into the warehouse, allowing downstream queries to correlate hospital-level facts with country-level health system characteristics.