## Construct Health Expenditure Performance Index and Health Expenditure Efficiency Index

### 📝 Note: Health Expenditure Performance Index (HEP)

**HEE = Health Expenditure Efficiency Index (HEE)

Clean the following data (for missing data)
Life expectancy (years), Infant Mortality (Death rate per 1000 people), Average Schooling (years), Learning Outcome (scores), Health Expenditure (% of GDP), Education Expenditure (% of GDP)

Replicate PSP model by using the following data to generate 

HEP(Health Expenditure Performance) 
Independent variables = Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome

After that we will generate HEE (Health Expenditure Efficiency) by divided the HEP by Expenditure


HEE = HEP/(Health Expenditure + Education Expenditure)

HEP = sum of Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome (Scores)
→ Methodology, normalized each independent variables before combine to HEP index, noted that Infant Mortality using inverse value since the less is better
→ Higher HEP means better standard of living

HEE = HEP/ (Health Expenditure + Education Expenditure) 
→ to evaluate the efficiency of government spending 
→ Higher HEE means using lower health and education expenditure but get better HEP
In order to compute efficiency indicators, government spending was normalized across countries, with the average taking the value of one for each of the two categories (Health Expenditure, Education Expenditure)

For each indicator (Health and Education Expenditure):

Normalize each country's value by dividing it by the cross-country average, so that:

normalized value = raw value/ mean value across countries

The result will have a mean of 1, but preserve relative differences.
(The reason we use normalized value instead of raw value is to prevent countries with large economies (e.g., USA, Germany) appear less efficient, simply because they spend more in absolute terms. Also, developing countries might look artificially efficient, even if their HEP is low, simply because their raw expenditure is very small. Therefore, using normalized value will remove scale bias)


In [3]:
import pandas as pd
import os

# === Step 1: Load and clean merged data ===

df = pd.read_csv("../data/interim/merged_data.csv")
df = df[df["Year"] <= 2022]
df = df.sort_values(by=["Country", "Year"], ascending=[True, False])

df_cleaned = (
    df.groupby("Country", group_keys=False)
      .apply(lambda g: g.bfill().ffill(), include_groups=False)
      .reset_index(drop=True)
)

os.makedirs("../data/processed", exist_ok=True)
df_cleaned.to_csv("../data/processed/merged_data_clean_for_HEE_HEP.csv", index=False)

# === Step 2: Normalize indicators ===

def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df_norm = df_cleaned.copy()
df_norm["life_expectancy_norm"] = normalize(df_norm["Life_Expectancy"])
df_norm["infant_mortality_norm"] = 1 - normalize(df_norm["Mortality_Rate"])
df_norm["average_schooling_norm"] = normalize(df_norm["average_schooling"])
df_norm["learning_outcome_norm"] = normalize(df_norm["learning_scores"])

# === Step 3: Calculate HEP and HEE ===

# Overall HEP (4 variables)
df_norm["HEP"] = df_norm[[
    "life_expectancy_norm",
    "infant_mortality_norm",
    "average_schooling_norm",
    "learning_outcome_norm"
]].mean(axis=1)

# HEP by Health only
df_norm["HEP_Health"] = df_norm[[
    "life_expectancy_norm",
    "infant_mortality_norm"
]].mean(axis=1)

# HEP by Education only
df_norm["HEP_Edu"] = df_norm[[
    "average_schooling_norm",
    "learning_outcome_norm"
]].mean(axis=1)

# Normalize expenditures
df_norm["Health_Expenditure_norm"] = df_norm["Health_Expenditure"] / df_norm["Health_Expenditure"].mean()
df_norm["Education_Expenditure_norm"] = df_norm["Education_Expenditure"] / df_norm["Education_Expenditure"].mean()

# Total spending and overall HEE
df_norm["total_expenditure"] = df_norm["Health_Expenditure_norm"] + df_norm["Education_Expenditure_norm"]
df_norm["HEE"] = df_norm["HEP"] / df_norm["total_expenditure"]

# HEE for Health and Education
df_norm["HEE_Health"] = df_norm["HEP_Health"] / df_norm["Health_Expenditure_norm"]
df_norm["HEE_Edu"] = df_norm["HEP_Edu"] / df_norm["Education_Expenditure_norm"]

# === Step 4: Save results ===

result = df_norm[[
    "ISO3", "Year", "income_level",
    "HEP", "HEE",
    "HEP_Health", "HEE_Health",
    "HEP_Edu", "HEE_Edu"
]]

result.to_csv("../data/processed/hep_hee_results.csv", index=False)

print("✅ HEP & HEE (+ health/edu split) calculated and saved successfully.")
print(result.tail())


✅ HEP & HEE (+ health/edu split) calculated and saved successfully.
     ISO3  Year         income_level       HEP       HEE  HEP_Health  \
1260  VNM  2004  Lower middle income  0.683046  0.373729    0.758211   
1261  VNM  2003  Lower middle income  0.678494  0.375883    0.757388   
1262  VNM  2002  Lower middle income  0.671862  0.374511    0.752406   
1263  VNM  2001  Lower middle income  0.664594  0.342885    0.746150   
1264  VNM  2000  Lower middle income  0.656241  0.358466    0.737723   

      HEE_Health   HEP_Edu   HEE_Edu  
1260    1.196054  0.607881  0.509231  
1261    1.238888  0.599600  0.502294  
1262    1.253485  0.591319  0.495357  
1263    1.002193  0.583039  0.488420  
1264    1.158181  0.574758  0.481483  


Add region and average index score over years for each country

In [8]:
import pandas as pd
import os

# === Auto-install pycountry if missing ===
try:
    import pycountry
except ImportError:
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pycountry"])
    import pycountry

# === Step 1: Load the data ===
data_path = "../data/processed/hep_hee_results.csv"
df = pd.read_csv(data_path)

# === Step 2: Define compact country-to-region mapping ===
emea = [
    "ARM", "CYP", "CZE", "EGY", "ETH", "DEU", "FRA", "GRC", "IRL", "JOR", "KAZ", "KEN", "KGZ",
    "LBN", "MAR", "NLD", "NGA", "ROU", "SRB", "SVK", "TJK", "TUN", "TUR", "UKR", "GBR", "UZB", "RUS", "IRN"
]
apac = [
    "AUS", "BGD", "CHN", "IND", "IDN", "JPN", "MYS", "MNG", "MMR",
    "NZL", "PAK", "PHL", "SGP", "THA", "VNM", "KOR"
]
latam = [
    "ARG", "BRA", "CHL", "ECU", "GTM", "MEX", "NIC", "PER", "URY"
]
na = ["CAN", "USA"]

# Combine into a single dictionary
country_region_map = {
    **{ISO3: "EMEA" for ISO3 in emea},
    **{ISO3: "APAC" for ISO3 in apac},
    **{ISO3: "LATAM" for ISO3 in latam},
    **{ISO3: "NA" for ISO3 in na},
}

# === Step 3: Map region and country name ===
df["Region"] = df["ISO3"].map(country_region_map)

def get_country_name(iso3):
    try:
        return pycountry.countries.get(alpha_3=iso3).name
    except:
        return "Unknown"

df["Country"] = df["ISO3"].apply(get_country_name)

# === Step 4: Warn unmapped ISO3 codes ===
unmapped = df[df["Region"].isnull()]["ISO3"].unique()
if len(unmapped) > 0:
    print("⚠️ Warning: The following countries couldn't be mapped to a region:")
    for c in unmapped:
        print(f"- {c}")
    print("👉 Please update 'country_region_map' with these countries.")

# === Step 5: Save updated file ===
output_path = "../data/processed/hep_hee_results_with_region.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_csv(output_path, index=False)
print(f"✅ 'Region' and 'Country' columns added and saved to: {output_path}")


✅ 'Region' and 'Country' columns added and saved to: ../data/processed/hep_hee_results_with_region.csv


Visualize HEE and HEP index

In [24]:
# === Step 0: Install dependencies if missing ===
try:
    import dash
    from dash import dcc, html, Input, Output
except ImportError:
    import subprocess, sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "dash"])
    from dash import dcc, html, Input, Output

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# === Step 1: Load and prepare data ===
df = pd.read_csv("../data/processed/hep_hee_results_with_region.csv")
df = df[df["income_level"].notnull()]
df["Year"] = df["Year"].astype(int)
df = df.sort_values(by="Year")
df["Year_str"] = df["Year"].astype(str)

# === Step 2: Compute average values ===
df_avg = df.groupby("ISO3")[["HEP", "HEE", "HEP_Health", "HEE_Health", "HEP_Edu", "HEE_Edu"]].mean().reset_index()
df_avg["text"] = df_avg["ISO3"] + " (avg)"

# === Step 3: Get country list ===
available_countries = sorted(df["ISO3"].unique())
country_options = [{"label": "All Countries", "value": "ALL"}] + [
    {"label": iso3, "value": iso3} for iso3 in available_countries
]

# === Step 4: Build Dash App ===
app = dash.Dash(__name__)
app.title = "HEP vs HEE Scatter"

app.layout = html.Div([
    html.H2("HEP vs HEE – Animated Scatter by Country"),

    html.Div([
        html.Label("Select Countries (ISO3):"),
        dcc.Dropdown(
            id="country-dropdown",
            options=country_options,
            value=["ALL"],
            multi=True
        ),
    ], style={"width": "60%", "marginBottom": "20px"}),

    dcc.Graph(id="animated-scatter-main", style={"height": "650px"}),
    dcc.Graph(id="animated-scatter-health", style={"height": "650px"}),
    dcc.Graph(id="animated-scatter-edu", style={"height": "650px"})
])

# === Step 5: Callback ===
@app.callback(
    Output("animated-scatter-main", "figure"),
    Output("animated-scatter-health", "figure"),
    Output("animated-scatter-edu", "figure"),
    Input("country-dropdown", "value")
)
def update_figure(selected):
    if "ALL" in selected:
        selected_countries = available_countries
    else:
        selected_countries = selected

    dff = df[df["ISO3"].isin(selected_countries)]
    dff_avg = df_avg[df_avg["ISO3"].isin(selected_countries)]

    def make_scatter(x_col, y_col, title, avg_x, avg_y):
        fig = px.scatter(
            dff,
            x=x_col,
            y=y_col,
            color="income_level",
            text="ISO3",
            animation_frame="Year_str",
            range_x=[0, 1],
            range_y=[0, 1],
            labels={x_col: f"Performance ({x_col})", y_col: f"Efficiency ({y_col})"},
            title=title,
            width=950,
            height=650
        )
        fig.update_traces(marker=dict(size=10), textposition="top center")
        fig.add_trace(go.Scatter(
            x=dff_avg[avg_x],
            y=dff_avg[avg_y],
            mode="markers+text",
            text=dff_avg["text"],
            textposition="bottom center",
            marker=dict(size=9, symbol="diamond", color="black", opacity=0.4),
            name="2020-2022 Avg"
        ))
        fig.add_shape(
            type="line",
            x0=0, y0=0, x1=1, y1=1,
            xref="x", yref="y",
            line=dict(dash="dash", color="gray")
        )
        return fig

    fig_main = make_scatter("HEP", "HEE", "HEP vs HEE (Overall)", "HEP", "HEE")
    fig_health = make_scatter("HEP_Health", "HEE_Health", "HEP vs HEE (Health Only)", "HEP_Health", "HEE_Health")
    fig_edu = make_scatter("HEP_Edu", "HEE_Edu", "HEP vs HEE (Education Only)", "HEP_Edu", "HEE_Edu")

    return fig_main, fig_health, fig_edu

# === Step 6: Run App ===
if __name__ == "__main__":
    app.run(debug=True)


In [25]:
print(df[["ISO3", "Country", "Region"]].drop_duplicates().head())

     ISO3     Country Region
1264  VNM    Viet Nam   APAC
1241  UZB  Uzbekistan   EMEA
91    BGD  Bangladesh   APAC
22    ARG   Argentina  LATAM
574   KEN       Kenya   EMEA
