## Construct Health Expenditure Performance Index and Health Expenditure Efficiency Index

### 📝 Note: Health Expenditure Performance Index (HEP)

**HEE = Health Expenditure Efficiency Index (HEE)

Clean the following data (for missing data)
Life expectancy (years), Infant Mortality (Death rate per 1000 people), Average Schooling (years), Learning Outcome (scores), Health Expenditure (% of GDP), Education Expenditure (% of GDP)

Replicate PSP model by using the following data to generate 

HEP(Health Expenditure Performance) 
Independent variables = Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome

After that we will generate HEE (Health Expenditure Efficiency) by divided the HEP by Expenditure


HEE = HEP/(Health Expenditure + Education Expenditure)

HEP = sum of Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome (Scores)
→ Methodology, normalized each independent variables before combine to HEP index, noted that Infant Mortality using inverse value since the less is better
→ Higher HEP means better standard of living

HEE = HEP/ (Health Expenditure + Education Expenditure) 
→ to evaluate the efficiency of government spending 
→ Higher HEE means using lower health and education expenditure but get better HEP
In order to compute efficiency indicators, government spending was normalized across countries, with the average taking the value of one for each of the two categories (Health Expenditure, Education Expenditure)

For each indicator (Health and Education Expenditure):

Normalize each country's value by dividing it by the cross-country average, so that:

normalized value = raw value/ mean value across countries

The result will have a mean of 1, but preserve relative differences.
(The reason we use normalized value instead of raw value is to prevent countries with large economies (e.g., USA, Germany) appear less efficient, simply because they spend more in absolute terms. Also, developing countries might look artificially efficient, even if their HEP is low, simply because their raw expenditure is very small. Therefore, using normalized value will remove scale bias)


In [79]:
import pandas as pd
import os

# === Step 1: Load and clean merged data ===

df = pd.read_csv("../data/interim/merged_data.csv")
df = df[df["Year"] <= 2022]
df = df.sort_values(by=["Country", "Year"], ascending=[True, False])

# Forward-fill and backfill per country group (without include_groups flag)
df_cleaned = (
    df.groupby("Country", group_keys=False)
      .apply(lambda g: g.bfill().ffill())
      .reset_index(drop=True)
)

# Save cleaned data
os.makedirs("../data/processed", exist_ok=True)
df_cleaned.to_csv("../data/processed/merged_data_clean_for_HEE_HEP.csv", index=False)

# === Step 2: Hybrid normalization ===

def hybrid_normalize(series):
    return (series - series.min()) / (series.max() - series.min())

def inverse_hybrid(series):
    return (series.max() - series) / (series.max() - series.min())

df_norm = df_cleaned.copy()
df_norm["life_expectancy_norm"] = hybrid_normalize(df_norm["Life_Expectancy"])
df_norm["infant_mortality_norm"] = inverse_hybrid(df_norm["Mortality_Rate"])
df_norm["average_schooling_norm"] = hybrid_normalize(df_norm["average_schooling"])
df_norm["learning_outcome_norm"] = hybrid_normalize(df_norm["learning_scores"])

# === Step 3: Calculate HEP and HEE ===

df_norm["HEP"] = df_norm[[ 
    "life_expectancy_norm",
    "infant_mortality_norm",
    "average_schooling_norm",
    "learning_outcome_norm"
]].mean(axis=1)

df_norm["HEP_Health"] = df_norm[[
    "life_expectancy_norm",
    "infant_mortality_norm"
]].mean(axis=1)

df_norm["HEP_Edu"] = df_norm[[
    "average_schooling_norm",
    "learning_outcome_norm"
]].mean(axis=1)

# Normalize expenditures (mean = 1)
df_norm["Health_Expenditure_norm"] = df_norm["Health_Expenditure"] / df_norm["Health_Expenditure"].mean()
df_norm["Education_Expenditure_norm"] = df_norm["Education_Expenditure"] / df_norm["Education_Expenditure"].mean()

df_norm["total_expenditure"] = df_norm["Health_Expenditure_norm"] + df_norm["Education_Expenditure_norm"]
df_norm["HEE"] = df_norm["HEP"] / df_norm["total_expenditure"]
df_norm["HEE_Health"] = df_norm["HEP_Health"] / df_norm["Health_Expenditure_norm"]
df_norm["HEE_Edu"] = df_norm["HEP_Edu"] / df_norm["Education_Expenditure_norm"]

# === Step 4: Save final result ===

result = df_norm[[ 
    "ISO3", "Year", "income_level",
    "HEP", "HEE",
    "HEP_Health", "HEE_Health",
    "HEP_Edu", "HEE_Edu"
]]

result.to_csv("../data/processed/hep_hee_results.csv", index=False)

print("✅ HEP & HEE (+ health/edu split) calculated and saved successfully.")
print(result.tail())


✅ HEP & HEE (+ health/edu split) calculated and saved successfully.
     ISO3  Year         income_level       HEP       HEE  HEP_Health  \
1260  VNM  2004  Lower middle income  0.683046  0.373729    0.758211   
1261  VNM  2003  Lower middle income  0.678494  0.375883    0.757388   
1262  VNM  2002  Lower middle income  0.671862  0.374511    0.752406   
1263  VNM  2001  Lower middle income  0.664594  0.342885    0.746150   
1264  VNM  2000  Lower middle income  0.656241  0.358466    0.737723   

      HEE_Health   HEP_Edu   HEE_Edu  
1260    1.196054  0.607881  0.509231  
1261    1.238888  0.599600  0.502294  
1262    1.253485  0.591319  0.495357  
1263    1.002193  0.583039  0.488420  
1264    1.158181  0.574758  0.481483  


Add region and average index score over years for each country

In [83]:
import pandas as pd
import os

# === Auto-install pycountry if missing ===
try:
    import pycountry
except ImportError:
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pycountry"])
    import pycountry

# === Step 1: Load the data ===
data_path = "../data/processed/hep_hee_results.csv"
df = pd.read_csv(data_path)

# === Step 2: Define compact country-to-region mapping ===
emea = [
    "ARM", "CYP", "CZE", "EGY", "ETH", "DEU", "FRA", "GRC", "IRL", "JOR", "KAZ", "KEN", "KGZ",
    "LBN", "MAR", "NLD", "NGA", "ROU", "SRB", "SVK", "TJK", "TUN", "TUR", "UKR", "GBR", "UZB", "RUS", "IRN"
]
apac = [
    "AUS", "BGD", "CHN", "IND", "IDN", "JPN", "MYS", "MNG", "MMR",
    "NZL", "PAK", "PHL", "SGP", "THA", "VNM", "KOR"
]
latam = [
    "ARG", "BRA", "CHL", "ECU", "GTM", "MEX", "NIC", "PER", "URY"
]
na = ["CAN", "USA"]

# Combine into a single dictionary
country_region_map = {
    **{ISO3: "EMEA" for ISO3 in emea},
    **{ISO3: "APAC" for ISO3 in apac},
    **{ISO3: "LATAM" for ISO3 in latam},
    **{ISO3: "NA" for ISO3 in na},
}

# === Step 3: Map region and country name ===
df["Region"] = df["ISO3"].map(country_region_map)

def get_country_name(iso3):
    try:
        return pycountry.countries.get(alpha_3=iso3).name
    except:
        return "Unknown"

df["Country"] = df["ISO3"].apply(get_country_name)

# === Step 4: Warn unmapped ISO3 codes ===
unmapped = df[df["Region"].isnull()]["ISO3"].unique()
if len(unmapped) > 0:
    print("⚠️ Warning: The following countries couldn't be mapped to a region:")
    for c in unmapped:
        print(f"- {c}")
    print("👉 Please update 'country_region_map' with these countries.")

# === Step 5: Save updated file ===
output_path = "../data/processed/hep_hee_results_with_region.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_csv(output_path, index=False)
print(f"✅ 'Region' and 'Country' columns added and saved to: {output_path}")


✅ 'Region' and 'Country' columns added and saved to: ../data/processed/hep_hee_results_with_region.csv


Visualize HEE and HEP index

In [91]:
# === Step 0: Install dependencies if missing ===
try:
    import dash
    from dash import dcc, html, Input, Output, dash_table
except ImportError:
    import subprocess, sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "dash"])
    from dash import dcc, html, Input, Output, dash_table

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# === Step 1: Load and prepare data ===
df = pd.read_csv("../data/processed/hep_hee_results_with_region.csv")
df = df[df["income_level"].notnull()]
df["Year"] = df["Year"].astype(int)
df = df.sort_values(by="Year")
df["Year_str"] = df["Year"].astype(str)

# === Step 2: Compute average values ===
df_avg = df[df["Year"].between(2000, 2022)]
df_avg = df_avg.groupby(["ISO3", "income_level"])[["HEP", "HEE", "HEP_Health", "HEE_Health", "HEP_Edu", "HEE_Edu"]].mean().reset_index()
df_avg["text"] = df_avg["ISO3"] + " (avg)"

# Add ranking info
ranking_df = df[df["Year"].between(2000, 2022)].copy()
df_rank = ranking_df.groupby("ISO3")[["HEP", "HEE", "HEP_Health", "HEE_Health", "HEP_Edu", "HEE_Edu"]].mean().reset_index()
df_rank["Rank"] = df_rank["HEP"].rank(ascending=False, method="min").astype(int)
df_rank = df_rank.sort_values(by="Rank")
meta_info = df[["ISO3", "income_level"]].drop_duplicates()
df_rank = df_rank.merge(meta_info, on="ISO3", how="left")

# === Step 3: Get options ===
available_countries = sorted(df["ISO3"].unique())
country_options = [{"label": "All Countries", "value": "ALL"}] + [
    {"label": iso3, "value": iso3} for iso3 in available_countries
]
available_income_levels = sorted(df["income_level"].dropna().unique())
income_options = [{"label": level, "value": level} for level in available_income_levels]
avg_toggle = [{"label": "Show 2000–2022 Average", "value": "show_avg"}]

# === Step 4: Build Dash App ===
app = dash.Dash(__name__)
app.title = "HEP vs HEE Scatter and Ranking"

app.layout = html.Div([
    html.H2("HEP vs HEE – Animated and Average Scatter by Country"),

    html.Div([
        html.Label("Select Countries (ISO3):"),
        dcc.Dropdown(id="country-dropdown", options=country_options, value=["ALL"], multi=True),
    ], style={"width": "60%", "marginBottom": "20px"}),

    html.Div([
        html.Label("Select Income Levels:"),
        dcc.Dropdown(id="income-dropdown", options=income_options, value=available_income_levels, multi=True),
    ], style={"width": "60%", "marginBottom": "20px"}),

    html.Div([
        dcc.Checklist(id="avg-toggle", options=avg_toggle, value=["show_avg"],
                      labelStyle={"display": "inline-block", "marginRight": "10px"})
    ], style={"marginBottom": "20px"}),

    dcc.Graph(id="animated-scatter-main", style={"height": "600px"}),
    dcc.Graph(id="animated-scatter-health", style={"height": "600px"}),
    dcc.Graph(id="animated-scatter-edu", style={"height": "600px"}),

    html.Hr(),
    html.H3("HEP vs HEE (2000–2022 Avg Only, No Animation)"),

    dcc.Graph(id="avg-scatter-main"),
    dcc.Graph(id="avg-scatter-health"),
    dcc.Graph(id="avg-scatter-edu"),

    html.Hr(),
    html.H3("Ranking Table of HEP & HEE Index (2000–2022 Avg)"),

    dash_table.DataTable(
        id='ranking-table',
        columns=[
            {"name": "Rank", "id": "Rank"},
            {"name": "Country Code (ISO3)", "id": "ISO3"},
            {"name": "Income Level", "id": "income_level"},
            {"name": "Avg HEP", "id": "HEP", "type": "numeric", "format": {"specifier": ".3f"}},
            {"name": "Avg HEE", "id": "HEE", "type": "numeric", "format": {"specifier": ".3f"}},
            {"name": "Avg HEP_Edu", "id": "HEP_Edu", "type": "numeric", "format": {"specifier": ".3f"}},
            {"name": "Avg HEE_Edu", "id": "HEE_Edu", "type": "numeric", "format": {"specifier": ".3f"}},
            {"name": "Avg HEP_Health", "id": "HEP_Health", "type": "numeric", "format": {"specifier": ".3f"}},
            {"name": "Avg HEE_Health", "id": "HEE_Health", "type": "numeric", "format": {"specifier": ".3f"}},
        ],
        data=df_rank.to_dict("records"),
        style_table={"overflowX": "auto"},
        sort_action="native",
        style_header={"backgroundColor": "#f4f4f4", "fontWeight": "bold"},
        style_cell={"padding": "8px", "textAlign": "center"},
        page_size=20
    )
])

# === Step 5: Callback ===
@app.callback(
    Output("animated-scatter-main", "figure"),
    Output("animated-scatter-health", "figure"),
    Output("animated-scatter-edu", "figure"),
    Output("avg-scatter-main", "figure"),
    Output("avg-scatter-health", "figure"),
    Output("avg-scatter-edu", "figure"),
    Input("country-dropdown", "value"),
    Input("income-dropdown", "value"),
    Input("avg-toggle", "value")
)
def update_figure(selected_countries, selected_income_levels, avg_toggle):
    if "ALL" in selected_countries:
        selected_countries = available_countries

    dff = df[df["ISO3"].isin(selected_countries) & df["income_level"].isin(selected_income_levels)]
    dff_avg = df_avg[df_avg["ISO3"].isin(selected_countries)]

    def make_animated_scatter(x_col, y_col, title, avg_x, avg_y):
        fig = px.scatter(
            dff, x=x_col, y=y_col, color="income_level", text="ISO3",
            animation_frame="Year_str", range_x=[0, 1], range_y=[0, 1],
            labels={x_col: f"Performance ({x_col})", y_col: f"Efficiency ({y_col})"},
            title=title, width=950, height=600
        )
        fig.update_traces(marker=dict(size=10), textposition="top center")

        if "show_avg" in avg_toggle:
            fig.add_trace(go.Scatter(
                x=dff_avg[avg_x], y=dff_avg[avg_y], mode="markers+text",
                text=dff_avg["text"], textposition="bottom center",
                marker=dict(size=9, symbol="diamond", color="black", opacity=0.4),
                name="2000–2022 Avg"
            ))

        fig.add_shape(
            type="line", x0=0, y0=0, x1=1, y1=1,
            xref="x", yref="y", line=dict(dash="dash", color="gray")
        )
        return fig

    def make_static_avg_scatter(x_col, y_col, title):
        dff_static = df_avg[df_avg["income_level"].isin(selected_income_levels) & df_avg["ISO3"].isin(selected_countries)]
        fig = px.scatter(
            dff_static, x=x_col, y=y_col, color="income_level", text="ISO3",
            range_x=[0, 1], range_y=[0, 1],
            labels={x_col: f"Performance ({x_col})", y_col: f"Efficiency ({y_col})"},
            title=title, width=950, height=600
        )
        fig.update_traces(marker=dict(size=10), textposition="top center")
        fig.add_shape(
            type="line", x0=0, y0=0, x1=1, y1=1,
            xref="x", yref="y", line=dict(dash="dash", color="gray")
        )
        return fig

    fig_main = make_animated_scatter("HEP", "HEE", "HEP vs HEE (Overall)", "HEP", "HEE")
    fig_health = make_animated_scatter("HEP_Health", "HEE_Health", "HEP vs HEE (Health Only)", "HEP_Health", "HEE_Health")
    fig_edu = make_animated_scatter("HEP_Edu", "HEE_Edu", "HEP vs HEE (Education Only)", "HEP_Edu", "HEE_Edu")

    fig_avg_main = make_static_avg_scatter("HEP", "HEE", "Avg HEP vs HEE (Overall)")
    fig_avg_health = make_static_avg_scatter("HEP_Health", "HEE_Health", "Avg HEP vs HEE (Health Only)")
    fig_avg_edu = make_static_avg_scatter("HEP_Edu", "HEE_Edu", "Avg HEP vs HEE (Education Only)")

    return fig_main, fig_health, fig_edu, fig_avg_main, fig_avg_health, fig_avg_edu

# === Step 6: Run App ===
if __name__ == "__main__":
    app.run(debug=True)


In [106]:
import pandas as pd
import plotly.express as px

# === Step 1: Load and aggregate ===
df = pd.read_csv("../data/processed/hep_hee_results_with_region.csv")
df_avg = df[df["Year"].between(2000, 2022)]
df_avg = df_avg.groupby("ISO3")[["HEP", "HEE"]].mean().reset_index()

# === Step 2: Melt to long format for grouped bars ===
df_long = df_avg.melt(id_vars="ISO3", value_vars=["HEP", "HEE"], 
                      var_name="Indicator", value_name="Value")

# === Step 3: Sort for display ===
df_long["ISO3"] = pd.Categorical(df_long["ISO3"], 
                                 categories=df_avg.sort_values("HEP")["ISO3"],
                                 ordered=True)

# === Step 4: Plot with Plotly Express ===
fig = px.bar(
    df_long,
    x="Value",
    y="ISO3",
    color="Indicator",
    barmode="group",
    orientation="h",
    height=1000,
    color_discrete_map={"HEP": "#FFB6C1", "HEE": "#FFA500"},
    labels={"Value": "Index Value", "ISO3": "Country", "Indicator": "Metric"},
    title="Avg HEP vs HEE index (2000–2022)"
)

# === Step 5: Layout adjustments ===
fig.update_layout(
    yaxis_title="Country",
    xaxis_title="Index Value",
    legend_title_text="Indicator",
    bargap=0.2,
    margin=dict(l=100, r=20, t=60, b=40)
)

fig.show()
