## Integrate R&D and contribution revenue into a broader index

### 📝 Note: Health Expenditure Performance Index (HEP)

**HEE = Health Expenditure Efficiency Index (HEE)

Clean the following data (for missing data)
Life expectancy (years), Infant Mortality (Death rate per 1000 people), Average Schooling (years), Learning Outcome (scores), Health Expenditure (% of GDP),Education Expenditure (% of GDP)

Replicate PSP model by using the following data to generate 

HEP(Health Expenditure Performance) 
Independent variables = Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome

After that we will generate HEE (Health Expenditure Efficiency) by divided the HEP by Expenditure



HEE = HEP/(Health Expenditure + Education Expenditure)

HEP = sum of Life expectancy, Infant Mortality, Average Schooling (years), Learning Outcome (Scores)
→ Methodology, normalised each independent variables before combine to HEP index, noted that Infant Mortality using inverse value since the less is better
→ Higher HEP means better standard of living

HEE = HEP/(Health Expenditure + Education Expenditure) 
→ to evaluate the efficiency of government spending 
→ Higher HEE means using lower health and education expenditure but get better HEP
In order to compute efficiency indicators, government spending was normalised across countries, with the average taking the value of one for each of the two categories (Health Expenditure, Education Expenditure)

For each indicator (Health and Education Expenditure):

Normalize each country's value by dividing it by the cross-country average, so that:

normalized value = raw value/ mean value across countries

The result will have a mean of 1, but preserve relative differences.

In [12]:
import pandas as pd
import os

# === Step 1: Load and clean merged data ===

# Adjusted path: go up one level from notebooks/ to access data/interim
df = pd.read_csv("../data/interim/merged_data.csv")

# Filter to only include years up to 2022
df = df[df["Year"] <= 2022]

# Sort by Country and Year (most recent first) for proper fill
df = df.sort_values(by=["Country", "Year"], ascending=[True, False])

# Fill missing values using most recent data available (backward then forward fill)
df_cleaned = (
    df.groupby("Country", group_keys=False)
      .apply(lambda g: g.bfill().ffill(), include_groups=False)
      .reset_index(drop=True)
)

# Save cleaned data
os.makedirs("../data/processed", exist_ok=True)
df_cleaned.to_csv("../data/processed/merged_data_clean_for_HEE_HEP.csv", index=False)

# === Step 2: Normalize performance indicators ===

def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df_norm = df_cleaned.copy()

# Apply normalization (adjust columns to exact spelling) (invert only the normalized values for infant_mortality which less is better)
df_norm["life_expectancy_norm"] = normalize(df_norm["Life_Expectancy"])
df_norm["infant_mortality_norm"] = 1 - normalize(df_norm["Mortality_Rate"])
df_norm["average_schooling_norm"] = normalize(df_norm["average_schooling"])
df_norm["learning_outcome_norm"] = normalize(df_norm["learning_scores"])

# === Step 3: Calculate HEP and HEE ===

df_norm["HEP"] = df_norm[[
    "life_expectancy_norm",
    "infant_mortality_norm",
    "average_schooling_norm",
    "learning_outcome_norm"
]].mean(axis=1)

# Normalize Health and Education Expenditure so that their mean = 1
df_norm["Health_Expenditure_norm"] = (
    df_norm["Health_Expenditure"] / df_norm["Health_Expenditure"].mean()
)
df_norm["Education_Expenditure_norm"] = (
    df_norm["Education_Expenditure"] / df_norm["Education_Expenditure"].mean()
)

# Now compute normalized total expenditure
df_norm["total_expenditure"] = (
    df_norm["Health_Expenditure_norm"] + df_norm["Education_Expenditure_norm"]
)
# Calculate HEE (Health Expenditure Efficiency)
df_norm["HEE"] = df_norm["HEP"] / df_norm["total_expenditure"]


result = df_norm[["ISO3", "Year", "income_level", "HEP", "HEE"]].copy()
# === Step 4: Save only HEP & HEE results ===
result = df_norm[["ISO3", "Year", "income_level", "HEP", "HEE"]].copy()
result.to_csv("../data/processed/hep_hee_results.csv", index=False)

print("✅ HEP & HEE calculated and saved successfully.")
print(result.tail())



✅ HEP & HEE calculated and saved successfully.
     ISO3  Year         income_level       HEP       HEE
1237  VNM  2004  Lower middle income  0.683046  0.370482
1238  VNM  2003  Lower middle income  0.678494  0.372637
1239  VNM  2002  Lower middle income  0.671862  0.371287
1240  VNM  2001  Lower middle income  0.664594  0.339823
1241  VNM  2000  Lower middle income  0.656241  0.355349


Add region

In [24]:
import pandas as pd
import os

# === Step 1: Load the data ===
data_path = "../data/processed/hep_hee_results.csv"
df = pd.read_csv(data_path)

# === Step 2: Define compact country-to-region mapping ===
emea = [
    "ARM", "CYP", "CZE", "EGY", "ETH", "DEU", "GRC", "IRL", "JOR", "KAZ", "KEN", "KGZ",
    "LBN", "MAR", "NLD", "NGA", "ROU", "SRB", "SVK", "TJK", "TUN", "TUR", "UKR", "GBR", "UZB", "RUS", "IRN"
]
apac = [
    "AUS", "BGD", "CHN", "IND", "IDN", "JPN", "MYS", "MNG", "MMR",
    "NZL", "PAK", "PHL", "SGP", "THA", "VNM", "KOR"
]
latam = [
    "ARG", "BRA", "CHL", "ECU", "GTM", "MEX", "NIC", "PER", "URY"
]
na = ["CAN", "USA"]

# Combine into a single dictionary
country_region_map = {
    **{ISO3: "EMEA" for ISO3 in emea},
    **{ISO3: "APAC" for ISO3 in apac},
    **{ISO3: "LATAM" for ISO3 in latam},
    **{ISO3: "NA" for ISO3 in na},
}

# === Step 3: Map region and validate ===
df["Region"] = df["ISO3"].map(country_region_map)
unmapped = df[df["Region"].isnull()]["ISO3"].unique()

if len(unmapped) > 0:
    print("⚠️ Warning: The following countries couldn't be mapped to a region:")
    for c in unmapped:
        print(f"- {c}")
    print("👉 Please update 'country_region_map' with these countries.")

# === Step 4: Save updated file ===
output_path = "../data/processed/hep_hee_results_with_region.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_csv(output_path, index=False)
print(f"✅ 'Region' column added and data saved to: {output_path}")


✅ 'Region' column added and data saved to: ../data/processed/hep_hee_results_with_region.csv


Visualize HEE and HEP

In [50]:
# === Step 0: Install Dash if not already installed ===
try:
    import dash
except ImportError:
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "dash"])

# === Step 1: Import libraries ===
import pandas as pd
import plotly.express as px
from dash import Dash, dcc, html, Input, Output

# === Step 2: Load and prepare data ===
df = pd.read_csv("../data/processed/hep_hee_results_with_region.csv")
df = df[df["income_level"].notnull()]  # Ensure income_level is present

# Get list of available years
available_years = sorted(df["Year"].unique())

# === Step 3: Build Dash App ===
app = Dash(__name__)
app.title = "HEP vs HEE by Income Level"

app.layout = html.Div([
    html.H2("HEP vs HEE Comparison by Income Level"),

    html.Div([
        html.Label("Select Year(s):"),
        dcc.Dropdown(
            id="year-dropdown",
            options=[{"label": str(y), "value": y} for y in available_years],
            value=[2022],  # default selected year(s)
            multi=True,
            clearable=False
        ),
    ], style={"width": "30%", "display": "inline-block", "paddingRight": "20px"}),

    html.Div([
        html.Label("Type ISO3 Country Codes (e.g., JPN, USA, THA):"),
        dcc.Input(
            id="iso3-input",
            type="text",
            placeholder="Leave blank to show all countries",
            debounce=True,
            style={"width": "100%"}
        )
    ], style={"width": "65%", "display": "inline-block"}),

    html.Br(), html.Br(),

    dcc.Graph(id="scatter-graph", style={"height": "700px"})
])

# === Step 4: Define callback ===
@app.callback(
    Output("scatter-graph", "figure"),
    Input("iso3-input", "value"),
    Input("year-dropdown", "value")
)
def update_figure(iso_input, selected_years):
    # Filter data by selected years
    dff = df[df["Year"].isin(selected_years)]

    # Optional ISO3 filter
    if iso_input:
        codes = [code.strip().upper() for code in iso_input.split(",")]
        dff = dff[dff["ISO3"].isin(codes)]

    # Plot using facet (1 subplot per year)
    fig = px.scatter(
        dff,
        x="HEP",
        y="HEE",
        color="income_level",
        text="ISO3",
        facet_col="Year",
        facet_col_wrap=2,
        labels={"HEP": "Performance", "HEE": "Efficiency"},
        title="HEP vs HEE by Year and Income Level",
        range_x=[0, 1],
        range_y=[0, 1],
        width=1000,
        height=650
    )

    fig.update_traces(marker=dict(size=10), textposition="top center")

    # Add diagonal line y = x for each subplot
    for i in range(1, len(selected_years) + 1):
        fig.add_shape(
            type="line",
            x0=0, y0=0, x1=1, y1=1,
            xref=f"x{i}", yref=f"y{i}",
            line=dict(dash="dash", color="gray")
        )

    return fig

# === Step 5: Run App ===
if __name__ == "__main__":
    app.run(debug=True)
