# üè• Health Inequality Analysis

## Problem Statement
Health outcomes vary significantly across countries and income groups.
Quantifying these disparities is essential for prioritizing healthcare
investment and policy intervention.

This analysis investigates the relationship between economic capacity,
health spending, and population health outcomes.

## Policy Relevance
- Supports equitable healthcare resource allocation
- Highlights regions at high health risk
- Informs long-term public health planning

## Target Variables
- Life expectancy
- Mortality indicators
- Health expenditure
- Income / economic indicators

## Scope & Limitations
- Cross-country analysis (macro-level)
- Does not capture within-country inequality
- Data reflects reported national averages


## üü¶ Phase 1 ‚Äî Data Loading & Initial Validation (WHO GHE DALYs)

### Objective
Ingest health burden data from the WHO Global Health Estimates (GHE)
using the OData API and validate its schema and coverage.

This phase focuses strictly on:
- API-based ingestion
- Schema inspection
- Metric availability validation
- Saving raw, unmodified data

‚ùå No filtering  
‚ùå No aggregation  
‚ùå No interpretation  

All downstream analysis will depend on validated availability
of required dimensions (year, sex, age, cause).


In [None]:
from utils.path_setup import setup_project_path
PROJECT_ROOT = setup_project_path()

import requests
import pandas as pd
from pathlib import Path

from src.utils.logger import get_logger

logger = get_logger("domain2_phase1_health")

print("Project root:", PROJECT_ROOT)


In [None]:
WHO_GHE_APIS = {
    2000: "https://xmart-api-public.who.int/DEX_CMS/GHE_FULL?$format=json",
    2010: "https://xmart-api-public.who.int/DEX_CMS/GHE_FULL?$format=json",
    2021: "https://xmart-api-public.who.int/DEX_CMS/GHE_FULL?$format=json"
}

WHO_GHE_APIS


In [None]:
import requests
import pandas as pd

HEADERS = {
    "User-Agent": "Mozilla/5.0 (EDA-MLOps-Portfolio)",
    "Accept": "application/json"
}

def fetch_who_payload(url):
    r = requests.get(url, headers=HEADERS, timeout=60)
    r.raise_for_status()
    return pd.DataFrame(r.json()["value"])


In [None]:
# WHO GHE API access is intentionally disabled due to consistent 400 responses.
# See markdown explanation below.
API_DISABLED = True


### ‚ö†Ô∏è WHO XMART API Limitation (Phase 1)

Direct programmatic access to the WHO XMART `GHE_FULL` endpoint
returns consistent `400 Bad Request` responses when called via
Python HTTP clients.

This issue persists across:
- Query parameter variations
- Header modifications
- Pagination attempts

Therefore:
- Phase 1 focuses on schema understanding and data planning
- Programmatic ingestion is deferred to static exports in Phase 2
- This preserves notebook reproducibility and execution stability


## üü¶ Phase 2 ‚Äî Dataset Overview & Coverage

This phase establishes **what data is available** for health burden analysis
before any interpretation or inequality assessment.

Focus:
- Record counts per year
- Cause coverage
- Metric availability
- Basic completeness checks

No analytical conclusions are drawn in this phase.


In [None]:
import pandas as pd
from pathlib import Path

# --- Paths ---
RAW_DIR = Path("datasets/raw/health")
YEARS = [2000, 2010, 2021]

# --- Read & combine ---
dfs = []
for year in YEARS:
    csv_file = RAW_DIR / f"ghe_dalys_india_{year}.csv"
    df = pd.read_csv(csv_file)
    df["Year"] = year  # add Year column
    dfs.append(df)

# Combine all years into a single DataFrame
df_combined = pd.concat(dfs, ignore_index=True)

# Optional: sort by Year and DALY
df_combined.sort_values(by=["Year", "VAL_DALY_RATE100K_NUMERIC"], ascending=[True, False], inplace=True)

# --- Save combined raw file ---
combined_csv = RAW_DIR / "ghe_dalys_india_combined.csv"
df_combined.to_csv(combined_csv, index=False)
print(f"Saved combined raw CSV: {combined_csv} | shape={df_combined.shape}")

# --- Quick check ---
print(df_combined.head(15))


In [None]:
# ==============================
# PHASE 2: EDA for WHO DALY Dataset
# ==============================

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# --- Load combined raw CSV ---
RAW_DIR = Path("datasets/raw/health")
df = pd.read_csv(RAW_DIR / "ghe_dalys_india_combined.csv")

# --- Quick overview ---
print("Dataset shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nMissing values:\n", df.isnull().sum())
print("\nSample rows:\n", df.head())

# --- Summary statistics ---
print("\nSummary statistics:\n", df.describe())

# --- Check unique causes ---
print("\nUnique causes:", df["DIM_GHECAUSE_TITLE"].nunique())
print(df["DIM_GHECAUSE_TITLE"].unique())

# --- Visualizations ---

sns.set_style("whitegrid")

# 1. Top DALYs per year
plt.figure(figsize=(12,6))
sns.barplot(
    data=df,
    x="DIM_GHECAUSE_TITLE",
    y="VAL_DALY_RATE100K_NUMERIC",
    hue="Year"
)
plt.xticks(rotation=45, ha="right")
plt.title("Top 10 DALY Rates per Cause (India)")
plt.ylabel("DALY Rate per 100K")
plt.xlabel("Cause")
plt.legend(title="Year")
plt.tight_layout()
plt.show()

# 2. Death rate comparison per year
plt.figure(figsize=(12,6))
sns.barplot(
    data=df,
    x="DIM_GHECAUSE_TITLE",
    y="VAL_DTHS_RATE100K_NUMERIC",
    hue="Year"
)
plt.xticks(rotation=45, ha="right")
plt.title("Top 10 Death Rates per Cause (India)")
plt.ylabel("Death Rate per 100K")
plt.xlabel("Cause")
plt.legend(title="Year")
plt.tight_layout()
plt.show()

# 3. Trend of top 3 causes over years
top3_causes = df.groupby("DIM_GHECAUSE_TITLE")["VAL_DALY_RATE100K_NUMERIC"].mean().nlargest(3).index
plt.figure(figsize=(10,5))
for cause in top3_causes:
    sub_df = df[df["DIM_GHECAUSE_TITLE"]==cause]
    plt.plot(sub_df["Year"], sub_df["VAL_DALY_RATE100K_NUMERIC"], marker='o', label=cause)
plt.title("Trend of Top 3 DALY Causes (2000-2021)")
plt.xlabel("Year")
plt.ylabel("DALY Rate per 100K")
plt.xticks([2000,2010,2021])
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# --- Optional: correlation between DALY and Death rate ---
plt.figure(figsize=(6,5))
sns.scatterplot(
    data=df,
    x="VAL_DALY_RATE100K_NUMERIC",
    y="VAL_DTHS_RATE100K_NUMERIC",
    hue="Year",
    s=100
)
plt.title("DALY vs Death Rate per 100K")
plt.xlabel("DALY Rate per 100K")
plt.ylabel("Death Rate per 100K")
plt.tight_layout()
plt.show()
