# **Global Healthcare Equality Analysis**

This notebook explores healthcare equality around the world by analyzing:
- Life expectancy trends
- Disease rate reductions in different economic contexts
- Cancer prevalence worldwide
- Correlation between cancer rates and screening availability

## **Data Sources:**
- WHO Global Health Observatory
- Kaggle Datasets
- API for Cancer Indicators

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import duckdb as db
import requests
from typing import List

%matplotlib inline
%config InlineBackend.figure_format = 'retina'


## **1. Data Loading & Cleaning**
We'll load datasets related to **cancer rates, life expectancy, and disease prevalence.**

In [2]:
df_cancer_reg = pd.read_csv("cancer_reg.csv") # where is the link to this kaggle dataset?
df_avg_household_size = pd.read_csv("avg-household-size.csv") # where is the link to this kaggle dataset?
df_life_expectancy = pd.read_csv("life_expectancy_data.csv") # where is the link to this kaggle dataset?

In [3]:
def get_api_data(url: str) -> dict:
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for bad status codes
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data from {url}: {e}")
        return None

In [4]:
# Cancer-related indicators
def get_cancer_indicators() -> pd.DataFrame:
    url = "https://ghoapi.azureedge.net/api/Indicator?$filter=contains(IndicatorName,'cancer')"
    data = get_api_data(url)
    
    if data and 'value' in data:
        # Convert 'value' list to DataFrame
        df = pd.DataFrame(data['value'])
        return df
    return pd.DataFrame()

In [5]:
# Data for specific indicator code and create DataFrame
def get_indicator_data(indicator_code: str) -> pd.DataFrame:
    url = f"https://ghoapi.azureedge.net/api/{indicator_code}"
    data = get_api_data(url)
    
    if data and 'value' in data:
        df = pd.DataFrame(data['value'])
        return df
    return pd.DataFrame()

### **Previewing Data**
Let's take a look at the first few rows of each dataset.

In [6]:
cancer_indicators_df = get_cancer_indicators()
indicator_codes = cancer_indicators_df['IndicatorCode'].tolist()

indicator_df = [get_indicator_data(indicator) for indicator in indicator_codes]


In [7]:
indicator_df[0].iloc[:, 5:].head()

Unnamed: 0,TimeDimType,ParentLocation,Dim1Type,TimeDim,Dim1,Dim2Type,Dim2,Dim3Type,Dim3,DataSourceDimType,DataSourceDim,Value,NumericValue,Low,High,Comments,Date,TimeDimensionValue,TimeDimensionBegin,TimeDimensionEnd
0,YEAR,Africa,,2013,,,,,,,,No data received,,,,,2015-06-01T13:06:28+02:00,2013,2013-01-01T00:00:00+01:00,2013-12-31T00:00:00+01:00
1,YEAR,Americas,,2013,,,,,,,,Yes,,,,,2015-06-01T13:06:18+02:00,2013,2013-01-01T00:00:00+01:00,2013-12-31T00:00:00+01:00
2,YEAR,Eastern Mediterranean,,2013,,,,,,,,Don't know,,,,Country responded Don't know.,2015-06-01T13:06:27+02:00,2013,2013-01-01T00:00:00+01:00,2013-12-31T00:00:00+01:00
3,YEAR,Western Pacific,,2013,,,,,,,,Yes,,,,,2015-06-01T13:06:23+02:00,2013,2013-01-01T00:00:00+01:00,2013-12-31T00:00:00+01:00
4,YEAR,Americas,,2013,,,,,,,,Yes,,,,,2015-06-01T13:06:18+02:00,2013,2013-01-01T00:00:00+01:00,2013-12-31T00:00:00+01:00


In [8]:
df_avg_household_size.head(10)

Unnamed: 0,statefips,countyfips,avghouseholdsize,geography
0,2,13,2.43,"Aleutians East Borough, Alaska"
1,2,16,3.59,"Aleutians West Census Area, Alaska"
2,2,20,2.77,"Anchorage Municipality, Alaska"
3,2,50,3.86,"Bethel Census Area, Alaska"
4,2,60,2.5,"Bristol Bay Borough, Alaska"
5,2,68,2.34,"Denali Borough, Alaska"
6,2,70,3.56,"Dillingham Census Area, Alaska"
7,2,90,2.69,"Fairbanks North Star Borough, Alaska"
8,2,100,2.12,"Haines Borough, Alaska"
9,2,105,2.26,"Hoonah-Angoon Census Area, Alaska"


In [9]:
df_cancer_reg.head(10)

Unnamed: 0,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,medianage,...,pctprivatecoveragealone,pctempprivcoverage,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate
0,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,"(61494.5, 125635]",39.3,...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",33.0,...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096
2,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",45.0,...,43.5,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488
3,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,"(42724.4, 45201]",42.8,...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,57.0,26,144.4,350.1,49955,10321,12.5,0.0,"(48021.6, 51046.4]",48.3,...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657
5,428.0,152,176.0,505.4,52313,61023,15.6,180.259902,"(51046.4, 54545.6]",45.4,...,38.8,32.6,43.2,20.2,84.882631,1.653205,1.538057,3.314635,51.22036,4.964476
6,250.0,97,175.9,461.8,37782,41516,23.2,0.0,"(37413.8, 40362.7]",42.6,...,35.0,28.3,46.4,28.7,75.106455,0.616955,0.866157,8.356721,51.0139,4.204317
7,146.0,71,183.6,404.0,40189,20848,17.8,0.0,"(37413.8, 40362.7]",51.7,...,33.1,25.9,50.9,24.1,89.406636,0.305159,1.889077,2.286268,48.967033,5.889179
8,88.0,36,190.5,459.4,42579,13088,22.3,0.0,"(40362.7, 42724.4]",49.3,...,37.8,29.9,48.1,26.6,91.787477,0.185071,0.208205,0.616903,53.446998,5.587583
9,4025.0,1380,177.8,510.9,60397,843954,13.1,427.748432,"(54545.6, 61494.5]",35.8,...,,44.4,31.4,16.5,74.729668,6.710854,6.041472,2.699184,50.063573,5.53343


In [10]:
"""dataset_info = {
    "df_avg_household_size": {
        "Rows": df_avg_household_size.shape[0],
        "Columns": df_avg_household_size.shape[1],
        "Column Names": df_avg_household_size.columns.tolist(),
    },
    "df_cancer_reg": {
        "Rows": df_cancer_reg.shape[0],
        "Columns": df_cancer_reg.shape[1],
        "Column Names": df_cancer_reg.columns.tolist(),
    },
    "df_life_expectancy": {
        "Rows": df_life_expectancy.shape[0],
        "Columns": df_life_expectancy.shape[1],
        "Column Names": df_life_expectancy.columns.tolist(),
    },
}
"""

'dataset_info = {\n    "df_avg_household_size": {\n        "Rows": df_avg_household_size.shape[0],\n        "Columns": df_avg_household_size.shape[1],\n        "Column Names": df_avg_household_size.columns.tolist(),\n    },\n    "df_cancer_reg": {\n        "Rows": df_cancer_reg.shape[0],\n        "Columns": df_cancer_reg.shape[1],\n        "Column Names": df_cancer_reg.columns.tolist(),\n    },\n    "df_life_expectancy": {\n        "Rows": df_life_expectancy.shape[0],\n        "Columns": df_life_expectancy.shape[1],\n        "Column Names": df_life_expectancy.columns.tolist(),\n    },\n}\n'

## **2. Life Expectancy Trends (2000-2015)**

In [11]:
conn = db.connect(database='cancer_data.ddb', read_only=False)
conn.execute("CREATE TABLE IF NOT EXISTS Life_Expectancy_Data AS SELECT * FROM df_life_expectancy")
for i, df in enumerate(indicator_df):
    conn.execute(f"CREATE TABLE IF NOT EXISTS {indicator_codes[i]} AS SELECT * FROM {df}")
conn.close()

ParserException: Parser Error: syntax error at or near "SpatialDimType"

In [12]:
conn = db.connect(database='cancer_data.ddb', read_only=True)
df_cancer_reg = conn.execute("SELECT * FROM cancer_reg").df()
conn.close()

ConnectionException: Connection Error: Can't open a connection to same database file with a different configuration than existing connections

In [None]:
print(f"""
df_avg_household_size.shape: {df_avg_household_size.shape}
df_cancer_reg.shape: {df_cancer_reg.shape}
df_life_expectancy.shape: {df_life_expectancy.shape}
""")


df_avg_household_size.shape: (3220, 4)
df_cancer_reg.shape: (3047, 33)
df_life_expectancy.shape: (2938, 22)



In [None]:
df_life_expectancy.isnull().sum(), df_life_expectancy.shape

(Country                              0
 Year                                 0
 Status                               0
 Life expectancy                     10
 Adult Mortality                     10
 infant deaths                        0
 Alcohol                            194
 percentage expenditure               0
 Hepatitis B                        553
 Measles                              0
  BMI                                34
 under-five deaths                    0
 Polio                               19
 Total expenditure                  226
 Diphtheria                          19
  HIV/AIDS                            0
 GDP                                448
 Population                         652
  thinness  1-19 years               34
  thinness 5-9 years                 34
 Income composition of resources    167
 Schooling                          163
 dtype: int64,
 (2938, 22))

### Analyzing global life expectancy growth.

In [None]:
df_life_expectancy[df_life_expectancy.isnull().any(axis=1)].sort_values(by=['Country','Year'], ascending=False)

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
2922,Zimbabwe,2015,Developing,67.0,336.0,22,,0.000000,87.0,0,...,88.0,,87.0,6.2,118.693830,15777451.0,5.6,5.5,0.507,10.3
2906,Zambia,2015,Developing,61.8,33.0,27,,0.000000,9.0,9,...,9.0,,9.0,4.1,1313.889646,161587.0,6.3,6.1,0.576,12.5
2917,Zambia,2004,Developing,47.9,578.0,36,2.46,8.369852,,35,...,84.0,7.33,83.0,17.6,53.277222,11731746.0,7.2,7.1,0.456,10.5
2918,Zambia,2003,Developing,46.4,64.0,39,2.33,65.789974,,881,...,85.0,8.18,83.0,18.2,429.158343,11421984.0,7.3,7.2,0.443,10.2
2919,Zambia,2002,Developing,45.5,69.0,41,2.44,54.043480,,25036,...,85.0,6.93,84.0,18.4,377.135244,111249.0,7.4,7.3,0.433,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,Algeria,2015,Developing,75.6,19.0,21,,0.000000,95.0,63,...,95.0,,95.0,0.1,4132.762920,39871528.0,6.0,5.8,0.743,14.4
44,Algeria,2003,Developing,71.7,146.0,20,0.34,25.018523,,15374,...,87.0,3.60,87.0,0.1,294.335560,3243514.0,6.3,6.1,0.663,11.5
45,Algeria,2002,Developing,71.6,145.0,20,0.36,148.511984,,5862,...,86.0,3.73,86.0,0.1,1774.336730,3199546.0,6.3,6.2,0.653,11.1
46,Algeria,2001,Developing,71.4,145.0,20,0.23,147.986071,,2686,...,89.0,3.84,89.0,0.1,1732.857979,31592153.0,6.4,6.3,0.644,10.9


In [None]:
df_life_expectancy.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


## **3. Disease Rate Decrease (Developing vs Developed)**

In [13]:
# Scatter plot - Median income vs Cancer Incidence Rate
fig4 = px.scatter(df_cancer_reg, 
                  x='medincome', 
                  y='incidencerate', 
                  title='Cancer Incidence Rate vs Median Income',
                  labels={'medincome': 'Median Income', 'incidencerate': 'Cancer Incidence Rate'},
                  color='incidencerate',
                  color_continuous_scale='Viridis')
fig4.show()


In [None]:
fig5 = px.box(df_cancer_reg, 
              x='binnedinc', 
              y='target_deathrate', 
              title='Cancer Death Rate Distribution by Income Level',
              labels={'binnedinc': 'Income Bins', 'target_deathrate': 'Cancer Death Rate'},
              color='binnedinc')
fig5.show()


### How fast are diseases declining?

## **4. Cancer Rates Around the World**

In [16]:
# Line chart - Average deaths per year vs median age (focused on 0-100)
fig7 = px.bar(df_cancer_reg[df_cancer_reg['medianage'] <= 100].sort_values(by='medianage'), 
               x='medianage', 
               y='avgdeathsperyear', 
               title='Average Deaths per Year vs Median Age (0-100)',
               labels={'medianage': 'Median Age', 'avgdeathsperyear': 'Average Deaths per Year'})
fig7.show()


### Geographic distribution of cancer cases.

## **5. Cancer Reduction in Developing vs Developed Countries**

In [18]:
fig2 = px.scatter(df_life_expectancy, x="GDP", y="Life expectancy ", 
                  title="Life Expectancy vs. GDP",
                  labels={"GDP": "Gross Domestic Product", "Life expectancy ": "Life Expectancy"},
                  trendline="ols")
fig2.show()

In [31]:
fig3 = px.scatter(df_cancer_reg, x="pctpubliccoverage", y="target_deathrate", 
                  title="Public Health Coverage vs. Cancer Death Rate",
                  labels={"pctpubliccoverage": "Private Health Coverage (%)", "target_deathrate": "Cancer Death Rate"},
                  color="pctprivatecoverage", trendline='ols')
fig3.show()


### Are we making progress against cancer?

## **6. Country-Specific Cancer Trends**

### Do socio-economical factors contribute to cancer prevalence?

In [24]:
fig5 = px.scatter(df_cancer_reg, x="povertypercent", y="incidencerate", 
                  title="Poverty Rate vs. Cancer Incidence",
                  labels={"povertypercent": "Poverty Rate (%)", "incidencerate": "Cancer Incidence Rate"},
                  trendline="ols")
fig5.show() #

## **7. Correlation Between Cancer Death Rates & Screening Availability**

### Does screening save lives?

## **8. Conclusion: Is Healthcare Equal Worldwide?**

### Bringing together all insights.