# Distribution of diseases in the world 🩺

In this notebook we will study the propagation of diseases in the world in 2024

Dataset source: [Global health statistics](https://www.kaggle.com/datasets/malaiarasugraj/global-health-statistics/data)

## Study goal:

- What are the most common diseases in each region of the world?
- Are there significant differences in the prevalence of disease between developed and developing countries?
- What is the average cost of treatment for different illnesses, and how does this vary from country to country?

In [86]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import warnings
import plotly.io as pio
warnings.filterwarnings('ignore')
pio.renderers.default = 'vscode'
import seaborn as sns
import matplotlib.pyplot as plt

px.defaults.template = "plotly_dark"

Columns in the dataset:
- **Country:** The name of the country where the health data was recorded.
- **Year:** The year in which the data was collected.
- **Disease Name:** The name of the disease or health condition tracked.
- **Disease Category:** The category of the disease (e.g., Infectious, Non-Communicable).
- **Prevalence Rate (%):** The percentage of the population affected by the disease.
- **Incidence Rate (%):** The percentage of new or newly diagnosed cases.
- **Mortality Rate (%):** The percentage of the affected population that dies from the disease.
- **Age Group:** The age range most affected by the disease.
- **Gender:** The gender(s) affected by the disease (Male, Female, Both).
- **Population Affected:** The total number of individuals affected by the disease.
- **Healthcare Access (%):** The percentage of the population with access to healthcare.
- **Doctors per 1000:** The number of doctors per 1000 people.
- **Hospital Beds per 1000:** The number of hospital beds available per 1000 people.
- **Treatment Type:** The primary treatment method for the disease (e.g., Medication, Surgery).
- **Average Treatment Cost (USD):** The average cost of treating the disease in USD.
- **Availability of Vaccines/Treatment:** Whether vaccines or treatments are available.
- **Recovery Rate (%):** The percentage of people who recover from the disease.
- **DALYs:** Disability-Adjusted Life Years, a measure of disease burden.
- **Improvement in 5 Years (%):** The improvement in disease outcomes over the last five years.
- **Per Capita Income (USD):** The average income per person in the country.
- **Education Index:** The average level of education in the country.
- **Urbanization Rate (%):** The percentage of the population living in urban areas

In [87]:
# import kagglehub

# # Download latest version
# path = kagglehub.dataset_download("malaiarasugraj/global-health-statistics")

# print("Path to dataset files:", path)

In [88]:
df = pd.read_csv("data/Global Health Statistics.csv")
df

Unnamed: 0,Country,Year,Disease Name,Disease Category,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Age Group,Gender,Population Affected,...,Hospital Beds per 1000,Treatment Type,Average Treatment Cost (USD),Availability of Vaccines/Treatment,Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%)
0,Italy,2013,Malaria,Respiratory,0.95,1.55,8.42,0-18,Male,471007,...,7.58,Medication,21064,No,91.82,4493,2.16,16886,0.79,86.02
1,France,2002,Ebola,Parasitic,12.46,8.63,8.75,61+,Male,634318,...,5.11,Surgery,47851,Yes,76.65,2366,4.82,80639,0.74,45.52
2,Turkey,2015,COVID-19,Genetic,0.91,2.35,6.22,36-60,Male,154878,...,3.49,Vaccination,27834,Yes,98.55,41,5.81,12245,0.41,40.20
3,Indonesia,2011,Parkinson's Disease,Autoimmune,4.68,6.29,3.99,0-18,Other,446224,...,8.44,Surgery,144,Yes,67.35,3201,2.22,49336,0.49,58.47
4,Italy,2013,Tuberculosis,Genetic,0.83,13.59,7.01,61+,Male,472908,...,5.90,Medication,8908,Yes,50.06,2832,6.93,47701,0.50,48.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,Saudi Arabia,2021,Parkinson's Disease,Infectious,4.56,4.83,9.65,0-18,Female,119332,...,4.23,Vaccination,4528,Yes,92.11,1024,3.88,29335,0.75,27.94
999996,Saudi Arabia,2013,Malaria,Respiratory,0.26,1.76,0.56,0-18,Female,354927,...,6.34,Surgery,20686,No,84.47,202,7.95,30752,0.47,77.66
999997,USA,2016,Zika,Respiratory,13.44,14.13,1.91,19-35,Other,807915,...,8.11,Therapy,18807,No,86.81,3338,7.31,62897,0.72,46.90
999998,Nigeria,2020,Asthma,Chronic,1.96,14.56,4.98,61+,Female,385896,...,6.91,Medication,21033,Yes,62.15,4806,3.82,98189,0.51,34.73


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 22 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   Country                             1000000 non-null  object 
 1   Year                                1000000 non-null  int64  
 2   Disease Name                        1000000 non-null  object 
 3   Disease Category                    1000000 non-null  object 
 4   Prevalence Rate (%)                 1000000 non-null  float64
 5   Incidence Rate (%)                  1000000 non-null  float64
 6   Mortality Rate (%)                  1000000 non-null  float64
 7   Age Group                           1000000 non-null  object 
 8   Gender                              1000000 non-null  object 
 9   Population Affected                 1000000 non-null  int64  
 10  Healthcare Access (%)               1000000 non-null  float64
 11  Doctors per 

In [90]:
df.describe()

Unnamed: 0,Year,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Population Affected,Healthcare Access (%),Doctors per 1000,Hospital Beds per 1000,Average Treatment Cost (USD),Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%)
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,2011.996999,10.047992,7.555005,5.049919,500735.427363,74.987835,2.747929,5.245931,25010.313665,74.496934,2499.144809,5.002593,50311.099835,0.650069,54.985212
std,7.217287,5.740189,4.298947,2.859427,288660.116648,14.436345,1.299067,2.742865,14402.279227,14.155168,1443.923798,2.888298,28726.959359,0.144472,20.214042
min,2000.0,0.1,0.1,0.1,1000.0,50.0,0.5,0.5,100.0,50.0,1.0,0.0,500.0,0.4,20.0
25%,2006.0,5.09,3.84,2.58,250491.25,62.47,1.62,2.87,12538.0,62.22,1245.0,2.5,25457.0,0.53,37.47
50%,2012.0,10.04,7.55,5.05,501041.0,75.0,2.75,5.24,24980.0,74.47,2499.0,5.0,50372.0,0.65,54.98
75%,2018.0,15.01,11.28,7.53,750782.0,87.49,3.87,7.62,37493.0,86.78,3750.0,7.51,75195.0,0.78,72.51
max,2024.0,20.0,15.0,10.0,1000000.0,100.0,5.0,10.0,50000.0,99.0,5000.0,10.0,100000.0,0.9,90.0


In [91]:
df.isna().sum()

Country                               0
Year                                  0
Disease Name                          0
Disease Category                      0
Prevalence Rate (%)                   0
Incidence Rate (%)                    0
Mortality Rate (%)                    0
Age Group                             0
Gender                                0
Population Affected                   0
Healthcare Access (%)                 0
Doctors per 1000                      0
Hospital Beds per 1000                0
Treatment Type                        0
Average Treatment Cost (USD)          0
Availability of Vaccines/Treatment    0
Recovery Rate (%)                     0
DALYs                                 0
Improvement in 5 Years (%)            0
Per Capita Income (USD)               0
Education Index                       0
Urbanization Rate (%)                 0
dtype: int64

In [92]:
df_24 = df[df["Year"] == 2024]

In [93]:
df_24

Unnamed: 0,Country,Year,Disease Name,Disease Category,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Age Group,Gender,Population Affected,...,Hospital Beds per 1000,Treatment Type,Average Treatment Cost (USD),Availability of Vaccines/Treatment,Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%)
15,USA,2024,Dengue,Viral,1.82,2.98,7.01,36-60,Female,623844,...,9.72,Medication,4664,No,87.75,484,8.11,56179,0.63,29.00
54,Argentina,2024,Cancer,Parasitic,10.94,12.88,6.25,61+,Female,419192,...,5.36,Medication,16363,Yes,72.52,1274,5.36,9633,0.51,37.11
69,Australia,2024,Measles,Genetic,5.88,12.37,4.73,36-60,Other,767034,...,6.73,Surgery,33829,No,75.40,2921,7.85,50348,0.87,56.51
79,Brazil,2024,Diabetes,Genetic,0.73,9.28,1.09,19-35,Male,4233,...,3.10,Therapy,2078,No,61.80,386,3.23,73287,0.74,25.44
109,Saudi Arabia,2024,Malaria,Chronic,13.40,11.44,9.59,61+,Other,259325,...,5.84,Surgery,14273,Yes,90.56,4937,7.20,78046,0.64,70.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999968,USA,2024,COVID-19,Infectious,10.56,8.92,3.10,61+,Other,509069,...,8.08,Therapy,25536,No,72.12,1448,5.01,69574,0.55,84.37
999969,Japan,2024,Diabetes,Neurological,15.19,2.93,7.38,19-35,Male,481945,...,6.31,Medication,11078,Yes,94.54,3645,2.29,24648,0.64,89.87
999971,Saudi Arabia,2024,Influenza,Genetic,3.67,1.80,0.79,19-35,Other,85928,...,7.15,Surgery,32179,No,51.51,3847,9.96,64922,0.65,46.94
999987,Mexico,2024,Hypertension,Respiratory,0.50,10.19,8.58,61+,Other,38627,...,3.28,Therapy,44642,No,70.08,834,2.57,87011,0.88,53.96


In [94]:
df_24['Country'].unique()

array(['USA', 'Argentina', 'Australia', 'Brazil', 'Saudi Arabia',
       'South Korea', 'Mexico', 'China', 'Japan', 'Turkey', 'France',
       'India', 'Nigeria', 'UK', 'Germany', 'Indonesia', 'Russia',
       'South Africa', 'Canada', 'Italy'], dtype=object)

In [95]:
df_24['Age Group'].unique()

array(['36-60', '61+', '19-35', '0-18'], dtype=object)

In [96]:
df_24['Age Group labels'] = df_24['Age Group'].replace(
{'0-18':'children and teens',
 '19-35':'young adults',
 '36-60':'adults',
 '61+':'seniors'})

df_24.head()

Unnamed: 0,Country,Year,Disease Name,Disease Category,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Age Group,Gender,Population Affected,...,Treatment Type,Average Treatment Cost (USD),Availability of Vaccines/Treatment,Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%),Age Group labels
15,USA,2024,Dengue,Viral,1.82,2.98,7.01,36-60,Female,623844,...,Medication,4664,No,87.75,484,8.11,56179,0.63,29.0,adults
54,Argentina,2024,Cancer,Parasitic,10.94,12.88,6.25,61+,Female,419192,...,Medication,16363,Yes,72.52,1274,5.36,9633,0.51,37.11,seniors
69,Australia,2024,Measles,Genetic,5.88,12.37,4.73,36-60,Other,767034,...,Surgery,33829,No,75.4,2921,7.85,50348,0.87,56.51,adults
79,Brazil,2024,Diabetes,Genetic,0.73,9.28,1.09,19-35,Male,4233,...,Therapy,2078,No,61.8,386,3.23,73287,0.74,25.44,young adults
109,Saudi Arabia,2024,Malaria,Chronic,13.4,11.44,9.59,61+,Other,259325,...,Surgery,14273,Yes,90.56,4937,7.2,78046,0.64,70.02,seniors


In [97]:
df_24['Disease Name'].unique()

array(['Dengue', 'Cancer', 'Measles', 'Diabetes', 'Malaria',
       "Parkinson's Disease", 'Rabies', 'Hypertension', 'Zika', 'Leprosy',
       'Tuberculosis', 'Influenza', 'HIV/AIDS', 'Polio', 'Hepatitis',
       'COVID-19', "Alzheimer's Disease", 'Ebola', 'Cholera', 'Asthma'],
      dtype=object)

Let's check the distribution of disease in each country

In [98]:
COLOR_PRIMARY = '#6e7b91'

COLOR_HIGHLIGHT = '#c2253e'
COLOR_PALETTE = ['#6e7b91', '#c2253e', '#f7f7f7']

In [99]:
df_counts = df_24.groupby(['Country', 'Disease Name']).size().reset_index(name='Count')

countries = df_counts['Country'].unique()

fig = go.Figure()

for country in countries:
    df_country = df_counts[df_counts['Country'] == country]
    max_count = df_country['Count'].max()
    colors = [COLOR_HIGHLIGHT if count == max_count else COLOR_PRIMARY for count in df_country['Count']]
    fig.add_trace(
        go.Bar(
            x=df_country['Disease Name'],
            y=df_country['Count'],
            name=country,
            marker=dict(color=colors)
        )
    )

dropdown_buttons = [
]
for i, country in enumerate(countries):
    visibility = [i == j for j in range(len(countries))]
    dropdown_buttons.append(
        {'label': country, 'method': 'update', 'args': [{'visible': visibility}, {'title': f'Distribution of Diseases in {country}'}]}
    )

fig.update_layout(
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=dropdown_buttons,
            direction='down',
            showactive=True,
        )
    ],
    barmode='group',
    title='Distribution of Diseases by Country',
    xaxis_title='Disease Name',
    yaxis_title='Count'
)

fig.show()


Let's check the distribution of type of diseases in each country:

In [100]:

df_counts = df_24.groupby(['Country', 'Disease Category']).size().reset_index(name='Count')

countries = df_counts['Country'].unique()

fig = go.Figure()

for country in countries:
    df_country = df_counts[df_counts['Country'] == country]
    max_count = df_country['Count'].max()
    colors = [COLOR_HIGHLIGHT if count == max_count else COLOR_PRIMARY  for count in df_country['Count']]
    fig.add_trace(
        go.Bar(
            x=df_country['Disease Category'],
            y=df_country['Count'],
            name=country,
            marker=dict(color=colors)
        )
    )

dropdown_buttons = [
]
for i, country in enumerate(countries):
    visibility = [i == j for j in range(len(countries))]
    dropdown_buttons.append(
        {'label': country, 'method': 'update', 'args': [{'visible': visibility}, {'title': f'Distribution of Diseases in {country}'}]}
    )

fig.update_layout(
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=dropdown_buttons,
            direction='down',
            showactive=True,
        )
    ],
    barmode='group',
    title='Distribution of Diseases by Country',
    xaxis_title='Disease Category',
    yaxis_title='Count'
)

fig.show()


We notice that some diesease are more frequante in some countries than the others.

For example in Brazil, Germany and Indonisia has Cancer as one of the top 3 diseases in the coutry but it's not the case for the other countries

## Let's check the distribution of diseases between world countries' categories

In [102]:
# let's add a column that labelise coutnries (first world coutries, second world countries, third world countries)


def labelise_country(c):
    """labelise coutries into first, second and thrid countries"""
    first_wc= ["USA","Australia","South Korea","Japan","France","UK","Germany","Canada","Italy"]
    second_wc = ["Russia","China","Saudi Arabia","Turkey"]
    third_wc = ["Argentina","Brazil","Mexico","India","Nigeria","Indonesia","South Africa"]

    if c in first_wc:
        return "First World country"
    elif c in second_wc:
        return "Second World country"
    elif c in third_wc:
        return "Third World country"

df_24["Country label"] = df_24["Country"].apply(labelise_country)
df_24

Unnamed: 0,Country,Year,Disease Name,Disease Category,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Age Group,Gender,Population Affected,...,Average Treatment Cost (USD),Availability of Vaccines/Treatment,Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%),Age Group labels,Country label
15,USA,2024,Dengue,Viral,1.82,2.98,7.01,36-60,Female,623844,...,4664,No,87.75,484,8.11,56179,0.63,29.00,adults,First World country
54,Argentina,2024,Cancer,Parasitic,10.94,12.88,6.25,61+,Female,419192,...,16363,Yes,72.52,1274,5.36,9633,0.51,37.11,seniors,Third World country
69,Australia,2024,Measles,Genetic,5.88,12.37,4.73,36-60,Other,767034,...,33829,No,75.40,2921,7.85,50348,0.87,56.51,adults,First World country
79,Brazil,2024,Diabetes,Genetic,0.73,9.28,1.09,19-35,Male,4233,...,2078,No,61.80,386,3.23,73287,0.74,25.44,young adults,Third World country
109,Saudi Arabia,2024,Malaria,Chronic,13.40,11.44,9.59,61+,Other,259325,...,14273,Yes,90.56,4937,7.20,78046,0.64,70.02,seniors,Second World country
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999968,USA,2024,COVID-19,Infectious,10.56,8.92,3.10,61+,Other,509069,...,25536,No,72.12,1448,5.01,69574,0.55,84.37,seniors,First World country
999969,Japan,2024,Diabetes,Neurological,15.19,2.93,7.38,19-35,Male,481945,...,11078,Yes,94.54,3645,2.29,24648,0.64,89.87,young adults,First World country
999971,Saudi Arabia,2024,Influenza,Genetic,3.67,1.80,0.79,19-35,Other,85928,...,32179,No,51.51,3847,9.96,64922,0.65,46.94,young adults,Second World country
999987,Mexico,2024,Hypertension,Respiratory,0.50,10.19,8.58,61+,Other,38627,...,44642,No,70.08,834,2.57,87011,0.88,53.96,seniors,Third World country


In [103]:
print("First world country: ", df_24['Country label'].value_counts().get("First World country"))
print("Second world country: ", df_24['Country label'].value_counts().get("Second World country"))
print("Third world country: ", df_24['Country label'].value_counts().get("Third World country"))

First world country:  18012
Second world country:  8085
Third world country:  13906


In [104]:
fig = px.pie(df_24, names='Country label', title='Distribution of Countries by World Category')

fig.update_layout(

    piecolorway=COLOR_PALETTE,
)

fig.show()



From this distribution, it appears that the dataset is not balanced in terms of representation across world categories. However, this imbalance might also reflect the fact that first world countries benefit from better disease tracking systems and more extensive health studies compared to second and third world countries, leading to a higher volume of recorded data.

As for why Third world coutries are more present in the dataset than Second world coutries, it could be explained as diseases that massively affect the populations of Third World countries (such as malaria or tropical diseases) are often better monitored, because they are the subject of global research.

In [105]:
df_counts = df_24.groupby(['Country', 'Disease Name','Disease Category' ,'Country label']).size().reset_index(name='Count')

def get_top_3_diseases(df, colname):
    top_3_diseases = []
    countries = df[colname].unique()

    for country in countries:
        df_country = df[df[colname] == country]
        top_3 = df_country.nlargest(3, 'Count')
        top_3[colname] = country
        top_3_diseases.append(top_3)

    return pd.concat(top_3_diseases)

top_diseases_df = get_top_3_diseases(df_counts, "Country")
fig_pie = px.pie(top_diseases_df, values='Count', names='Disease Name',
                 title='Top 3 Diseases in Each Country', facet_col='Country', facet_col_wrap=4, height=1000)
fig_pie.show()


In [106]:
top_diseases_df = get_top_3_diseases(df_counts, "Country label")
fig_pie = px.pie(top_diseases_df, values='Count', names='Disease Name',
                 title='Top 3 Diseases in each world category', facet_col='Country label')
fig_pie.show()

In [107]:
top_diseases_df = get_top_3_diseases(df_counts, "Country label")
fig_pie = px.pie(top_diseases_df, values='Count', names='Disease Category',
                 title='Top 3 Diseases Category in each world category', facet_col='Country label')
fig_pie.show()