## **The Influence of welfare on HIV/AIDS incidence and Vaccination Rate**

1. **Introduction**

Does a country's welfare have an influence on HIV/AIDS incidence and vaccination rate? 

Welfare in this aspect is a very broad term. We aim to look at two independent variables that involve welfare. First, we want to look at the GDP of countries and second, we want to investigate the total expenditure of countries. The dependent variables we are interested in are the number of HIV/AIDS deaths per 1,000 live births and ...



Sub research questions:

- Was there an association between a country's GDP and incidences of HIV/AIDS (high and low GDP vs amount of population infected with HIV/AIDS) in 2010?
(25 percentile (low) and 75 percentile (high) gdp)
 

- Do countries with a low total expenditure have more incidences of diseases caused by not-vaccinating infants (polio, measles, diphtheria) 


**2. Data Preparation**


In [None]:
import pandas as pd
Health_factors = pd.read_csv("https://raw.githubusercontent.com/NHameleers/dtz2025-datasets/master/CountryHealthFactors.csv")

In [None]:
Health_factors

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [None]:
Health_factors.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

Here, it is visible that the column names of the dataset have some trailing spaces. Therefore, we use the strip() method to remove those spaces according to the code below: 

In [None]:
Health_factors = Health_factors.rename(columns = str.strip)
Health_factors.columns

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

There is still a lot of variability in case, e.g. some variables start with an upper case whereas others start with a lower case. We will homogenize it to lower case according to the code below. This to prevent potential Syntax errors in our analysis.

In [None]:
Health_factors = Health_factors.rename(columns = str.lower)
Health_factors.columns

Index(['country', 'year', 'status', 'life expectancy', 'adult mortality',
       'infant deaths', 'alcohol', 'percentage expenditure', 'hepatitis b',
       'measles', 'bmi', 'under-five deaths', 'polio', 'total expenditure',
       'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness  1-19 years',
       'thinness 5-9 years', 'income composition of resources', 'schooling'],
      dtype='object')

Let's see how many rows and columns the data have

In [None]:

Health_factors.shape

(2938, 22)

We only want to select the country rows belonging to year 2010 and the columns of values of gdp, incidences of hiv/aids, and status.

In [None]:

new_dataframe = Health_factors.loc[Health_factors.year == 2010, ['country', 'year', 'gdp', 'hiv/aids', 'status']]
print(new_dataframe)





                                 country  year           gdp  hiv/aids  \
5                            Afghanistan  2010    553.328940       0.1   
21                               Albania  2010    494.358832       0.1   
37                               Algeria  2010   4463.394675       0.1   
53                                Angola  2010   3529.534820       2.5   
69                   Antigua and Barbuda  2010  12126.876140       0.1   
...                                  ...   ...           ...       ...   
2863  Venezuela (Bolivarian Republic of)  2010           NaN       0.1   
2879                            Viet Nam  2010           NaN       0.1   
2895                               Yemen  2010           NaN       0.1   
2911                              Zambia  2010   1463.213573       6.8   
2927                            Zimbabwe  2010    713.635620      15.7   

          status  
5     Developing  
21    Developing  
37    Developing  
53    Developing  
69    Developing

**3. Explore and Clean the Data by**

We want to investigate how many NaN values exist in the dataframe.


In [None]:
new_dataframe.isnull().sum()

country      0
year         0
gdp         27
hiv/aids     0
status       0
dtype: int64

We observe 27 null values for the 'gdp' column. We want to remove these NaN values from the dataframe.

In [None]:
NaN_removed = new_dataframe.dropna()
print(NaN_removed)

                  country  year           gdp  hiv/aids      status
5             Afghanistan  2010    553.328940       0.1  Developing
21                Albania  2010    494.358832       0.1  Developing
37                Algeria  2010   4463.394675       0.1  Developing
53                 Angola  2010   3529.534820       2.5  Developing
69    Antigua and Barbuda  2010  12126.876140       0.1  Developing
...                   ...   ...           ...       ...         ...
2815              Uruguay  2010  11938.212000       0.1  Developing
2831           Uzbekistan  2010   1377.821400       0.2  Developing
2847              Vanuatu  2010   2965.824340       0.1  Developing
2911               Zambia  2010   1463.213573       6.8  Developing
2927             Zimbabwe  2010    713.635620      15.7  Developing

[156 rows x 5 columns]


Let's see if the method worked.

In [None]:
NaN_removed.isnull().sum()

country     0
year        0
gdp         0
hiv/aids    0
status      0
dtype: int64

We now want to see descriptive statistics of the NaN_removed dataframe

In [None]:
NaN_removed.describe()

Unnamed: 0,year,gdp,hiv/aids
count,156.0,156.0,156.0
mean,2010.0,7464.487887,1.374359
std,0.0,13959.522835,3.181334
min,2010.0,8.376432,0.1
25%,2010.0,700.796595,0.1
50%,2010.0,2183.358036,0.1
75%,2010.0,5841.772875,0.5
max,2010.0,87646.75346,21.6


The max gdp and hiv/aids values differ quite a lot from the mean, median, and 75% quartile. Let's check for potential outliers.

In [None]:
import altair as alt
alt.Chart(NaN_removed).mark_point().encode(
    x = 'gdp',
)

As expected, there are major influential outliers for the 'gdp' variable. As a cutoff, we choose to only include countries with GDP values below 10,000.

In [None]:
alt.Chart(NaN_removed).mark_point().encode(
    x = 'hiv/aids',
)

There are also major influential outliers of the variable 'hiv/aids' we need to exclude. As a cutoff, we choose to only include countries with hiv/aids deaths per 1,000 live births of below 3.

The countries are not yet included in the descriptive statistics. We want to see the country count as well

In [None]:

NaN_removed.describe(include = ['object'])

Unnamed: 0,country,status
count,156,156
unique,156,2
top,Afghanistan,Developing
freq,1,128


Each row belongs to only one country, which is logical since we selected data from only the year 2010. We see that 128 countries are developing and only 28 are developed countries.

From the dataframe 'NaN_removed', we want to make a new dataframe without the major influential outliers. First, we rename the 'hiv/aids' column to 'hiv_aids' to make it more readable and for error handling.


In [None]:
NaN_removed2 = NaN_removed.rename(columns = {'hiv/aids': 'hiv_aids'})
print(NaN_removed2)



                  country  year           gdp  hiv_aids      status
5             Afghanistan  2010    553.328940       0.1  Developing
21                Albania  2010    494.358832       0.1  Developing
37                Algeria  2010   4463.394675       0.1  Developing
53                 Angola  2010   3529.534820       2.5  Developing
69    Antigua and Barbuda  2010  12126.876140       0.1  Developing
...                   ...   ...           ...       ...         ...
2815              Uruguay  2010  11938.212000       0.1  Developing
2831           Uzbekistan  2010   1377.821400       0.2  Developing
2847              Vanuatu  2010   2965.824340       0.1  Developing
2911               Zambia  2010   1463.213573       6.8  Developing
2927             Zimbabwe  2010    713.635620      15.7  Developing

[156 rows x 5 columns]


Now, we create a definitive dataframe without the major influential outliers.

In [None]:
Health_factors_definitive = NaN_removed2.loc[(NaN_removed2.gdp < 10000) & (NaN_removed2.hiv_aids < 3), :]
print(Health_factors_definitive)




                   country  year          gdp  hiv_aids      status
5              Afghanistan  2010   553.328940       0.1  Developing
21                 Albania  2010   494.358832       0.1  Developing
37                 Algeria  2010  4463.394675       0.1  Developing
53                  Angola  2010  3529.534820       2.5  Developing
85               Argentina  2010  1276.265000       0.1  Developing
...                    ...   ...          ...       ...         ...
2702          Turkmenistan  2010  4439.230000       0.1  Developing
2735               Ukraine  2010  2965.142365       0.2  Developing
2751  United Arab Emirates  2010  3549.148320       0.1  Developing
2831            Uzbekistan  2010  1377.821400       0.2  Developing
2847               Vanuatu  2010  2965.824340       0.1  Developing

[109 rows x 5 columns]


After this processing, the dataset now contains 109 countries. To see if the exclusion of influential outliers worked, we will make the two plots again.


In [None]:
alt.Chart(Health_factors_definitive).mark_point().encode(
    x = 'gdp',
)

In [None]:
alt.Chart(Health_factors_definitive).mark_point().encode(
    x = 'hiv_aids',
)

For the variable 'gdp', there are still some outliers. These are properly eliminated in the variable 'hiv_aids'. We observe the descriptive statistics of this new dataframe again to see if the mean differs too much from the median.

In [None]:
Health_factors_definitive.describe()

Unnamed: 0,year,gdp,hiv_aids
count,109.0,109.0,109.0
mean,2010.0,2434.078018,0.383486
std,0.0,2262.953116,0.561498
min,2010.0,8.376432,0.1
25%,2010.0,575.446453,0.1
50%,2010.0,1493.1651,0.1
75%,2010.0,3661.994,0.4
max,2010.0,8959.581416,2.5


Excluding the outliers brought the mean and median values closer together, which is logical since the mean is influenced to a greater extent by influential outliers than the median. However, they are still quite far apart. We choose to continue with the Health_factors_definitive dataframe, however, since there are only 109 countries left and we don't want to eliminate more. This would compromise the statistical power.

In [None]:
Health_factors_definitive.describe(include = ['object'])

Unnamed: 0,country,status
count,109,109
unique,109,2
top,Afghanistan,Developing
freq,1,97


In the final dataset, 97 developing countries are included and 12 developed countries are included. 

In [None]:
Health_factors_definitive

Unnamed: 0,country,year,gdp,hiv_aids,status
5,Afghanistan,2010,553.328940,0.1,Developing
21,Albania,2010,494.358832,0.1,Developing
37,Algeria,2010,4463.394675,0.1,Developing
53,Angola,2010,3529.534820,2.5,Developing
85,Argentina,2010,1276.265000,0.1,Developing
...,...,...,...,...,...
2702,Turkmenistan,2010,4439.230000,0.1,Developing
2735,Ukraine,2010,2965.142365,0.2,Developing
2751,United Arab Emirates,2010,3549.148320,0.1,Developing
2831,Uzbekistan,2010,1377.821400,0.2,Developing


No other inconsistencies are observed in the dataset. Therefore, we will continue our analysis and describe/visualize the data with the 'Health_factors_definitive' dataframe.

**4. Describe and Visualize**

In [None]:
Health_factors_definitive.describe()


Unnamed: 0,year,gdp,hiv_aids
count,109.0,109.0,109.0
mean,2010.0,2434.078018,0.383486
std,0.0,2262.953116,0.561498
min,2010.0,8.376432,0.1
25%,2010.0,575.446453,0.1
50%,2010.0,1493.1651,0.1
75%,2010.0,3661.994,0.4
max,2010.0,8959.581416,2.5


The table above displays the descriptive statistics of the quantitative variables. The count indicates the amount of rows present in the dataframe. For each dependent variable, there are 109 values. Since the variable 'year' is the same for all data, all descriptive statistics are equal to 2010.0. One thing about the descriptive statistics that stands out is the fact that, as mentioned earlier, the mean differs substantially from the median for both the 'gdp' and 'hiv/aids' variables. In addition, for the 'hiv/aids' variable, the first two quartiles (lower 50%) have the same value. This suggests a relatively high skewness of the distribution. This needs to be checked with a plot.

In [None]:
alt.Chart(Health_factors_definitive).mark_bar().encode(
    x=alt.X('gdp:Q', bin=alt.Bin(maxbins=10)),
    y='count():Q'
)

We observe a skewness of the 'gdp' data to the right.

In [None]:
alt.Chart(Health_factors_definitive).mark_bar().encode(
    x=alt.X('hiv_aids:Q', bin=alt.Bin(maxbins=10)),
    y='count():Q'
)

There are only 5 bins visible since the max value of the variable 'hiv_aids' is 2.5:

In [None]:
Health_factors_definitive['hiv_aids'].max()

2.5

There is also a strong skewness of the hiv_aids data to the right.

The first research question was to check whether there is an association between a country's GDP and incidences of HIV/AIDS (high and low GDP vs amount of population infected with HIV/AIDS) in 2010? (25 percentile (low) and 75 percentile (high)). Let's first make a scatterplot containing all gdp values.

In [None]:
chart = alt.Chart(Health_factors_definitive).mark_point().encode(
    x='gdp:Q',
    y='hiv_aids:Q'
)
chart + chart.transform_regression('gdp', 'hiv_aids').mark_line()

We can also calculate the correlation coefficient with the NumPy module.

In [None]:
import numpy as np
np.corrcoef(Health_factors_definitive['gdp'], Health_factors_definitive['hiv_aids'])

array([[ 1.        , -0.30127765],
       [-0.30127765,  1.        ]])

We observe that there is a negative correlation between a country's gdp and hiv_aids deaths per 1,000 live births. The correlation coefficient is equal to -0.301. Now let's only select the data belonging to the first quartile (25th percentile) and the data that belongs to the fourth quartile (75th percentile). For this, we need the NumPy module.

In [None]:
first_quartile_cutoffvalue = Health_factors_definitive['gdp'].quantile(.25)
fourth_quartile_cutoffvalue = Health_factors_definitive['gdp'].quantile(.75)
print(first_quartile_cutoffvalue)
print(fourth_quartile_cutoffvalue)

575.4464527
3661.994


Accordingly, the value of the 25th percentile is 575.4 and the value of the 75th percentile is 3662.0. We will now select the data corresponding to the gdp values lower than 575.4 and higher than 3662.0. 

In [None]:
Health_factors_quartiles = Health_factors_definitive.loc[(Health_factors_definitive.gdp < 575.4) | (Health_factors_definitive.gdp > 3662), ]
print(Health_factors_quartiles)




                     country  year          gdp  hiv_aids      status
5                Afghanistan  2010   553.328940       0.1  Developing
21                   Albania  2010   494.358832       0.1  Developing
37                   Algeria  2010  4463.394675       0.1  Developing
149               Azerbaijan  2010  5842.857840       0.1  Developing
229                  Belarus  2010    63.388770       0.1  Developing
245                  Belgium  2010  4438.237410       0.1   Developed
261                   Belize  2010  4344.151770       0.2  Developing
325   Bosnia and Herzegovina  2010  4611.472980       0.1  Developing
389                 Bulgaria  2010  6843.263289       0.1   Developed
421                  Burundi  2010   231.194326       1.9  Developing
565                    China  2010   456.512487       0.1  Developing
630               Costa Rica  2010  8199.414621       0.1  Developing
662                     Cuba  2010  5676.141430       0.1  Developing
678                 

In [None]:
Health_factors_quartiles.describe(include = ['object'])

Unnamed: 0,country,status
count,54,54
unique,54,2
top,Afghanistan,Developing
freq,1,45


Now, 54 countries are included of which 45 developing and 9 are developed. Let's make a scatterplot with these data to see the correlation between gdp and hiv_aids deaths per 1,000 live births of only the first and fourth quartile data.

In [None]:
alt.Chart(Health_factors_quartiles).mark_point().encode(
    x='gdp:Q',
    y='hiv_aids:Q'
)

The scatterplot shows the same data points; just the datapoints with the interquartile range (IQR) are excluded. Therefore, a scatterplot does not provide additional relevant information in this regard. Let's make a bar chart.

In [None]:
alt.Chart(Health_factors_quartiles).mark_bar().encode(
    x='gdp:Q',
    y='hiv_aids:Q'
)

The bar chart also simply shows a depletion of the datapoints within the IQR. Therefore, only selecting the datapoints that fall outside of the IQR does not provide additional relevant information in light of our research question. Let's go back to our initial dataframe Health_factors_definitive. We haven't described the qualitative data of the Health_factors_definitive dataframe yet.

In [None]:
Health_factors_definitive.describe(include = ['object'])

Unnamed: 0,country,status
count,109,109
unique,109,2
top,Afghanistan,Developing
freq,1,97


The table above displays the descriptive statistics of the qualitative variables within our initial dataframe Health_factors_definitive. It is visible that all 109 countries are unique. This means that no countries in our dataframe are present in duplicate. Regarding the status, it is visible that 97 countries are labelled as 'developing' and 12 countries are labelled as 'developed'. We can also show this with a histogram.

In [None]:
alt.Chart(Health_factors_definitive).mark_bar().encode(
    x=alt.X('status:N', bin=False),
    y='count():Q'
)

As we saw in the scatterplot, there is a negative association between gdp and the amount of hiv_aids deaths per 1,000 live births. This finding suggests that countries with a higher gdp have a lower amount of hiv_aids deaths compared to countries with a lower gdp. Let's make an interactive visualization from this by making a regression line and using the formula for the interactive dropdown menu.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact
from ipywidgets import fixed
import statsmodels.api as sm

Y = Health_factors_definitive['hiv_aids']
X = Health_factors_definitive['gdp']
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params


const    0.565446
gdp     -0.000075
dtype: float64

The formula for the regression model is

hiv_aids = 0.565446 - 0.000075 * gdp 

Let's make this formula interactive.

In [None]:
def calc_hiv_aids(gdp):
  return 0.565446 - 0.000075 * gdp

interact(calc_hiv_aids, gdp = Health_factors_definitive['gdp'])



interactive(children=(Dropdown(description='gdp', options=(553.32894, 494.358832, 4463.394675, 3529.53482, 127…

<function __main__.calc_hiv_aids(gdp)>

This interactive formula allows the user to calculate the value of 'hiv_aids' deaths per 1,000 live births based on the value of 'gdp'. Higher gdp values will result in lower hiv_aids values and vice versa. 

Each country has a different gdp and amount of hiv_aids deaths values. Let's additionally create an interactive bar chart in which the user can select the country.

In [None]:
def select_country_data(country, Health_factors_definitive):
  return Health_factors_definitive.loc[Health_factors_definitive['country'] == country, ]

def visualize_country_data(country, Health_factors_definitive):
  data_to_visualize = select_country_data(country, Health_factors_definitive)

  bar_chart = alt.Chart(data_to_visualize).mark_bar().encode(
      x='gdp:Q',
      y='hiv_aids:Q', 
      tooltip='country',
    )
  display(bar_chart)



let's see if the visualize_country_data function works properly with an example country.

In [None]:
visualize_country_data('Angola', Health_factors_definitive)

The function set-up is completed succesfully. Now, let's pass the function to interact to make the bar chart interactive for the user. 

In [None]:
interact(visualize_country_data,
         country = Health_factors_definitive['country'],
         Health_factors_definitive = fixed(Health_factors_definitive))


interactive(children=(Dropdown(description='country', options=('Afghanistan', 'Albania', 'Algeria', 'Angola', …

<function __main__.visualize_country_data(country, Health_factors_definitive)>

Now, we have created an interactive bar chart in which the user can select the country of interest to check the values of gdp and hiv_aids.

5. **Conclusion**

The initial research question of our research was: 

Does a country's welfare have an influence on HIV/AIDS incidence and vaccination rate? We divided this research question into two sub-research questions:

- Was there an association between a country's GDP and incidences of HIV/AIDS (high and low GDP vs amount of population infected with HIV/AIDS) in 2010?
(25 percentile (low) and 75 percentile (high) gdp)
 
- Do countries with a low total expenditure have more incidences of diseases caused by not-vaccinating infants (polio, measles, diphtheria) 

In response to the first research question, we did find an association between a country's GDP and incidences of HIV/AIDS in 2010. Countries with a lower gdp appear to have a higher amount of HIV/AIDS deaths per 1,000 live births compared to countries with a higher gdp. In this study, gdp is taken as a measure of welfare. Therefore, countries with better welfare tend to have less lives lost due to HIV/AIDS compared to countries with a lesser welfare. Only selecting the data that belongs to the first and fourth quartile was not additionally relevant. 

