<a href="https://colab.research.google.com/github/Peiprjs/voila/blob/main/HIV_Deaths_VS_Total_expenditure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HIV deaths in 0-4 year-old children against percentage healthcare expenditure in South Africa and the Netherlands
#### A project by Ellie Petrova (i6326413) and Mar Roca (i6351071)

---


How does the death rate of 0-4 year-olds due to HIV/AIDS compare between the Netherlands and an ex-Dutch colony (South Africa)? Is it related to healthcare spending? Has South Africa improved its treatment of pregnant people who are HIV-positive between 2000 and 2015, leading to less deaths? Does increased healthcare spending correlate with lower HIV/AIDS deaths in newborns and children?

In [1]:
# REMOVE?? Only if stuff breaks
# !pip install altair==5.2.0 --quiet
# (we needed Altair 5.2.0 specifically because of some version-specific update, so we call for it using PIP, muting the output)

In [2]:
import numpy as np
import pandas as pd
import altair as alt
import ipywidgets as widgets
from ipywidgets import interact
# This imports the required dependencies

In [3]:
healthFactors = pd.read_csv('https://raw.githubusercontent.com/NHameleers/dtz2025-datasets/master/CountryHealthFactors.csv')
healthFactors = healthFactors.rename(columns=lambda x: x.strip().title())
healthFactors = healthFactors.rename(columns={"Hiv/Aids": "HIV/AIDS", "Gdp": "GDP", "Bmi": "BMI"})
# This imports the CSV with the dataset and strips leading and trailing whitespaces in the indexes. It also changes all indexes into the Title format (first letter of each word capitalised) for style and uniformity. HIV/AIDS, BMI and GDP are kept as exemptions due to being made up of initials.
hf_SouthAfrica = healthFactors.loc[healthFactors.Country == "South Africa", ['Year', 'Percentage Expenditure', 'HIV/AIDS']]
hf_Netherlands = healthFactors.loc[healthFactors.Country == "Netherlands", ['Year', 'Percentage Expenditure', 'HIV/AIDS']]
# This selects only the data that we're interested in: Years, Total expenditure and HIV/AIDS
# from rows which Country column is equal to South Africa and the Netherlands respectively

We start off by importing all the necessary packages, and loading the CSV containing the health factors data. This file needs to be parsed using Pandas, and some indexes must be corrected. We strip leading and trailing spaces, as well as converting all indexes to a Title format (first letter of each word is capitalised) for stylistic and standardisation reasons, except for acronyms.
We also select the variables that we are interested in studying: Year, Percentage Expenditure and HIV/AIDS in the rows in which the country is South Africa or Netherlands.

In [4]:
print(f"The datasets have both the same shape: {hf_SouthAfrica.shape[0]} rows and {hf_SouthAfrica.shape[1]} colums")
print(f"The data that we have was collected between {hf_SouthAfrica.Year.min()} and {hf_Netherlands.Year.max()}")

The datasets have both the same shape: 16 rows and 3 colums
The data that we have was collected between 2000 and 2015


By running `hf_SouthAfrica.shape` or `hf_Netherlands.shape` we get the shapes of the dataframes resulting from isolating the data that we are interested in. We can observe that both of the resulting frames have a shape of **16x3**: **16** rows and **3** columns. By running `hf_SouthAfrica.Year.min()` or `hf_SouthAfrica.Year.max()` we can find out between what years we have the data from: **2000** to **2015**

# In South Africa throughout the years (2000-2015) - Mar Roca


In [5]:
hf_SouthAfrica.head(15)

Unnamed: 0,Year,Percentage Expenditure,HIV/AIDS
2393,2015,0.0,3.6
2394,2014,922.050731,3.7
2395,2013,978.590529,4.5
2396,2012,1089.954838,7.6
2397,2011,123.753335,8.5
2398,2010,1038.885632,11.0
2399,2009,782.598714,19.0
2400,2008,780.033642,23.5
2401,2007,805.490079,26.4
2402,2006,732.12553,28.1


The first step in any statistical analysis is to graph the variables we're interested in studying, to see if there is and (in case there is) what kind of relationship the two variables follow.
So, we will start by generating graphs using Altair. The next step will be to calculate some desciptive statistics.

In [6]:
def make_graph(dataframe, varX, varY, typeX):
  dataframe[varX] = dataframe[varX].astype(str)
  # We need to perform this type conversion because of Altair being weird with years. According to the documentation, we must define it as string and specify the variable type as temporal
  scatter = alt.Chart(dataframe).mark_circle(opacity=0.5).encode(
    alt.X(varX, type= typeX, scale=alt.Scale(zero=False)),
    alt.Y(varY, type='quantitative'),)
  # This first part draws the dots in the scatter plot
  scatter_w_loess = scatter + scatter.transform_loess(varX, varY).mark_line()
  # This second part draws a LOESS (LOcally Estimated Scatterplot Smoothing) line, which makes seeing the evolution easier.
  display(scatter_w_loess)
# This first section defines how to make a scatter plot with LOESS very easily, simply passing to the class a dataframe, the x-variable and the y-variable indexes.

graph_HIV_Year = widgets.Output()
with graph_HIV_Year:
  make_graph(hf_SouthAfrica, "Year", "HIV/AIDS", 'temporal')

graph_Expend_Year = widgets.Output()
with graph_Expend_Year:
  make_graph(hf_SouthAfrica, "Year", "Percentage Expenditure", 'temporal')

graph_HIV_Expend = widgets.Output()
with graph_HIV_Expend:
  make_graph(hf_SouthAfrica, "Percentage Expenditure", "HIV/AIDS", 'quantitative')
# This second section defines all the graphs we want to plot

widgets.HBox([graph_HIV_Year, graph_Expend_Year, graph_HIV_Expend])
# And finally we plot all of them together in a row

HBox(children=(Output(), Output(), Output()))

The code above generates XY scatterplots for the variables that we are interested in, which show that:
- Except for an increase before 2004, HIV deaths in children aged 0 to 4 steadily decreased since 2004 and seem to have plateau'd at below 5 per 1000 live births.
- There seems to be two outliers in the percentage expenditure. We can see one in 2011, when South Africa *only* spent a bit over 100% of its GDP per capita, and in 2015, where it was reported at 0. We can assume that the data for 2015 is missing, and so the percentage calculation returned 0.
- HIV deaths seemed to increase with percentage expenditure, until the expenditure reached a certain limit, at which point the cases seem to drop significantly.

We will compute a new dataframe of percentage expenditure excluding the outlying values. They will appear as `NaN`, so they will be ignored when we perform descriptive statistics using `Pandas.describe()` or graphing using `Altair.Chart()`. We will then graph this new dataframe against the year.

In [7]:
hf_SouthAfrica["Percentage Expenditure Outlierless"] = hf_SouthAfrica.loc[(hf_SouthAfrica["Year"] != "2011") & (hf_SouthAfrica["Year"] != "2015"), "Percentage Expenditure"]
# We use this line of code to remove the two datapoints which we considered outliers.

graph_Expend_Year = widgets.Output()
with graph_Expend_Year:
  make_graph(hf_SouthAfrica, "Year", "Percentage Expenditure Outlierless", 'temporal')

graph_HIV_Expend = widgets.Output()
with graph_HIV_Expend:
  make_graph(hf_SouthAfrica, "Percentage Expenditure Outlierless", "HIV/AIDS", 'quantitative')

widgets.HBox([graph_Expend_Year, graph_HIV_Expend])

HBox(children=(Output(), Output()))

In this new graph we can observe how expenditure on healthcare (as a percentage of GDP per capita) has been slowly increasing since 2000, with a small decline after 2012.
We can also observe in the Percentage Expenditure vs HIV/AIDS deaths graph where outliers have been removed, that there is actually a gradual increase as healthcare expenditure increases, and an immediate drop-off when it increases after a certain point.


We will now perform a descriptive statistics analysis on our data.

In [8]:
print(hf_SouthAfrica.loc[hf_SouthAfrica.isna().values])

Empty DataFrame
Columns: [Year, Percentage Expenditure, HIV/AIDS, Percentage Expenditure Outlierless]
Index: []


The singular line of code above allows us to find in the entire dataframe of data that we selected the instances in which there's a NaN. We can see that in our selected portion of data, there is no missing values.

In [13]:
hf_SouthAfrica.describe()

Unnamed: 0,HIV/AIDS
count,16.0
mean,18.49375
std,10.166053
min,3.6
25%,8.275
50%,22.4
75%,26.975
max,29.7


# Comparing South Africa and The Netherlands in [year] - Ellie Petrova



# SCRAP CODE

In [10]:
hf_SouthAfrica["Year"] = hf_SouthAfrica["Year"].astype(str)
# We need to perform this type conversion because of Altair being weird with years. According to the documentation, we must define it as string and specify the variable type as temporal
base = alt.Chart(hf_SouthAfrica).mark_circle(opacity=0.5).encode(
    alt.X('Year', type='temporal', scale=alt.Scale(zero=False)),
    alt.Y('HIV/AIDS', type='quantitative'),)

# This first part draws the dots in the scatter plot
base + base.transform_loess('Year', 'HIV/AIDS').mark_line()
# This second part draws a LOESS (LOcally Estimated Scatterplot Smoothing) line, which makes seeing the evolution easier.

In [11]:
def make_graph(dataframe):
  dataframe["Year"] = dataframe["Year"].astype(str)
  # We need to perform this type conversion because of Altair being weird with years. According to the documentation, we must define it as string and specify the variable type as temporal
  scatter = alt.Chart(dataframe).mark_circle(opacity=0.5).encode(
    alt.X('Year', type='temporal', scale=alt.Scale(zero=False)),
    alt.Y('HIV/AIDS', type='quantitative'),)
  # This first part draws the dots in the scatter plot
  final_graph = scatter + scatter.transform_loess('Year', 'HIV/AIDS').mark_line()
  # This second part draws a LOESS (LOcally Estimated Scatterplot Smoothing) line, which makes seeing the evolution easier.
  display(final_graph)


graph_HIV_Year = widgets.Output()
with graph_HIV_Year:
  make_graph(hf_SouthAfrica)

widgets.HBox([graph_HIV_Year])

HBox(children=(Output(),))

In [12]:

import pandas as pd


df = pd.DataFrame({'a': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
def g(df):
    return df.rename(columns=lambda x: x.strip().upper())

result = g(healthFactors)
print(result)



          COUNTRY  YEAR      STATUS  LIFE EXPECTANCY  ADULT MORTALITY  \
0     Afghanistan  2015  Developing             65.0            263.0   
1     Afghanistan  2014  Developing             59.9            271.0   
2     Afghanistan  2013  Developing             59.9            268.0   
3     Afghanistan  2012  Developing             59.5            272.0   
4     Afghanistan  2011  Developing             59.2            275.0   
...           ...   ...         ...              ...              ...   
2933     Zimbabwe  2004  Developing             44.3            723.0   
2934     Zimbabwe  2003  Developing             44.5            715.0   
2935     Zimbabwe  2002  Developing             44.8             73.0   
2936     Zimbabwe  2001  Developing             45.3            686.0   
2937     Zimbabwe  2000  Developing             46.0            665.0   

      INFANT DEATHS  ALCOHOL  PERCENTAGE EXPENDITURE  HEPATITIS B  MEASLES  \
0                62     0.01               71