Report made by: Reinout Willems

Student ID: I6331474



# **Research question and short introduction**

*What is the relation between schooling and the vaccination rate in different Asian countries of the ASEAN Regional Forum (ARF)?*
1. What is the relation between schooling and the vaccination rate in some Asian ARF countries?
2. How do they compare to other Asian ARF countries?
3. What is the overall correlation of the Asian ARF countries?
4. How do the countries compare over the years?

ARF stands for ASEAN Regional Forum and contains the following countries/unions:
Australia, Bangladesh, Brunei Darussalam, Cambodia, Canada, China, Democratic People’s Republic of Korea, European Union, India, Indonesia, Japan, Lao PDR, Malaysia, Mongolia, Myanmar, New Zealand, Pakistan, Papua New Guinea, Philippines, Republic of Korea, Russia, Singapore, Sri Lanka, Thailand, Timor-Leste, United States, and Viet Nam.

Not all the countries in the ARF are Asian (i.e. USA and the EU, so all that countries will be excluded from the research), this leaves us with the following countries: Bangladesh, Brunei Darussalam (Brunei), Cambodia, China, Democratic People’s Republic of Korea (North Korea), India, Indonesia, Japan, Lao PDR (Laos), Malaysia, Mongolia, Myanmar, Pakistan, Philippines, Republic of Korea (South Korea), Russia, Singapore, Sri Lanka, Thailand, Timor-Leste, and Viet Nam

# **Data collection for research**

**Loading the dataset**

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import ipywidgets as widgets
from ipywidgets import interact

FactorHealth = pd.read_csv('https://raw.githubusercontent.com/NHameleers/dtz2025-datasets/master/CountryHealthFactors.csv')



In [2]:
FactorHealth = FactorHealth.rename(columns=str.strip)

For determining the average vaccination rate, a new column 'Average Vaccination Rate' was calculated with the average of the sum of 'Hepatitis B', 'Polio', and 'Diphtheria'.

In [3]:
FactorHealth['Average Vaccination Rate'] = (FactorHealth['Hepatitis B'] + FactorHealth['Polio'] + FactorHealth['Diphtheria']) / 3

**Making a new dataset including values for the research**
- To only include the Asian ARF countries, a list called 'Asian_ARFcountries' was made with only the Asian ARF countries
- Also, all the columns were put in a list called 'research columns'
- The new table included all these research columns, and from the countries only the Asian ARF countries while using a '.isin' function

In [4]:
Asian_ARFcountries = ['Bangladesh', 'Brunei Darussalam', 'China', 'Cambodia', "Democratic People's Republic of Korea", 'India', 'Indonesia', 'Japan', "Lao People's Democratic Republic", 'Malaysia', 'Mongolia', 'Myanmar', 'Pakistan',
'Philippines', 'Republic of Korea', 'Russian Federation', 'Singapore', 'Sri Lanka', 'Thailand', 'Timor-Leste', 'Viet Nam']

ResearchColumns = ['Country', 'Year', 'Hepatitis B', 'Polio', 'Diphtheria', 'Schooling', 'Average Vaccination Rate']

FactorHealth1 = FactorHealth[FactorHealth['Country'].isin(Asian_ARFcountries)][ResearchColumns]

# **Data cleaning**

**Data description**
- Hepatitis B and the Average vaccination rate have the same amount of missing values
- 32 missing values in countries for schooling, since every country has 16 different years of data, it could be only 2 countries containing the missing values

In [5]:
FactorHealth1.describe()

Unnamed: 0,Year,Hepatitis B,Polio,Diphtheria,Schooling,Average Vaccination Rate
count,336.0,280.0,334.0,334.0,304.0,280.0
mean,2007.5,81.878571,85.107784,85.532934,11.444079,85.158333
std,4.616647,24.493864,21.287544,18.27346,2.44325,17.845523
min,2000.0,4.0,5.0,5.0,0.0,8.0
25%,2003.75,77.0,81.0,79.0,9.875,77.583333
50%,2007.5,94.0,95.0,94.0,11.7,94.333333
75%,2011.25,97.0,98.0,97.0,13.225,97.0
max,2015.0,99.0,99.0,99.0,15.4,99.0


**Schooling**
- Looking at the countries in schooling; both North Korea and South Korea can't be found

In [6]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Year',
    y ='Schooling',
    color = 'Country'
)

**Hepatitis B**
- Japan, North Korea and South Korea have no data values (the last two because of missing values in schooling)
- Some values of Hepatitis B below/around 10, probably, the decimal was placed wrongly over there

In [7]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Hepatitis B',
    color= 'Country'
)

**Polio**
- North Korea and South Korea are missing (because of no values in Schooling)
- A few values of Polio are below/around 10, probably because of a wrongly placed decimal

In [8]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Polio',
    color = 'Country'
)

**Diphtheria**
- North Korea and South Korea are missing (because of missing values in schooling)
- Some values of Diphtheria below/around 10, the decimal is probably placed wrongly

In [9]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Diphtheria',
    color = 'Country'
)

**Checking for typos in countries' names**
- Checking the unique country values in the data file
- Every country only has one unique value, so there are no typos

**Resolving the issues in missing data**
- Making a new data file
  - Minimum value of Hepatitis B to 12
  - Minimum value of Polio to 10
  - Minimum value of Diphtheria to 10
  - Minimum value of Schooling to 2

In [10]:
FactorHealth2 = FactorHealth1[(FactorHealth1['Hepatitis B'] >= 12.0) & (FactorHealth1['Diphtheria'] >= 10.0) & (FactorHealth1['Polio'] >= 10.0) & (FactorHealth1['Schooling'] >= 2.0)]

# **Results**

**Determining correlation of every country**
- Function of correlation of every country using the '.corr()' function
- Dropdown of a country made to compare them with each other
- Also, 'import interact,fixed' is used because 'df = fixed(df)'
- Function copied two times for comparing two graphs with each other

**Visualization of every correlation**
- Graph with regression line with dropdown for every country
- Function copied two times for comparing two graphs with each other


In [11]:
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def correlation(Country, FactorHealth2):
  Countries_data = Country_data(Country, FactorHealth2)
  CorrCountry = Countries_data[['Schooling', 'Average Vaccination Rate']].corr()
  return CorrCountry

def interact_visualization3(Country, FactorHealth2):
  Visualization = correlation(Country, FactorHealth2)
  display(Visualization)

interact(interact_visualization3,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [12]:
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def correlation(Country, FactorHealth2):
  Countries_data = Country_data(Country, FactorHealth2)
  CorrCountry = Countries_data[['Schooling', 'Average Vaccination Rate']].corr()
  return CorrCountry

def interact_visualization3(Country, FactorHealth2):
  Visualization = correlation(Country, FactorHealth2)
  display(Visualization)

interact(interact_visualization3,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [13]:
#Correlation of each country compared to other countries
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def interact_visualization(Country, FactorHealth2):
  Visualization = Country_data(Country, FactorHealth2)
  ChartPoint = alt.Chart(Visualization).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate')
  ChartTotal = ChartPoint + ChartPoint.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()
  display(ChartTotal)

interact(interact_visualization,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [14]:
def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def interact_visualization(Country, FactorHealth2):
  Visualization = Country_data(Country, FactorHealth2)
  ChartPoint = alt.Chart(Visualization).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate')
  ChartTotal = ChartPoint + ChartPoint.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()
  display(ChartTotal)

interact(interact_visualization,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

**Results of comparisons between countries**
- The correlation the between average vaccination rate and schooling varies a lot per country
- Examples:
  - High correlations: China, Laos and India
  - Low correlations: Vietnam, Cambodia and Singapore

**Overall correlation between Average Vaccination Rate and Schooling**
- A new datafile made with the mean of Average Vaccination Rate and Schooling for every country
- Correlation determined using the '.corr()' function
- Visualization in a scatterplot with regression line



In [15]:
#Comparing the mean average vaccination rate of every country
FactorHealth3 = FactorHealth2.groupby('Country')[['Schooling', 'Average Vaccination Rate']].mean()
FactorHealth3.loc[:, ['Schooling', 'Average Vaccination Rate']].corr()

Unnamed: 0,Schooling,Average Vaccination Rate
Schooling,1.0,0.641809
Average Vaccination Rate,0.641809,1.0


In [16]:
#Comparing the mean average vaccination rate of every country
chart = alt.Chart(FactorHealth3).mark_point().encode(
    x = 'Schooling',
    y = 'Average Vaccination Rate'
)
chart + chart.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()

**Results overall correlation**
- A moderate correlation between the average vaccination rate and schooling

**Comparison of countries over the years**
- Same as for the previous interactive graphs, only 'country' is made a color and 'year' is made a dropdown

In [17]:
def Year_data(Year, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Year'] == Year]


def interact_visualization1(Year, FactorHealth2):
  Visualization1 = Year_data(Year, FactorHealth2)
  ChartTotal1 = alt.Chart(Visualization1).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate', color = 'Country')
  display(ChartTotal1)

interact(interact_visualization1, Year = FactorHealth2['Year'].unique(), FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Year', options=(2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 20…

**Results of countries over the years**
- This also varies a lot per country
- Examples:
  - Malaysia always had high schooling and a high vaccination rate
  - In India, Schooling and the Average Vaccination Rate increased over the years  

# **Conclusion**

- Correlation between Average Vaccination Rate and Schooling varies per country
  - Same case when comparing countries over the years
- There's a moderate correlation between the Average Vaccination Rate and Schooling overall