# **Research question and short introduction**

*What is the relation between schooling and the vaccination rate in different Asian countries of the ASEAN Regional Forum (ARF)?*
1. What is the relation between schooling and the vaccination rate in some Asian ARF countries?
2. How do they compare to other Asian ARF countries?
3. What is the overall correlation of the Asian ARF countries?
4. How do the countries compare over the years?

ARF stands for ASEAN Regional Forum, and contains the following countries:
Australia, Bangladesh, Brunei Darussalam, Cambodia, Canada, China, Democratic People’s Republic of Korea, European Union, India, Indonesia, Japan, Lao PDR, Malaysia, Mongolia, Myanmar, New Zealand, Pakistan, Papua New Guinea, Philippines, Republic of Korea, Russia, Singapore, Sri Lanka, Thailand, Timor-Leste, United States, and Viet Nam.

Not all the countries in the ARF are Asian (i.e. USA and the EU, so all that countries will be excluded from the research): Bangladesh, Brunei Darussalam (Brunei), Cambodia, China, Democratic People’s Republic of Korea (North Korea), India, Indonesia, Japan, Lao PDR (Laos), Malaysia, Mongolia, Myanmar, Pakistan, Philippines, Republic of Korea (South Korea), Russia, Singapore, Sri Lanka, Thailand, Timor-Leste, and Viet Nam

# **Data collection for research**

**Loading the dataset**
- To load the dataset, pandas was imported:

In [1]:
#Load the dataset
import pandas as pd
import numpy as np
import altair as alt
import ipywidgets as widgets
from ipywidgets import interact

FactorHealth = pd.read_csv('https://raw.githubusercontent.com/NHameleers/dtz2025-datasets/master/CountryHealthFactors.csv')
FactorHealth.head(n=17)



Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
5,Afghanistan,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,...,66.0,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2
6,Afghanistan,2009,Developing,58.6,281.0,77,0.01,56.762217,63.0,2861,...,63.0,9.42,63.0,0.1,445.893298,284331.0,18.6,18.7,0.434,8.9
7,Afghanistan,2008,Developing,58.1,287.0,80,0.03,25.873925,64.0,1599,...,64.0,8.33,64.0,0.1,373.361116,2729431.0,18.8,18.9,0.433,8.7
8,Afghanistan,2007,Developing,57.5,295.0,82,0.02,10.910156,63.0,1141,...,63.0,6.73,63.0,0.1,369.835796,26616792.0,19.0,19.1,0.415,8.4
9,Afghanistan,2006,Developing,57.3,295.0,84,0.03,17.171518,64.0,1990,...,58.0,7.43,58.0,0.1,272.56377,2589345.0,19.2,19.3,0.405,8.1


In [2]:
FactorHealth = FactorHealth.rename(columns=str.strip)

For determining the average vaccination rate, a new column 'Average Vaccination Rate' was calculated with the average of the sum of 'Hepatitis B', 'Polio', and 'Diphteria'.

In [3]:
FactorHealth['Average Vaccination Rate'] = (FactorHealth['Hepatitis B'] + FactorHealth['Polio'] + FactorHealth['Diphtheria']) / 3
FactorHealth

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Average Vaccination Rate
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,45.333333
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,60.666667
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,63.333333
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,67.000000
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,68.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2,66.666667
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5,27.333333
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0,72.333333
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8,75.666667


**Making a new dataset including values for the research**
- To only include the Asian ARF countries, a list called 'Asian_ARF countries was made with only the Asian ARF countries'
- Also, all the columns were put in a list called 'research columns'
- The new table included all these research columns, and from the countries only the Asian ARF countries while using an '.isin' function

In [4]:
Asian_ARFcountries = ['Bangladesh', 'Brunei Darussalam', 'China', 'Cambodia', "Democratic People's Republic of Korea", 'India', 'Indonesia', 'Japan', "Lao People's Democratic Republic", 'Malaysia', 'Mongolia', 'Myanmar', 'Pakistan',
'Philippines', 'Republic of Korea', 'Russian Federation', 'Singapore', 'Sri Lanka', 'Thailand', 'Timor-Leste', 'Viet Nam']

ResearchColumns = ['Country', 'Year', 'Hepatitis B', 'Polio', 'Diphtheria', 'Schooling', 'Average Vaccination Rate']

FactorHealth1 = FactorHealth[FactorHealth['Country'].isin(Asian_ARFcountries)][ResearchColumns]
FactorHealth1

Unnamed: 0,Country,Year,Hepatitis B,Polio,Diphtheria,Schooling,Average Vaccination Rate
192,Bangladesh,2015,97.0,97.0,97.0,10.2,97.000000
193,Bangladesh,2014,97.0,97.0,97.0,10.0,97.000000
194,Bangladesh,2013,96.0,96.0,96.0,10.0,96.000000
195,Bangladesh,2012,94.0,94.0,94.0,9.9,94.000000
196,Bangladesh,2011,96.0,96.0,96.0,9.4,96.000000
...,...,...,...,...,...,...,...
2885,Viet Nam,2004,94.0,96.0,96.0,11.0,95.333333
2886,Viet Nam,2003,78.0,96.0,99.0,10.9,91.000000
2887,Viet Nam,2002,,92.0,75.0,10.7,
2888,Viet Nam,2001,,96.0,96.0,10.6,


# **Data cleaning**

**Data description**
- Hepatitis B and the Average vaccination rate have the same amount of missing values
- 32 missing values in countries for schooling, since every country has 16 different years of data, it could be only 2 countries containing the missing values

In [5]:
FactorHealth1.describe()

Unnamed: 0,Year,Hepatitis B,Polio,Diphtheria,Schooling,Average Vaccination Rate
count,336.0,280.0,334.0,334.0,304.0,280.0
mean,2007.5,81.878571,85.107784,85.532934,11.444079,85.158333
std,4.616647,24.493864,21.287544,18.27346,2.44325,17.845523
min,2000.0,4.0,5.0,5.0,0.0,8.0
25%,2003.75,77.0,81.0,79.0,9.875,77.583333
50%,2007.5,94.0,95.0,94.0,11.7,94.333333
75%,2011.25,97.0,98.0,97.0,13.225,97.0
max,2015.0,99.0,99.0,99.0,15.4,99.0


**Schooling**
- For charts, altair was imported
- Looking at the countries in schooling, North Korea and South Korea can't be found

In [6]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Year',
    y ='Schooling',
    color = 'Country'
)

**Hepatitis B**
- Japan, North Korea and South Korea have no data values (the last two because of missing values in schooling)
- Some values of Hepatits B below/around 10, probably, the decimal was placed wrongly over there

In [7]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Hepatitis B',
    color= 'Country'
)

**Polio**
- North Korea and South Korea are missing (because of no values in Schooling)
- Few values of Polio are below/around 10, probably because of a wrongly placed decimal

In [8]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Polio',
    color = 'Country'
)

**Diphteria**
- North Korea and South Korea are missing (because of missing values in schooling)
- Some values of Diphtheria below/around 10, the decimal is probably placed wrongly

In [9]:
alt.Chart(FactorHealth1).mark_point().encode(
    x ='Schooling',
    y ='Diphtheria',
    color = 'Country'
)

**Checking for typos in countries' names**
- Checking the unique country values in the data file
- Every country only has one unique value, so there are no typos

In [10]:
FactorHealth1["Country"].unique()

array(['Bangladesh', 'Brunei Darussalam', 'Cambodia', 'China',
       "Democratic People's Republic of Korea", 'India', 'Indonesia',
       'Japan', "Lao People's Democratic Republic", 'Malaysia',
       'Mongolia', 'Myanmar', 'Pakistan', 'Philippines',
       'Republic of Korea', 'Russian Federation', 'Singapore',
       'Sri Lanka', 'Thailand', 'Timor-Leste', 'Viet Nam'], dtype=object)

**Resolving the issues in missing datas**
- Making a new data file
  - Minimum value of Hepatitis B to 12
  - Minimum value of Polio to 10
  - Minimum value of Diphtheria to 10
  - Minimum value of schooling to 2

In [11]:
FactorHealth2 = FactorHealth1[(FactorHealth1['Hepatitis B'] >= 12.0) & (FactorHealth1['Diphtheria'] >= 10.0) & (FactorHealth1['Polio'] >= 10.0) & (FactorHealth1['Schooling'] >= 2.0)]
FactorHealth2

Unnamed: 0,Country,Year,Hepatitis B,Polio,Diphtheria,Schooling,Average Vaccination Rate
192,Bangladesh,2015,97.0,97.0,97.0,10.2,97.000000
193,Bangladesh,2014,97.0,97.0,97.0,10.0,97.000000
194,Bangladesh,2013,96.0,96.0,96.0,10.0,96.000000
195,Bangladesh,2012,94.0,94.0,94.0,9.9,94.000000
196,Bangladesh,2011,96.0,96.0,96.0,9.4,96.000000
...,...,...,...,...,...,...,...
2882,Viet Nam,2007,67.0,92.0,92.0,11.4,83.666667
2883,Viet Nam,2006,93.0,94.0,94.0,11.3,93.666667
2884,Viet Nam,2005,94.0,94.0,95.0,11.1,94.333333
2885,Viet Nam,2004,94.0,96.0,96.0,11.0,95.333333


# **Results**

**Determining correlation of every country**
- Function of correlation of every country using the '.corr()' function
- Dropdown of a country made to compare them with each other
- Also, 'import interact,fixed' is used because 'df = fixed(df)'
- Function copied two times for comparing two graphs with each other

**Visualization of every correlation**
- Graph with regression line with dropdown for every country
- Function copied two times for comparing two graphs with each other


In [12]:
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def correlation(Country, FactorHealth2):
  Countries_data = Country_data(Country, FactorHealth2)
  CorrCountry = Countries_data[['Schooling', 'Average Vaccination Rate']].corr()
  return CorrCountry

def interact_visualization3(Country, FactorHealth2):
  Visualization = correlation(Country, FactorHealth2)
  display(Visualization)

interact(interact_visualization3,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [13]:
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def correlation(Country, FactorHealth2):
  Countries_data = Country_data(Country, FactorHealth2)
  CorrCountry = Countries_data[['Schooling', 'Average Vaccination Rate']].corr()
  return CorrCountry

def interact_visualization3(Country, FactorHealth2):
  Visualization = correlation(Country, FactorHealth2)
  display(Visualization)

interact(interact_visualization3,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [14]:
#Correlation of each country compared to other countries
from ipywidgets import interact, fixed

def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def interact_visualization(Country, FactorHealth2):
  Visualization = Country_data(Country, FactorHealth2)
  ChartPoint = alt.Chart(Visualization).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate')
  ChartTotal = ChartPoint + ChartPoint.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()
  display(ChartTotal)

interact(interact_visualization,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));
#Vietnam (low) and China (high)
#Laos (high) and Cambodia (low)
#Singapore (low) and India (high)

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

In [15]:
def Country_data(Country, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Country'] == Country]

def interact_visualization(Country, FactorHealth2):
  Visualization = Country_data(Country, FactorHealth2)
  ChartPoint = alt.Chart(Visualization).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate')
  ChartTotal = ChartPoint + ChartPoint.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()
  display(ChartTotal)

interact(interact_visualization,
         Country=FactorHealth2['Country'].unique(),
         FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Country', options=('Bangladesh', 'Brunei Darussalam', 'Cambodia', …

**Results of country comparisons**
- The correlation between average vaccination rate and schooling varies a lot per country
  - High correlations: China, Laos and India
  - Low correlations: Vietnam, Cambodia and Singapore

**Overall correlation between Average Vaccination Rate and Schooling**
- New datafile made with the mean of Average Vaccination Rate and Schooling for every country
- Correlation determined using the '.corr()' function
- Visualization in a scatterplot with regression line



In [16]:
#Comparing the mean average vaccination rate of every country
FactorHealth3 = FactorHealth2.groupby('Country')[['Schooling', 'Average Vaccination Rate']].mean()
FactorHealth3.loc[:, ['Schooling', 'Average Vaccination Rate']].corr()

Unnamed: 0,Schooling,Average Vaccination Rate
Schooling,1.0,0.641809
Average Vaccination Rate,0.641809,1.0


In [17]:
#Comparing the mean average vaccination rate of every country
chart = alt.Chart(FactorHealth3).mark_point().encode(
    x = 'Schooling',
    y = 'Average Vaccination Rate'
)
chart + chart.transform_regression('Schooling', 'Average Vaccination Rate').mark_line()

**Results overall comparison**
- A moderate correlation between the average vaccination rate and schooling

**Comparison of countries over the years**
- Same as for the previous interactive graphs, only 'country' is made a color and 'year' is made a dropdown

In [18]:
def Year_data(Year, FactorHealth2):
  return FactorHealth2.loc[FactorHealth2['Year'] == Year]


def interact_visualization1(Year, FactorHealth2):
  Visualization1 = Year_data(Year, FactorHealth2)
  ChartTotal1 = alt.Chart(Visualization1).mark_point().encode(x = 'Schooling', y = 'Average Vaccination Rate', color = 'Country')
  display(ChartTotal1)

interact(interact_visualization1, Year = FactorHealth2['Year'].unique(), FactorHealth2=fixed(FactorHealth2));

interactive(children=(Dropdown(description='Year', options=(2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 20…

**Results of countries over the years**
- Also varies a lot per country
- Examples
  - Malaysia always had high schooling and a high vaccination rate
  - In India, Schooling and Average Vaccination Rate increased over the years  

# **Conclusion**

- Correlation between Average Vaccination Rate and Schooling varies per country
  - Same case when comparing countries over the years
- There's a moderate correlation between the Average Vaccination Rate and Schooling overall