# Explore source data

#### This notebook aims to explore source csv files to retrive useful data for the project while doing some cleaning tasks at the same time

##### Available in the data/source folder:

        - WMR2022_Annex_4F.xlsx
        - WMR2021_Annex5J.xlsx and WMR2022_Annex_4J.xlsx
        - malaria-in-africa.csv
        - DatasetAfricaMalaria.csv
        - confirmed_cases.csv, presumed_cases.csv and total_malaria_cases (confimed+presumed).csv
        - Population_access_to_ITN.csv
        - pr_timeseries_annual_cru_1901-2021_BFA.csv, pr_timeseries_annual_cru_1901-2021_GHA.csv, pr_timeseries_annual_cru_1901-2021_KEN.csv and pr_timeseries_annual_cru_1901-2021_MWI.csv
        - tasmin_timeseries_annual_cru_1901-2021_BFA.csv, tasmin_timeseries_annual_cru_1901-2021_GHA.csv, tasmin_timeseries_annual_cru_1901-2021_KEN.csv and tasmin_timeseries_annual_cru_1901-2021_MWI.csv
        - API_19_DS2_en_csv_v2_5729899.csv

In [1]:
""" Import packages """
import sys

import numpy as np
import pandas as pd

sys.path.insert(1, '../scripts/')


In [2]:
from explorer import DataTransformer

transformer = DataTransformer()


In [3]:
# Declaring notebook variables
REPO = "./"
filepath_1 = "../data/source/WMR2022_Annex_4F.xlsx"
filepath_21 = "../data/source/WMR2021_Annex5J.xlsx"
filepath_22 = "../data/source/WMR2022_Annex_4J.xlsx"
filepath_3 = "../data/source/malaria-in-africa.csv"
filepath_4 = "../data/source/DatasetAfricaMalaria.csv"
filepath_51 = "../data/source/confirmed_cases.csv"
filepath_52 = "../data/source/presumed_cases.csv"
filepath_53 = "../data/source/total_malaria_cases (confimed+presumed).csv"
filepath_6 = "../data/source/Population_access_to_ITN.csv"
filepath_711 = "../data/source/pr_timeseries_annual_cru_1901-2021_NGA.csv"
filepath_721 = "../data/source/pr_timeseries_annual_cru_1901-2021_BFA.csv"
filepath_731 = "../data/source/pr_timeseries_annual_cru_1901-2021_GHA.csv"
filepath_741 = "../data/source/pr_timeseries_annual_cru_1901-2021_KEN.csv"
filepath_751 ="../data/source/pr_timeseries_annual_cru_1901-2021_MWI.csv"
filepath_712 = "../data/source/tasmin_timeseries_annual_cru_1901-2021_NGA.csv"
filepath_722 = "../data/source/tasmin_timeseries_annual_cru_1901-2021_BFA.csv"
filepath_732 = "../data/source/tasmin_timeseries_annual_cru_1901-2021_GHA.csv"
filepath_742 = "../data/source/tasmin_timeseries_annual_cru_1901-2021_KEN.csv"
filepath_752 = "../data/source/tasmin_timeseries_annual_cru_1901-2021_MWI.csv"
filepath_8 = "../data/source/Annual_Surface_Temperature_Change.csv"
filepath_9 = "../data/source/API_19_DS2_en_csv_v2_5729899.csv"


### Exploring WMR2022_Annex_4F.xlsx - Population Denominator for case incidence and mortality rate, and estimated malaria cases and deaths in WHO, Malaria Report

In [4]:
""" Import the dataset """
source_df1 = transformer.load_data(ext="xlsx", filepath=filepath_1, repo=REPO, nrows=2538)

# Display first rows
display(source_df1.head())

# Display last
display(source_df1.tail())

Unnamed: 0,WHO region Country/area,Year,Population denominator for incidence and mortality rate,Cases_Lower,Cases_Point,Cases_Upper,Deaths_Lower,Deaths_Point,Deaths_Upper
0,AFRICAN,,,,,,,,
1,"Algeria1,2,3",2000.0,1807701.0,-,34.0,-,-,2.0,-
2,,2001.0,1832745.0,-,6.0,-,-,1.0,-
3,,2002.0,1857634.0,-,10.0,-,-,0.0,-
4,,2003.0,1882962.0,-,5.0,-,-,0.0,-


Unnamed: 0,WHO region Country/area,Year,Population denominator for incidence and mortality rate,Cases_Lower,Cases_Point,Cases_Upper,Deaths_Lower,Deaths_Point,Deaths_Upper
2532,,2017.0,3950178000.0,219000000,236597200.0,257000000,553000,587091.0,654000
2533,,2018.0,4008394000.0,214000000,231237356.0,252000000,532000,566857.0,642000
2534,,2019.0,4065968000.0,213000000,232401031.0,255000000,532000,568392.0,654000
2535,,2020.0,4122687000.0,222000000,245085135.0,273000000,583000,624525.0,747000
2536,,2021.0,4176318000.0,224000000,247430170.0,276000000,577000,618975.0,754000


In [5]:
# Select important columns
keep_columns = [*source_df1.columns[0:3], "Cases_Point", "Deaths_Point"]

# Subset source_df1
source_df1 = source_df1[keep_columns]

display(source_df1.head())

Unnamed: 0,WHO region Country/area,Year,Population denominator for incidence and mortality rate,Cases_Point,Deaths_Point
0,AFRICAN,,,,
1,"Algeria1,2,3",2000.0,1807701.0,34.0,2.0
2,,2001.0,1832745.0,6.0,1.0
3,,2002.0,1857634.0,10.0,0.0
4,,2003.0,1882962.0,5.0,0.0


In [6]:
# Rename columns
source_df1.rename(
   columns={
        source_df1.columns[0]: "Country",
        "Cases_Point": "Cases",
        "Deaths_Point": "Deaths"
    }, inplace=True
)

In [7]:
source_df1.head()

Unnamed: 0,Country,Year,Population denominator for incidence and mortality rate,Cases,Deaths
0,AFRICAN,,,,
1,"Algeria1,2,3",2000.0,1807701.0,34.0,2.0
2,,2001.0,1832745.0,6.0,1.0
3,,2002.0,1857634.0,10.0,0.0
4,,2003.0,1882962.0,5.0,0.0


In [8]:
# Fill NaN in Country
source_df1["Country"] = source_df1["Country"].fillna(method="ffill")

In [9]:
# Subset study countries
cleaned_df1 = transformer.subset_study_countries(source_df1, "Country")

# Display cleaned_df1
display(cleaned_df1.head())

Unnamed: 0,Country,Year,Population denominator for incidence and mortality rate,Cases,Deaths
0,Burkina Faso,2000.0,11882888.0,7121621.0,37845.0
1,Burkina Faso,2001.0,12249764.0,7346774.0,40660.0
2,Burkina Faso,2002.0,12632269.0,7513033.0,38905.0
3,Burkina Faso,2003.0,13030591.0,7615906.0,37642.0
4,Burkina Faso,2004.0,13445977.0,7546925.0,35229.0


In [10]:
# Convert year to datetime format
cleaned_df1 = transformer.convert_to_dateformat(cleaned_df1, "Year")


In [11]:
# Convert float columns to int
columns_to_int  = [*cleaned_df1.select_dtypes(float).columns]
convert_dict = {i: int for i in columns_to_int}
cleaned_df1 = cleaned_df1.astype(convert_dict)

In [12]:
# Get cleaned_df1 info
display(cleaned_df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 5 columns):
 #   Column                                                   Non-Null Count  Dtype         
---  ------                                                   --------------  -----         
 0   Country                                                  110 non-null    object        
 1   Year                                                     110 non-null    datetime64[ns]
 2   Population denominator for incidence and mortality rate  110 non-null    int64         
 3   Cases                                                    110 non-null    int64         
 4   Deaths                                                   110 non-null    int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 4.4+ KB


None

In [13]:
cleaned_df1.sample(5)

Unnamed: 0,Country,Year,Population denominator for incidence and mortality rate,Cases,Deaths
29,Ghana,2007-12-31,23708320,7556676,15139
47,Kenya,2003-12-31,33767120,7177403,12595
13,Burkina Faso,2013-12-31,17636408,8620220,27175
19,Burkina Faso,2019-12-31,20951640,7472680,19228
88,Nigeria,2000-12-31,122851984,50779020,249259


### Exploring WMR2021_Annex5J.xlsx and WMR2022_Annex_4J.xlsx - Reported Malaria Deaths in WHO, Malaria Report

In [14]:
""" Import the dataset """
source_df2 = transformer.load_data(ext="xlsx", filepath=filepath_21, repo=REPO, nrows=123)

source_df2_add = transformer.load_data(ext="xlsx", filepath=filepath_22, repo=REPO, nrows=102)

# Display first rows
display(source_df2.head())

display(source_df2_add.head())

# Display last rows
display(source_df2.tail())

display(source_df2_add.tail())

Unnamed: 0,WHO region Country/area,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,AFRICAN,,,,,,,,,,...,,,,,,,,,,
1,Algeria1,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Angola,9510.0,9473.0,14434.0,38598.0,12459.0,13768.0,10220.0,9812.0,9465.0,...,6909.0,5736.0,7300.0,5714.0,7832.0,15997.0,13967.0,11814.0,18691.0,11757.0
3,Benin,,468.0,707.0,560.0,944.0,322.0,1226.0,1290.0,918.0,...,1753.0,2261.0,2288.0,1869.0,1416.0,1646.0,2182.0,2138.0,2589.0,2336.0
4,Botswana,,29.0,23.0,18.0,19.0,11.0,40.0,6.0,12.0,...,8.0,3.0,7.0,22.0,5.0,3.0,17.0,9.0,7.0,11.0


Unnamed: 0,WHO region Country/area,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,AFRICAN,,,,,,,,,,,,
1,Algeria1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Angola,8114.0,6909.0,5736.0,7300.0,5714.0,7832.0,15997.0,13967.0,11814.0,18691.0,11757.0,13676.0
3,Benin,964.0,1753.0,2261.0,2288.0,1869.0,1416.0,1646.0,2182.0,2138.0,2589.0,2336.0,2956.0
4,Botswana,8.0,8.0,3.0,7.0,22.0,5.0,3.0,17.0,9.0,7.0,11.0,5.0


Unnamed: 0,WHO region Country/area,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
117,Eastern Mediterranean,2166.0,2254.0,2135.0,2538.0,1894.0,1857.0,1363.0,1352.0,1223.0,...,736.0,996.0,1048.0,976.0,1016.0,861.0,1715.0,3320.0,1688.0,792.0
118,European,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119,South-East Asia,5522.0,4780.0,4610.0,4283.0,4254.0,3506.0,4587.0,2963.0,3101.0,...,1821.0,1229.0,1126.0,955.0,620.0,560.0,299.0,171.0,162.0,147.0
120,Western Pacific,2368.0,1945.0,1574.0,1586.0,1427.0,1385.0,1322.0,961.0,997.0,...,727.0,524.0,393.0,268.0,215.0,342.0,313.0,232.0,203.0,194.0
121,Total,89477.0,112454.0,119325.0,161900.0,122341.0,144369.0,144518.0,108047.0,107889.0,...,107508.0,107018.0,119048.0,101709.0,129636.0,113155.0,105690.0,92291.0,110698.0,77934.0


Unnamed: 0,WHO region Country/area,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
96,Philippines,30.0,12.0,16.0,12.0,10.0,20.0,7.0,4.0,2.0,9.0,3.0,3.0
97,Republic of Korea2,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,Solomon Islands,34.0,19.0,18.0,18.0,23.0,13.0,20.0,27.0,7.0,14.0,3.0,9.0
99,Vanuatu2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,Viet Nam2,21.0,14.0,8.0,6.0,6.0,3.0,3.0,6.0,1.0,0.0,0.0,0.0


In [15]:
# Rename columns
source_df2.rename(
   columns={
        source_df2.columns[0]: "Country"
    }, inplace=True
)

source_df2_add.rename(
   columns={
        source_df2_add.columns[0]: "Country"
    }, inplace=True
)

# Select and subset study countries 
cleaned_df2 = transformer.subset_study_countries(source_df2, "Country")

source_df2_add = transformer.subset_study_countries(source_df2_add, "Country")

# Add 2021 column
cleaned_df2 = pd.merge(cleaned_df2, source_df2_add[["Country", 2021]], on="Country", how="outer")

# Display cleaned_df1
display(cleaned_df2.head())

Unnamed: 0,Country,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Burkina Faso,,4233.0,4032.0,4860.0,4205.0,5224.0,8083.0,6472.0,7834.0,...,7963.0,6294.0,5632.0,5379.0,3974.0,4144.0,4294.0,1060.0,3983.0,4355.0
1,Ghana,6108.0,1717.0,2376.0,2103.0,1575.0,2037.0,3125.0,4622.0,3889.0,...,2855.0,2506.0,2200.0,2137.0,1264.0,599.0,428.0,336.0,308.0,277.0
2,Kenya,48767.0,48286.0,47697.0,51842.0,25403.0,44328.0,40079.0,,,...,785.0,360.0,472.0,15061.0,603.0,,,858.0,742.0,753.0
3,Malawi,,3355.0,5775.0,4767.0,3457.0,5070.0,6464.0,7486.0,8048.0,...,5516.0,3723.0,4490.0,3799.0,4000.0,3613.0,2967.0,2341.0,2517.0,2368.0
4,Nigeria,,4317.0,4092.0,5343.0,6032.0,6494.0,6586.0,10289.0,8677.0,...,7734.0,7878.0,6082.0,9330.0,7397.0,8720.0,14936.0,26540.0,1811.0,7828.0


In [16]:
# Reshape cleaned_df2 to get a better format
cleaned_df2 = transformer.extract_unique_serie(cleaned_df2, "Country", "Deaths")

# Display cleaned_df2
cleaned_df2.head()

Unnamed: 0,Country,Year,Deaths
0,Burkina Faso,2000,
1,Burkina Faso,2001,4233.0
2,Burkina Faso,2002,4032.0
3,Burkina Faso,2003,4860.0
4,Burkina Faso,2004,4205.0


In [17]:
# Convert year to datetime format
cleaned_df2 = transformer.convert_to_dateformat(cleaned_df2, "Year")


In [18]:
cleaned_df2.sample(5)

Unnamed: 0,Country,Year,Deaths
36,Ghana,2014-12-31,2200.0
99,Nigeria,2011-12-31,3353.0
109,Nigeria,2021-12-31,7828.0
89,Nigeria,2001-12-31,4317.0
103,Nigeria,2015-12-31,9330.0


### Exploring malaria-in-africa.csv - ACADIC

1. ZM Nia, JD Kong, Malaria in Africa (2022) http://acadic.org/malaria-in-africa
2. Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2019 (GBD 2019) Results (2019) https://vizhub.healthdata.org/gbd-results

In [19]:
source_df3 = transformer.load_data(ext="csv", filepath=filepath_3, repo=REPO)

# Display first rows
display(source_df3.head())

# Display last rows
display(source_df3.tail())

Unnamed: 0,admin,adm0_a3,continent,subregion,case2000,case2001,case2002,case2003,case2004,case2005,...,itn2005,itn2017,itn2004,itn2009,itn2002,itn2008,itn2013,itn2020,itn2019,itn1999
0,Ethiopia,ETH,Africa,Eastern Africa,222.326538,231.740753,205.862854,217.671036,222.274536,152.039246,...,1.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Kenya,KEN,Africa,Eastern Africa,216.795532,241.661362,223.48761,209.224899,178.076828,136.533554,...,0.0,0.0,0.0,46.700001,0.0,0.0,0.0,0.0,0.0,0.0
2,Malawi,MWI,Africa,Eastern Africa,463.229462,465.235626,437.955322,398.393829,369.799011,363.373383,...,15.0,67.5,14.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,United Republic of Tanzania,TZA,Africa,Eastern Africa,339.933502,328.419098,302.445984,283.714691,267.095825,248.378387,...,16.0,54.599998,10.0,0.0,0.0,25.700001,0.0,0.0,0.0,2.0
4,Somaliland,SOL,Africa,Eastern Africa,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,admin,adm0_a3,continent,subregion,case2000,case2001,case2002,case2003,case2004,case2005,...,itn2005,itn2017,itn2004,itn2009,itn2002,itn2008,itn2013,itn2020,itn2019,itn1999
46,Egypt,EGY,Africa,Northern Africa,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,Mauritania,MRT,Africa,Western Africa,153.568314,159.007248,124.171692,118.745255,92.541077,84.204803,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48,Equatorial Guinea,GNQ,Africa,Middle Africa,388.128937,389.03717,388.943909,392.427216,394.424164,396.47937,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49,Gambia,GMB,Africa,Western Africa,373.170685,361.466675,350.238525,339.536865,329.345856,319.603943,...,0.0,62.400002,0.0,0.0,0.0,0.0,47.0,44.0,0.0,0.0
50,Madagascar,MDG,Africa,Eastern Africa,57.46167,68.353767,53.594303,68.294609,46.853584,36.148884,...,0.0,0.0,0.0,45.799999,0.0,0.0,62.299999,0.0,0.0,0.0


In [20]:
# Select and subset study countries 
source_df3 = transformer.subset_study_countries(source_df3, "admin")


In [21]:
source_df3

Unnamed: 0,admin,adm0_a3,continent,subregion,case2000,case2001,case2002,case2003,case2004,case2005,...,itn2005,itn2017,itn2004,itn2009,itn2002,itn2008,itn2013,itn2020,itn2019,itn1999
0,Kenya,KEN,Africa,Eastern Africa,216.795532,241.661362,223.48761,209.224899,178.076828,136.533554,...,0.0,0.0,0.0,46.700001,0.0,0.0,0.0,0.0,0.0,0.0
1,Malawi,MWI,Africa,Eastern Africa,463.229462,465.235626,437.955322,398.393829,369.799011,363.373383,...,15.0,67.5,14.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Nigeria,NGA,Africa,Western Africa,418.038635,407.761841,392.93924,391.655121,401.153656,407.785461,...,0.0,49.099998,0.0,0.0,0.0,5.5,16.6,0.0,0.0,0.0
3,Burkina Faso,BFA,Africa,Western Africa,603.210999,601.937744,595.852051,585.123291,562.411316,533.845093,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ghana,GHA,Africa,Western Africa,435.234741,419.930389,391.170441,370.895569,352.26535,336.385345,...,0.0,0.0,0.0,0.0,0.0,38.700001,0.0,0.0,54.099998,0.0


In [22]:
# Split columns
country_columns = ["admin", "adm0_a3"]
cases_columns = [i for i in source_df3.columns if "case" in i]
deaths_columns = [i for i in source_df3.columns if "death2" in i]
deaths0_4_columns = [i for i in source_df3.columns if "death0-42" in i]
deaths5_14_columns = [i for i in source_df3.columns if "death5-142" in i]
deaths15_49_columns = [i for i in source_df3.columns if "death15-492" in i]
deaths50_69_columns = [i for i in source_df3.columns if "death50-692" in i]
deaths70p_columns = [i for i in source_df3.columns if "death70-2" in i]
itns_columns = [i for i in source_df3.columns if "itn2" in i] # Percentage of childrens sleeping under Insecticide-Treated bed Nets (ITNs)

In [23]:
# Extract series from source_df3
cases_series = transformer.extract_series([cases_columns, "Cases", "case"], source_data=source_df3, immutable_columns=country_columns)
deaths_series = transformer.extract_series([deaths_columns, "Deaths", "death"], source_data=source_df3, immutable_columns=country_columns)
deaths0_4_series = transformer.extract_series([deaths0_4_columns, "Deaths0_4", "death0-4"], source_data=source_df3, immutable_columns=country_columns)
deaths5_14_series = transformer.extract_series([deaths5_14_columns, "Deaths5_14", "death5-14"], source_data=source_df3, immutable_columns=country_columns)
deaths15_49_series = transformer.extract_series([deaths15_49_columns, "Deaths15_49", "death15-49"], source_data=source_df3, immutable_columns=country_columns)
deaths50_69_series = transformer.extract_series([deaths50_69_columns, "Deaths50_69", "death50-69"], source_data=source_df3, immutable_columns=country_columns)
deaths70p_series = transformer.extract_series([deaths70p_columns, "Deaths70p", "death70-"], source_data=source_df3, immutable_columns=country_columns)
itns_series = transformer.extract_series([itns_columns, "ITN", "itn"], source_data=source_df3, immutable_columns=country_columns)

In [24]:
cleaned_df3 = (
    cases_series.
    merge(deaths_series, on=country_columns+["Year"], how="outer").
    merge(deaths0_4_series, on=country_columns+["Year"], how="outer").
    merge(deaths5_14_series, on=country_columns+["Year"], how="outer").
    merge(deaths15_49_series, on=country_columns+["Year"], how="outer").
    merge(deaths50_69_series, on=country_columns+["Year"], how="outer").
    merge(deaths70p_series, on=country_columns+["Year"], how="outer").
    merge(itns_series, on=country_columns+["Year"], how="outer")
)

In [25]:
display(cleaned_df3.info())

display(cleaned_df3.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   admin        105 non-null    object        
 1   adm0_a3      105 non-null    object        
 2   Year         105 non-null    datetime64[ns]
 3   Cases        105 non-null    float64       
 4   Deaths       100 non-null    float64       
 5   Deaths0_4    100 non-null    float64       
 6   Deaths5_14   100 non-null    float64       
 7   Deaths15_49  100 non-null    float64       
 8   Deaths50_69  100 non-null    float64       
 9   Deaths70p    100 non-null    float64       
 10  ITN          105 non-null    float64       
dtypes: datetime64[ns](1), float64(8), object(2)
memory usage: 9.1+ KB


None

Unnamed: 0,admin,adm0_a3,Year,Cases,Deaths,Deaths0_4,Deaths5_14,Deaths15_49,Deaths50_69,Deaths70p,ITN
98,Ghana,GHA,2014-12-31,310.783264,110.713205,352.294645,15.964282,35.105035,195.373814,405.144418,46.599998
7,Kenya,KEN,2007-12-31,79.708366,22.164906,93.504211,5.820863,5.67549,34.766218,54.45388,0.0
46,Nigeria,NGA,2004-12-31,401.153656,171.031376,604.461257,25.020138,49.711742,308.495649,507.106998,0.0
49,Nigeria,NGA,2007-12-31,412.435028,181.00758,611.324693,27.032348,51.77602,324.584432,587.861666,0.0
85,Ghana,GHA,2001-12-31,419.930389,134.614344,482.296703,18.408364,39.77737,228.800874,489.89762,0.0


### Exploring DatasetAfricaMalaria.csv - From Kaggle 

URL: https://www.kaggle.com/datasets/lydia70/malaria-in-africa

The data have been retrieved from the world bank open data source.

In [26]:
source_df4 = transformer.load_data(ext="csv", filepath=filepath_4, repo=REPO)

# Display first rows
display(source_df4.head())

# Display last rows
display(source_df4.tail())

Unnamed: 0,Country Name,Year,Country Code,"Incidence of malaria (per 1,000 population at risk)",Malaria cases reported,Use of insecticide-treated bed nets (% of under-5 population),Children with fever receiving antimalarial drugs (% of children under age 5 with fever),Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women),People using safely managed drinking water services (% of population),"People using safely managed drinking water services, rural (% of rural population)",...,Urban population growth (annual %),People using at least basic drinking water services (% of population),"People using at least basic drinking water services, rural (% of rural population)","People using at least basic drinking water services, urban (% of urban population)",People using at least basic sanitation services (% of population),"People using at least basic sanitation services, rural (% of rural population)","People using at least basic sanitation services, urban (% of urban population)",latitude,longitude,geometry
0,Algeria,2007,DZA,0.01,26.0,,,,,,...,2.71,91.68,85.83,94.78,85.85,76.94,90.57,28.033886,1.659626,POINT (28.033886 1.659626)
1,Angola,2007,AGO,286.72,1533485.0,18.0,29.8,1.5,,,...,5.01,47.96,23.77,65.83,37.26,14.0,54.44,-11.202692,17.873887,POINT (-11.202692 17.873887)
2,Benin,2007,BEN,480.24,0.0,,,,,,...,4.09,63.78,54.92,76.24,11.8,4.29,22.36,9.30769,2.315834,POINT (9.307689999999999 2.315834)
3,Botswana,2007,BWA,1.03,390.0,,,,,,...,4.8,78.89,57.6,94.35,61.6,39.99,77.3,-22.328474,24.684866,POINT (-22.328474 24.684866)
4,Burkina Faso,2007,BFA,503.8,44246.0,,,,,,...,5.91,52.27,45.13,76.15,15.6,6.38,46.49,12.238333,-1.561593,POINT (12.238333 -1.561593)


Unnamed: 0,Country Name,Year,Country Code,"Incidence of malaria (per 1,000 population at risk)",Malaria cases reported,Use of insecticide-treated bed nets (% of under-5 population),Children with fever receiving antimalarial drugs (% of children under age 5 with fever),Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women),People using safely managed drinking water services (% of population),"People using safely managed drinking water services, rural (% of rural population)",...,Urban population growth (annual %),People using at least basic drinking water services (% of population),"People using at least basic drinking water services, rural (% of rural population)","People using at least basic drinking water services, urban (% of urban population)",People using at least basic sanitation services (% of population),"People using at least basic sanitation services, rural (% of rural population)","People using at least basic sanitation services, urban (% of urban population)",latitude,longitude,geometry
589,Togo,2017,TGO,278.2,1755577.0,69.7,31.1,41.7,,,...,3.79,65.13,48.39,89.06,16.13,7.4,28.61,8.619543,0.824782,POINT (8.619543 0.824782)
590,Tunisia,2017,TUN,,,,,,92.66,,...,1.57,96.25,88.71,99.7,90.92,81.35,95.29,33.886917,9.537499,POINT (33.886917 9.537499)
591,Uganda,2017,UGA,336.76,11667831.0,,,,7.07,4.46,...,6.25,49.1,41.25,75.11,18.47,16.17,26.11,1.373333,32.290275,POINT (1.373333 32.290275)
592,Zambia,2017,ZMB,160.05,5505639.0,,,,,,...,4.21,59.96,41.95,83.86,26.37,18.93,36.24,-13.133897,27.849332,POINT (-13.133897 27.849332)
593,Zimbabwe,2017,ZWE,108.55,467508.0,,,,,,...,1.28,64.05,49.8,94.0,36.22,31.47,46.22,-19.015438,29.154857,POINT (-19.015438 29.154857)


In [27]:
# Select and subset study countries 
source_df4 = transformer.subset_study_countries(source_df4, "Country Name")


In [28]:
source_df4.head()

Unnamed: 0,Country Name,Year,Country Code,"Incidence of malaria (per 1,000 population at risk)",Malaria cases reported,Use of insecticide-treated bed nets (% of under-5 population),Children with fever receiving antimalarial drugs (% of children under age 5 with fever),Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women),People using safely managed drinking water services (% of population),"People using safely managed drinking water services, rural (% of rural population)",...,Urban population growth (annual %),People using at least basic drinking water services (% of population),"People using at least basic drinking water services, rural (% of rural population)","People using at least basic drinking water services, urban (% of urban population)",People using at least basic sanitation services (% of population),"People using at least basic sanitation services, rural (% of rural population)","People using at least basic sanitation services, urban (% of urban population)",latitude,longitude,geometry
0,Burkina Faso,2007,BFA,503.8,44246.0,,,,,,...,5.91,52.27,45.13,76.15,15.6,6.38,46.49,12.238333,-1.561593,POINT (12.238333 -1.561593)
1,Ghana,2007,GHA,322.33,476484.0,,,,21.1,3.86,...,3.99,71.35,59.83,83.5,12.12,6.94,17.58,7.946527,-1.023194,POINT (7.946527 -1.023194)
2,Kenya,2007,KEN,78.02,0.0,,,,,,...,4.46,52.3,42.38,86.63,31.69,30.6,35.47,0.1769,37.9083,POINT (0.1769 37.9083)
3,Malawi,2007,MWI,370.08,0.0,,,,,,...,3.4,59.55,54.81,85.95,23.03,21.27,32.82,-13.254308,34.301525,POINT (-13.254308 34.301525)
4,Nigeria,2007,NGA,421.33,0.0,,,2.0,17.57,13.18,...,4.8,57.64,43.3,78.44,32.53,29.78,36.51,9.081999,8.675277,POINT (9.081999 8.675276999999999)


In [29]:
# Percentage of missing values
round((source_df4.isna().sum() / len(source_df4)) * 100, 2)

Country Name                                                                                0.00
Year                                                                                        0.00
Country Code                                                                                0.00
Incidence of malaria (per 1,000 population at risk)                                         0.00
Malaria cases reported                                                                      0.00
Use of insecticide-treated bed nets (% of under-5 population)                              60.00
Children with fever receiving antimalarial drugs (% of children under age 5 with fever)    63.64
Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women)      61.82
People using safely managed drinking water services (% of population)                      60.00
People using safely managed drinking water services, rural (% of rural population)         60.00
People using safely managed dr

In [30]:
# Columns with 50% or more missing valaue are not useful
cokumns_50_more_missing_df4 = [col for col in source_df4.columns if (source_df4[col].isna().sum() / len(source_df4)) >= 0.5]

# Drop columns_50_more_missing and non useful columns from source_df4
additional_to_drop_df4 = [*source_df4.columns[-3:]]
cleaned_df4 = source_df4.drop(columns=cokumns_50_more_missing_df4+additional_to_drop_df4)

In [31]:
cleaned_df4.head()

Unnamed: 0,Country Name,Year,Country Code,"Incidence of malaria (per 1,000 population at risk)",Malaria cases reported,"People using safely managed drinking water services, urban (% of urban population)",Rural population (% of total population),Rural population growth (annual %),Urban population (% of total population),Urban population growth (annual %),People using at least basic drinking water services (% of population),"People using at least basic drinking water services, rural (% of rural population)","People using at least basic drinking water services, urban (% of urban population)",People using at least basic sanitation services (% of population),"People using at least basic sanitation services, rural (% of rural population)","People using at least basic sanitation services, urban (% of urban population)"
0,Burkina Faso,2007,BFA,503.8,44246.0,,77.0,2.16,23.0,5.91,52.27,45.13,76.15,15.6,6.38,46.49
1,Ghana,2007,GHA,322.33,476484.0,39.28,51.33,1.26,48.67,3.99,71.35,59.83,83.5,12.12,6.94,17.58
2,Kenya,2007,KEN,78.02,0.0,58.53,77.58,2.29,22.42,4.46,52.3,42.38,86.63,31.69,30.6,35.47
3,Malawi,2007,MWI,370.08,0.0,,84.77,2.69,15.24,3.4,59.55,54.81,85.95,23.03,21.27,32.82
4,Nigeria,2007,NGA,421.33,0.0,23.93,59.18,1.16,40.82,4.8,57.64,43.3,78.44,32.53,29.78,36.51


In [32]:
# Convert year to datetime format
cleaned_df4 = transformer.convert_to_dateformat(cleaned_df4, "Year")


In [33]:
# Convert Malaria cases reported to int dtype
cleaned_df4 = cleaned_df4.astype({"Malaria cases reported": int})

In [34]:
display(cleaned_df4.info())

display(cleaned_df4.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 16 columns):
 #   Column                                                                              Non-Null Count  Dtype         
---  ------                                                                              --------------  -----         
 0   Country Name                                                                        55 non-null     object        
 1   Year                                                                                55 non-null     datetime64[ns]
 2   Country Code                                                                        55 non-null     object        
 3   Incidence of malaria (per 1,000 population at risk)                                 55 non-null     float64       
 4   Malaria cases reported                                                              55 non-null     int64         
 5   People using safely managed drinking water services,

None

Unnamed: 0,Country Name,Year,Country Code,"Incidence of malaria (per 1,000 population at risk)",Malaria cases reported,"People using safely managed drinking water services, urban (% of urban population)",Rural population (% of total population),Rural population growth (annual %),Urban population (% of total population),Urban population growth (annual %),People using at least basic drinking water services (% of population),"People using at least basic drinking water services, rural (% of rural population)","People using at least basic drinking water services, urban (% of urban population)",People using at least basic sanitation services (% of population),"People using at least basic sanitation services, rural (% of rural population)","People using at least basic sanitation services, urban (% of urban population)"
5,Burkina Faso,2008-12-31,BFA,533.39,36514,,76.47,2.32,23.53,5.34,51.78,44.16,76.52,16.08,6.93,45.82
40,Burkina Faso,2015-12-31,BFA,400.09,7015446,,72.47,2.12,27.53,5.13,48.66,37.09,79.11,18.83,10.45,40.87
35,Burkina Faso,2014-12-31,BFA,436.06,5428655,,73.07,2.16,26.93,5.17,49.07,38.13,78.74,18.5,9.99,41.61
16,Ghana,2010-12-31,GHA,364.15,1071637,44.28,49.29,1.11,50.71,3.84,74.36,62.19,86.19,13.95,8.34,19.39
34,Nigeria,2013-12-31,NGA,328.65,0,24.36,53.88,1.07,46.12,4.59,65.94,50.65,83.81,36.27,30.37,43.16


### Exploring confirmed_cases.csv, presumed_cases.csv and total_malaria_cases (confimed+presumed).csv - From WHO site

Source: https://apps.who.int/gho/data/node.main

In [35]:
source_df5_1 = transformer.load_data(ext="csv", filepath=filepath_51, repo=REPO, header=1)

source_df5_2 = transformer.load_data(ext="csv", filepath=filepath_52, repo=REPO, header=1)

source_df5_3 = transformer.load_data(ext="csv", filepath=filepath_53, repo=REPO, header=1)


In [36]:
# Display first rows
display(source_df5_1.head())

# Display last rows
display(source_df5_1.tail())

Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,French Guiana,143,154,212,546,597,258,434,448,875,...,3 462,3 320,4 828,4 074,3 414,3 038,3 839,3 661,3 823,3 708
1,Mayotte,,,,47,19,28,11,15,82,...,352,346,421,392,500,743,792,,,
2,Afghanistan,86 263,105 295,173 860,248 689,313 086,241 233,119 859,106 478,52 965,...,65 000,81 574,92 202,86 129,116 444,242 022,360 940,414 611,50 850,94 475
3,Algeria,1 164,2 726,1 014,1 242,453,432,747,266,603,...,94,196,288,117,299,163,427,307,435,541
4,Angola,8 325 921,6 599 327,7 054 978,5 150 575,3 874 892,3 794 253,2 769 305,2 298 979,1 999 868,...,1 573 422,1 377 992,1 533 485,1 082 398,889 572,,,,,


Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
103,Venezuela (Bolivarian Republic of),194 057,232 757,492 753,522 059,525 897,301 466,137 996,91 918,80 320,...,35 828,32 037,41 749,37 062,45 049,46 655,31 719,29 491,20 006,29 736
104,Viet Nam,467,1 422,4 765,4 813,4 548,4 161,9 331,15 752,17 128,...,16 130,11 355,16 389,22 637,19 496,24 909,38 790,47 807,68 699,74 316
105,Yemen,180 339,164 066,165 899,157 900,143 333,99 700,76 259,86 707,102 778,...,55 454,44 206,67 677,55 000,44 150,48 756,50 811,75 508,,
106,Zambia,6 769 142,8 121 215,5 147 350,5 039 679,5 505 639,4 851 319,4 184 661,4 077 547,,...,,,,,,,,,,
107,Zimbabwe,133 137,447 381,308 173,264 018,471 798,316 989,484 794,550 696,423 702,...,57 014,16 394,116 518,19 702,18 954,16 990,,,,


In [37]:
# Display first rows
display(source_df5_2.head())

# Display last rows
display(source_df5_2.tail())

Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,French Guiana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Mayotte,,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,,,
2,Afghanistan,107,150,1 034,51 174,100 450,194 784,263 149,211 130,273 628,...,325 849,385 549,369 081,328 278,210 250,31 355,224 662,212 228,,
3,Algeria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Angola,843 346,556 783,475 810,777 685,625 329,506 893,484 965,881 042,1 144 232,...,2 153 184,2 054 432,1 193 045,1 200 699,1 439 744,2 489 170,3 246 258,1 862 662,1 249 767,2 080 348


Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
102,Venezuela (Bolivarian Republic of),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
103,Viet Nam,0,311,1 222,2 057,3 863,6 285,9 921,12 116,18 278,...,33 056,40 313,43 212,52 129,64 977,83 441,97 199,104 154,119 423,200 594
104,Yemen,64 518,38 605,50 864,75 243,,45 927,28 572,36 105,46 673,...,83 133,114 402,155 622,162 270,156 410,109 805,214 221,111 651,,
105,Zambia,390 101,577 089,212 670,156 044,549 040,1 124 873,909 462,1 895 386,5 465 122,...,2 976 395,3 080 301,4 248 295,4 731 338,4 121 356,4 078 234,4 346 172,3 760 335,3 838 402,3 337 796
106,Zimbabwe,0,,,,295 271,67 040,,22 248,,...,622 869,971 058,1 038 001,1 293 756,1 475 564,1 798 480,,,,


In [38]:
# Display first rows
display(source_df5_3.head())

# Display last rows
display(source_df5_3.tail())

Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,French Guiana,143,154,212,546,597,258,434,448,875,...,3 462,3 320,4 828,4 074,3 414,3 038,3 839,3 661,3 823,3 708
1,Mayotte,,,,47,19,28,11,15,82,...,352,346,421,392,500,743,792,,,
2,Afghanistan,86 370,105 445,174 894,299 863,413 536,436 017,383 008,317 608,326 593,...,390 849,467 123,461 283,414 407,326 694,273 377,585 602,626 839,50 850,94 475
3,Algeria,1 164,2 726,1 014,1 242,453,432,747,266,603,...,94,196,288,117,299,163,427,307,435,541
4,Angola,9 169 267,7 156 110,7 530 788,5 928 260,4 500 221,4 301 146,3 254 270,3 180 021,3 144 100,...,3 726 606,3 432 424,2 726 530,2 283 097,2 329 316,2 489 170,3 246 258,1 862 662,1 249 767,2 080 348


Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
103,Venezuela (Bolivarian Republic of),194 057,232 757,492 753,522 059,525 897,301 466,137 996,91 918,80 320,...,35 828,32 037,41 749,37 062,45 049,46 655,31 719,29 491,20 006,29 736
104,Viet Nam,467,1 733,5 987,6 870,8 411,10 446,19 252,27 868,35 406,...,49 186,51 668,59 601,74 766,84 473,108 350,135 989,151 961,188 122,274 910
105,Yemen,244 857,202 671,216 763,233 143,143 333,145 627,104 831,122 812,149 451,...,138 587,158 608,223 299,217 270,200 560,158 561,265 032,187 159,,
106,Zambia,7 159 243,8 698 304,5 360 020,5 195 723,6 054 679,5 976 192,5 094 123,5 972 933,5 465 122,...,2 976 395,3 080 301,4 248 295,4 731 338,4 121 356,4 078 234,4 346 172,3 760 335,3 838 402,3 337 796
107,Zimbabwe,133 137,447 381,308 173,264 018,767 069,384 029,484 794,572 944,423 702,...,679 883,987 452,1 154 519,1 313 458,1 494 518,1 815 470,,,,


In [39]:
# Rename columns
source_df5_1.rename(
   columns={
        source_df5_1.columns[0]: "Country"
    }, inplace=True
)

source_df5_2.rename(
   columns={
        source_df5_2.columns[0]: "Country"
    }, inplace=True
)

source_df5_3.rename(
   columns={
        source_df5_3.columns[0]: "Country"
    }, inplace=True
)

# Select and subset study countries 
source_df5_1 = transformer.subset_study_countries(source_df5_1, "Country")
source_df5_2 = transformer.subset_study_countries(source_df5_2, "Country")
source_df5_3 = transformer.subset_study_countries(source_df5_3, "Country")

# Display source_df5_1
display(source_df5_1.head())

# Display source_df5_2
display(source_df5_2.head())

# Display source_df5_3
display(source_df5_3.head())

Unnamed: 0,Country,2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,Burkina Faso,11 791 638,10 600 340,5 877 426,10 278 970,10 557 260,9 779 411,7 015 446,5 428 655,3 769 051,...,182 527,36 514,44 246,44 265,21 335,18 256,,,,
1,Ghana,5 747 585,5 172 803,6 115 267,4 931 454,7 003 155,5 428 979,5 657 096,3 415 912,1 643 642,...,1 104 370,1 094 483,476 484,472 255,655 093,475 441,,,,
2,Kenya,3 828 757,3 659 170,5 019 389,2 318 090,3 607 026,3 064 796,2 041 277,2 851 555,2 375 129,...,,839 903,,,,28 328,39 383,20 049,,
3,Malawi,6 948 500,7 139 065,5 184 107,5 865 476,4 947 443,4 827 373,3 661 238,2 905 310,1 280 892,...,,,,,,,,,,
4,Nigeria,21 325 186,18 325 240,19 806 915,14 548 024,13 087 878,13 598 282,8 068 583,8 572 322,,...,479 845,143 079,,,,,,,,


Unnamed: 0,Country,2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,Burkina Faso,673 905,967 358,597 601,1 712 176,1 698 411,20 407,1 271 007,2 849 753,3 376 975,...,4 355 073,3 753 724,2 443 387,2 016 602,1 594 360,1 528 388,1 443 184,1 188 870,352 587,
1,Ghana,329 958,431 943,588 420,6 222 946,6 468 934,6 022 349,6 021 210,5 150 090,5 616 250,...,2 657 618,2 105 664,2 646 663,3 039 197,2 797 876,2 940 592,3 552 896,3 140 893,3 044 844,3 349 528
2,Kenya,427 117,3 216 199,,8 557 644,4 855 050,5 582 276,6 177 953,6 846 974,7 415 667,...,8 123 689,,9 610 691,8 926 058,9 181 224,7 517 213,5 298 625,3 299 350,3 262 931,4 216 531
3,Malawi,33 923,30 577,21 813,,988 905,338 013,1 272 178,2 160 393,2 625 946,...,6 183 816,5 185 082,4 786 045,4 498 949,3 688 389,2 871 098,3 358 960,2 784 001,3 823 796,3 646 212
4,Nigeria,2 283 611,3 254 815,3 569 878,5 934 356,7 131 390,10 358 387,8 633 678,8 685 173,12 830 911,...,3 815 841,2 548 016,2 969 950,3 982 372,3 532 108,3 310 229,2 608 479,2 605 381,2 253 519,2 476 608


Unnamed: 0,Country,2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,Burkina Faso,12 465 543,11 567 698,6 475 027,11 991 146,12 255 671,9 799 818,8 286 453,8 278 408,7 146 026,...,4 537 600,3 790 238,2 487 633,2 060 867,1 615 695,1 546 644,1 443 184,1 188 870,352 587,
1,Ghana,6 077 543,5 604 746,6 703 687,11 154 400,13 472 089,11 451 328,11 678 306,8 566 002,7 259 892,...,3 761 988,3 200 147,3 123 147,3 511 452,3 452 969,3 416 033,3 552 896,3 140 893,3 044 844,3 349 528
2,Kenya,4 255 874,6 875 369,5 019 389,10 875 734,8 462 076,8 647 072,8 219 230,9 698 529,9 790 796,...,8 123 689,839 903,9 610 691,8 926 058,9 181 224,7 545 541,5 338 008,3 319 399,3 262 931,4 216 531
3,Malawi,6 982 423,7 169 642,5 205 920,5 865 476,5 936 348,5 165 386,4 933 416,5 065 703,3 906 838,...,6 183 816,5 185 082,4 786 045,4 498 949,3 688 389,2 871 098,3 358 960,2 784 001,3 823 796,3 646 212
4,Nigeria,23 608 797,21 580 055,23 376 793,20 482 380,20 219 268,23 956 669,16 702 261,17 257 495,12 830 911,...,4 295 686,2 691 095,2 969 950,3 982 372,3 532 108,3 310 229,2 608 479,2 605 381,2 253 519,2 476 608


In [40]:
# Reshape source_df5 to get a better format
cleaned_df5_1 = transformer.extract_unique_serie(source_df5_1, "Country", "Confirmed cases")

cleaned_df5_2 = transformer.extract_unique_serie(source_df5_2, "Country", "Presumed cases")

cleaned_df5_3 = transformer.extract_unique_serie(source_df5_3, "Country", "Total cases")

# Display source_df5_1
display(cleaned_df5_1.head())

# Display source_df5_2
display(cleaned_df5_2.head())

# Display source_df5_3
display(cleaned_df5_3.head())

Unnamed: 0,Country,Year,Confirmed cases
0,Burkina Faso,2021,11 791 638
1,Burkina Faso,2020,10 600 340
2,Burkina Faso,2019,5 877 426
3,Burkina Faso,2018,10 278 970
4,Burkina Faso,2017,10 557 260


Unnamed: 0,Country,Year,Presumed cases
0,Burkina Faso,2021,673 905
1,Burkina Faso,2020,967 358
2,Burkina Faso,2019,597 601
3,Burkina Faso,2018,1 712 176
4,Burkina Faso,2017,1 698 411


Unnamed: 0,Country,Year,Total cases
0,Burkina Faso,2021,12 465 543
1,Burkina Faso,2020,11 567 698
2,Burkina Faso,2019,6 475 027
3,Burkina Faso,2018,11 991 146
4,Burkina Faso,2017,12 255 671


In [41]:
cleaned_df5 = (
    cleaned_df5_1.
    merge(cleaned_df5_2, on=["Country", "Year"], how="outer").
    merge(cleaned_df5_3, on=["Country", "Year"], how="outer")
)
cleaned_df5 = cleaned_df5.astype({"Year": int})

cleaned_df5.head()

Unnamed: 0,Country,Year,Confirmed cases,Presumed cases,Total cases
0,Burkina Faso,2021,11 791 638,673 905,12 465 543
1,Burkina Faso,2020,10 600 340,967 358,11 567 698
2,Burkina Faso,2019,5 877 426,597 601,6 475 027
3,Burkina Faso,2018,10 278 970,1 712 176,11 991 146
4,Burkina Faso,2017,10 557 260,1 698 411,12 255 671


In [42]:
# Convert year to datetime format
cleaned_df5 = transformer.convert_to_dateformat(cleaned_df5, "Year")


In [43]:
cleaned_df5.sample(5)

Unnamed: 0,Country,Year,Confirmed cases,Presumed cases,Total cases
4,Burkina Faso,2017-12-31,10 557 260,1 698 411,12 255 671
82,Malawi,2005-12-31,,3 688 389,3 688 389
97,Nigeria,2012-12-31,,6 938 519,6 938 519
9,Burkina Faso,2012-12-31,3 858 046,3 112 654,6 970 700
106,Nigeria,2003-12-31,,2 608 479,2 608 479


### Exploring Population_access_to_ITN.csv - From WHO site

Source: https://apps.who.int/gho/data/node.main

In [44]:
source_df6 = transformer.load_data(ext="csv", filepath=filepath_6, repo=REPO, header=1)

# Display first rows
display(source_df6.head())

# Display last rows
display(source_df6.tail())

Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,Angola,16.78,15.49,27.48,52.69,51.68,23.27,22.61,23.26,12.92,...,15.12,17.39,13.02,6.41,4.01,4.49,5.3,4.72,3.67,2.82
1,Benin,63.97,68.49,29.86,58.93,52.89,26.69,64.15,32.67,15.81,...,18.19,22.61,22.37,3.31,3.16,4.1,3.6,3.13,3.05,2.6
2,Burkina Faso,52.04,69.39,66.52,49.46,67.7,69.07,51.96,70.2,49.45,...,7.53,7.73,5.55,5.19,4.28,3.0,2.6,2.9,2.97,2.55
3,Burundi,66.91,65.19,49.99,71.64,59.25,38.81,66.1,65.6,39.48,...,44.59,34.65,27.56,9.09,4.04,4.15,3.28,3.36,3.3,2.78
4,Cameroon,68.36,73.51,69.3,62.92,71.06,70.02,56.71,55.86,60.05,...,6.88,6.14,6.41,6.63,3.3,2.22,2.21,2.49,2.94,2.62


Unnamed: 0,"Countries, territories and areas",2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
35,Togo,88.95,83.8,65.75,79.28,77.53,62.11,77.05,68.61,55.74,...,53.29,38.37,15.28,22.83,30.65,19.5,3.27,3.31,3.22,2.68
36,Uganda,74.47,69.85,60.63,80.55,83.15,64.79,67.17,68.53,47.84,...,17.45,19.13,12.75,6.61,4.03,3.42,3.35,3.37,3.12,2.68
37,United Republic of Tanzania,50.66,56.08,53.8,58.25,65.71,64.06,48.16,22.31,45.76,...,38.1,21.14,16.97,12.25,13.1,8.86,5.6,3.6,3.13,2.64
38,Zambia,56.98,45.34,48.21,64.45,71.52,38.26,59.52,61.12,58.42,...,34.26,42.85,36.06,17.51,6.35,4.01,3.81,4.69,4.34,2.77
39,Zimbabwe,36.38,40.32,37.84,37.47,46.84,48.38,47.04,46.74,30.52,...,12.42,8.22,10.91,7.79,2.8,3.03,3.33,3.34,3.22,2.79


In [45]:
# Rename columns
source_df6.rename(
   columns={
        source_df6.columns[0]: "Country"
    }, inplace=True
)

# Select and subset study countries 
source_df6 = transformer.subset_study_countries(source_df6, "Country")

# Display source_df6
display(source_df6.head())

Unnamed: 0,Country,2021,2020,2019,2018,2017,2016,2015,2014,2013,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,Burkina Faso,52.04,69.39,66.52,49.46,67.7,69.07,51.96,70.2,49.45,...,7.53,7.73,5.55,5.19,4.28,3.0,2.6,2.9,2.97,2.55
1,Ghana,63.78,58.99,78.9,85.41,71.2,70.05,68.87,61.7,63.76,...,24.47,31.26,27.15,11.01,4.96,4.15,3.18,3.48,3.4,2.79
2,Kenya,57.42,46.24,61.24,71.0,70.05,63.26,60.05,52.94,55.02,...,46.27,47.21,46.31,38.18,13.86,5.05,5.38,8.36,3.1,2.72
3,Malawi,54.13,43.23,71.1,72.58,61.79,63.12,31.02,48.91,62.67,...,37.49,28.49,19.24,25.12,14.96,16.08,8.83,3.94,3.36,2.77
4,Nigeria,51.01,44.59,47.98,50.42,49.65,52.78,53.67,43.26,38.35,...,13.35,3.82,2.94,2.81,3.05,2.87,2.38,2.61,2.96,2.67


In [46]:
# Reshape source_df5 to get a better format
cleaned_df6 = transformer.extract_unique_serie(source_df6, "Country", "ITN Access Population (%)")

cleaned_df6 = cleaned_df6.astype({"Year": int})

# Display source_df6
display(cleaned_df6.head())


Unnamed: 0,Country,Year,ITN Access Population (%)
0,Burkina Faso,2021,52.04
1,Burkina Faso,2020,69.39
2,Burkina Faso,2019,66.52
3,Burkina Faso,2018,49.46
4,Burkina Faso,2017,67.7


In [47]:
# Convert year to datetime format
cleaned_df6 = transformer.convert_to_dateformat(cleaned_df6, "Year")


In [48]:
cleaned_df6.sample(5)

Unnamed: 0,Country,Year,ITN Access Population (%)
23,Ghana,2020-12-31,58.99
94,Nigeria,2015-12-31,53.67
25,Ghana,2018-12-31,85.41
7,Burkina Faso,2014-12-31,70.2
95,Nigeria,2014-12-31,43.26


### Exploring pr_timeseries_annual_cru_1901-2021_BFA.csv, pr_timeseries_annual_cru_1901-2021_GHA.csv, pr_timeseries_annual_cru_1901-2021_KEN.csv and pr_timeseries_annual_cru_1901-2021_MWI.csv; tasmin_timeseries_annual_cru_1901-2021_BFA.csv, tasmin_timeseries_annual_cru_1901-2021_GHA.csv, tasmin_timeseries_annual_cru_1901-2021_KEN.csv and tasmin_timeseries_annual_cru_1901-2021_MWI.csv - From World Bank Group site

Source: https://climateknowledgeportal.worldbank.org/download-data

In [49]:
source_df7_1 = transformer.load_data(ext="csv", filepath=filepath_711, repo=REPO, header=1)
source_df7_2 = transformer.load_data(ext="csv", filepath=filepath_721, repo=REPO, header=1)
source_df7_3 = transformer.load_data(ext="csv", filepath=filepath_731, repo=REPO, header=1)
source_df7_4 = transformer.load_data(ext="csv", filepath=filepath_741, repo=REPO, header=1)
source_df7_5 = transformer.load_data(ext="csv", filepath=filepath_751, repo=REPO, header=1)

source_df7_12 = transformer.load_data(ext="csv", filepath=filepath_712, repo=REPO, header=1)
source_df7_22 = transformer.load_data(ext="csv", filepath=filepath_722, repo=REPO, header=1)
source_df7_32 = transformer.load_data(ext="csv", filepath=filepath_732, repo=REPO, header=1)
source_df7_42 = transformer.load_data(ext="csv", filepath=filepath_742, repo=REPO, header=1)
source_df7_52 = transformer.load_data(ext="csv", filepath=filepath_752, repo=REPO, header=1)


# Display first rows 
display(source_df7_1.head())

display(source_df7_2.head())

display(source_df7_3.head())

display(source_df7_4.head())

display(source_df7_5.head())

display(source_df7_12.head())

display(source_df7_22.head())

display(source_df7_32.head())

display(source_df7_42.head())

display(source_df7_52.head())

Unnamed: 0.1,Unnamed: 0,Nigeria,Adamawa,Akwa Ibom,Anambra,Benue,Borno,Cross River,Delta,Edo,...,Ebonyi,Ekiti,Enugu,Gombe,Nassarawa,Ondo,Plateau,Rivers,Sokoto,Zamfara
0,1901,1344.89,1025.47,2639.57,2242.89,1658.13,568.55,2512.83,3024.8,2639.05,...,2170.08,2177.53,1936.3,869.05,1439.77,2621.25,1211.56,2568.02,674.2,886.57
1,1902,1150.76,1025.47,2958.45,2094.62,1725.11,568.55,2761.82,2221.06,1667.74,...,2317.05,1111.19,2034.81,869.05,1206.83,1318.3,1122.55,2737.25,674.2,880.17
2,1903,1222.17,1025.47,2902.59,2195.42,1766.09,568.55,2743.53,2500.09,1956.98,...,2320.23,1466.0,2050.68,869.05,1394.84,1720.48,1180.15,2772.0,674.2,884.21
3,1904,1155.25,996.41,2738.85,2031.23,1602.78,522.16,2555.9,2410.35,1855.39,...,2119.99,1393.23,1851.44,867.76,1132.58,1669.13,1100.28,2636.84,674.2,878.12
4,1905,1262.99,1028.84,3152.08,2358.62,1825.43,556.63,2974.42,2527.35,1930.75,...,2517.63,1386.44,2219.84,923.53,1305.28,1639.38,1204.93,3049.36,695.6,1016.81


Unnamed: 0.1,Unnamed: 0,Burkina Faso,Boucle Du Mouhoun,Cascades,Centre,Centre-est,Centre-nord,Centre-ouest,Centre-sud,Est,Hauts-bassins,Nord,Plateau Central,Sahel,Sud-ouest
0,1901,819.6,788.84,1088.73,837.23,965.39,656.49,873.47,953.77,835.9,1011.01,623.81,814.74,500.27,1037.24
1,1902,830.19,826.02,1103.2,861.71,925.38,677.63,914.81,947.6,792.01,1041.34,649.35,827.73,505.98,1102.9
2,1903,793.49,786.9,1098.46,793.93,866.64,627.69,847.78,886.9,756.58,1024.34,617.25,765.2,493.7,1055.76
3,1904,835.93,859.12,1098.33,891.07,874.03,715.07,926.04,918.72,770.6,1049.35,683.28,852.27,531.51,1077.06
4,1905,811.87,799.06,1133.34,816.88,906.6,644.93,858.75,911.22,781.37,1037.39,660.61,786.09,504.84,1035.76


Unnamed: 0.1,Unnamed: 0,Ghana,Ashanti,Brong Ahafo,Central,Eastern,Greater Accra,Northern,Upper East,Upper West,Volta,Western
0,1901,1279.37,1429.3,1296.37,1251.96,1305.07,967.72,1195.28,1068.52,1034.45,1276.55,1647.23
1,1902,1180.87,1248.94,1207.39,1145.44,1088.15,854.43,1165.22,1034.48,1066.25,1122.76,1440.46
2,1903,1051.25,1103.33,1047.88,975.26,991.28,741.75,1036.76,977.29,1000.39,993.01,1285.99
3,1904,1004.62,1035.66,1025.73,932.86,880.1,677.21,977.95,942.07,1004.09,937.96,1271.11
4,1905,1132.06,1171.41,1149.84,1057.41,1035.12,764.02,1143.2,1011.98,1005.38,1117.57,1345.24


Unnamed: 0.1,Unnamed: 0,Kenya,Central,Coast,Eastern,Nairobi,North Eastern,Nyanza,Rift Valley,Western
0,1901,663.78,968.07,765.69,639.79,841.38,482.37,1123.77,677.49,1160.94
1,1902,739.69,1157.12,834.69,672.14,976.85,484.46,1405.39,803.41,1567.67
2,1903,665.61,1128.04,554.06,567.36,877.99,378.18,1602.86,834.17,1744.96
3,1904,704.95,1069.12,934.78,652.6,835.32,432.23,1253.39,729.7,1335.61
4,1905,757.08,1185.96,916.37,710.23,959.29,500.12,1342.79,791.2,1373.82


Unnamed: 0.1,Unnamed: 0,Malawi,Central Region,Northern Region,Southern Region,Area under National Administration
0,1901,1143.88,999.66,1162.08,1138.87,1367.49
1,1902,1188.29,1006.32,1354.5,1028.9,1518.6
2,1903,1047.24,889.48,1205.53,879.63,1361.4
3,1904,1430.77,1167.87,1531.97,1408.52,1764.54
4,1905,1147.05,1049.92,1193.82,1090.37,1334.69


Unnamed: 0.1,Unnamed: 0,Nigeria,Adamawa,Akwa Ibom,Anambra,Benue,Borno,Cross River,Delta,Edo,...,Ebonyi,Ekiti,Enugu,Gombe,Nassarawa,Ondo,Plateau,Rivers,Sokoto,Zamfara
0,1901,20.69,19.79,21.41,21.73,21.21,20.34,21.03,22.29,22.06,...,21.51,20.66,21.34,20.1,20.78,22.04,19.36,21.13,21.74,20.3
1,1902,20.88,20.04,21.68,22.02,21.52,20.42,21.31,22.57,22.34,...,21.8,20.92,21.63,20.33,21.09,22.3,19.65,21.4,21.77,20.36
2,1903,20.91,20.06,21.8,22.1,21.58,20.42,21.42,22.65,22.4,...,21.89,20.95,21.72,20.34,21.12,22.33,19.67,21.51,21.87,20.38
3,1904,21.07,20.28,21.54,21.93,21.57,20.62,21.25,22.47,22.28,...,21.73,20.95,21.57,20.58,21.23,22.26,19.88,21.27,22.23,20.74
4,1905,20.51,19.3,21.87,22.08,21.18,19.78,21.27,22.65,22.39,...,21.74,20.85,21.61,19.56,20.71,22.3,19.0,21.57,21.68,20.12


Unnamed: 0.1,Unnamed: 0,Burkina Faso,Boucle Du Mouhoun,Cascades,Centre,Centre-est,Centre-nord,Centre-ouest,Centre-sud,Est,Hauts-bassins,Nord,Plateau Central,Sahel,Sud-ouest
0,1901,21.78,21.63,21.51,21.79,21.82,21.82,21.71,21.79,21.97,21.36,21.92,21.81,22.02,21.77
1,1902,22.13,22.2,21.83,22.11,22.07,22.12,22.11,22.09,22.18,21.82,22.37,22.11,22.31,22.19
2,1903,22.39,22.28,21.75,22.53,22.41,22.61,22.38,22.44,22.55,21.77,22.7,22.54,22.82,22.23
3,1904,22.5,22.4,21.87,22.54,22.51,22.63,22.46,22.5,22.73,21.9,22.75,22.57,22.88,22.38
4,1905,22.15,21.99,21.58,22.23,22.13,22.4,22.05,22.1,22.36,21.55,22.43,22.28,22.62,21.93


Unnamed: 0.1,Unnamed: 0,Ghana,Ashanti,Brong Ahafo,Central,Eastern,Greater Accra,Northern,Upper East,Upper West,Volta,Western
0,1901,21.97,21.83,21.86,22.48,21.91,23.01,21.85,21.88,21.77,22.17,22.38
1,1902,22.22,22.04,22.09,22.68,22.11,23.23,22.12,22.16,22.13,22.38,22.58
2,1903,22.14,21.85,21.95,22.48,21.91,23.02,22.14,22.4,22.27,22.21,22.37
3,1904,22.33,22.06,22.16,22.69,22.11,23.23,22.32,22.52,22.41,22.41,22.59
4,1905,21.83,21.59,21.64,22.3,21.64,22.8,21.76,22.03,21.9,21.89,22.23


Unnamed: 0.1,Unnamed: 0,Kenya,Central,Coast,Eastern,Nairobi,North Eastern,Nyanza,Rift Valley,Western
0,1901,18.51,9.75,20.76,19.24,12.19,21.77,14.5,15.79,13.52
1,1902,18.69,9.98,21.07,19.4,12.5,21.91,14.79,15.96,13.68
2,1903,18.62,9.84,20.89,19.37,12.27,21.91,14.54,15.89,13.56
3,1904,18.0,9.26,20.33,18.71,11.75,21.27,13.98,15.28,12.94
4,1905,18.19,9.5,20.58,18.88,12.03,21.42,14.3,15.47,13.12


Unnamed: 0.1,Unnamed: 0,Malawi,Central Region,Northern Region,Southern Region,Area under National Administration
0,1901,16.41,15.5,14.93,17.39,18.23
1,1902,16.78,15.89,15.25,17.81,18.57
2,1903,16.83,15.97,15.19,17.98,18.54
3,1904,16.18,15.22,14.84,17.05,18.09
4,1905,16.76,15.86,15.2,17.85,18.53


In [50]:
# Get precipitation data of each country
pr_NGA = transformer.get_country_climate_data(source_df7_1, "Precipitation", start=2000)
pr_BFA = transformer.get_country_climate_data(source_df7_2, "Precipitation", start=2000)
pr_GHA = transformer.get_country_climate_data(source_df7_3, "Precipitation", start=2000)
pr_KEN = transformer.get_country_climate_data(source_df7_4, "Precipitation", start=2000)
pr_MWI = transformer.get_country_climate_data(source_df7_5, "Precipitation", start=2000)

# Get min temperature data of each country
mt_NGA = transformer.get_country_climate_data(source_df7_12, "Min Temperature", start=2000)
mt_BFA = transformer.get_country_climate_data(source_df7_22, "Min Temperature", start=2000)
mt_GHA = transformer.get_country_climate_data(source_df7_32, "Min Temperature", start=2000)
mt_KEN = transformer.get_country_climate_data(source_df7_42, "Min Temperature", start=2000)
mt_MWI = transformer.get_country_climate_data(source_df7_52, "Min Temperature", start=2000)


In [51]:
# Concatenate precipiattion series
pr_df = pd.concat([pr_NGA, pr_BFA, pr_GHA, pr_KEN, pr_MWI])

# Concatenate min temperature series
mt_df = pd.concat([mt_NGA, mt_BFA, mt_GHA, mt_KEN, mt_MWI])

In [52]:
cleaned_df7 = pr_df.merge(mt_df, on=["Country", "Year"], how="outer")

In [53]:
cleaned_df7.sample(5)

Unnamed: 0,Year,Precipitation,Country,Min Temperature
98,2010-12-31,934.38,Malawi,17.7
88,2000-12-31,1138.28,Malawi,16.94
82,2016-12-31,712.17,Kenya,19.74
56,2012-12-31,1253.35,Ghana,22.6
85,2019-12-31,1146.13,Kenya,19.6


### Exploring Annual_Surface_Temperature_Change.csv - From International Monetary Fund (IMF) Climate Change Dashboard

URL: https://climatedata.imf.org/pages/climatechange-data

In [54]:
source_df8 = transformer.load_data(ext="csv", filepath=filepath_8, repo=REPO)

# Display first rows
display(source_df8.head())

# Display last rows
display(source_df8.tail())

Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS_Code,CTS_Name,CTS_Full_Descriptor,...,F2013,F2014,F2015,F2016,F2017,F2018,F2019,F2020,F2021,F2022
0,1,"Afghanistan, Islamic Rep. of",AF,AFG,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.281,0.456,1.093,1.555,1.54,1.544,0.91,0.498,1.327,2.012
1,2,Albania,AL,ALB,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.333,1.198,1.569,1.464,1.121,2.028,1.675,1.498,1.536,1.518
2,3,Algeria,DZ,DZA,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.192,1.69,1.121,1.757,1.512,1.21,1.115,1.926,2.33,1.688
3,4,American Samoa,AS,ASM,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.257,1.17,1.009,1.539,1.435,1.189,1.539,1.43,1.268,1.256
4,5,"Andorra, Principality of",AD,AND,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.831,1.946,1.69,1.99,1.925,1.919,1.964,2.562,1.533,3.243


Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS_Code,CTS_Name,CTS_Full_Descriptor,...,F2013,F2014,F2015,F2016,F2017,F2018,F2019,F2020,F2021,F2022
220,221,Western Sahara,EH,ESH,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.423,1.401,1.51,1.732,2.204,0.942,1.477,2.069,1.593,1.97
221,222,World,,WLD,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,1.016,1.053,1.412,1.66,1.429,1.29,1.444,1.711,1.447,1.394
222,223,"Yemen, Rep. of",YE,YEM,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,,,,,,,,,,
223,224,Zambia,ZM,ZMB,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.79,0.917,1.45,1.401,0.105,0.648,0.855,0.891,0.822,0.686
224,225,Zimbabwe,ZW,ZWE,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.118,0.025,0.97,1.27,0.088,0.453,0.925,0.389,-0.125,-0.49


In [55]:
# Select and subset study countries 
cleaned_df8 = transformer.subset_study_countries(source_df8, "Country")


In [56]:
cleaned_df8.head()

Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS_Code,CTS_Name,CTS_Full_Descriptor,...,F2013,F2014,F2015,F2016,F2017,F2018,F2019,F2020,F2021,F2022
0,31,Burkina Faso,BF,BFA,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.849,0.888,1.111,1.075,1.212,1.156,1.065,1.044,1.624,0.802
1,76,Ghana,GH,GHA,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.841,0.887,1.132,1.217,1.292,1.081,1.32,1.284,1.494,0.996
2,103,Kenya,KE,KEN,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.93,1.024,1.164,1.237,1.5,0.675,1.624,1.344,1.421,1.28
3,119,Malawi,MW,MWI,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.676,1.129,1.141,1.361,1.016,1.022,1.22,1.306,0.762,0.816
4,147,Nigeria,NG,NGA,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,ECCS,Surface Temperature Change,"Environment, Climate Change, Climate Indicator...",...,0.922,1.012,1.187,1.168,1.17,1.093,1.229,1.127,1.559,0.791


In [57]:
change_temp_columns = [i for i in cleaned_df8.columns if "F2" in i]

In [58]:
cleaned_df8 = transformer.extract_series([change_temp_columns, "Surface Temperature Change", "F"], source_data=cleaned_df8, immutable_columns=['Country', 'ISO3'])

In [59]:
display(cleaned_df8.sample(5))

Unnamed: 0,Country,ISO3,Year,Surface Temperature Change
78,Malawi,MWI,2009-12-31,0.648
22,Burkina Faso,BFA,2022-12-31,0.802
98,Nigeria,NGA,2006-12-31,1.204
87,Malawi,MWI,2018-12-31,1.022
66,Kenya,KEN,2020-12-31,1.344


### Exploring API_19_DS2_en_csv_v2_5729899.csv - From World Bank Group

URL: https://data.worldbank.org/topic/19

In [60]:
source_df9 = transformer.load_data(ext="csv", filepath=filepath_9, repo=REPO, header=2)

# Display first rows
display(source_df9.head())


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Aruba,ABW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,50.776,50.761,50.746,50.73,50.715,50.7,...,43.041,43.108,43.192,43.293,43.411,43.546,43.697,43.866,44.052,
1,Aruba,ABW,Urban population,SP.URB.TOTL,27728.0,28330.0,28764.0,29157.0,29505.0,29802.0,...,44588.0,44943.0,45297.0,45648.0,45999.0,46351.0,46574.0,46734.0,46891.0,
2,Aruba,ABW,Urban population growth (annual %),SP.URB.GROW,,2.147858,1.520329,1.357042,1.186472,1.001576,...,0.810669,0.793026,0.784578,0.771899,0.765986,0.762321,0.479958,0.342951,0.335381,
3,Aruba,ABW,"Population, total",SP.POP.TOTL,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,...,103594.0,104257.0,104874.0,105439.0,105962.0,106442.0,106585.0,106537.0,106445.0,
4,Aruba,ABW,Population growth (annual %),SP.POP.GROW,,2.179059,1.548572,1.389337,1.215721,1.032841,...,0.691615,0.637959,0.590062,0.537296,0.494795,0.45197,0.134255,-0.045045,-0.086392,


In [61]:
# Select and subset study countries 
cleaned_df9 = transformer.subset_study_countries(source_df9, "Country Name")


In [62]:
cleaned_df9.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Burkina Faso,BFA,Urban population (% of total population),SP.URB.TOTL.IN.ZS,4.7,4.796,4.893,4.993,5.095,5.198,...,26.934,27.53,28.134,28.743,29.358,29.98,30.607,31.24,31.877,
1,Burkina Faso,BFA,Urban population,SP.URB.TOTL,224813.0,232742.0,240956.0,249584.0,258628.0,268109.0,...,4893865.0,5153071.0,5422969.0,5701421.0,5986896.0,6281301.0,6587430.0,6904253.0,7227715.0,
2,Burkina Faso,BFA,Urban population growth (annual %),SP.URB.GROW,,3.46616,3.46838,3.518119,3.55952,3.600288,...,5.187071,5.161047,5.10506,5.007199,4.885764,4.800404,4.758616,4.697431,4.578534,
3,Burkina Faso,BFA,"Population, total",SP.POP.TOTL,4783259.0,4852833.0,4924497.0,4998671.0,5076111.0,5157929.0,...,18169840.0,18718020.0,19275500.0,19835860.0,20392720.0,20951640.0,21522630.0,22100680.0,22673760.0,
4,Burkina Faso,BFA,Population growth (annual %),SP.POP.GROW,,1.444054,1.465948,1.494994,1.537334,1.598973,...,2.979778,2.972346,2.934811,2.865655,2.768681,2.703876,2.688788,2.650376,2.559988,


In [63]:
cleaned_df9_needed_columns = [i for i in cleaned_df9.columns if i.startswith("2")]

In [64]:
cleaned_df9 = transformer.extract_series(
    [cleaned_df9_needed_columns, "Indicator Value", ""],
    source_data=cleaned_df9,
    immutable_columns=["Country Name", "Indicator Name"], 
    multiple_index=True
)

In [65]:
# Columns with 50% or more missing valaue are not useful
cokumns_50_more_missing_df9 = [col for col in cleaned_df9.columns if (cleaned_df9[col].isna().sum() / len(cleaned_df9)) >= 0.5]

# Drop columns_50_more_missing and non useful columns from cleaned_df9
cleaned_df9 = cleaned_df9.drop(columns=cokumns_50_more_missing_df9)

In [66]:
display(cleaned_df9.sample(5))

Indicator Name,Year,Country Name,Access to electricity (% of population),Agricultural land (% of land area),Agricultural land (sq. km),"Agriculture, forestry, and fishing, value added (% of GDP)","Annual freshwater withdrawals, total (% of internal resources)","Annual freshwater withdrawals, total (billion cubic meters)",Arable land (% of land area),Average precipitation in depth (mm per year),...,"Population, total","Primary completion rate, total (% of relevant age group)",Renewable electricity output (% of total electricity output),Renewable energy consumption (% of total final energy consumption),"School enrollment, primary and secondary (gross), gender parity index (GPI)",Total greenhouse gas emissions (% change from 1990),Total greenhouse gas emissions (kt of CO2 equivalent),Urban population,Urban population (% of total population),Urban population growth (annual %)
10,2002-12-31,Burkina Faso,9.917558,39.364035,107700.0,26.372448,5.728,0.716,17.178363,748.0,...,12632269.0,27.48547,17.493831,85.43,,312.002599,15786.33044,2432722.0,19.258,6.874386
53,2010-12-31,Malawi,8.7,60.299109,56850.0,24.63214,8.406444,1.3568,39.244803,1181.0,...,14718422.0,69.206673,91.055227,81.21,0.9917,150.85276,6369.508614,2287832.0,15.544,3.634122
49,2009-12-31,Nigeria,49.968922,74.281887,676537.14,26.748855,5.5619,12.2918,35.135105,1150.0,...,156595758.0,73.8032,22.900339,88.68,0.88796,67.299183,244087.5116,66691001.0,42.588,4.829727
35,2007-12-31,Burkina Faso,12.047199,40.270468,110180.0,21.839487,6.544,0.818,17.909357,748.0,...,14757074.0,34.97427,18.137255,84.3,0.81793,259.079953,21063.13735,3393537.0,22.996,5.931429
17,2003-12-31,Kenya,16.0,47.218611,268740.0,25.804439,11.207729,2.32,9.041712,630.0,...,33767122.0,,83.61412,82.87,0.9646,18.913564,36601.44803,7073537.0,20.948,4.686362


### Some vaccination information

#### General information

1. RTS,S vaccine was created in 1987 as part of a collaboration between GlaxoSmithKline (GSK) and the Walter Reed Army Institute of Research (WRAIR) that began in 1984.

2. Program for Appropriate Technology in Health (PATH) has been involved in RTS,S vaccine development in 2001.

##### Conclusion 1: None of the countries did not use the vaccine before 2001. 

Source: https://www.precisionvaccinations.com/vaccines/mosquirix-malaria-vaccine

3. Ghana, Kenya and Malawi participated in RTS,S Trials through MVIP from 2019 to 2023

##### Conclusion 2: Participated in RTS Trials and Participated in MVIP as well will be 1 for these countries from 2019 to 2023

Source: https://clinicaltrials.gov/study/NCT03806465

4. Burkina Faso participated to a RTS trial from 2017 to 2020

##### Conclusion 3: Participated in RTS Trials will be 1 for BF from 2017 to 2020

Source: https://clinicaltrials.gov/study/NCT03143218 : Seasonal Malaria Vaccination (RTS,S/​AS01) and Seasonal Malaria Chemoprevention (SP/​AQ) (RTSS-SMC)


5. Burkina Faso, Ghana, Kenya and Malawi participated to RTS phase 3 trial from 2009 to 2014

##### Conclusion 4: Participated in RTS Trials will be 1 for Burkina Faso, Ghana, Kenya and Malawi from 2009 to 2014

Source: https://clinicaltrials.gov/study/NCT00866619 

6. Kenya has been undergone an extension of the trial till 2015.

##### Conclusion 5: Participated in RTS Trials will be 1 for Kenya in 2015

Source: https://clinicaltrials.gov/study/NCT00872963

1. Kenya participated in RTS trial from 2005 to 2008 (Phase 2)

Source: https://clinicaltrials.gov/study/NCT00197054
        https://clinicaltrials.gov/study/NCT00380393

##### Conclusion 5: Participated in RTS Trials will be 1 for Kenya in from 2005 to 2008

1. Burkina Faso participated in R21 Trials from 2019 to 2023.

Source: https://clinicaltrials.gov/study/NCT03896724

##### Conclusion 2: Participated in R21 Trials will be 1 for BF from 2019 to 2022.

2. Burkina Faso and Kenya are under going phase 3 trial of R21 since 2021.

Source:  Phase III randomized controlled multi-centre trial: https://clinicaltrials.gov/study/NCT04704830?tab=history&a=7

##### Conclusion 3: Participated in R21 Trials will be 1 for BF and Kenya from 2021.

In [67]:
vaccination_status_df = cleaned_df8.drop(columns="Surface Temperature Change")

In [68]:
vaccination_status_df[["Participated in MVIP", "Participated in RTS Trials", "Participated in R21 Trials"]] = 0

In [69]:
mask_MVIP = (vaccination_status_df["Country"].isin(["Ghana", "Kenya", "Malawi"])) & (vaccination_status_df["Year"].between("2019", "2023"))
vaccination_status_df.loc[mask_MVIP, ["Participated in MVIP", "Participated in RTS Trials"]] = 1

In [70]:
mask_RTS_1 = (vaccination_status_df["Country"].isin(["Burkina Faso"])) & (vaccination_status_df["Year"].between("2017", "2021"))
mask_RTS_2 = (vaccination_status_df["Country"].isin(["Burkina Faso", "Ghana", "Kenya", "Malawi"])) & (vaccination_status_df["Year"].between("2009", "2015"))
mask_RTS_3 = (vaccination_status_df["Country"].isin(["Kenya"])) & ((vaccination_status_df["Year"].between("2005", "2009")) | (vaccination_status_df["Year"].between("2015", "2016")))
mask_RTS = mask_RTS_1 | mask_RTS_2 | mask_RTS_3
vaccination_status_df.loc[mask_RTS, ["Participated in RTS Trials"]] = 1

In [71]:
mask_R21_1 = (vaccination_status_df["Country"].isin(["Burkina Faso"])) & (vaccination_status_df["Year"].between("2019", "2023"))
mask_R21_2 = (vaccination_status_df["Country"].isin(["Burkina Faso", "Kenya"])) & (vaccination_status_df["Year"].between("2021", "2024"))
mask_R21 = mask_R21_1 | mask_R21_2
vaccination_status_df.loc[mask_R21, "Participated in R21 Trials"] = 1

In [72]:
vaccination_status_df.sample(5)

Unnamed: 0,Country,ISO3,Year,Participated in MVIP,Participated in RTS Trials,Participated in R21 Trials
27,Ghana,GHA,2004-12-31,0,0,0
88,Malawi,MWI,2019-12-31,1,1,0
51,Kenya,KEN,2005-12-31,0,1,0
53,Kenya,KEN,2007-12-31,0,1,0
66,Kenya,KEN,2020-12-31,1,1,0


In [73]:
vaccination_status_df.to_csv("../data/cleaned/sub-datasets/vaccination_status.csv", index=False)

### Saving sub-datsets

In [74]:
cleaned_df1.to_csv("../data/cleaned/sub-datasets/cleaned_df1.csv", index=False)
cleaned_df2.to_csv("../data/cleaned/sub-datasets/cleaned_df2.csv", index=False)
cleaned_df3.to_csv("../data/cleaned/sub-datasets/cleaned_df3.csv", index=False)
cleaned_df4.to_csv("../data/cleaned/sub-datasets/cleaned_df4.csv", index=False)
cleaned_df5.to_csv("../data/cleaned/sub-datasets/cleaned_df5.csv", index=False)
cleaned_df6.to_csv("../data/cleaned/sub-datasets/cleaned_df6.csv", index=False)
cleaned_df7.to_csv("../data/cleaned/sub-datasets/cleaned_df7.csv", index=False)
cleaned_df8.to_csv("../data/cleaned/sub-datasets/cleaned_df8.csv", index=False)
cleaned_df9.to_csv("../data/cleaned/sub-datasets/cleaned_df9.csv", index=False)

### Key information on vaccination

#### Vaccines efficacy

1. RTS: Global efficacy of approximately 30%
2. R21: Current know efficat of approximately 77%

#### Known vaccinations strategies

    1. Essential Programme on Immunisation (EPI) vaccination: age-based priming series, age-based additional doses
        - Age at first vaccination fixed at 5 or 6 months of age.
        - Uses existing EPI vaccine infrastructure and current contacts to deliver RTS,S.
    
    2. Seasonal vaccination (SV): seasonal priming series, seasonal fourth and fifth doses
        - Calendar month of first vaccination fixed.
        - Peak vaccine efficacy of primary series and additional doses are aligned with time of peak risk.
        - Once the infrastructure for seasonal doses is established, it may be possible to provide more vaccine doses in childhood.
        - Dose schedule changes could result in heightened efficacy of additional doses compared to EPI scheduling.

    3. Hybrid vaccination: age-based priming series, seasonal fourth and fifth doses
        - Age at first vaccination fixed at 5 or 6 months of age.
        - Uses EPI vaccine infrastructure.
        - Peak efficacy of additional doses are aligned with time of peak risk.
        - Once the infrastructure for seasonal doses is established, it may be possible to provide more vaccine doses in childhood.

Source: https://gh.bmj.com/content/8/5/e011838 (key: grantDeliveryStrategiesMalaria2023) and Full evidence report on RTS vaccine (key: rtss/as01sage/mpagworkinggroup.FullEvidenceReport2022) 

#### Vaccination scenarios features to consider in design of scenarios

1. A vaccine efficacy of either 30% (comparable to RTS vaccine), 60% or 75% (comparable to R21 vaccine)

2. Vaccination coverage (number of vaccinated people between susceptibles): 40%, 60% or 80%

3. Vaccination strategy: either EPI, SV or Hybrid

4. Meaningful combination