###  Feature engineering and dataset update

In [1]:
""" Import packages """
import sys
import logging
import warnings

import numpy as np
import pandas as pd

sys.path.insert(0, '../scripts/')

logging.basicConfig(filename='../logs/data_cleaning.log', filemode='a',
                    encoding='utf-8', level=logging.DEBUG)

warnings.filterwarnings("ignore")

In [2]:
from explorer import DataTransformer
from dataframe_info_extractor import DataFrameInfo
from utils_cleaner import DataFrameCleaner
from feature_engineering import FeatureEnginnering

transformer = DataTransformer()

In [3]:
REPO = "./"
filepath = "../data/cleaned/final/cleaned_project_dataset.csv"
rev="clpdat1"

### Importing

In [4]:
data = transformer.load_data(ext="csv", filepath=filepath, repo=REPO, rev=rev, parse_dates=["Date"], header=0)
data.head()

Unnamed: 0,Country,ISO3,Date,Malaria_Incidence,Malaria_Deaths_U5,Malaria_Deaths,ITN_Access,PopDensity,MedianAgePop,PopGrowthRate,...,"Foreign direct investment, net inflows (% of GDP)","Mortality rate, under-5 (per 1,000 live births)",Population growth (annual %),Population in urban agglomerations of more than 1 million (% of total population),Urban population (% of total population),Urban population growth (annual %),Rural population,Precipitation,Average Mean Surface Air Temperature,Average Minimum Surface Air Temperature
0,Burkina Faso,BFA,2000-12-31,603.211,874.85,249.82,2.55,43.4316,15.4232,3.02,...,0.778458,178.7,2.983886,7.749025,17.844,6.857565,9762505.0,714.73,29.06,22.72
1,Burkina Faso,BFA,2001-12-31,601.93774,918.92,264.6,2.97,44.7725,15.5302,3.06,...,0.196437,174.7,3.040729,8.086915,18.54,6.86702,9978658.0,749.26,29.19,22.77
2,Burkina Faso,BFA,2002-12-31,595.85205,958.85,274.54,2.9,46.1706,15.6492,3.089,...,0.414816,170.2,3.07479,8.437518,19.258,6.874386,10199547.0,690.37,29.47,23.24
3,Burkina Faso,BFA,2003-12-31,585.1233,965.41,278.27,2.6,47.6264,15.7656,3.12,...,0.614299,165.0,3.104518,8.800698,19.996,6.865103,10424994.0,935.59,29.34,23.12
4,Burkina Faso,BFA,2004-12-31,562.4113,925.19,267.83,3.0,49.1447,15.871,3.156,...,0.26319,159.1,3.138021,9.17737,20.757,6.873132,10654996.0,752.75,29.41,23.25


In [5]:
data.shape

(483, 32)

In [6]:
official_feats = data.columns

In [7]:
data_plus = data.copy()

## Some vaccination information

#### General information

1. RTS,S vaccine was created in 1987 as part of a collaboration between GlaxoSmithKline (GSK) and the Walter Reed Army Institute of Research (WRAIR) that began in 1984.

2. Program for Appropriate Technology in Health (PATH) has been involved in RTS,S vaccine development in 2001.

##### Conclusion 1: None of the countries did not use the vaccine before 2001. 

Source: https://www.precisionvaccinations.com/vaccines/mosquirix-malaria-vaccine

3. Ghana, Kenya and Malawi participated in RTS,S Trials through MVIP from 2019 to 2023

##### Conclusion 2: Participated in RTS Trials and Participated in MVIP as well will be 1 for these countries from 2019 to 2023

Source: https://clinicaltrials.gov/study/NCT03806465

4. Burkina Faso and Mali participated to a RTS trial from 2017 to 2020

##### Conclusion 3: Participated in RTS Trials will be 1 for BF and Mali from 2017 to 2020

Source: https://clinicaltrials.gov/study/NCT03143218 : Seasonal Malaria Vaccination (RTS,S/​AS01) and Seasonal Malaria Chemoprevention (SP/​AQ) (RTSS-SMC)


5. Burkina Faso, Gabon, Ghana, Kenya, Malawi, Mozambique, Tanzania participated to RTS phase 3 trial from 2009 to 2014

##### Conclusion 4: Participated in RTS Trials will be 1 for Burkina Faso, Gabon, Ghana, Kenya, Malawi, Mozambique, Tanzania from 2009 to 2014

Source: https://clinicaltrials.gov/study/NCT00866619 

6. Kenya has been undergone an extension of the trial till 2015.

##### Conclusion 5: Participated in RTS Trials will be 1 for Kenya in 2015

Source: https://clinicaltrials.gov/study/NCT00872963

7. Kenya participated in RTS trial from 2005 to 2008 (Phase 2) and Tanzania from 2007 to 2008

Source: https://clinicaltrials.gov/study/NCT00197054
        https://clinicaltrials.gov/study/NCT00380393

##### Conclusion 6: Participated in RTS Trials will be 1 for Kenya in from 2005 to 2008 and Tanzania from 2007 to 2008

8. Burkina Faso participated in R21 Trials from 2019 to 2023.

Source: https://clinicaltrials.gov/study/NCT03896724

##### Conclusion 7: Participated in R21 Trials will be 1 for BF from 2019 to 2022.

9. Burkina Faso and Kenya are under going phase 3 trial of R21 since 2021.

Source:  Phase III randomized controlled multi-centre trial: https://clinicaltrials.gov/study/NCT04704830?tab=history&a=7

##### Conclusion 8: Participated in R21 Trials will be 1 for BF and Kenya from 2021.

### Perspectives on vaccination
1. 
Source: https://www.who.int/news/item/05-07-2023-18-million-doses-of-first-ever-malaria-vaccine-allocated-to-12-african-countries-for-2023-2025--gavi--who-and-unicef
- In response to high demand for the first-ever malaria vaccine, 12 countries in Africa will be allocated a total of 18 million doses of RTS,S/AS01 for the 2023–2025 period
- Malaria Vaccine Implementation Programme countries Ghana, Kenya and Malawi will receive doses to continue vaccinations in pilot areas
- Allocations were also made for new introductions in Benin, Burkina Faso, Burundi, Cameroon, Democratic Republic of the Congo, Liberia, Niger, Sierra Leone and Uganda

##### Conclusion 9: Ghana, Kenya, Malawi as well as Benin, Burkina Faso, Burundi, Cameroon, Democratic Republic of the Congo, Liberia, Niger, Sierra Leone and Uganda will have 1 for RTS Trials from 2023 to 2025, if applicable

### Adding participation to vaccination features

In [8]:
vac_participation_feats = ["In MVIP", "Leveraged RTS Vaccine", "Leveraged R21 Vaccine"]
data_plus[vac_participation_feats] = 0

In [9]:
mask_MVIP = (data_plus["Country"].isin(["Ghana", "Kenya", "Malawi"])) & (data_plus["Date"].between("2019", "2023"))
data_plus.loc[mask_MVIP, ["In MVIP", "Leveraged RTS Vaccine"]] = 1

In [10]:
mask_RTS_1 = (data_plus["Country"].isin(["Burkina Faso", "Mali"])) & (data_plus["Date"].between("2017", "2021"))
mask_RTS_2 = (data_plus["Country"].isin(["Burkina Faso", "Ghana", "Kenya", "Malawi", "Gabon", "Mozambique", "Tanzania"])) & (data_plus["Date"].between("2009", "2015"))
mask_RTS_3 = (data_plus["Country"].isin(["Kenya"])) & ((data_plus["Date"].between("2005", "2009")) | (data_plus["Date"].between("2015", "2016")))
mask_RTS_4 = (data_plus["Country"].isin(["Tanzania"])) & (data_plus["Date"].between("2007", "2009"))
mask_RTS = mask_RTS_1 | mask_RTS_2 | mask_RTS_3 | mask_RTS_4
data_plus.loc[mask_RTS, ["Leveraged RTS Vaccine"]] = 1

In [11]:
mask_R21_1 = (data_plus["Country"].isin(["Burkina Faso"])) & (data_plus["Date"].between("2019", "2023"))
mask_R21_2 = (data_plus["Country"].isin(["Burkina Faso", "Kenya"])) & (data_plus["Date"].between("2021", "2024"))
mask_R21 = mask_R21_1 | mask_R21_2
data_plus.loc[mask_R21, "Leveraged R21 Vaccine"] = 1

In [12]:
data_plus.sample(5)

Unnamed: 0,Country,ISO3,Date,Malaria_Incidence,Malaria_Deaths_U5,Malaria_Deaths,ITN_Access,PopDensity,MedianAgePop,PopGrowthRate,...,Population in urban agglomerations of more than 1 million (% of total population),Urban population (% of total population),Urban population growth (annual %),Rural population,Precipitation,Average Mean Surface Air Temperature,Average Minimum Surface Air Temperature,In MVIP,Leveraged RTS Vaccine,Leveraged R21 Vaccine
154,Kenya,KEN,2016-12-31,65.96076,59.44,13.04,63.26,82.4735,18.0567,2.245,...,10.865226,26.105,3.929292,35391766.0,712.25,25.56,19.74,0,0,0
127,Ghana,GHA,2012-12-31,366.0709,397.7,113.27,58.83,118.0397,19.4493,2.469,...,16.315275,52.073,3.773112,12872599.0,1253.35,27.53,22.6,0,1,0
306,Rwanda,RWA,2007-12-31,90.638435,113.82,24.83,45.8,393.3568,17.2063,2.699,...,7.526119,16.921,2.717337,7911753.0,1285.13,19.28,13.09,0,0,0
461,Zimbabwe,ZWE,2001-12-31,113.03278,128.72,44.31,3.22,30.7897,16.9261,0.551,...,11.799871,34.17,1.855721,7840997.0,895.31,21.6,14.93,0,0,0
28,Chad,TCD,2005-12-31,223.1305,198.18,62.39,4.14,7.9455,14.6821,3.54,...,8.262059,21.801,4.143265,7823819.0,376.28,27.83,20.21,0,0,0


### Loading and cleaning useful additional demographic data

Source: https://population.un.org/wpp/Download/Standard/

#### Population by age data

In [13]:
filepath_pop = "../data/source/final/additional/WPP2022_POP_F01_1_POPULATION_SINGLE_AGE_BOTH_SEXES.xlsx"

In [14]:
additionalpopdata_present = transformer.load_data(ext="xlsx", filepath=filepath_pop, skiprows=16, sheet_name="Estimates", repo=REPO)
additionalpopdata_present.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Location code,ISO3 Alpha-code,ISO2 Alpha-code,SDMX code**,Type,Parent code,...,91,92,93,94,95,96,97,98,99,100+
0,1,Estimates,WORLD,,900,,,1.0,World,0,...,243.035,178.791,127.5605,83.2035,55.081,41.308,30.1455,20.5005,12.8975,14.469
1,2,Estimates,WORLD,,900,,,1.0,World,0,...,245.486,173.545,123.9785,86.3035,55.1435,35.772,25.986,18.3655,12.199,15.671
2,3,Estimates,WORLD,,900,,,1.0,World,0,...,247.5805,176.853,121.8605,85.007,57.845,36.289,23.053,16.214,11.1625,16.1695
3,4,Estimates,WORLD,,900,,,1.0,World,0,...,247.3985,179.6745,125.188,84.484,57.5765,38.2795,23.603,14.66,10.04,16.048
4,5,Estimates,WORLD,,900,,,1.0,World,0,...,248.8665,180.453,128.0485,87.347,57.797,38.447,24.9605,15.0515,9.171,15.426


In [15]:
additionalpopdata_futur = transformer.load_data(ext="xlsx", filepath=filepath_pop, skiprows=16, sheet_name="Medium variant", repo=REPO)
additionalpopdata_futur.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Location code,ISO3 Alpha-code,ISO2 Alpha-code,SDMX code**,Type,Parent code,...,91,92,93,94,95,96,97,98,99,100+
0,1,Medium,WORLD,,900,,,1.0,World,0,...,4449.7295,3579.527,2785.1215,2113.995,1580.662,1153.563,814.764,571.011,384.2365,632.8125
1,2,Medium,WORLD,,900,,,1.0,World,0,...,4542.4105,3674.8825,2902.126,2212.4975,1644.525,1203.722,857.8305,590.4365,403.483,677.5855
2,3,Medium,WORLD,,900,,,1.0,World,0,...,4706.4065,3778.7305,3000.005,2321.8485,1732.8975,1259.4965,900.7425,625.665,419.0115,722.0745
3,4,Medium,WORLD,,900,,,1.0,World,0,...,4877.243,3921.704,3091.144,2405.096,1822.441,1330.048,943.837,658.482,445.29,763.4645
4,5,Medium,WORLD,,900,,,1.0,World,0,...,5081.2515,4068.833,3211.375,2481.731,1890.5265,1400.8455,998.2225,690.5665,469.4675,810.2235


In [16]:
additionalpopdata_present = additionalpopdata_present.replace({"Region, subregion, country or area *": {'United Republic of Tanzania':"Tanzania", 'Democratic Republic of the Congo': 'Democratic Republic of Congo', "Côte d'Ivoire": "Cote d'Ivoire"}})
additionalpopdata_futur = additionalpopdata_futur.replace({"Region, subregion, country or area *": {'United Republic of Tanzania':"Tanzania", 'Democratic Republic of the Congo': 'Democratic Republic of Congo', "Côte d'Ivoire": "Cote d'Ivoire"}})


In [17]:
additionalpopdata_present = transformer.subset_study_countries(additionalpopdata_present, "Region, subregion, country or area *", countries=list(data_plus.Country.unique()))
additionalpopdata_futur = transformer.subset_study_countries(additionalpopdata_futur, "Region, subregion, country or area *", countries=list(data_plus.Country.unique()))

In [18]:
additionalpopdata_present.rename(
    columns={
        "Region, subregion, country or area *": "Country",
        "Year": "Date",
        "ISO3 Alpha-code": "ISO3",
    },
    inplace=True
)

additionalpopdata_futur.rename(
    columns={
        "Region, subregion, country or area *": "Country",
        "Year": "Date",
        "ISO3 Alpha-code": "ISO3",
    },
    inplace=True
)

In [19]:
additionalpopdata_present.rename(
    columns={
        0: "Total Aged 0 (thousand)",
        1: "Total Aged 1 (thousand)",
        2: "Total Aged 2 (thousand)",
        3: "Total Aged 3 (thousand)",
        4: "Total Aged 4 (thousand)",
        5: "Total Aged 5 (thousand)",
    },
    inplace=True
)

additionalpopdata_futur.rename(
    columns={
        0: "Total Aged 0 (thousand)",
        1: "Total Aged 1 (thousand)",
        2: "Total Aged 2 (thousand)",
        3: "Total Aged 3 (thousand)",
        4: "Total Aged 4 (thousand)",
        5: "Total Aged 5 (thousand)",
    },
    inplace=True
)

In [20]:
use_columns = ["Country", "ISO3", "Date", "Total Aged 0 (thousand)", "Total Aged 1 (thousand)", "Total Aged 2 (thousand)", "Total Aged 3 (thousand)", "Total Aged 4 (thousand)", "Total Aged 5 (thousand)"]

In [21]:
additionalpopdata_present = transformer.convert_to_dateformat(additionalpopdata_present, "Date")
additionalpopdata_present = additionalpopdata_present[(additionalpopdata_present.Date >= "2000-12-31") & (additionalpopdata_present.Date <= "2022-12-31")]
additionalpopdata_futur = transformer.convert_to_dateformat(additionalpopdata_futur, "Date")
additionalpopdata_futur = additionalpopdata_futur[(additionalpopdata_futur.Date >= "2022-12-31") & (additionalpopdata_futur.Date <= "2070-12-31")]

In [22]:
additionalpopdata_2022 = additionalpopdata_futur[additionalpopdata_futur["Date"] == "2022-12-31"]

additionalpopdata_present = pd.concat([additionalpopdata_present, additionalpopdata_2022])
additionalpopdata_present = additionalpopdata_present.sort_values(by = ["Country", "Date"]).reset_index(drop=True)

additionalpopdata_futur = additionalpopdata_futur[additionalpopdata_futur.Date >= "2023-12-31"]
additionalpopdata_futur = additionalpopdata_futur.sort_values(by = ["Country", "Date"]).reset_index(drop=True)


In [23]:
additionalpopdata_present = additionalpopdata_present[use_columns]
additionalpopdata_present.head()

Unnamed: 0,Country,ISO3,Date,Total Aged 0 (thousand),Total Aged 1 (thousand),Total Aged 2 (thousand),Total Aged 3 (thousand),Total Aged 4 (thousand),Total Aged 5 (thousand)
0,Burkina Faso,BFA,2000-12-31,496.4965,457.9985,430.213,409.9985,393.008,377.8115
1,Burkina Faso,BFA,2001-12-31,508.9675,471.1035,441.9765,420.6335,404.104,389.2575
2,Burkina Faso,BFA,2002-12-31,520.864,483.8915,455.242,432.4285,414.722,400.3095
3,Burkina Faso,BFA,2003-12-31,535.6045,496.2515,468.267,445.746,426.522,410.9145
4,Burkina Faso,BFA,2004-12-31,551.775,511.572,481.0225,458.9295,439.895,422.741


In [24]:
additionalpopdata_futur = additionalpopdata_futur[use_columns]
additionalpopdata_futur.head()

Unnamed: 0,Country,ISO3,Date,Total Aged 0 (thousand),Total Aged 1 (thousand),Total Aged 2 (thousand),Total Aged 3 (thousand),Total Aged 4 (thousand),Total Aged 5 (thousand)
0,Burkina Faso,BFA,2023-12-31,766.3215,746.381,729.1135,713.927,700.5295,687.5185
1,Burkina Faso,BFA,2024-12-31,774.8695,754.59,737.9945,723.1505,709.628,697.371
2,Burkina Faso,BFA,2025-12-31,783.321,762.735,745.926,731.828,718.703,706.362
3,Burkina Faso,BFA,2026-12-31,792.4925,770.9505,753.912,739.645,727.292,715.37
4,Burkina Faso,BFA,2027-12-31,802.31,780.262,762.191,747.675,735.139,723.976


In [25]:
additionalpopdata_present.to_csv("../data/cleaned/final/addtionnal_pop_data_present.csv", index=False)
additionalpopdata_futur.to_csv("../data/cleaned/final/addtionnal_pop_data_futur.csv", index=False)

#### Survival rate by age data

In [26]:
filepath_surv_present = "../data/source/final/additional/WPP2022_Life_Table_Complete_Medium_Both_1950-2021.csv"
filepath_surv_futur = "../data/source/final/additional/WPP2022_Life_Table_Complete_Medium_Both_2022-2100.csv"

In [27]:
additionalpxdata_present = transformer.load_data(ext="csv", filepath=filepath_surv_present, repo=REPO)
additionalpxdata_present.head()

Unnamed: 0,SortOrder,LocID,Notes,ISO3_code,ISO2_code,SDMX_code,LocTypeID,LocTypeName,ParentID,Location,...,mx,qx,px,lx,dx,Lx,Sx,Tx,ex,ax
0,1,900,,,,1.0,1,World,0,World,...,0.159391,0.143387,0.856613,100000.0,14338.689,89959.186,0.899592,4646428.975,46.4643,0.29974
1,1,900,,,,1.0,1,World,0,World,...,0.04362,0.042689,0.957311,85661.311,3656.811,83832.906,0.931899,4556469.789,53.1917,0.5
2,1,900,,,,1.0,1,World,0,World,...,0.025708,0.025381,0.974619,82004.5,2081.392,80963.804,0.965776,4472636.883,54.5414,0.5
3,1,900,,,,1.0,1,World,0,World,...,0.017162,0.017016,0.982984,79923.108,1359.96,79243.128,0.978748,4391673.079,54.9487,0.5
4,1,900,,,,1.0,1,World,0,World,...,0.012348,0.012272,0.987728,78563.147,964.127,78081.084,0.985336,4312429.951,54.8913,0.5


In [28]:
additionalpxdata_futur = transformer.load_data(ext="csv", filepath=filepath_surv_futur, repo=REPO)
additionalpxdata_futur.head()

Unnamed: 0,SortOrder,LocID,Notes,ISO3_code,ISO2_code,SDMX_code,LocTypeID,LocTypeName,ParentID,Location,...,mx,qx,px,lx,dx,Lx,Sx,Tx,ex,ax
0,1,900,,,,1.0,1,World,0,World,...,0.028174,0.027528,0.972472,100000.0,2752.829,97707.932,0.977079,7171351.522,71.7135,0.167377
1,1,900,,,,1.0,1,World,0,World,...,0.003599,0.003593,0.996407,97247.171,349.399,97072.472,0.993496,7073643.589,72.7388,0.5
2,1,900,,,,1.0,1,World,0,World,...,0.002476,0.002473,0.997527,96897.772,239.642,96777.951,0.996966,6976571.117,71.9993,0.5
3,1,900,,,,1.0,1,World,0,World,...,0.001951,0.001949,0.998051,96658.13,188.391,96563.935,0.997789,6879793.166,71.1766,0.5
4,1,900,,,,1.0,1,World,0,World,...,0.001618,0.001617,0.998383,96469.74,155.974,96391.753,0.998217,6783229.231,70.3146,0.5


In [29]:
additionalpxdata_present = additionalpxdata_present.replace({"Location": {'United Republic of Tanzania':"Tanzania", 'Democratic Republic of the Congo': 'Democratic Republic of Congo', "Côte d'Ivoire": "Cote d'Ivoire"}})
additionalpxdata_futur = additionalpxdata_futur.replace({"Location": {'United Republic of Tanzania':"Tanzania", 'Democratic Republic of the Congo': 'Democratic Republic of Congo', "Côte d'Ivoire": "Cote d'Ivoire"}})

In [30]:
additionalpxdata_present = transformer.subset_study_countries(additionalpxdata_present, "Location", countries=list(data_plus.Country.unique()))
additionalpxdata_futur = transformer.subset_study_countries(additionalpxdata_futur, "Location", countries=list(data_plus.Country.unique()))

In [31]:
additionalpxdata_present.rename(
    columns={
        "Location": "Country",
        "Time": "Date",
        "ISO3_code": "ISO3",
    },
    inplace=True
)

additionalpxdata_futur.rename(
    columns={
        "Location": "Country",
        "Time": "Date",
        "ISO3_code": "ISO3",
    },
    inplace=True
)

In [32]:
use_columns = ["Country", "ISO3", "Date", "AgeGrpStart", "px"]

In [33]:
additionalpxdata_present = additionalpxdata_present[use_columns]
additionalpxdata_futur = additionalpxdata_futur[use_columns]

In [34]:
additionalpxdata_present = transformer.convert_to_dateformat(additionalpxdata_present, "Date")
additionalpxdata_present = additionalpxdata_present[(additionalpxdata_present.Date >= "2000-12-31") & (additionalpxdata_present.Date <= "2022-12-31")]
additionalpxdata_futur = transformer.convert_to_dateformat(additionalpxdata_futur, "Date")
additionalpxdata_futur = additionalpxdata_futur[(additionalpxdata_futur.Date >= "2022-12-31") & (additionalpxdata_futur.Date <= "2070-12-31")]

In [35]:
additionalpxdata_present = additionalpxdata_present[additionalpxdata_present["AgeGrpStart"] <= 5]
additionalpxdata_futur = additionalpxdata_futur[additionalpxdata_futur['AgeGrpStart'] <= 5]

In [36]:
additionalpxdata_present = (additionalpxdata_present.pivot(columns="AgeGrpStart", values="px", index=["Country", "ISO3", "Date"])).reset_index()
additionalpxdata_present.columns.name = None

additionalpxdata_futur = (additionalpxdata_futur.pivot(columns="AgeGrpStart", values="px", index=["Country", "ISO3", "Date"])).reset_index()
additionalpxdata_futur.columns.name = None

In [37]:
additionalpxdata_present.rename(
    columns={
        0: "Probablity of surviving at age 0",
        1: "Probablity of surviving at age 1",
        2: "Probablity of surviving at age 2",
        3: "Probablity of surviving at age 3",
        4: "Probablity of surviving at age 4",
        5: "Probablity of surviving at age 5",
    },
    inplace=True
)

additionalpxdata_futur.rename(
    columns={
        0: "Probablity of surviving at age 0",
        1: "Probablity of surviving at age 1",
        2: "Probablity of surviving at age 2",
        3: "Probablity of surviving at age 3",
        4: "Probablity of surviving at age 4",
        5: "Probablity of surviving at age 5",
    },
    inplace=True
)

In [38]:
additionalpxdata_2022 = additionalpxdata_futur[additionalpxdata_futur["Date"] == "2022-12-31"]

additionalpxdata_present = pd.concat([additionalpxdata_present, additionalpxdata_2022])
additionalpxdata_present = additionalpxdata_present.sort_values(by = ["Country", "Date"]).reset_index(drop=True)

additionalpxdata_futur = additionalpxdata_futur[additionalpxdata_futur.Date >= "2023-12-31"]
additionalpxdata_futur = additionalpxdata_futur.sort_values(by = ["Country", "Date"]).reset_index(drop=True)


In [39]:
additionalpxdata_present.head()

Unnamed: 0,Country,ISO3,Date,Probablity of surviving at age 0,Probablity of surviving at age 1,Probablity of surviving at age 2,Probablity of surviving at age 3,Probablity of surviving at age 4,Probablity of surviving at age 5
0,Burkina Faso,BFA,2000-12-31,0.90795,0.956455,0.972418,0.982372,0.988478,0.992189
1,Burkina Faso,BFA,2001-12-31,0.909586,0.958119,0.973213,0.982717,0.988606,0.992221
2,Burkina Faso,BFA,2002-12-31,0.911156,0.959681,0.973996,0.983072,0.988743,0.992255
3,Burkina Faso,BFA,2003-12-31,0.913341,0.961577,0.975078,0.983687,0.989097,0.992466
4,Burkina Faso,BFA,2004-12-31,0.915872,0.963684,0.976304,0.984401,0.989521,0.992729


In [40]:
additionalpxdata_futur.head()

Unnamed: 0,Country,ISO3,Date,Probablity of surviving at age 0,Probablity of surviving at age 1,Probablity of surviving at age 2,Probablity of surviving at age 3,Probablity of surviving at age 4,Probablity of surviving at age 5
0,Burkina Faso,BFA,2023-12-31,0.953906,0.987339,0.990856,0.99334,0.995081,0.996293
1,Burkina Faso,BFA,2024-12-31,0.953884,0.987206,0.990761,0.993273,0.995034,0.99626
2,Burkina Faso,BFA,2025-12-31,0.952666,0.986765,0.99044,0.993038,0.994862,0.996132
3,Burkina Faso,BFA,2026-12-31,0.953366,0.987,0.990606,0.993157,0.994947,0.996195
4,Burkina Faso,BFA,2027-12-31,0.954083,0.98724,0.990776,0.993278,0.995035,0.99626


In [41]:
additionalpxdata_present.to_csv("../data/cleaned/final/addtionnal_px_data_present.csv", index=False)
additionalpxdata_futur.to_csv("../data/cleaned/final/addtionnal_px_data_futur.csv", index=False)

### Adding vaccine coverage and vaccine efficacy features (mainly useful for vaccination scenarios)

#### From 2023 to 2070

In [42]:
REPO = "./"
filepath_futur = "../data/cleaned/final/downloaded_projections_data.csv"
rev="cldfdat1"

In [43]:
future_data = transformer.load_data(ext="csv", filepath=filepath_futur, repo=REPO, rev=rev, header=0, parse_dates=["Date"])
future_data = future_data.sort_values(by=["Country", 'Date'], ascending=True)
future_data.head()

Unnamed: 0,Date,Precipitation_SSP2-4.5,Average Mean Surface Air Temperature_SSP2-4.5,Average Minimum Surface Air Temperature_SSP2-4.5,ISO3,Country,PopDensity,MedianAgePop,PopGrowthRate,TFR,IMR,Q5,CNMR
192,2023-12-31,664.92,29.13,23.46,BFA,Burkina Faso,84.9835,16.872,2.512,4.5664,46.0945,77.5592,-1.074
193,2024-12-31,695.9,29.43,23.73,BFA,Burkina Faso,87.1354,17.0439,2.49,4.4729,46.1159,77.8987,-1.048
194,2025-12-31,667.79,29.3,23.58,BFA,Burkina Faso,89.3162,17.2362,2.454,4.3812,47.3345,80.1619,-1.022
195,2026-12-31,704.17,29.38,23.68,BFA,Burkina Faso,91.5219,17.4412,2.425,4.295,46.6344,78.9234,-0.998
196,2027-12-31,655.76,29.4,23.78,BFA,Burkina Faso,93.7533,17.6535,2.393,4.2068,45.917,77.6543,-0.974


In [44]:
future_data_plus = future_data.copy()

In [45]:
future_data_plus = (
    future_data_plus.
    merge(additionalpopdata_futur, on=["Country", "ISO3", "Date"], how="inner").
    merge(additionalpxdata_futur, on=["Country", "ISO3", "Date"], how="inner")
)

In [46]:
create_feat_future = FeatureEnginnering(future_data_plus)

In [47]:
future_data_plus.columns

Index(['Date', 'Precipitation_SSP2-4.5',
       'Average Mean Surface Air Temperature_SSP2-4.5',
       'Average Minimum Surface Air Temperature_SSP2-4.5', 'ISO3', 'Country',
       'PopDensity', 'MedianAgePop', 'PopGrowthRate', 'TFR', 'IMR', 'Q5',
       'CNMR', 'Total Aged 0 (thousand)', 'Total Aged 1 (thousand)',
       'Total Aged 2 (thousand)', 'Total Aged 3 (thousand)',
       'Total Aged 4 (thousand)', 'Total Aged 5 (thousand)',
       'Probablity of surviving at age 0', 'Probablity of surviving at age 1',
       'Probablity of surviving at age 2', 'Probablity of surviving at age 3',
       'Probablity of surviving at age 4', 'Probablity of surviving at age 5'],
      dtype='object')

In [48]:
new_features = [
    "Vaccinated Aged 0",
    "Susceptibles, not vaccinated (0-5)",
    "Effectively_protected (0-5)",
    "Vaccinated_still_susceptibles (0-5)",
]

#### Test de la function de feature engineering pour: coverage_0=.7, coverage_2=.6, efficacy=0.8 (R21)

In [49]:
future_data_plus_test = create_feat_future.create_new_features_for_future(column="ISO3", coverage_0=0.7, coverage_2=0.6, initial_efficacy=0.8, efficacy_booster=0.8)

In [50]:
future_data_plus_test = future_data_plus_test[list(future_data.columns)+new_features]

In [51]:
future_data_plus_test.columns

Index(['Date', 'Precipitation_SSP2-4.5',
       'Average Mean Surface Air Temperature_SSP2-4.5',
       'Average Minimum Surface Air Temperature_SSP2-4.5', 'ISO3', 'Country',
       'PopDensity', 'MedianAgePop', 'PopGrowthRate', 'TFR', 'IMR', 'Q5',
       'CNMR', 'Vaccinated Aged 0', 'Susceptibles, not vaccinated (0-5)',
       'Effectively_protected (0-5)', 'Vaccinated_still_susceptibles (0-5)'],
      dtype='object')

In [52]:
future_data_plus_test.head()

Unnamed: 0,Date,Precipitation_SSP2-4.5,Average Mean Surface Air Temperature_SSP2-4.5,Average Minimum Surface Air Temperature_SSP2-4.5,ISO3,Country,PopDensity,MedianAgePop,PopGrowthRate,TFR,IMR,Q5,CNMR,Vaccinated Aged 0,"Susceptibles, not vaccinated (0-5)",Effectively_protected (0-5),Vaccinated_still_susceptibles (0-5)
0,2023-12-31,664.92,29.13,23.46,BFA,Burkina Faso,84.9835,16.872,2.512,4.5664,46.0945,77.5592,-1.074,536.42505,3807.36595,429.14004,111.28501
1,2024-12-31,695.9,29.43,23.73,BFA,Burkina Faso,87.1354,17.0439,2.49,4.4729,46.1159,77.8987,-1.048,542.40865,3343.507519,682.210164,375.885818
2,2025-12-31,667.79,29.3,23.58,BFA,Burkina Faso,89.3162,17.2362,2.454,4.3812,47.3345,80.1619,-1.022,548.3247,2878.90103,991.190443,279.834373
3,2026-12-31,704.17,29.38,23.68,BFA,Burkina Faso,91.5219,17.4412,2.425,4.295,46.6344,78.9234,-0.998,554.74475,2411.974632,1183.627213,301.946938
4,2027-12-31,655.76,29.4,23.78,BFA,Burkina Faso,93.7533,17.6535,2.393,4.2068,45.917,77.6543,-0.974,561.617,1942.457661,1307.021074,395.150789


#### From 2000 to 2022

In [53]:
data_plus = (
    data_plus.
    merge(additionalpopdata_present, on=["Country", "ISO3", "Date"], how="outer").
    merge(additionalpxdata_present, on=["Country", "ISO3", "Date"], how="outer")
)

In [54]:
data_plus[new_features] = 0

In [55]:
create_feat_present = FeatureEnginnering(data_plus)

#### Rules:

1. For country that has participated to vaccine trial, we will calculated the feat "Vaccinated Aged 0" using coverage_0=0.01
2. Find appropriate Susceptibles, not vaccinated (0-5)

Note: We're not applying any booster at this step. We are only applying initial efficacy of 0.4 assimilated the declared efficacy of RTS vaccine

In [56]:
mask_vacc_aged_0 = (data_plus["Leveraged RTS Vaccine"] == 1) | (data_plus["Leveraged R21 Vaccine"] == 1)
data_plus.loc[mask_vacc_aged_0, "Vaccinated Aged 0"] = data_plus.loc[mask_vacc_aged_0, "Total Aged 0 (thousand)"] * 0.01

In [57]:
data_plus = create_feat_present.create_new_features_for_present(column="ISO3", initial_efficacy=0.4)

In [58]:
data_plus = data_plus[list(data.columns)+vac_participation_feats+new_features]

In [59]:
data_plus.head()

Unnamed: 0,Country,ISO3,Date,Malaria_Incidence,Malaria_Deaths_U5,Malaria_Deaths,ITN_Access,PopDensity,MedianAgePop,PopGrowthRate,...,Precipitation,Average Mean Surface Air Temperature,Average Minimum Surface Air Temperature,In MVIP,Leveraged RTS Vaccine,Leveraged R21 Vaccine,Vaccinated Aged 0,"Susceptibles, not vaccinated (0-5)",Effectively_protected (0-5),Vaccinated_still_susceptibles (0-5)
0,Burkina Faso,BFA,2000-12-31,603.211,874.85,249.82,2.55,43.4316,15.4232,3.02,...,714.73,29.06,22.72,0,0,0,0.0,2565.526,0.0,0.0
1,Burkina Faso,BFA,2001-12-31,601.93774,918.92,264.6,2.97,44.7725,15.5302,3.06,...,749.26,29.19,22.77,0,0,0,0.0,2636.0425,0.0,0.0
2,Burkina Faso,BFA,2002-12-31,595.85205,958.85,274.54,2.9,46.1706,15.6492,3.089,...,690.37,29.47,23.24,0,0,0,0.0,2707.4575,0.0,0.0
3,Burkina Faso,BFA,2003-12-31,585.1233,965.41,278.27,2.6,47.6264,15.7656,3.12,...,935.59,29.34,23.12,0,0,0,0.0,2783.3055,0.0,0.0
4,Burkina Faso,BFA,2004-12-31,562.4113,925.19,267.83,3.0,49.1447,15.871,3.156,...,752.75,29.41,23.25,0,0,0,0.0,2865.935,0.0,0.0


### Exporting data

In [64]:
data_plus.to_csv("../data/cleaned/final/cleaned_project_dataset.csv") # "clpdat2"

In [65]:
future_data_plus.to_csv("../data/cleaned/final/downloaded_projections_data.csv") # "cldfdat2"