## Initial analysis of the data 
Exploration of available databases from World Bank API and selection of indicators

In [2]:
import wbgapi as wb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob


In [3]:
print(wb.source.info())

economies = list(wb.economy.list())
print(">>>>>>> NUMBER OF ECONOMIES IN THE DATABASE >>>>>>>")
print(len(economies))
print(economies[:5])


id    name                                                                  code      concepts  lastupdated
----  --------------------------------------------------------------------  ------  ----------  -------------
1     Doing Business                                                        DBS              3  2021-08-18
2     World Development Indicators                                          WDI              3  2025-10-07
3     Worldwide Governance Indicators                                       WGI              3  2024-11-05
5     Subnational Malnutrition Database                                     SNM              3  2016-03-21
6     International Debt Statistics                                         IDS              4  2025-02-26
11    Africa Development Indicators                                         ADI              3  2013-02-22
12    Education Statistics                                                  EDS              3  2024-06-25
13    Enterprise Surveys         

Finding the indicators related to the economy such as GDP or GDP growth.

In [4]:
WDI = 2     # second database = WDI (World Development Indicators)

for ind in wb.series.list(db = WDI, q = ('gdp')):
    print(f"{ind['id']}: {ind['value']}")


EG.GDP.PUSE.KO.PP: GDP per unit of energy use (PPP $ per kg of oil equivalent)
EG.GDP.PUSE.KO.PP.KD: GDP per unit of energy use (constant 2021 PPP $ per kg of oil equivalent)
EG.USE.COMM.GD.PP.KD: Energy use (kg of oil equivalent) per $1,000 GDP (constant 2021 PPP)
EN.GHG.CO2.RT.GDP.KD: Carbon intensity of GDP (kg CO2e per constant 2015 US$ of GDP)
EN.GHG.CO2.RT.GDP.PP.KD: Carbon intensity of GDP (kg CO2e per 2021 PPP $ of GDP)
NY.GDP.DEFL.KD.ZG: Inflation, GDP deflator (annual %)
NY.GDP.DEFL.KD.ZG.AD: Inflation, GDP deflator: linked series (annual %)
NY.GDP.DEFL.ZS: GDP deflator (base year varies by country)
NY.GDP.DEFL.ZS.AD: GDP deflator: linked series (base year varies by country)
NY.GDP.DISC.CN: Discrepancy in expenditure estimate of GDP (current LCU)
NY.GDP.DISC.KN: Discrepancy in expenditure estimate of GDP (constant LCU)
NY.GDP.MKTP.CD: GDP (current US$)
NY.GDP.MKTP.CN: GDP (current LCU)
NY.GDP.MKTP.CN.AD: GDP: linked series (current LCU)
NY.GDP.MKTP.KD: GDP (constant 2015 US$)

Looking over health indicators such as life expectancy, mortality and health expenditure which links economy to health of the population.

In [5]:
for query in ['life expectancy', 'mortality', 'health expenditure']:
    for ind in wb.series.list(db = WDI, q = query):
        print(f"{ind['id']}: {ind['value']}")


SP.DYN.LE00.FE.IN: Life expectancy at birth, female (years)
SP.DYN.LE00.IN: Life expectancy at birth, total (years)
SP.DYN.LE00.MA.IN: Life expectancy at birth, male (years)
SH.DYN.MORT: Mortality rate, under-5 (per 1,000 live births)
SH.DYN.MORT.FE: Mortality rate, under-5, female (per 1,000 live births)
SH.DYN.MORT.MA: Mortality rate, under-5, male (per 1,000 live births)
SH.DYN.NCOM.FE.ZS: Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70, female (%)
SH.DYN.NCOM.MA.ZS: Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70, male (%)
SH.DYN.NCOM.ZS: Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70 (%)
SH.DYN.NMRT: Mortality rate, neonatal (per 1,000 live births)
SH.STA.AIRP.FE.P5: Mortality rate attributed to household and ambient air pollution, age-standardized, female (per 100,000 female population)
SH.STA.AIRP.MA.P5: Mortality rate attributed to household and ambient air pollution, age-standardized, male (per 100,000

Examining possible indicators for environmental welfare for a country focusing on pollution, use of renewable energy or access to clean water.

In [None]:
for query in ['CO2', 'pollution', 'greenhouse', 'renewable energy', 'forest', 'water']:
    for ind in wb.series.list(db = WDI, q = query):
        print(f"{ind['id']}: {ind['value']}")


Exploring population its growth and urbanization to see how it affects the economy, health, and environment.

In [None]:
for query in ['population', 'urban']:
    for ind in wb.series.list(db = WDI, q = query):
        print(f"{ind['id']}: {ind['value']}")

## Choosing main indicators for each group
### Economy 
 - GDP per capita, PPP (constant 2021 international $) ‚Äî measures average economic output per person.
 - GDP growth (annual %) ‚Äî tracks how fast the economy is expanding/contracting over time.
 - Carbon intensity of GDP (kg CO2e per 2021 PPP $ of GDP) ‚Äî shows how much CO2 is emitted per unit of economic output, linking economic activity to environmental cost.

### Health
 - Life expectancy at birth, total (years) ‚Äî average expected lifespan at birth, measure of overall health.
 - Mortality rate, under-5 (per 1,000 live births) ‚Äî deaths of children under five, demonstrates healthcare access and quality.
 - Current health expenditure (% of GDP) ‚Äî investment in health sector.

-- Indicators which I found under environment indicators explorations but are related to health sector and will provide meaningful links between these sectors, connects environmental conditions to human health.
 - Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)connects environmental conditions to human health.
 - Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (per 100,000 population)

### Environment
 - Carbon dioxide (CO2) emissions (total) excluding LULUCF (Mt CO2e)
 - Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita)CO2 per capita 
 - PM2.5 air pollution, mean annual exposure (micrograms per cubic meter) ‚Äî indicator of air quality.
 - Total greenhouse gas emissions excluding LULUCF (Mt CO2e)
 - Total greenhouse gas emissions excluding LULUCF per capita (t CO2e/capita)
 - Renewable energy consumption (% of total) ‚Äî sustainability metric.
 - Forest area (% of land area) ‚Äî  air quality, biodiversity.
 - People using safely managed drinking water services (% of population) ‚Äî access to clean drinking water.

### Population and Urbanization Indicators
 - Population growth (annual %)
 - Population, total
 - Urban population (% of total population) ‚Äî share of people living in cities, reflecting urbanization level.
 - Population density (people per sq. km of land area) ‚Äî influencing environmental stress.

Next steps: selecting the countries, pulling the data, checking the missing values, possibly adjusting number of indicators based on coverage.

## Country selection
1. Selecting the real countries - filtering out economies in the list named such as Africa Eastern and Southern, World...
2. Keeping the sample of countries limited to larger economies - filtering out small countries with population under 5 millions

In [7]:
for e in wb.economy.list(): 
    if e['aggregate'] == False:
        print(f"{e['id']}: {e['value']}")

ABW: Aruba
AFG: Afghanistan
AGO: Angola
ALB: Albania
AND: Andorra
ARE: United Arab Emirates
ARG: Argentina
ARM: Armenia
ASM: American Samoa
ATG: Antigua and Barbuda
AUS: Australia
AUT: Austria
AZE: Azerbaijan
BDI: Burundi
BEL: Belgium
BEN: Benin
BFA: Burkina Faso
BGD: Bangladesh
BGR: Bulgaria
BHR: Bahrain
BHS: Bahamas, The
BIH: Bosnia and Herzegovina
BLR: Belarus
BLZ: Belize
BMU: Bermuda
BOL: Bolivia
BRA: Brazil
BRB: Barbados
BRN: Brunei Darussalam
BTN: Bhutan
BWA: Botswana
CAF: Central African Republic
CAN: Canada
CHE: Switzerland
CHI: Channel Islands
CHL: Chile
CHN: China
CIV: Cote d'Ivoire
CMR: Cameroon
COD: Congo, Dem. Rep.
COG: Congo, Rep.
COL: Colombia
COM: Comoros
CPV: Cabo Verde
CRI: Costa Rica
CUB: Cuba
CUW: Curacao
CYM: Cayman Islands
CYP: Cyprus
CZE: Czechia
DEU: Germany
DJI: Djibouti
DMA: Dominica
DNK: Denmark
DOM: Dominican Republic
DZA: Algeria
ECU: Ecuador
EGY: Egypt, Arab Rep.
ERI: Eritrea
ESP: Spain
EST: Estonia
ETH: Ethiopia
FIN: Finland
FJI: Fiji
FRA: France
FRO: Far

In [13]:
non_aggregates = [e for e in wb.economy.list() if e['aggregate'] == False]
print(len(non_aggregates))
economy_ids = [e['id'] for e in non_aggregates]

# Filtering economies with population > 5 million
min_pop = 5000000  # minimum population
pop_data = wb.data.DataFrame('SP.POP.TOTL', economy_ids)
print(pop_data[:3])

# Selecting countries satisfying the population criterion for the last available year
filtered_countries = pop_data.index[pop_data.iloc[:, -1] >= min_pop].tolist()
print(filtered_countries[:10])


217
            YR1960     YR1961     YR1962     YR1963     YR1964      YR1965  \
economy                                                                      
ABW        54922.0    55578.0    56320.0    57002.0    57619.0     58190.0   
AFG      9035043.0  9214083.0  9404406.0  9604487.0  9814318.0  10036008.0   
AGO      5231654.0  5301583.0  5354310.0  5408320.0  5464187.0   5521981.0   

             YR1966      YR1967      YR1968      YR1969  ...      YR2015  \
economy                                                  ...               
ABW         58694.0     58990.0     59069.0     59052.0  ...    107906.0   
AFG      10266395.0  10505959.0  10756922.0  11017409.0  ...  33831764.0   
AGO       5581386.0   5641807.0   5702699.0   5763685.0  ...  28157798.0   

             YR2016      YR2017      YR2018      YR2019      YR2020  \
economy                                                               
ABW        108727.0    108735.0    108908.0    109203.0    108587.0   
AFG      34

After filtering, I am left with 126 countries. This set is sufficient to capture global trends while reducing the number of very small economies, which helps save computing time. In the data_collection.py I collect all the data for chosen indicators and countries.

In [15]:
df = pd.read_csv("../data/raw/all_indicators.csv")
print(df.head())
print(df.shape)


  economy                series     YR2000     YR2001     YR2002     YR2003  \
0     AFG        AG.LND.FRST.ZS   1.852782   1.852782   1.852782   1.852782   
1     AFG        EG.FEC.RNEW.ZS  45.000000  45.600000  37.800000  36.700000   
2     AFG     EN.ATM.PM25.MC.M3  64.767280  64.597573  64.416888  64.176231   
3     AFG  EN.GHG.ALL.PC.CE.AR5   0.691280   0.601771   0.690285   0.674661   
4     AFG  EN.GHG.CO2.PC.CE.AR5   0.050476   0.046573   0.044078   0.044341   

      YR2004     YR2005     YR2006     YR2007  ...     YR2014     YR2015  \
0   1.852782   1.852782   1.852782   1.852782  ...   1.852782   1.852782   
1  44.200000  33.900000  31.900000  28.800000  ...  19.100000  17.700000   
2  63.826609  63.319026  61.514649  58.083785  ...  77.143728  73.490818   
3   0.649282   0.638053   0.620097   0.650602  ...   0.836169   0.810135   
4   0.037898   0.051888   0.055392   0.077561  ...   0.238643   0.246706   

      YR2016     YR2017     YR2018     YR2019     YR2020     YR2021 

In [16]:
year_cols = df.columns[2:]

# Sort indicators by missing fraction in descending order
missing_per_indicator = df.groupby('series')[year_cols].apply(lambda x: x.isna().mean().mean())
print("Missing values per indicator:")
print(missing_per_indicator.sort_values(ascending=False))

Missing values per indicator:
series
SH.STA.AIRP.P5             0.958995
SH.STA.WASH.P5             0.958995
SH.H2O.SMDW.ZS             0.320767
EN.ATM.PM25.MC.M3          0.131944
EG.FEC.RNEW.ZS             0.081349
SH.XPD.CHEX.PC.CD          0.073743
SH.XPD.CHEX.GD.ZS          0.073082
EN.GHG.CO2.RT.GDP.PP.KD    0.055556
NY.GDP.PCAP.PP.KD          0.039683
EN.GHG.ALL.PC.CE.AR5       0.023810
EN.GHG.CO2.PC.CE.AR5       0.023810
NY.GDP.MKTP.KD.ZG          0.018519
AG.LND.FRST.ZS             0.017857
SH.DYN.MORT                0.007937
SP.DYN.LE00.IN             0.000000
SP.POP.GROW                0.000000
SP.POP.TOTL                0.000000
SP.URB.TOTL.IN.ZS          0.000000
dtype: float64


The indicators that have more than 30% of the data missing are removed. And new data frame is created.

In [17]:
low_data_indicators = missing_per_indicator[missing_per_indicator > 0.3].index
df_clean1 = df[~df['series'].isin(low_data_indicators)]

Countries are checked next and ones with over 30% of missing data removed

In [None]:
missing_per_country = df_clean1.groupby('economy')[year_cols].apply(lambda x: x.isna().mean().mean())
print("Missing values per country:")
print(missing_per_country.sort_values(ascending = False))
low_data_countries = missing_per_country[missing_per_country > 0.3].index
df_clean2 = df_clean1[~df_clean1['economy'].isin(low_data_countries)]
df_clean2.to_csv('..data/cleaned/all_indicators_cleaned.csv', index=False)

Missing values per country:
economy
SSD    0.494444
PRK    0.347222
HKG    0.338889
VEN    0.277778
SRB    0.236111
         ...   
POL    0.013889
PHL    0.013889
HUN    0.013889
ITA    0.013889
GBR    0.013889
Length: 126, dtype: float64


OSError: Cannot save file into a non-existent directory: 'data/cleaned'

Saved the cleaned file and checking remaining indicators to set names for them which will be used for their separate dataframes.

In [None]:
remaining_indicators = df_clean2['series'].unique()
print(remaining_indicators)

indicator_names = {
            'AG.LND.FRST.ZS': 'forest_area',
            'EG.FEC.RNEW.ZS': 'renewable_energy',
            'EN.ATM.PM25.MC.M3': 'pm25_pollution',
            'EN.GHG.ALL.PC.CE.AR5': 'ghg_per_capita',
            'EN.GHG.CO2.PC.CE.AR5': 'co2_per_capita',
            'EN.GHG.CO2.RT.GDP.PP.KD': 'carbon_intensity',
            'NY.GDP.MKTP.KD.ZG': 'gdp_growth',
            'NY.GDP.PCAP.PP.KD': 'gdp_per_capita',
            'SH.DYN.MORT': 'child_mortality',
            'SH.XPD.CHEX.GD.ZS': 'health_exp_pct_gdp',
            'SH.XPD.CHEX.PC.CD': 'health_exp_per_capita',
            'SP.DYN.LE00.IN': 'life_expectancy',
            'SP.POP.GROW': 'population_growth',
            'SP.POP.TOTL': 'population',
            'SP.URB.TOTL.IN.ZS': 'urban_population'
        }


In data_cleaning.py this and more is provided to remove the lowest coverage indicators and countries and then the cleaned dataframe is split into several csv files one per each indicator all the functions for cleaning, visualisation and analysis are defined and explained in the appropriate scripts in the project_code file.

For displaying the project streamlit is used, main app page is _üåç_Global_overview.py.