# ETL & Data Preparation : Extract, Transform, Load

## Objectives

* To prepare a unified, analysis-ready dataset that supports examining how renewable energy deployment and energy efficiency impact CO₂ emissions over time, and to detect structural tipping points in emission trends. The dataset enables hypothesis testing, machine learning, and interactive dashboard visualisation to guide policy and research.

## Inputs

- **Global Sustainable Energy (Kaggle)**  
  Country-level annual indicators on electricity generation, renewable energy share, and CO₂ emissions (2000–2020).

- **World Bank Population (SP.POP.TOTL)**  
  Total population data by country (1960–2023), subset for 2000–2020 to align with energy data coverage.

- **UNSD M49 Region Mapping**  
  Static country-to-region classification to support regional comparisons and aggregation.

## Outputs

* A cleaned, merged dataset containing harmonised country-year records (2000–2020), enriched with population and regional classifications.  
This integrated dataset forms the foundation for statistical analysis, hypothesis validation, and predictive modelling in a Streamlit dashboard.

---

### Load and Inspect the Data
Understand the raw data — check columns, types, shape, duplicates and any obvious issues

In [1]:
# Import required libraries and set up output directory
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import boxplot
import plotly.express as px
import os

In [2]:
# Create folder to save images 
os.makedirs("images", exist_ok=True)

### Extract Datasets

Step 1: Load Datasets

In [None]:
# Load the global energy dataset
df_energy = pd.read_csv("../data/raw/global-data-on-sustainable-energy.csv")

# Load the population dataset
df_region = pd.read_csv("../data/raw/unsd_country_region_mapping.csv", sep=";")


df_population = pd.read_csv("../data/raw/world_bank_population.csv")



Step 2: Preview the Raw Data

In [27]:
from IPython.display import display

# Use display() to preview the first few rows of each dataset
display(df_energy.head())
display(df_population.head())
display(df_region.head())


Unnamed: 0,Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),...,Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,302.59482,1.64,760.0,,,,60,652230.0,33.93911,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,...,236.89185,1.74,730.0,,,,60,652230.0,33.93911,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,210.86215,1.4,1029.999971,,,179.426579,60,652230.0,33.93911,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,229.96822,1.4,1220.000029,,8.832278,190.683814,60,652230.0,33.93911,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,204.23125,1.2,1029.999971,,1.414118,211.382074,60,652230.0,33.93911,67.709953


Unnamed: 0,STRUCTURE,STRUCTURE_ID,ACTION,FREQ_ID,FREQ_NAME,REF_AREA_ID,REF_AREA_NAME,INDICATOR_ID,INDICATOR_NAME,SEX_ID,...,DATA_SOURCE_NAME,UNIT_TYPE_ID,UNIT_TYPE_NAME,TIME_FORMAT_ID,TIME_FORMAT_NAME,COMMENT_OBS,OBS_STATUS_ID,OBS_STATUS_NAME,OBS_CONF_ID,OBS_CONF_NAME
0,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,AFE,Africa Eastern and Southern,WB_WDI_SP_POP_TOTL,"Population, total",_T,...,World Development Indicators (WDI),COUNT,Count (Integer),P1Y,Annual,,A,Normal value,PU,Public
1,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,AFW,Africa Western and Central,WB_WDI_SP_POP_TOTL,"Population, total",_T,...,World Development Indicators (WDI),COUNT,Count (Integer),P1Y,Annual,,A,Normal value,PU,Public
2,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,ARB,Arab World,WB_WDI_SP_POP_TOTL,"Population, total",_T,...,World Development Indicators (WDI),COUNT,Count (Integer),P1Y,Annual,,A,Normal value,PU,Public
3,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,CSS,Caribbean small states,WB_WDI_SP_POP_TOTL,"Population, total",_T,...,World Development Indicators (WDI),COUNT,Count (Integer),P1Y,Annual,,A,Normal value,PU,Public
4,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,CEB,Central Electricity Board (CEB),WB_WDI_SP_POP_TOTL,"Population, total",_T,...,World Development Indicators (WDI),COUNT,Count (Integer),P1Y,Annual,,A,Normal value,PU,Public


Unnamed: 0,Global Code,Global Name,Region Code,Region Name,Sub-region Code,Sub-region Name,Intermediate Region Code,Intermediate Region Name,Country or Area,M49 Code,ISO-alpha2 Code,ISO-alpha3 Code,Least Developed Countries (LDC),Land Locked Developing Countries (LLDC),"Small Island Developing States (SIDS),"
0,1,World,2.0,Africa,15.0,Northern Africa,,,Algeria,12,DZ,DZA,,,","
1,1,World,2.0,Africa,15.0,Northern Africa,,,Egypt,818,EG,EGY,,,","
2,1,World,2.0,Africa,15.0,Northern Africa,,,Libya,434,LY,LBY,,,","
3,1,World,2.0,Africa,15.0,Northern Africa,,,Morocco,504,MA,MAR,,,","
4,1,World,2.0,Africa,15.0,Northern Africa,,,Sudan,729,SD,SDN,x,,","


Step 3: Inspect Dataset Structure

In [28]:
# Inspect the structure and nulls in the energy dataset
print("df_energy:")
print(df_energy.shape)
df_energy.info()
print(df_energy.isnull().sum())

df_energy:
(3649, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 21 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Entity                                                            3649 non-null   object 
 1   Year                                                              3649 non-null   int64  
 2   Access to electricity (% of population)                           3639 non-null   float64
 3   Access to clean fuels for cooking                                 3480 non-null   float64
 4   Renewable-electricity-generating-capacity-per-capita              2718 non-null   float64
 5   Financial flows to developing countries (US $)                    1560 non-null   float64
 6   Renewable energy share in the total final energy consumption (%)  3455 non-null   float64
 7   Electricity

In [29]:
# Inspect the structure and nulls in the population dataset
print("\ndf_population:")
print(df_population.shape)
df_population.info()
print(df_population.isnull().sum())


df_population:
(16930, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16930 entries, 0 to 16929
Data columns (total 45 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   STRUCTURE              16930 non-null  object 
 1   STRUCTURE_ID           16930 non-null  object 
 2   ACTION                 16930 non-null  object 
 3   FREQ_ID                16930 non-null  object 
 4   FREQ_NAME              16930 non-null  object 
 5   REF_AREA_ID            16930 non-null  object 
 6   REF_AREA_NAME          16930 non-null  object 
 7   INDICATOR_ID           16930 non-null  object 
 8   INDICATOR_NAME         16930 non-null  object 
 9   SEX_ID                 16930 non-null  object 
 10  SEX_NAME               16930 non-null  object 
 11  AGE_ID                 16930 non-null  object 
 12  AGE_NAME               16930 non-null  object 
 13  URBANISATION_ID        16930 non-null  object 
 14  URBANISATION_NAME      169

In [30]:
# Inspect the structure and nulls in the region mapping dataset
print("\ndf_region:")
print(df_region.shape)
df_region.info()
print(df_region.isnull().sum())


df_region:
(248, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Global Code                              248 non-null    int64  
 1   Global Name                              248 non-null    object 
 2   Region Code                              247 non-null    float64
 3   Region Name                              247 non-null    object 
 4   Sub-region Code                          247 non-null    float64
 5   Sub-region Name                          247 non-null    object 
 6   Intermediate Region Code                 105 non-null    float64
 7   Intermediate Region Name                 105 non-null    object 
 8   Country or Area                          248 non-null    object 
 9   M49 Code                                 248 non-null    int64  
 10  ISO-alpha2 Code             

Step 4: Clean Column Names in All Datasets

In [31]:
# Function to clean column names: lowercase, trim spaces, replace internal spaces with underscores, remove special characters
def clean_column_names(df):
    df.columns = (
        df.columns.str.strip()                      # remove leading/trailing whitespace
                  .str.lower()                      # make lowercase
                  .str.replace(r'[^\w\s]', '', regex=True)  # remove special characters
                  .str.replace(r'\s+', '_', regex=True)     # replace space(s) with underscore
    )
    return df

# Apply to all datasets
df_energy = clean_column_names(df_energy)
df_population = clean_column_names(df_population)
df_region = clean_column_names(df_region)


Step 5: Inspect Dataset Structure

In [34]:
# inspect the cleaned structure and nulls in the energy dataset
print("df_energy:")
print(df_energy.shape)
df_energy.info()
print(df_energy.isnull().sum())

df_energy:
(3649, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 21 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   entity                                                         3649 non-null   object 
 1   year                                                           3649 non-null   int64  
 2   access_to_electricity_of_population                            3639 non-null   float64
 3   access_to_clean_fuels_for_cooking                              3480 non-null   float64
 4   renewableelectricitygeneratingcapacitypercapita                2718 non-null   float64
 5   financial_flows_to_developing_countries_us_                    1560 non-null   float64
 6   renewable_energy_share_in_the_total_final_energy_consumption_  3455 non-null   float64
 7   electricity_from_fossil_fuels_twh     

In [35]:
# inspect the cleaned structure and nulls in the population dataset
print("\ndf_population:")
print(df_population.shape)
df_population.info()
print(df_population.isnull().sum())


df_population:
(16930, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16930 entries, 0 to 16929
Data columns (total 45 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   structure              16930 non-null  object 
 1   structure_id           16930 non-null  object 
 2   action                 16930 non-null  object 
 3   freq_id                16930 non-null  object 
 4   freq_name              16930 non-null  object 
 5   ref_area_id            16930 non-null  object 
 6   ref_area_name          16930 non-null  object 
 7   indicator_id           16930 non-null  object 
 8   indicator_name         16930 non-null  object 
 9   sex_id                 16930 non-null  object 
 10  sex_name               16930 non-null  object 
 11  age_id                 16930 non-null  object 
 12  age_name               16930 non-null  object 
 13  urbanisation_id        16930 non-null  object 
 14  urbanisation_name      169

In [36]:
# inspect the cleaned structure and nulls in the region dataset
print("\ndf_region:")
print(df_region.shape)
df_region.info()
print(df_region.isnull().sum())


df_region:
(248, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   global_code                            248 non-null    int64  
 1   global_name                            248 non-null    object 
 2   region_code                            247 non-null    float64
 3   region_name                            247 non-null    object 
 4   subregion_code                         247 non-null    float64
 5   subregion_name                         247 non-null    object 
 6   intermediate_region_code               105 non-null    float64
 7   intermediate_region_name               105 non-null    object 
 8   country_or_area                        248 non-null    object 
 9   m49_code                               248 non-null    int64  
 10  isoalpha2_code                         247 non-null 

Step 6: Clean the Energy Dataset

In [37]:
# Rename columns in the energy dataset for clarity and consistency
df_energy.rename(columns={'entity': 'country'}, inplace=True)
df_energy['year'] = df_energy['year'].astype(int)



In [38]:
df_energy.head()  # Display the first few rows of the energy dataset

Unnamed: 0,country,year,access_to_electricity_of_population,access_to_clean_fuels_for_cooking,renewableelectricitygeneratingcapacitypercapita,financial_flows_to_developing_countries_us_,renewable_energy_share_in_the_total_final_energy_consumption_,electricity_from_fossil_fuels_twh,electricity_from_nuclear_twh,electricity_from_renewables_twh,...,primary_energy_consumption_per_capita_kwhperson,energy_intensity_level_of_primary_energy_mj2017_ppp_gdp,value_co2_emissions_kt_by_country,renewables_equivalent_primary_energy,gdp_growth,gdp_per_capita,densitynpkm2,land_areakm2,latitude,longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,302.59482,1.64,760.0,,,,60,652230.0,33.93911,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,...,236.89185,1.74,730.0,,,,60,652230.0,33.93911,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,210.86215,1.4,1029.999971,,,179.426579,60,652230.0,33.93911,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,229.96822,1.4,1220.000029,,8.832278,190.683814,60,652230.0,33.93911,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,204.23125,1.2,1029.999971,,1.414118,211.382074,60,652230.0,33.93911,67.709953


In [39]:
df_energy.info() # Display the structure of the energy dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 21 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   country                                                        3649 non-null   object 
 1   year                                                           3649 non-null   int32  
 2   access_to_electricity_of_population                            3639 non-null   float64
 3   access_to_clean_fuels_for_cooking                              3480 non-null   float64
 4   renewableelectricitygeneratingcapacitypercapita                2718 non-null   float64
 5   financial_flows_to_developing_countries_us_                    1560 non-null   float64
 6   renewable_energy_share_in_the_total_final_energy_consumption_  3455 non-null   float64
 7   electricity_from_fossil_fuels_twh                           

Step 7: Clean and Filter the Population Dataset
 This step filters out region aggregates and retains only country-level data aligned with the energy dataset

In [40]:
# Keep only rows with total population indicator
df_population = df_population[df_population['indicator_id'] == 'WB_WDI_SP_POP_TOTL']

# Filter for countries that match those in the energy dataset
valid_countries = df_energy['country'].unique()
df_population = df_population[df_population['ref_area_name'].isin(valid_countries)]

# Select and rename columns
df_population = df_population[['ref_area_name', 'time_period', 'obs_value']].copy()
df_population.columns = ['country', 'year', 'population']

# Convert datatypes
df_population['year'] = df_population['year'].astype(int)
df_population['population'] = pd.to_numeric(df_population['population'], errors='coerce')


Step 8: Clean the Region Mapping Dataset

In [41]:
# Select and rename relevant columns
df_region = df_region[['country_or_area', 'region_name', 'subregion_name']].copy()
df_region.columns = ['country', 'region', 'subregion']


Step 9: Merge the Datasets

In [42]:
# Merge energy and population datasets on country and year
df_merged = pd.merge(df_energy, df_population, on=['country', 'year'], how='left')

# Merge with region mapping on country
df_final = pd.merge(df_merged, df_region, on='country', how='left')


Step 10: Inspect the Final Merged Dataset

In [43]:
print("Final Dataset:")
print(df_final.shape)
df_final.info()
print(df_final.isnull().sum())


Final Dataset:
(3649, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 24 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   country                                                        3649 non-null   object 
 1   year                                                           3649 non-null   int32  
 2   access_to_electricity_of_population                            3639 non-null   float64
 3   access_to_clean_fuels_for_cooking                              3480 non-null   float64
 4   renewableelectricitygeneratingcapacitypercapita                2718 non-null   float64
 5   financial_flows_to_developing_countries_us_                    1560 non-null   float64
 6   renewable_energy_share_in_the_total_final_energy_consumption_  3455 non-null   float64
 7   electricity_from_fossil_fuels_twh 

In [44]:
# Rename some  inconsistent label names
df_final.rename(columns={
    'renewableelectricitygeneratingcapacitypercapita': 'renewable_electricity_generating_capacity_per_capita',
    'densitynpkm2': 'density_per_km2',
    'land_areakm2': 'land_area_km2'
}, inplace=True)


In [45]:
# Check updated column names
df_final.columns.tolist()


['country',
 'year',
 'access_to_electricity_of_population',
 'access_to_clean_fuels_for_cooking',
 'renewable_electricity_generating_capacity_per_capita',
 'financial_flows_to_developing_countries_us_',
 'renewable_energy_share_in_the_total_final_energy_consumption_',
 'electricity_from_fossil_fuels_twh',
 'electricity_from_nuclear_twh',
 'electricity_from_renewables_twh',
 'lowcarbon_electricity_electricity',
 'primary_energy_consumption_per_capita_kwhperson',
 'energy_intensity_level_of_primary_energy_mj2017_ppp_gdp',
 'value_co2_emissions_kt_by_country',
 'renewables_equivalent_primary_energy',
 'gdp_growth',
 'gdp_per_capita',
 'density_per_km2',
 'land_area_km2',
 'latitude',
 'longitude',
 'population',
 'region',
 'subregion']

In [46]:
# Rename some columns in the final dataset for clarity and consistency
df_final.rename(columns={
    'access_to_electricity_of_population': 'access_to_electricity_pct',
    'financial_flows_to_developing_countries_us_': 'financial_support_to_developing_countries_usd',
    'renewable_energy_share_in_the_total_final_energy_consumption_': 'renewable_energy_share_in_the_total_final_energy_consumption_pct',
    'lowcarbon_electricity_electricity': 'low_carbon_electricity_pct',
    'primary_energy_consumption_per_capita_kwhperson': 'primary_energy_consumption_per_capita_kwh_per_person',
    'renewables_equivalent_primary_energy': 'renewables_equivalent_primary_energy_pct'
}, inplace=True)


In [47]:
# Check updated column names
df_final.columns.tolist()

['country',
 'year',
 'access_to_electricity_pct',
 'access_to_clean_fuels_for_cooking',
 'renewable_electricity_generating_capacity_per_capita',
 'financial_support_to_developing_countries_usd',
 'renewable_energy_share_in_the_total_final_energy_consumption_pct',
 'electricity_from_fossil_fuels_twh',
 'electricity_from_nuclear_twh',
 'electricity_from_renewables_twh',
 'low_carbon_electricity_pct',
 'primary_energy_consumption_per_capita_kwh_per_person',
 'energy_intensity_level_of_primary_energy_mj2017_ppp_gdp',
 'value_co2_emissions_kt_by_country',
 'renewables_equivalent_primary_energy_pct',
 'gdp_growth',
 'gdp_per_capita',
 'density_per_km2',
 'land_area_km2',
 'latitude',
 'longitude',
 'population',
 'region',
 'subregion']

Step 11: Save Cleaned Dataset

In [49]:
import os

# Save the cleaned dataset as CSV inside the folder
df_final.to_csv("../data/cleaned/enhanced_energy_dataset.csv", index=False)



In [51]:
# Read the cleaned dataset to verify eerything is saved correctly
df_cleaned = pd.read_csv("../data/cleaned/enhanced_energy_dataset.csv")

# Preview the first few rows
df_cleaned.head()


Unnamed: 0,country,year,access_to_electricity_pct,access_to_clean_fuels_for_cooking,renewable_electricity_generating_capacity_per_capita,financial_support_to_developing_countries_usd,renewable_energy_share_in_the_total_final_energy_consumption_pct,electricity_from_fossil_fuels_twh,electricity_from_nuclear_twh,electricity_from_renewables_twh,...,renewables_equivalent_primary_energy_pct,gdp_growth,gdp_per_capita,density_per_km2,land_area_km2,latitude,longitude,population,region,subregion
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,,,,60,652230.0,33.93911,67.709953,20130327.0,Asia,Southern Asia
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,...,,,,60,652230.0,33.93911,67.709953,20284307.0,Asia,Southern Asia
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,,,179.426579,60,652230.0,33.93911,67.709953,21378117.0,Asia,Southern Asia
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,,8.832278,190.683814,60,652230.0,33.93911,67.709953,22733049.0,Asia,Southern Asia
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,,1.414118,211.382074,60,652230.0,33.93911,67.709953,23560654.0,Asia,Southern Asia


In [52]:
df_cleaned.info()  # Display the structure of the cleaned dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 24 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   country                                                           3649 non-null   object 
 1   year                                                              3649 non-null   int64  
 2   access_to_electricity_pct                                         3639 non-null   float64
 3   access_to_clean_fuels_for_cooking                                 3480 non-null   float64
 4   renewable_electricity_generating_capacity_per_capita              2718 non-null   float64
 5   financial_support_to_developing_countries_usd                     1560 non-null   float64
 6   renewable_energy_share_in_the_total_final_energy_consumption_pct  3455 non-null   float64
 7   electricity_from_fossil_fuels_twh

In [54]:
df_cleaned.shape

(3649, 24)

In [55]:
# check null values
df_cleaned.isnull().sum()

country                                                                0
year                                                                   0
access_to_electricity_pct                                             10
access_to_clean_fuels_for_cooking                                    169
renewable_electricity_generating_capacity_per_capita                 931
financial_support_to_developing_countries_usd                       2089
renewable_energy_share_in_the_total_final_energy_consumption_pct     194
electricity_from_fossil_fuels_twh                                     21
electricity_from_nuclear_twh                                         126
electricity_from_renewables_twh                                       21
low_carbon_electricity_pct                                            42
primary_energy_consumption_per_capita_kwh_per_person                   0
energy_intensity_level_of_primary_energy_mj2017_ppp_gdp              207
value_co2_emissions_kt_by_country                  

In [56]:
#check for duplicates
df_cleaned.duplicated().sum()

0

### TRANSFORM DATA

Handle missing values for key columns

In [58]:
# Sort by country and year to prepare for time-series interpolation
df_final.sort_values(['country', 'year'], inplace=True)

In [59]:
# Interpolate time-series numeric columns for analysis 
cols_to_interpolate = [
    'renewable_energy_share_in_the_total_final_energy_consumption_pct',
    'value_co2_emissions_kt_by_country',
    'population',
    'energy_intensity_level_of_primary_energy_mj2017_ppp_gdp',
    'gdp_per_capita',
    'access_to_electricity_pct',
    'access_to_clean_fuels_for_cooking',
    'renewable_electricity_generating_capacity_per_capita',
    'electricity_from_fossil_fuels_twh',
    'electricity_from_nuclear_twh',
    'electricity_from_renewables_twh',
    'low_carbon_electricity_pct',
    'gdp_growth'
]

In [62]:
# Interpolate missing values for selected columns within each country group

df_final[cols_to_interpolate] = (
    df_final.groupby('country')[cols_to_interpolate]
    .transform(lambda group: group.interpolate())
)


In [63]:
# Forward/backward fill for country-level static fields (latitude, longitude, area, density)
geo_cols = ['latitude', 'longitude', 'density_per_km2', 'land_area_km2']
df_final[geo_cols] = df_final.groupby('country')[geo_cols].transform(lambda g: g.ffill().bfill())

In [64]:
# Fill region and subregion per country as they don’t change over time
df_final[['region', 'subregion']] = df_final.groupby('country')[['region', 'subregion']].transform(lambda g: g.ffill().bfill())

In [65]:
# Fill financial support with 0 (assumes missing = no support)
df_final['financial_support_to_developing_countries_usd'] = df_final['financial_support_to_developing_countries_usd'].fillna(0)

In [66]:
# Final check: Remaining missing values summary
missing_summary = df_final.isnull().sum()
print("Remaining missing values per column:")
print(missing_summary[missing_summary > 0])

Remaining missing values per column:
access_to_electricity_pct                                             10
access_to_clean_fuels_for_cooking                                    169
renewable_electricity_generating_capacity_per_capita                 931
renewable_energy_share_in_the_total_final_energy_consumption_pct      21
electricity_from_fossil_fuels_twh                                     21
electricity_from_nuclear_twh                                         126
electricity_from_renewables_twh                                       21
low_carbon_electricity_pct                                            42
energy_intensity_level_of_primary_energy_mj2017_ppp_gdp               22
value_co2_emissions_kt_by_country                                    253
renewables_equivalent_primary_energy_pct                            2137
gdp_growth                                                           279
gdp_per_capita                                                       264
density_per_km

Forward/Backward Fill for Remaining Time Series Columns
If missing values still exist within countries, use fill forward and backward.

In [71]:
# Continue to handle any remaining missing values from previous steps

additional_cols = [
    'access_to_electricity_pct',
    'access_to_clean_fuels_for_cooking',
    'electricity_from_fossil_fuels_twh',
    'electricity_from_nuclear_twh',
    'electricity_from_renewables_twh',
    'low_carbon_electricity_pct',
    'gdp_growth',
    'gdp_per_capita'
]

df_final[additional_cols] = df_final.groupby('country')[additional_cols].transform(lambda g: g.ffill().bfill())



In [72]:
# Fill Static Data (Land Area, Lat/Lon, Population)

static_cols = ['density_per_km2', 'land_area_km2', 'latitude', 'longitude', 'population']

df_final[static_cols] = df_final.groupby('country')[static_cols].transform(lambda g: g.ffill().bfill())

In [73]:
#  Flag non-core high-missing columns that are not essential for hypotheses analysis

df_final['renewables_equivalent_primary_energy_pct_missing'] = df_final['renewables_equivalent_primary_energy_pct'].isnull().astype(int)



In [75]:
# Sanity check missing values
df_final.isnull().sum()

country                                                                0
year                                                                   0
access_to_electricity_pct                                              1
access_to_clean_fuels_for_cooking                                    169
renewable_electricity_generating_capacity_per_capita                 931
financial_support_to_developing_countries_usd                          0
renewable_energy_share_in_the_total_final_energy_consumption_pct      21
electricity_from_fossil_fuels_twh                                     21
electricity_from_nuclear_twh                                         126
electricity_from_renewables_twh                                       21
low_carbon_electricity_pct                                            42
primary_energy_consumption_per_capita_kwh_per_person                   0
energy_intensity_level_of_primary_energy_mj2017_ppp_gdp               22
value_co2_emissions_kt_by_country                  

In [76]:
# Rerun handling for any remaining missing values

# Columns suitable for interpolation (timeseries-style)
interpolate_cols = [
    'access_to_electricity_pct',
    'renewable_energy_share_in_the_total_final_energy_consumption_pct',
    'electricity_from_fossil_fuels_twh',
    'electricity_from_nuclear_twh',
    'electricity_from_renewables_twh',
    'low_carbon_electricity_pct',
    'energy_intensity_level_of_primary_energy_mj2017_ppp_gdp',
    'value_co2_emissions_kt_by_country',
    'gdp_growth',
    'gdp_per_capita'
]

# Interpolate within each country
df_final[interpolate_cols] = (
    df_final.groupby('country')[interpolate_cols]
    .transform(lambda g: g.interpolate())
)

# Then forward/backward fill remaining gaps within each country
df_final[interpolate_cols] = (
    df_final.groupby('country')[interpolate_cols]
    .transform(lambda g: g.ffill().bfill())
)

In [77]:
# Fill static geo data

geo_cols = ['density_per_km2', 'land_area_km2', 'latitude', 'longitude', 'population']

df_final[geo_cols] = (
    df_final.groupby('country')[geo_cols]
    .transform(lambda g: g.ffill().bfill())
)


In [78]:
# Flag Missing Values for non-core columns

non_core_cols_to_flag = [
    'access_to_clean_fuels_for_cooking',
    'renewable_electricity_generating_capacity_per_capita',
    'renewables_equivalent_primary_energy_pct',
    'region',
    'subregion'
]
for col in non_core_cols_to_flag:
    df_final[f'{col}_missing'] = df_final[col].isnull().astype(int)





In [79]:
# Check how many rows are missing in each of the flagged columns
df_final[[f"{col}_missing" for col in non_core_cols_to_flag]].sum()

access_to_clean_fuels_for_cooking_missing                        169
renewable_electricity_generating_capacity_per_capita_missing     931
renewables_equivalent_primary_energy_pct_missing                2137
region_missing                                                    84
subregion_missing                                                 84
dtype: int64

In [80]:
# Final check for remaining missing values 
missing = df_final.isnull().sum()
print("Remaining missing values:\n", missing[missing > 0])

Remaining missing values:
 access_to_electricity_pct                                              1
access_to_clean_fuels_for_cooking                                    169
renewable_electricity_generating_capacity_per_capita                 931
renewable_energy_share_in_the_total_final_energy_consumption_pct      21
electricity_from_fossil_fuels_twh                                     21
electricity_from_nuclear_twh                                         126
electricity_from_renewables_twh                                       21
low_carbon_electricity_pct                                            42
energy_intensity_level_of_primary_energy_mj2017_ppp_gdp               22
value_co2_emissions_kt_by_country                                    253
renewables_equivalent_primary_energy_pct                            2137
gdp_growth                                                           232
gdp_per_capita                                                       232
density_per_km2         

In [81]:
# Review remaining missing values (including core columns)
df_final.isnull().sum().sort_values(ascending=False)

renewables_equivalent_primary_energy_pct                            2137
renewable_electricity_generating_capacity_per_capita                 931
value_co2_emissions_kt_by_country                                    253
population                                                           232
gdp_per_capita                                                       232
gdp_growth                                                           232
access_to_clean_fuels_for_cooking                                    169
electricity_from_nuclear_twh                                         126
region                                                                84
subregion                                                             84
low_carbon_electricity_pct                                            42
energy_intensity_level_of_primary_energy_mj2017_ppp_gdp               22
electricity_from_renewables_twh                                       21
electricity_from_fossil_fuels_twh                  

### Handling Missing Data

After data cleaning and transformation, some missing values remain in the dataset. These are primarily found in:

- **Non-core fields** such as `renewable_electricity_generating_capacity_per_capita`, `access_to_clean_fuels_for_cooking`, and `region/subregion`.  
  These were **flagged** rather than filled, as they are not essential to the hypotheses being tested.

- **Time series metrics** like `value_co2_emissions_kt_by_country` and electricity-related fields, where missing values at the start or end of a country's timeline could not be interpolated.

- **Static attributes** such as `population`, `land_area_km2`, `latitude`, and `longitude`, where only minor gaps remain. These were either **filled with forward/backward fill** or **flagged** if uncertain.

These remaining gaps do **not materially impact** the core hypotheses, which focus on:
- The relationship between **renewables share and CO₂ per capita**,
- The effect of crossing a **30% renewables tipping point**, and
- The correlation between **energy intensity and emissions**.

Careful decisions were made to preserve data quality and avoid introducing bias during analysis.

### LOAD DATA

In [82]:
# Save the final cleaned dataset as CSV inside the cleaned folder
df_final.to_csv("../data/cleaned/enhanced_energy_dataset_final.csv", index=False)


In [83]:
df_final_check = pd.read_csv("../data/cleaned/enhanced_energy_dataset_final.csv")
df_final_check.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 29 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   country                                                           3649 non-null   object 
 1   year                                                              3649 non-null   int64  
 2   access_to_electricity_pct                                         3648 non-null   float64
 3   access_to_clean_fuels_for_cooking                                 3480 non-null   float64
 4   renewable_electricity_generating_capacity_per_capita              2718 non-null   float64
 5   financial_support_to_developing_countries_usd                     3649 non-null   float64
 6   renewable_energy_share_in_the_total_final_energy_consumption_pct  3628 non-null   float64
 7   electricity_from_fossil_fuels_twh

In [84]:
df_final.shape

(3649, 29)

In [85]:
df_final.columns.tolist()

['country',
 'year',
 'access_to_electricity_pct',
 'access_to_clean_fuels_for_cooking',
 'renewable_electricity_generating_capacity_per_capita',
 'financial_support_to_developing_countries_usd',
 'renewable_energy_share_in_the_total_final_energy_consumption_pct',
 'electricity_from_fossil_fuels_twh',
 'electricity_from_nuclear_twh',
 'electricity_from_renewables_twh',
 'low_carbon_electricity_pct',
 'primary_energy_consumption_per_capita_kwh_per_person',
 'energy_intensity_level_of_primary_energy_mj2017_ppp_gdp',
 'value_co2_emissions_kt_by_country',
 'renewables_equivalent_primary_energy_pct',
 'gdp_growth',
 'gdp_per_capita',
 'density_per_km2',
 'land_area_km2',
 'latitude',
 'longitude',
 'population',
 'region',
 'subregion',
 'renewables_equivalent_primary_energy_pct_missing',
 'access_to_clean_fuels_for_cooking_missing',
 'renewable_electricity_generating_capacity_per_capita_missing',
 'region_missing',
 'subregion_missing']