### Import libs & data

data source : International Renewable Energy Agency (IRENA) (2022), Renewable Energy Statistics 2022, https://pxweb.irena.org/pxweb/en/IRENASTAT; IMF Staff Calculations.

data downloaded from : [IMF Climate Data](https://climatedata.imf.org/datasets/0bfab7fb7e0e4050b82bba40cd7a1bd5_0/about)

In [1]:
import pandas as pd
# pd.set_option('display.max_colwidth', None)
import sys
from pathlib import Path

In [2]:
PROJECT_ROOT = Path.cwd().parents[0]
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

from my_project.paths import get_paths

paths = get_paths(PROJECT_ROOT)
DATA_DIR = paths['DATA_DIR']
RAW_DATA_DIR = paths['RAW_DATA_DIR']
PROCESSED_DATA_DIR = paths['PROCESSED_DATA_DIR']
LOGS_DIR = paths['LOGS_DIR']

In [3]:
from data.get_data import download_imf_energy_data, download_iso_codes, download_natural_earth_data

# download required datasets
imf_data = download_imf_energy_data(RAW_DATA_DIR)
iso_codes = download_iso_codes(RAW_DATA_DIR)
natural_earth_folder = download_natural_earth_data(RAW_DATA_DIR)

INFO:data.get_data:Downloading IMF energy data to /home/zephyr/workspace/Global_Energy_Trends/data/raw/imf_renewable_energy.csv...
INFO:data.get_data:IMF energy data download complete.
INFO:data.get_data:Fetching ISO codes from https://www.iban.com/country-codes...
INFO:data.get_data:ISO codes saved to /home/zephyr/workspace/Global_Energy_Trends/data/raw/iso_country_codes.csv
INFO:data.get_data:Downloading Natural Earth dataset to /home/zephyr/workspace/Global_Energy_Trends/data/raw/natural_earth/world_countries.zip...
INFO:data.get_data:Download complete. Unzipping /home/zephyr/workspace/Global_Energy_Trends/data/raw/natural_earth/world_countries.zip...
INFO:data.get_data:Natural Earth data extracted to /home/zephyr/workspace/Global_Energy_Trends/data/raw/natural_earth.


In [4]:
# load main dataset
df = pd.read_csv(imf_data, 
                  low_memory=False, encoding='utf-8', index_col=0)
print(df.shape)

(2063, 36)


In [5]:
# load isocode dataset
iso_df = pd.read_csv(iso_codes,
                     low_memory=False, encoding='utf-8')
print(iso_df.shape)

(249, 4)


### Dataset Overview

In [6]:
df.columns

Index(['Country', 'ISO2', 'ISO3', 'Technology', 'Energy_Type', 'Indicator',
       'Unit', 'Source', 'CTS_Name', 'CTS_Code', 'CTS_Full_Descriptor',
       'F2000', 'F2001', 'F2002', 'F2003', 'F2004', 'F2005', 'F2006', 'F2007',
       'F2008', 'F2009', 'F2010', 'F2011', 'F2012', 'F2013', 'F2014', 'F2015',
       'F2016', 'F2017', 'F2018', 'F2019', 'F2020', 'F2021', 'F2022', 'F2023',
       'F2024'],
      dtype='object')

In [7]:
df['Country'].nunique()  # 248 unique countries in our dataset

248

In [8]:
df.sample(5)

Unnamed: 0_level_0,Country,ISO2,ISO3,Technology,Energy_Type,Indicator,Unit,Source,CTS_Name,CTS_Code,...,F2015,F2016,F2017,F2018,F2019,F2020,F2021,F2022,F2023,F2024
ObjectId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1435,Panama,PA,PAN,Fossil fuels,Total Non-Renewable,Electricity Generation,Gigawatt-hours (GWh),International Renewable Energy Agency (IRENA) ...,Electricity Generation,ECNEG,...,3069.8,3501.83,3237.19,2023.346,4815.43,2726.06,2146.44,2424.11,,
1142,Mali,ML,MLI,Solar energy,Total Renewable,Electricity Installed Capacity,Megawatt (MW),International Renewable Energy Agency (IRENA) ...,Electricity Installed Capacity,ECNEC,...,6.303,7.096,9.549,10.043,11.078,61.545,92.546,96.961,97.009,137.009
725,Georgia,GE,GEO,Hydropower (excl. Pumped Storage),Total Renewable,Electricity Installed Capacity,Megawatt (MW),International Renewable Energy Agency (IRENA) ...,Electricity Installed Capacity,ECNEC,...,2802.0,3160.087,3113.087,3220.087,3300.087,3323.087,3354.337,3379.04,3449.938,3413.938
1173,Mauritius,MU,MUS,Bioenergy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),International Renewable Energy Agency (IRENA) ...,Electricity Generation,ECNEG,...,530.199,515.679,480.13,459.685,459.469,408.354,368.716,300.166,,
1495,Portugal,PT,PRT,Hydropower (excl. Pumped Storage),Total Renewable,Electricity Generation,Gigawatt-hours (GWh),International Renewable Energy Agency (IRENA) ...,Electricity Generation,ECNEG,...,8660.506,15723.323,5896.883,12393.349,8817.714,12082.581,11907.543,6536.022,,


Each row in the dataset corresponds to a specific [country](https://www.iban.com/country-codes), energy technology type available.
Technologies are classified under renewable or non‑renewable categories, and further separated into two indicators: Electricity Generation (measured in gigawatt‑hours, GWh) and Installed Capacity (measured in megawatts, MW).

The columns include metadata attributes such as country codes (ISO2, ISO3), technology name, energy type, and indicator description, etc., followed by annual values spanning from 2000 to 2024 that record the evolution of generation and capacity over time.

Next, we'll see which countries are represented in our dataset.

> The term *"country"* as used in this workbook and associated analysis refers to the list of entities provided by the IBAN.com country codes list, which is based on the ISO 3166‑1 standard. This list includes sovereign states as well as various dependencies, overseas territories, and special areas of geographical interest, each assigned a unique code for data processing and communication purposes.

##### Country List in Dataset

In [9]:
df['ISO3'].isnull().sum() # no missing ISO3 codes

# codes that are in our datasets but not in the ISO 3166 international standard list
to_define = set(set(df['ISO3']) - set(iso_df['Alpha-3 code']))
# codes that are in ISO 3166 international standard list but not in our dataset
missing_codes = set(set(iso_df['Alpha-3 code']) - set(df['ISO3']))

In [10]:
# codes not in ISO 3166 international standard list 
df[df['ISO3'].isin(to_define)]\
    .groupby('ISO3')['Country'].unique()

ISO3
AETMP                     [Advanced Economies]
AMETMP                              [Americas]
ASIATMP                                 [Asia]
EMDETMP    [Emerging and Developing Economies]
EURTMP                                [Europe]
LACTMP       [Latin America and the Caribbean]
NA119                                     [G7]
NA120                                    [G20]
NA225                       [Northern America]
NA510                           [Eastern Asia]
NA605                                 [Africa]
NACA                            [Central Asia]
NAEE                          [Eastern Europe]
NANA9                        [Northern Africa]
NANE                         [Northern Europe]
NASA                           [Southern Asia]
NASE                         [Southern Europe]
NASEA                     [South-eastern Asia]
NAWA                            [Western Asia]
NAWE                          [Western Europe]
OCETMP                               [Oceania]
OCRTMP  

They are not in ISO dataset since they are not individual countries/economies but aggregated entitie.
Republic of Kosovo is not a UN member and its statehood is disputed which is why its code is not in our iso df.

We will add a new column for filtering between aggregated regions and individual regions.

In [11]:
# add new column for filtering between individual countries and aggregated regions
region_codes = to_define
df['Region_type'] = df['ISO3'].apply(lambda x: 'Aggregated Region' if x in region_codes else 'Country')

In [12]:
# codes not in our dataset
iso_df[iso_df['Alpha-3 code'].isin(missing_codes)]\
        .groupby('Alpha-3 code')['Country'].unique()

Alpha-3 code
ALA                                   [Åland Islands]
ATA                                      [Antarctica]
ATF               [French Southern Territories (the)]
BMU                                         [Bermuda]
BVT                                   [Bouvet Island]
CCK                   [Cocos (Keeling) Islands (the)]
CXR                                [Christmas Island]
ESH                                  [Western Sahara]
GGY                                        [Guernsey]
GIB                                       [Gibraltar]
HMD               [Heard Island and McDonald Islands]
IMN                                     [Isle of Man]
IOT            [British Indian Ocean Territory (the)]
JEY                                          [Jersey]
LIE                                   [Liechtenstein]
MAC                                           [Macao]
MCO                                          [Monaco]
MNP                  [Northern Mariana Islands (the)]
NFK            

The dataset contains no missing ISO3 codes; however, several entries do not correspond to official ISO 3166 country codes. Instead, these represent aggregated regions or economic groups such as Advanced Economies, Emerging Economies, G20, G7, World, and Sub‑Saharan Africa.

At the same time, some small territories and microstates (e.g., Monaco, Liechtenstein, Gibraltar, Isle of Man) are included in the ISO standard list but are absent from the dataset. This structure shows that the dataset is designed to capture information not only at the country level but also at broader regional and economic bloc levels.

##### Attribute Details

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2063 entries, 1 to 3779
Data columns (total 37 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Country              2063 non-null   object 
 1   ISO2                 1791 non-null   object 
 2   ISO3                 2063 non-null   object 
 3   Technology           2063 non-null   object 
 4   Energy_Type          2063 non-null   object 
 5   Indicator            2063 non-null   object 
 6   Unit                 2063 non-null   object 
 7   Source               2063 non-null   object 
 8   CTS_Name             2063 non-null   object 
 9   CTS_Code             2063 non-null   object 
 10  CTS_Full_Descriptor  2063 non-null   object 
 11  F2000                1406 non-null   float64
 12  F2001                1437 non-null   float64
 13  F2002                1461 non-null   float64
 14  F2003                1473 non-null   float64
 15  F2004                1511 non-null   float6

In [14]:
# remove ISO2 column as it is redundant given ISO3
to_drop = ['ISO2']
to_drop = df.columns.intersection(to_drop)
df.drop(columns=to_drop, inplace=True)

In [15]:
# get technology for each energy type in a table
energy_types_map = \
    df[['Technology', 'Energy_Type']].drop_duplicates().reset_index(drop=True)
energy_types_map

Unnamed: 0,Technology,Energy_Type
0,Bioenergy,Total Renewable
1,Fossil fuels,Total Non-Renewable
2,Hydropower (excl. Pumped Storage),Total Renewable
3,Solar energy,Total Renewable
4,Wind energy,Total Renewable


The dataset currently includes five technology types mapped into two energy types. However, other technologies such as nuclear energy, tidal/wave energy, geothermal, and pumped storage hydro are missing. As a result, for countries that rely heavily on these sources (e.g., France with nuclear ([70% of its electricity](https://world-nuclear.org/Information-Library/Country-Profiles/countries-A-F/France)), [Iceland with geothermal](https://www.government.is/topics/business-and-industry/energy/), UK with tidal pilots, etc.,), their total electricity generation and installed capacity will be under‑represented in this dataset.

In [16]:
# type of unit for each indicator
inicator_units = \
    df[['Indicator', 'Unit']].drop_duplicates().reset_index(drop=True)
inicator_units

Unnamed: 0,Indicator,Unit
0,Electricity Generation,Gigawatt-hours (GWh)
1,Electricity Installed Capacity,Megawatt (MW)


In [17]:
# drop the source col since it is the same for all rows
to_drop = ['Source']
to_drop = df.columns.intersection(to_drop)
df.drop(columns=to_drop, inplace=True)

In [18]:
# check CTS code, name and description mappings
cts_map = \
    df[['CTS_Code', 'CTS_Name', 'CTS_Full_Descriptor', 'Indicator']].drop_duplicates().reset_index(drop=True)
cts_map

Unnamed: 0,CTS_Code,CTS_Name,CTS_Full_Descriptor,Indicator
0,ECNEG,Electricity Generation,"Environment, Climate Change, Mitigation, Renew...",Electricity Generation
1,ECNEC,Electricity Installed Capacity,"Environment, Climate Change, Mitigation, Renew...",Electricity Installed Capacity


In [19]:
# drop the CTS_Name and CTS_Full_Descriptor columns since Indicator is sufficient
to_drop = ['CTS_Name', 'CTS_Full_Descriptor', 'CTS_Code']
to_drop = df.columns.intersection(to_drop)
df.drop(columns=to_drop, inplace=True)

##### Check Missing Data

In [20]:
# check missing values
df.isnull().sum()

Country           0
ISO3              0
Technology        0
Energy_Type       0
Indicator         0
Unit              0
F2000           657
F2001           626
F2002           602
F2003           590
F2004           552
F2005           514
F2006           492
F2007           447
F2008           411
F2009           365
F2010           313
F2011           270
F2012           223
F2013           166
F2014           126
F2015           103
F2016            76
F2017            61
F2018            52
F2019            39
F2020            24
F2021            23
F2022            24
F2023          1035
F2024          1035
Region_type       0
dtype: int64

We have missing data, but need to see if data is missing at random or systematically missing. 
Does earlier years have more missing data percentage than more recent years?
Or do certain countries have a higher volume of missing data compared to the rest? 

In [21]:
# we only want to check missing values in these columns
year_cols = [col for col in df.columns if col.startswith("F")]

# calculate missing values count by country across all year columns
missing_by_country = (
    df.groupby("Country")[year_cols]
      .apply(lambda x: x.isnull().sum().sum())
)

# calculate total possible values per country
total_values_per_country = len(year_cols) * df.groupby("Country").size() # all years * num of rows per country
missing_pct_by_country = (missing_by_country / total_values_per_country) * 100 

# create a summary dataframe
missing_data_summary = pd.DataFrame({
    "Missing_Count": missing_by_country,
    "Total_Values": total_values_per_country,
    "Missing_Percent": missing_pct_by_country.round(2)
}).sort_values("Missing_Percent", ascending=False)

missing_data_summary

Unnamed: 0_level_0,Missing_Count,Total_Values,Missing_Percent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"South Sudan, Rep. of",52,100,52.00
"Sint Maarten, Kingdom of the Netherlands",26,50,52.00
Montenegro,102,200,51.00
Cayman Islands,74,150,49.33
Djibouti,74,150,49.33
...,...,...,...
United States,10,250,4.00
United Kingdom,10,250,4.00
Western Asia,10,250,4.00
Western Europe,10,250,4.00


For some countries around 50% of data is missing. How many such countries do we have in this dataset? 

In [22]:
missing_data_summary['Missing_Percent'].describe()

count    248.000000
mean      18.478065
std       12.655432
min        4.000000
25%        5.200000
50%       16.250000
75%       28.000000
max       52.000000
Name: Missing_Percent, dtype: float64

Across our dataset, a quarter of countries have less than **5% missing data**. The median is **16%**, meaning half of the dataset is reasonably well‑covered. At the upper end, some countries are missing over **50% of values**, which might limit their usability for certain time‑series analysis. 

However, the dataset is generally **usable**, especially for countries with <20% missing data, but analyses should account for incomplete coverage in certain regions and consider restricting to well‑reported years or applying suitable imputation methods.

Moreover, we should see if missing data across time and association with certain geographical regions.

In [23]:
# next we will check the amount of missing data per year across all countries
yearly_missing = df[year_cols].isnull().sum()
yearly_total = df[year_cols].shape[0]
yearly_missing_pct = (yearly_missing / yearly_total) * 100
yearly_missing_summary = pd.DataFrame({
    "Missing_Count": yearly_missing,
    "Total_Values": yearly_total,
    "Missing_Percent": yearly_missing_pct.round(2)
})
yearly_missing_summary

Unnamed: 0,Missing_Count,Total_Values,Missing_Percent
F2000,657,2063,31.85
F2001,626,2063,30.34
F2002,602,2063,29.18
F2003,590,2063,28.6
F2004,552,2063,26.76
F2005,514,2063,24.92
F2006,492,2063,23.85
F2007,447,2063,21.67
F2008,411,2063,19.92
F2009,365,2063,17.69


Missing data is highest in the early years, with ~32% of values absent in 2000. Coverage steadily improves over time, dropping below 5% by 2015 and reaching ~1% by 2020–2022. However, the most recent years (2023–2024) show a sharp spike, with ~50% of values missing, likely due to provisional or unreported data. Overall, the dataset is most reliable from 2010–2022, while earlier years and the latest two years require caution.

In [24]:
# save df version 1 to parquet
# changes: dropped ISO2, Source, CTS_Name, CTS_Full_Descriptor columns; added Region_type column

output_path = (DATA_DIR / 'eda_interim' / 'imf_renewable_energy_v1.parquet')
df.to_parquet(output_path, engine='pyarrow', index=False)

### Transform Dataset

The dataset contains numerical variables (electricity generation in GWh and installed capacity in MW) across years 2000–2024, stored in wide format with each year as a separate column (e.g., F2000, F2001). This structure is efficient for storage but hinders time-series analysis and visualization.

To address this, we will first transform the data to long format by melting the year columns into `Year` and `Value` columns. This enables easier aggregation, plotting, and modeling for trends and correlations.

Geospatial features (e.g., continents, subregions, polygons) are relevant only at the country level. To avoid redundancy in the long-format dataset, we will add these after transformation, creating a separate GeoDataFrame for countries. This keeps the long dataset lean for analysis while supporting map-based visualizations.

In [25]:
df.columns

Index(['Country', 'ISO3', 'Technology', 'Energy_Type', 'Indicator', 'Unit',
       'F2000', 'F2001', 'F2002', 'F2003', 'F2004', 'F2005', 'F2006', 'F2007',
       'F2008', 'F2009', 'F2010', 'F2011', 'F2012', 'F2013', 'F2014', 'F2015',
       'F2016', 'F2017', 'F2018', 'F2019', 'F2020', 'F2021', 'F2022', 'F2023',
       'F2024', 'Region_type'],
      dtype='object')

In [26]:
# separate year and other categorical variable for transforming to long format
year_cols = [col for col in df.columns if col.startswith("F")]
id_vars = [col for col in df.columns if col not in year_cols]
print(year_cols)
print(id_vars)

['F2000', 'F2001', 'F2002', 'F2003', 'F2004', 'F2005', 'F2006', 'F2007', 'F2008', 'F2009', 'F2010', 'F2011', 'F2012', 'F2013', 'F2014', 'F2015', 'F2016', 'F2017', 'F2018', 'F2019', 'F2020', 'F2021', 'F2022', 'F2023', 'F2024']
['Country', 'ISO3', 'Technology', 'Energy_Type', 'Indicator', 'Unit', 'Region_type']


In [27]:
df = df.melt(id_vars = id_vars, value_vars = year_cols, var_name="Year", value_name="Value")
df["Year"] = df["Year"].str.replace('F', '').astype(int)
df.sample(3)

Unnamed: 0,Country,ISO3,Technology,Energy_Type,Indicator,Unit,Region_type,Year,Value
13127,Greece,GRC,Bioenergy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2006,114.0
15486,Latvia,LVA,Fossil fuels,Total Non-Renewable,Electricity Installed Capacity,Megawatt (MW),Country,2007,561.0
44809,"Poland, Rep. of",POL,Solar energy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2021,3934.448


In [28]:
# rename hydropower label since it is too long
label_rename = {"Hydropower (excl. Pumped Storage)":"Hydropower"}
df['Technology'] = df['Technology'].replace(label_rename)

In [39]:
# final df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51575 entries, 0 to 51574
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Country      51575 non-null  object 
 1   ISO3         51575 non-null  object 
 2   Technology   51575 non-null  object 
 3   Energy_Type  51575 non-null  object 
 4   Indicator    51575 non-null  object 
 5   Unit         51575 non-null  object 
 6   Region_type  51575 non-null  object 
 7   Year         51575 non-null  int64  
 8   Value        42749 non-null  float64
dtypes: float64(1), int64(1), object(7)
memory usage: 3.5+ MB


In [None]:
df.sample(5)

Unnamed: 0,Country,ISO3,Technology,Energy_Type,Indicator,Unit,Region_type,Year,Value
37452,Cameroon,CMR,Solar energy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2018,16.27
25252,Denmark,DNK,Wind energy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2012,10269.94
6942,Greece,GRC,Hydropower,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2003,4766.0
43252,"Venezuela, Rep. Bolivariana de",VEN,Solar energy,Total Renewable,Electricity Generation,Gigawatt-hours (GWh),Country,2020,5.72
3542,Philippines,PHL,Wind energy,Total Renewable,Electricity Installed Capacity,Megawatt (MW),Country,2001,


In [29]:
# save the long format dataset
output_path = (DATA_DIR / 'eda_interim' / 'imf_renewable_energy_v2.parquet')
df.to_parquet(output_path, engine="pyarrow", index=False)

### Add Geospatial Data

We have downloaded geospatial data for visualizing maps. We'll do the joins later after the energy dataset is clean and ready. For now, let's just inspect the geo data, select the columns we want and save it. 

In [30]:
import geopandas as gpd

# load shapefile from natural earth dataset (https://www.naturalearthdata.com/downloads/110m-cultural-vectors/)
shapefile_path = (natural_earth_folder / 'ne_110m_admin_0_countries.shp')
world = gpd.read_file(shapefile_path)

# find relevant columns
print([i for i in world.columns if "ISO" in i or "REGION" in i or "CONTINENT" in i])

['ISO_A2', 'ISO_A2_EH', 'ISO_A3', 'ISO_A3_EH', 'ISO_N3', 'ISO_N3_EH', 'ADM0_ISO', 'CONTINENT', 'REGION_UN', 'SUBREGION', 'REGION_WB', 'FCLASS_ISO']


In [31]:
df[df['Region_type']=='Country']['ISO3'].unique()[:10]

array(['AFG', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATG', 'ARG',
       'ARM'], dtype=object)

In [32]:
geo_data = world[["ISO_A3", "NAME", "CONTINENT", "SUBREGION", "geometry"]].copy()
geo_data.sample(3)

Unnamed: 0,ISO_A3,NAME,CONTINENT,SUBREGION,geometry
33,PAN,Panama,North America,Central America,"POLYGON ((-77.35336 8.6705, -77.47472 8.52429,..."
88,OMN,Oman,Asia,Western Asia,"MULTIPOLYGON (((55.20834 22.70833, 55.23449 23..."
56,NGA,Nigeria,Africa,Western Africa,"POLYGON ((2.6917 6.25882, 2.74906 7.87073, 2.7..."


In [33]:
# save processed geo_data
output_path = (DATA_DIR / 'eda_interim' / "geo_data.geoparquet")
geo_data.to_parquet(output_path, index=False)

### Profile All Datasets with Pandas profiling

In [34]:
# data profiling for all datasets
from ydata_profiling import ProfileReport

  from .autonotebook import tqdm as notebook_tqdm
INFO:visions.backends:Pandas backend loaded 2.3.3
INFO:visions.backends:Numpy backend loaded 2.3.5
INFO:visions.backends:Pyspark backend NOT loaded
INFO:visions.backends:Python backend loaded


In [35]:
# Profile the main energy dataset (use minimal=True for speed on large data)
profile_df = ProfileReport(df[df['Region_type']=="Country"], title="IMF Renewable Energy Dataset Profile", minimal=True)
profile_df.to_file(DATA_DIR / "eda_interim" / "imf_energy_dataset_profile.html")
print("Profile saved to data/processed/energy_dataset_profile.html")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)
100%|██████████| 9/9 [00:00<00:00, 191.15it/s]0<00:00, 17.75it/s, Describe variable: Value]
Summarize dataset: 100%|██████████| 15/15 [00:00<00:00, 43.26it/s, Completed]               
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.43s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.97it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 197.95it/s]

Profile saved to data/processed/energy_dataset_profile.html





In [36]:
# Profile ISO codes dataset
profile_iso = ProfileReport(iso_df, title="ISO Country Codes Profile")
profile_iso.to_file(DATA_DIR / "eda_interim" / "iso_codes_profile.html")
print("Profile saved to processed/iso_codes_profile.html")

100%|██████████| 4/4 [00:00<00:00, 319.38it/s]<00:00, 19.92it/s, Describe variable: Numeric]
Summarize dataset: 100%|██████████| 14/14 [00:00<00:00, 32.92it/s, Completed]                
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.73s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 14.63it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 286.73it/s]

Profile saved to processed/iso_codes_profile.html





### IMF Energy Data Initial Findings Summary 

1. **Dataset Overview**:
    - Loaded main dataset with 248 unique countries and aggregated regions.
    - Technologies: Bioenergy, Fossil fuels, Hydropower, Solar, Wind.
    - Indicators: Electricity Generation (GWh) and Installed Capacity (MW).
    - Years: 2000-2024.

2. **Data Cleaning**:
    - Dropped redundant columns: ISO2, Source, CTS_Name, CTS_Full_Descriptor, CTS_Code.
    - Added 'Region_type' column to distinguish countries from aggregated regions.
    - No missing ISO3 codes; some ISO3 codes represent regions (e.g., AETMP for Advanced Economies).

3. **Missing Data Analysis**:
    - Overall missing data: ~17% across the dataset.
    - By country: Median 16% missing; some countries >50% missing (e.g., South Sudan).
    - By year: High missing in early years (32% in 2000) and recent years (50% in 2023-2024); reliable from 2010-2022.
    - Missing data is not random; caution needed for time-series analysis.

4. **Data Transformation**:
    - Converted from wide to long format: Melted year columns (F2000-F2024) into 'Year' and 'Value' columns.
    - Resulting long dataset has 51,575 rows.

5. **Geospatial Integration**:
    - Added geospatial data from Natural Earth shapefile.
    - Created GeoDataFrame with ISO3, Country, Continent, Subregion, Geometry.
    - Saved as geo_data.geoparquet.

6. **Data Profiling**:
    - Generated Pandas profiling reports for the energy dataset and ISO codes.
    - Profiles saved as HTML files in processed_data_dir.

7. **Saved Versions**:
    - Cleaned wide format: imf_renewable_energy_v1.parquet
    - Long format: imf_renewable_energy_v2.parquet
    - Geospatial data: geo_data.geoparquet