## Import Libs & Fetch Dataset

This data is from [Ember](https://ember-energy.org/) which is an open data platform for worldwide energy data.

In [4]:
import os
import sys
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go

In [5]:
PROJECT_ROOT = Path.cwd().parents[0]
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

from my_project.paths import get_paths

paths = get_paths(PROJECT_ROOT)
DATA_DIR = paths['DATA_DIR']
RAW_DATA_DIR = paths['RAW_DATA_DIR']
PROCESSED_DATA_DIR = paths['PROCESSED_DATA_DIR']
LOGS_DIR = paths['LOGS_DIR']

from data.ember_api_client import EmberAPI
from utils.plots import save_static_img

In [6]:
# get data from Ember API 
api = EmberAPI()

# get yearly generation data
params = {"is_aggregate_series":False, "start_date":"2000",}
yearly_generation_json = api.fetch_and_cache(
    endpoint_name="electricity_generation_yearly",
    fetch_func=api.electricity_generation_yearly,
    params=params
)

# get monthly generation data
# params = {"is_aggregate_series":False, "start_date":"2000-01", }
# monthly_generation_json = api.fetch_and_cache(
#     endpoint_name="electricity_generation_monthly",
#     fetch_func=api.electricity_generation_monthly,
#     params=params
# )

# get monthly installed capacity data
params = {"is_aggregate_series":False, "start_date": "2000-01",}
monthly_capacity_json = api.fetch_and_cache(
    endpoint_name = "electricity_capacity_monthly",
    fetch_func=api.electricity_capacity_monthly,
    params=params
)

2026-01-01 23:18:46 | INFO | Initialized EmberAPI client with base URL: https://api.ember-energy.org
2026-01-01 23:18:46 | INFO | Loading cached data from /home/zephyr/workspace/Global_Energy_Trends/data/raw/electricity_generation_yearly_is_aggregate_series-False_start_date-2000.json
2026-01-01 23:18:46 | INFO | Loading cached data from /home/zephyr/workspace/Global_Energy_Trends/data/raw/electricity_capacity_monthly_is_aggregate_series-False_start_date-2000-01.json


Decided to use yearly instead of monthly generation data since the yearly data is more comprehensive in terms of  coverage. As of Dec2025, the monthly generation data only contains 88 economies/countries whereas the yearly data contains over 210 economies/countries.

As for capacity data, only monthly data is available on the API, we will transform it and later merge with IMF energy data. 

In [7]:
# df_generation_monthly = pd.DataFrame(monthly_generation_json.get("data",[]))
df_capacity_monthly = pd.DataFrame(monthly_capacity_json.get("data",[]))
df_generation_yearly = pd.DataFrame(yearly_generation_json.get("data", []))

## Dataset Overview

### Electricity Generation Yearly

In [8]:
df_generation_yearly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52181 entries, 0 to 52180
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   entity                   52181 non-null  object 
 1   entity_code              49231 non-null  object 
 2   is_aggregate_entity      52181 non-null  bool   
 3   date                     52181 non-null  object 
 4   series                   52181 non-null  object 
 5   is_aggregate_series      52181 non-null  bool   
 6   generation_twh           52181 non-null  float64
 7   share_of_generation_pct  52181 non-null  float64
dtypes: bool(2), float64(2), object(4)
memory usage: 2.5+ MB


In [9]:
df_generation_yearly['is_aggregate_series'].value_counts()

is_aggregate_series
False    52181
Name: count, dtype: int64

In [10]:
# dropping this constant col since we have already requested unaggregated series during api call 
if "is_aggregate_series" in df_generation_yearly.columns:
    df_generation_yearly = df_generation_yearly.drop(columns="is_aggregate_series")

# rename date to year since only year data is there
# rename series to technology to be more descriptive
df_generation_yearly = df_generation_yearly\
    .rename(columns={"date":"year",
                     "series":"technology"})

In [11]:
df_generation_yearly.isnull().sum()

entity                        0
entity_code                2950
is_aggregate_entity           0
year                          0
technology                    0
generation_twh                0
share_of_generation_pct       0
dtype: int64

In [12]:
df_generation_yearly[df_generation_yearly['is_aggregate_entity']==True]['entity_code'].isnull().sum()

np.int64(2950)

No missing data. The only nulls are entity codes which are not defined for those aggregated regions such as EU, Asia, etc.

#### Split the dataset into country/economy and aggregated region

In [13]:
# split dataset into aggregated regions and individual country/economy dataset
# aggregated region dataset
df_generation_region = df_generation_yearly[df_generation_yearly['is_aggregate_entity'] == True].copy()
df_generation_country = df_generation_yearly[df_generation_yearly['is_aggregate_entity'] == False].copy()

# drop is_aggregated_entity column since it is constant in each dataset after splitting
df_generation_region = df_generation_region.drop(columns="is_aggregate_entity")
df_generation_country = df_generation_country.drop(columns="is_aggregate_entity")

print(df_generation_region.shape, df_generation_country.shape)

(2950, 6) (49231, 6)


In [14]:
print("Num Countries/Economies: ", df_generation_country['entity'].nunique())
print("Num Aggregated Regions: ", df_generation_region['entity'].nunique())

Num Countries/Economies:  210
Num Aggregated Regions:  13


Our dataset covers 13 aggregated regions and 210 countries.

##### Country/Economy-wise Variables

In [15]:
df_generation_country.info() # no missing data

<class 'pandas.core.frame.DataFrame'>
Index: 49231 entries, 0 to 52170
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   entity                   49231 non-null  object 
 1   entity_code              49231 non-null  object 
 2   year                     49231 non-null  object 
 3   technology               49231 non-null  object 
 4   generation_twh           49231 non-null  float64
 5   share_of_generation_pct  49231 non-null  float64
dtypes: float64(2), object(4)
memory usage: 2.6+ MB


In [16]:
# do we have full years for each countries?
df_generation_country.groupby("entity").agg(
    year_count = pd.NamedAgg(column = "year", aggfunc=pd.Series.nunique),
    min_year = pd.NamedAgg(column = "year", aggfunc=pd.Series.min),
    max_year = pd.NamedAgg(column = "year", aggfunc=pd.Series.max)
).sort_values(by="year_count")

Unnamed: 0_level_0,year_count,min_year,max_year
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
South Sudan,12,2012,2023
Montenegro,20,2005,2024
Cabo Verde,23,2000,2022
Bahamas,23,2000,2022
Falkland Islands (Malvinas),23,2000,2022
...,...,...,...
United Kingdom,25,2000,2024
United Arab Emirates,25,2000,2024
Türkiye,25,2000,2024
Viet Nam,25,2000,2024


Almost all countries have generation data from the year 2000 except for South Sudan.

In [17]:
df_generation_country['technology'].value_counts().sort_values()

technology
Other renewables    4459
Nuclear             4712
Wind                4905
Gas                 4948
Hydro               4981
Coal                4993
Bioenergy           5032
Solar               5054
Net imports         5073
Other fossil        5074
Name: count, dtype: int64

Dataset have 9 different categories of energy generation excluding imports.

In [18]:
# when is the earlies time recorded for each energy category
df_generation_country.groupby("technology").agg(
    min_year = pd.NamedAgg(column="year", aggfunc="min")
)

Unnamed: 0_level_0,min_year
technology,Unnamed: 1_level_1
Bioenergy,2000
Coal,2000
Gas,2000
Hydro,2000
Net imports,2000
Nuclear,2000
Other fossil,2000
Other renewables,2000
Solar,2000
Wind,2000


##### Aggregated regions

In [19]:
df_generation_region.entity.value_counts()

entity
EU                             250
Middle East                    225
Latin America and Caribbean    225
North America                  225
World                          225
Oceania                        225
OECD                           225
ASEAN                          225
Africa                         225
Asia                           225
G20                            225
G7                             225
Europe                         225
Name: count, dtype: int64

In [20]:
# do we have full years for each regions
df_generation_region.groupby("entity").agg(
    year_count = pd.NamedAgg(column = "year", aggfunc=pd.Series.nunique),
    min_year = pd.NamedAgg(column = "year", aggfunc=pd.Series.min),
    max_year = pd.NamedAgg(column = "year", aggfunc=pd.Series.max)
).sort_values(by="year_count")

Unnamed: 0_level_0,year_count,min_year,max_year
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ASEAN,25,2000,2024
Africa,25,2000,2024
Asia,25,2000,2024
EU,25,2000,2024
Europe,25,2000,2024
G20,25,2000,2024
G7,25,2000,2024
Latin America and Caribbean,25,2000,2024
Middle East,25,2000,2024
North America,25,2000,2024


We have electricity generation data from 2000 till 2024 for all aggregated regions.

In [21]:
df_generation_region['technology'].value_counts().sort_values()

technology
Net imports          25
Bioenergy           325
Gas                 325
Coal                325
Nuclear             325
Other fossil        325
Other renewables    325
Hydro               325
Solar               325
Wind                325
Name: count, dtype: int64

In [23]:
# when is the earlies time recorded for each energy category
df_generation_region.groupby("technology").agg(
    min_year = pd.NamedAgg(column="year", aggfunc="min"),
    max_year = pd.NamedAgg(column="year", aggfunc="max"),
)

Unnamed: 0_level_0,min_year,max_year
technology,Unnamed: 1_level_1,Unnamed: 2_level_1
Bioenergy,2000,2024
Coal,2000,2024
Gas,2000,2024
Hydro,2000,2024
Net imports,2000,2024
Nuclear,2000,2024
Other fossil,2000,2024
Other renewables,2000,2024
Solar,2000,2024
Wind,2000,2024


In [29]:
# filter for world electricity generation trends only
df_generation_world = df_generation_region[df_generation_region['entity']=='World'].copy()
df_generation_world.sample(4)

Unnamed: 0,entity,entity_code,year,technology,generation_twh,share_of_generation_pct
8311,World,,2003,Solar,2.1,0.01
47077,World,,2021,Wind,1856.7,6.57
8309,World,,2003,Other fossil,1295.92,7.79
21130,World,,2009,Gas,4382.35,21.97


In [28]:
total_yearly_gen = df_generation_world.groupby(['technology', 'year']).agg(
    total_yearly_gen = pd.NamedAgg(column="generation_twh", aggfunc="sum") 
).reset_index()

# Plot Electricity Generation Over Time based on sum of all countries yearly
fig = px.area(total_yearly_gen, x='year', y='total_yearly_gen', color='technology',
                  title='Global Share of Renewable vs Non-Renewable Energy Generation Over Time')
fig.update_layout(xaxis_title='Year', yaxis_title='Electricity Generation (tWh)')
fig.show() # for interactive image
# save_static_img(fig, DATA_DIR / "eda_interim" / "figures" / "global_generation_share_over_time.png") # since github doesn't render px plots

In [30]:
# compare with total sum of coutntries
total_yearly_gen = df_generation_country.groupby(['technology', 'year']).agg(
    total_yearly_gen = pd.NamedAgg(column="generation_twh", aggfunc="sum") 
).reset_index()

# Plot Electricity Generation Over Time based on sum of all countries yearly
fig = px.area(total_yearly_gen, x='year', y='total_yearly_gen', color='technology',
                  title='Global Share of Renewable vs Non-Renewable Energy Generation Over Time')
fig.update_layout(xaxis_title='Year', yaxis_title='Electricity Generation (tWh)')
fig.show() # for interactive image
# save_static_img(fig, DATA_DIR / "eda_interim" / "figures" / "global_generation_share_over_time.png") # since github doesn't render px plots

#### Ember Data Initial Findings Summary 

1. Dataset Overview <br>
    Data sourced from Ember, an open platform for worldwide energy data. Fetched yearly electricity generation data (2000-2024) for 210 countries/economies and 13 aggregated regions. Monthly generation data was considered but discarded due to limited coverage (only 88 economies). Monthly installed capacity data (2000-2025) was also fetched for future merging with IMF energy data.


2. Electricity Generation Yearly <br>
    
    Attributes include entity, entity_code, is_aggregate_entity, year, technology, generation_twh, and share_of_generation_pct. No missing values in key columns; entity_code is null only for aggregated regions (e.g., EU, Asia). Generation data of 9 categories (e.g., Bioenergy, Coal, Gas, Hydro, Nuclear, Other fossil, Other renewables, Solar, Wind), excluding Net imports.

- **Split Datasets**: Country/Economy dataset: 49,231 rows, covering 210 unique entities. Aggregated Region dataset: 2,950 rows, covering 13 unique regions.

- **Temporal Coverage**: Most countries have data from 2000 onward except for South Sudan. All aggregated regions have complete data from 2000 to 2024.

We compared the pre-aggregated world level data with the sum of electricity generation of all countries in the dataset. And the results closely matches. Next we'll move on to checking the capacity data.

### Electricity Capacity Yearly

In [31]:
df_capacity_monthly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6679 entries, 0 to 6678
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   entity                 6679 non-null   object 
 1   entity_code            6679 non-null   object 
 2   is_aggregate_entity    6679 non-null   bool   
 3   date                   6679 non-null   object 
 4   series                 6679 non-null   object 
 5   is_aggregate_series    6679 non-null   bool   
 6   capacity_gw            6679 non-null   float64
 7   capacity_w_per_capita  6679 non-null   float64
dtypes: bool(2), float64(2), object(4)
memory usage: 326.3+ KB
