# Europe Energy Transition Tracker
## Notebook 01 — Data Exploration

Data Source: Eurostat (nrg_bal_peh)
Goal: Understand structure and prepare electricity generation data.


In [1]:
from google.colab import files
uploaded = files.upload()


Saving eurostat_generation.csv to eurostat_generation.csv


In [2]:
import pandas as pd

df = pd.read_csv("eurostat_generation.csv")

df.head()


  df = pd.read_csv("eurostat_generation.csv")


Unnamed: 0,DATAFLOW,LAST UPDATE,freq,nrg_bal,siec,unit,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG,CONF_STATUS
0,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Batteries,Gigawatt-hour,Albania,2000,0.0,,
1,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Batteries,Gigawatt-hour,Albania,2001,0.0,,
2,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Batteries,Gigawatt-hour,Albania,2002,0.0,,
3,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Batteries,Gigawatt-hour,Albania,2003,0.0,,
4,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Batteries,Gigawatt-hour,Albania,2004,0.0,,


What energy sources exist?

In [3]:
sorted(df["siec"].unique())

['Additives and oxygenates (excluding biofuel portion)',
 'Ambient heat (heat pumps)',
 'Anthracite',
 'Aviation gasoline',
 'Batteries',
 'Bioenergy',
 'Biogases',
 'Bitumen',
 'Blast furnace gas',
 'Blended bio jet kerosene',
 'Blended biodiesels',
 'Blended biogasoline',
 'Brown coal briquettes',
 'Charcoal',
 'Coal tar',
 'Coke oven coke',
 'Coke oven gas',
 'Coking coal',
 'Crude oil',
 'Electricity',
 'Ethane',
 'Fossil energy',
 'Fuel oil',
 'Gas coke',
 'Gas oil and diesel oil (excluding biofuel portion)',
 'Gas works gas',
 'Gasoline-type jet fuel',
 'Geothermal',
 'Heat',
 'Hydro',
 'Industrial waste (non-renewable)',
 'Kerosene-type jet fuel (excluding biofuel portion)',
 'Lignite',
 'Liquefied petroleum gases',
 'Lubricants',
 'Manufactured gases',
 'Motor gasoline (excluding biofuel portion)',
 'Naphtha',
 'Natural gas',
 'Natural gas liquids',
 'Non-renewable municipal waste',
 'Non-renewable waste',
 'Nuclear heat',
 'Oil and petroleum products (excluding biofuel portion

How many countries?

In [6]:
len(df["geo"].unique())


41

In [7]:
sorted(df["geo"].unique())


['Albania',
 'Austria',
 'Belgium',
 'Bosnia and Herzegovina',
 'Bulgaria',
 'Croatia',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Estonia',
 'European Union - 27 countries (from 2020)',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Greece',
 'Hungary',
 'Iceland',
 'Ireland',
 'Italy',
 'Kosovo*',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Malta',
 'Moldova',
 'Montenegro',
 'Netherlands',
 'North Macedonia',
 'Norway',
 'Poland',
 'Portugal',
 'Romania',
 'Serbia',
 'Slovakia',
 'Slovenia',
 'Spain',
 'Sweden',
 'Türkiye',
 'Ukraine',
 'United Kingdom']

What years?

In [8]:
df["TIME_PERIOD"].min(), df["TIME_PERIOD"].max()


(2000, 2024)

Any missing values?

In [9]:
df["OBS_VALUE"].isna().sum()


np.int64(16038)

In [10]:
df[df["OBS_VALUE"].isna()].head()

Unnamed: 0,DATAFLOW,LAST UPDATE,freq,nrg_bal,siec,unit,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG,CONF_STATUS
17322,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Electricity,Gigawatt-hour,Albania,2000,,m,
17323,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Electricity,Gigawatt-hour,Albania,2001,,m,
17324,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Electricity,Gigawatt-hour,Albania,2002,,m,
17325,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Electricity,Gigawatt-hour,Albania,2003,,m,
17326,ESTAT:NRG_BAL_PEH(1.0),16/02/26 23:00:00,Annual,Gross electricity production,Electricity,Gigawatt-hour,Albania,2004,,m,


Which sources cause most of the missing values?

In [11]:
df[df["OBS_VALUE"].isna()]["siec"].value_counts()


Unnamed: 0_level_0,count
siec,Unnamed: 1_level_1
Electricity,941
Refinery feedstocks,941
Lubricants,941
Additives and oxygenates (excluding biofuel portion),941
Other hydrocarbons,941
Ethane,941
Aviation gasoline,941
Motor gasoline (excluding biofuel portion),941
Gasoline-type jet fuel,941
White spirit and special boiling point industrial spirits,941


In [13]:
df_clean = df.copy()

df_clean["OBS_VALUE"] = df_clean["OBS_VALUE"].fillna(0)


In [14]:
#verify fix
df_clean["OBS_VALUE"].isna().sum()


np.int64(0)

In [16]:
#Creating a whitelist
sorted(df_clean["siec"].unique())

valid_sources = [
    "Hydro",
    "Wind",
    "Solar",
    "Nuclear heat",
    "Natural gas",
    "Solid fossil fuels",
    "Oil and petroleum products",
    "Primary solid biofuels",
    "Biogases",
    "Renewable municipal waste",
    "Non-renewable waste",
    "Geothermal",
    "Other fuels"
]

df_power = df_clean[df_clean["siec"].isin(valid_sources)].copy()
#check filtered dataset
df_power["siec"].unique()



array(['Solid fossil fuels', 'Natural gas', 'Nuclear heat',
       'Primary solid biofuels', 'Biogases', 'Hydro', 'Geothermal',
       'Wind', 'Non-renewable waste', 'Renewable municipal waste'],
      dtype=object)

In [17]:
#Mapping dictionary
energy_map = {
    "Solid fossil fuels": "Coal",
    "Natural gas": "Natural Gas",
    "Nuclear heat": "Nuclear",
    "Hydro": "Hydro",
    "Wind": "Wind",
    "Geothermal": "Geothermal",
    "Primary solid biofuels": "Bioenergy",
    "Biogases": "Bioenergy",
    "Renewable municipal waste": "Bioenergy",
    "Non-renewable waste": "Bioenergy"
}
df_power["energy_group"] = df_power["siec"].map(energy_map)
df_power["energy_group"].unique()


array(['Coal', 'Natural Gas', 'Nuclear', 'Bioenergy', 'Hydro',
       'Geothermal', 'Wind'], dtype=object)