# A Bayesian Network to model the influence of energy consumption on greenhouse gases in Italy

### by Lorenzo Mario Amorosa
#### *Fundamentals of Artificial Intelligence and Knowledge Representation (Mod. 3) - Alma Mater Studiorum Università di Bologna*

## Overview
<img src="images/gases-by-source-2020-caption.jpg" float="right" height="33%" width="33%" align="right" style="padding-left: 3.5%">Nowadays it is no surprise that **global warming** is hugely caused by greenhouse gases, which are indeed responsible for trapping heat in the atmosphere. The 3 most common gases are carbon dioxide (**CO**<sub><b>2</b></sub>), methane (**CH**<sub><b>4</b></sub>) and nitrous oxide (**N**<sub><b>2</b></sub>**O**) [[1]](#first).

There are several sources of greenhouse gases (transportation, industry, commercial and residental, etc.). In this notebook I will tackle the problem in a general way, considering the impact of energy consumption on greenhouse gases emissions. Energy is indeed strictly related to almost all source factors. In particular, I will face the modelling of **causal relations** between **energy consumption and greenhouse gases in Italy** using a **Bayesian network**. The aim is to learn a model that can provide **probabilistic results given** some input **evidence**. The causal relations and their relative probabilities will be estimated by analyzing **annual growth factors** of several parameters using open source datasets from the World Bank [[3]](#third). The choice of analyzing the annual growth aims to capture which are the reciprocal impact variatons between the parameters.

In the developing of the work I was inspired by a paper [[2]](#second) in which the authors produced estimates about energy investments in Turkey, given **historical data**. Their work helped me to come up with interesting measures to be investigated and to be represented in the Bayesian network.

## Network definition, datasets and CPT learning

Generally we can expect that the increase of fossil fuel consumption determines the growth of greenhouse gases diffusion, whereas a wider use of renewable energies leads to a reduction of greenhouse gases emissions. In [[2]](#second) it is suggested that growth rate factors about population, urbanization, gross domestic product (GDP) and industrialization can all influence the overall energy use in a nation. Given these assumption, the Bayesian network is defined as follows: 
<img src="images/GreenhouseGasesBayesianNet.png" height="35%" width="35%" align="center">
The terms in the nodes have the following meaning:

<code style="background-color:white;">
Pop  = Population growth (annual %)
Urb  = Urban population growth (annual %)
GDP  = GDP per capita growth (annual %)
Ind  = Industry (including construction), value added (annual % growth)
EU   = Energy use (kg of oil equivalent per capita) - [annual growth %]
FFEC = Fossil fuel energy consumption (% of total) - [annual growth %]
REC  = Renewable energy consumption (% of total final energy consumption) - [annual growth %]
EI   = Energy imports, net (% of energy use) - [annual growth %]
CO2  = CO2 emissions (metric tons per capita) - [annual growth %]
MH4  = Methane emissions in energy sector (thousand metric tons of CO2 equivalent) - [annual growth %]
N2O  = Nitrous oxide emissions in energy sector (thousand metric tons of CO2 equivalent) - [annual growth %]
</code>

The Bayesian network proposed differs from the one of [[2]](#second) because here greenhouse gases are kept distinct (CO<sub>2</sub>, CH<sub>4</sub>, N<sub>2</sub>O) and energy investments are not represented.

In the following cell the Bayesian network is created using the pgmpy library.

In [17]:
from pgmpy.models import BayesianModel

model = BayesianModel([('Pop', 'EC'),   ('Urb', 'EC'),   ('GDP', 'EC'), ('Ind', 'EC'),
                       ('EC', 'FFEC'),  ('EC', 'REC'),   ('EC', 'EI'),
                       ('REC', 'CO2'),  ('REC', 'CH4'),  ('REC', 'N2O'),
                       ('FFEC', 'CO2'), ('FFEC', 'CH4'), ('FFEC', 'N2O')])

The data to compute the conditional probability tables (CPT) of the network are retrived from [[3]](#third). Some data are given in absolute value and not in terms of annual growth, hence it is necessary to transform them properly (the labels of transformed data end all with the marking string "- [annual growth %]"). 

In the following cell data are imported.

In [16]:
from pandas import read_csv, DataFrame
import numpy as np

def annual_growth(row, years):
    min_year = years["min_year"]
    max_year = years["max_year"]
    row["Indicator Name"] = row["Indicator Name"] + " - [annual growth %]"
    for year in range(max_year, min_year, -1):
        if not np.isnan(float(row[str(year)])) and not np.isnan(float(row[str(year - 1)])):
            row[str(year)] = 100 * (float(row[str(year)]) - float(row[str(year - 1)])) / abs(float(row[str(year - 1)]))
        else:
            np.nan     
    row[str(min_year)] = np.nan
    return row

years = {"min_year" : 1960, "max_year" : 2019}
df_raw = read_csv("csv/italy-raw-data.csv")
df = DataFrame([row if "growth" in row["Indicator Name"] else annual_growth(row, years) for index, row in df_raw.iterrows()])
print("There are " + str(df.shape[0]) + " indicators in the dataframe.")
df.head()

There are 11 indicators in the dataframe.


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Italy,ITA,Population growth (annual %),SP.POP.GROW,1.993928,0.668383,0.676623,0.729553,0.822624,0.842109,...,0.307591,0.171978,0.269541,1.159251,0.917504,-0.096376,-0.169884,-0.149861,-0.190064,
1,Italy,ITA,Urban population growth (annual %),SP.URB.GROW,2.836401,1.498807,1.506833,1.551287,1.636027,1.642485,...,0.480439,0.343066,0.619579,1.587835,1.341371,0.325701,0.246127,0.262999,0.228198,
2,Italy,ITA,GDP per capita growth (annual %),NY.GDP.PCAP.KD.ZG,,7.486419,5.487478,4.842052,1.955533,2.402046,...,1.400915,0.534287,-3.24206,-2.972404,-0.917814,0.875477,1.451875,1.868715,0.966058,
3,Italy,ITA,"Industry (including construction), value added...",NV.IND.TOTL.KD.ZG,,,,,,,...,3.644522,0.059232,-4.862533,-3.051678,-2.096562,0.50063,2.745482,3.163118,2.026691,
4,Italy,ITA,Energy use (kg of oil equivalent per capita) -...,EG.USE.PCAP.KG.OE,,12.0622,13.064053,11.188621,9.110076,7.753922,...,2.113919,-3.486796,-4.211107,-4.791839,-6.396212,2.786129,,,,


It is possible to learn the CPT in pgmpy using either a Maximum Likelihood estimator (MLE) or a Bayesian Estimator. The latter exploits a known prior distribution of data, the former does not make any particular assumption. Given the fact that no relevant prior is given, the MLE is used.

kilotonne equals google

In [2]:
from pgmpy.estimators import MaximumLikelihoodEstimator



## References
<a name="first">[1]</a> [*U.S. Environmental Protection Agency - Greenhouse Gas Emissions*](https://www.epa.gov/ghgemissions/overview-greenhouse-gases)

<a name="second">[2]</a> [*Didem Cinar, Gulgun Kayakutlu - Scenario analysis using Bayesian networks: A case study in energy sector*](https://www.sciencedirect.com/science/article/pii/S0950705110000110)

<a name="third">[3]</a> [*World Bank Open Data*](https://data.worldbank.org/indicator)