# Global Energy

## Introduction

In this project we will address a common problem: **missing data**. We will start from a global energy consumption dataset containing a variable with several missing values and we will build a **linear regression** model that will exploit the known values (of another variable) to predict the null ones.

![](https://raw.githubusercontent.com/boolean-data-analytics/img/main/on-demand/DAREG-global-energy.jpeg)

**[linear regression](https://realpython.com/linear-regression-in-python/)** is one of the best known statistical learning techniques and is therefore the perfect springboard to enter the vast and exciting world of machine learning.

The most common goal in statistics and, more generally, in data science, is to **predict**, i.e. use the `data` available to build a `model` (which is a simplified representation of the reality) that allows us to make a `prediction` about a future or not yet observed event.

> *It may seem like an abstract concept, but it's what we do every day before leaving home:*
1. let's look out the window and see if the sky is overcast > `data`
2. we compare the observed sky with our past experience > `model`
3. let's evaluate whether it is likely to rain > `forecast`

> in general the most common goal of data science is to predict something from data, creating a model (simplification of reality) and a future or missing value.

## Dataset

First we load all the **libraries** that we will need in the analysis.

In [2]:
#--------------------
# Basic libraries for data analysis
import numpy as np
import pandas as pd

#--------------------
# Data Visualization
import seaborn as sns

sns.set_theme()
sns.set(rc={'figure.figsize':(11,7)})

from matplotlib import pyplot as plt

#--------------------
# One of the most popular data science libraries
from sklearn.linear_model import LinearRegression


We will work with a dataset containing annual data (from 2000 to 2012) related to socio-economic indicators (such as GDP, CO2 emissions, energy consumption, life expectancy, etc.) from over 200 countries around the world.

Let's load the dataset and take a first look at it:

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/boolean-data-analytics/data/main/World%20Indicators%20S.csv')
df

Unnamed: 0,Country/Region,Region,Year,Population Total,GDP,CO2 Emissions,Energy Usage,Life Expectancy,Infant Mortality Rate
0,Algeria,Africa,01/12/2000,31719449,5.479006e+10,87931.0,26998.0,69.0,0.034
1,Angola,Africa,01/12/2000,13924930,9.129595e+09,9542.0,7499.0,45.5,0.128
2,Benin,Africa,01/12/2000,6949366,2.359122e+09,1617.0,1983.0,55.0,0.090
3,Botswana,Africa,01/12/2000,1755375,5.788312e+09,4276.0,1836.0,50.5,0.054
4,Burkina Faso,Africa,01/12/2000,11607944,2.610959e+09,1041.0,,50.5,0.096
...,...,...,...,...,...,...,...,...,...
2686,Belize,The Americas,01/12/2012,324060,1.572500e+09,,,74.0,0.015
2687,Haiti,The Americas,01/12/2012,10173775,7.890217e+09,,,63.0,0.056
2688,Bolivia,The Americas,01/12/2012,10496285,2.703511e+10,,,67.0,0.032
2689,Honduras,The Americas,01/12/2012,7935846,1.856426e+10,,,73.5,0.020


GDP = Gross domestic product

## Preliminary analysis

We use the `.info()` method to explore the variables included in the DataFrame and any missing values:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2691 entries, 0 to 2690
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Country/Region         2691 non-null   object 
 1   Region                 2691 non-null   object 
 2   Year                   2691 non-null   object 
 3   Population Total       2691 non-null   int64  
 4   GDP                    2494 non-null   float64
 5   CO2 Emissions          2125 non-null   float64
 6   Energy Usage           1785 non-null   float64
 7   Life Expectancy        2555 non-null   float64
 8   Infant Mortality Rate  2444 non-null   float64
dtypes: float64(5), int64(1), object(3)
memory usage: 189.3+ KB


We note that most of the numeric variables (except for `Population Total`), contain missing values, in particular `Energy Usage` is **the variable with the highest number of null values** (1785 out of 2691, about 66%) and therefore lends itself as an excellent candidate for our linear regression exercise.