# OBJECTIVE:
Predict the carbon emissions and energy consumption. 

# KEY FEATURES:

*   ***Entity:*** The name of the country or region for which the data is reported.

*   ***Year:*** The year for which the data is reported, ranging from 2000 to 2020.

*   ***Access to electricity (% of population):*** The percentage of population with access to electricity.

*   ***Access to clean fuels for cooking (% of population):*** The percentage of the population with primary reliance on clean fuels.

*   ***Renewable-electricity-generating-capacity-per-capita:*** Installed Renewable energy capacity per person

*   ***Financial flows to developing countries (US $):*** Aid and assistance from developed countries for clean energy projects.

*   ***Renewable energy share in total final energy consumption (%):*** Percentage of renewable energy in final energy consumption.

*   ***Electricity from fossil fuels (TWh):*** Electricity generated from fossil fuels (coal, oil, gas) in terawatt-hours.

*   ***Electricity from nuclear (TWh):*** Electricity generated from nuclear power in terawatt-hours.

*   ***Electricity from renewables (TWh):*** Electricity generated from renewable sources (hydro, solar, wind, etc.) in terawatt-hours.

*   ***Low-carbon electricity (% electricity):*** Percentage of electricity from low-carbon sources (nuclear and renewables).

*   ***Primary energy consumption per capita (kWh/person):*** Energy consumption per person in kilowatt-hours.

*   ***Energy intensity level of primary energy (MJ/$2011 PPP GDP):*** Energy use per unit of GDP at purchasing power parity.

*   ***Value_co2_emissions (metric tons per capita):*** Carbon dioxide emissions per person in metric tons.

*   ***Renewables (% equivalent primary energy):*** Equivalent primary energy that is derived from renewable sources.

*   ***GDP growth (annual %):*** Annual GDP growth rate based on constant local currency.

*   ***GDP per capita:*** Gross domestic product per person.

*   ***Density (P/Km2):*** Population density in persons per square kilometer.

*   ***Land Area (Km2):*** Total land area in square kilometers.

*   ***Latitude:*** Latitude of the country's centroid in decimal degrees.

*   ***Longitude:*** Longitude of the country's centroid in decimal degrees.

In [1]:
# LIBRARIES:
import pandas as pd
import numpy as np

# SHOWCASTING THE DATA:
pd.set_option("display.max_column", None)

In [3]:
# IMPORTING DATA:
df = pd.read_csv(r"/Users/alberto/Downloads/PROJECTS/Machine Learning on Cloud/ML Assignment/global-data-on-sustainable-energy.csv")
df.head(10)

Unnamed: 0,Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),Low-carbon electricity (% electricity),Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,65.95744,302.59482,1.64,760.0,,,,60,652230,33.93911,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,84.745766,236.89185,1.74,730.0,,,,60,652230,33.93911,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,81.159424,210.86215,1.4,1029.999971,,,179.426579,60,652230,33.93911,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,67.02128,229.96822,1.4,1220.000029,,8.832278,190.683814,60,652230,33.93911,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,62.92135,204.23125,1.2,1029.999971,,1.414118,211.382074,60,652230,33.93911,67.709953
5,Afghanistan,2005,25.390894,12.2,7.51,9830000.0,33.88,0.34,0.0,0.59,63.440857,252.06912,1.41,1549.999952,,11.229715,242.031313,60,652230,33.93911,67.709953
6,Afghanistan,2006,30.71869,13.85,7.4,10620000.0,31.89,0.2,0.0,0.64,76.190475,304.4209,1.5,1759.99999,,5.357403,263.733602,60,652230,33.93911,67.709953
7,Afghanistan,2007,36.05101,15.3,7.25,15750000.0,28.78,0.2,0.0,0.75,78.94737,354.2799,1.53,1769.999981,,13.82632,359.693158,60,652230,33.93911,67.709953
8,Afghanistan,2008,42.4,16.7,7.49,16170000.0,21.17,0.19,0.0,0.54,73.9726,607.8335,1.94,3559.999943,,3.924984,364.663542,60,652230,33.93911,67.709953
9,Afghanistan,2009,46.74005,18.4,7.5,9960000.0,16.53,0.16,0.0,0.78,82.97872,975.04816,2.25,4880.000114,,21.390528,437.26874,60,652230,33.93911,67.709953


In [9]:
# SIZE OF THE DATA:
df.shape

Unnamed: 0,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),Low-carbon electricity (% electricity),Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Latitude,Longitude
count,3649.0,3639.0,3480.0,2718.0,1560.0,3455.0,3628.0,3523.0,3628.0,3607.0,3649.0,3442.0,3221.0,1512.0,3332.0,3367.0,3648.0,3648.0
mean,2010.038367,78.933702,63.255287,113.137498,94224000.0,32.638165,70.365003,13.45019,23.96801,36.801182,25743.981745,5.307345,159866.5,11.986707,3.44161,13283.774348,18.246388,14.822695
std,6.054228,30.275541,39.043658,244.167256,298154400.0,29.894901,348.051866,73.006623,104.431085,34.314884,34773.221366,3.53202,773661.1,14.994644,5.68672,19709.866716,24.159232,66.348148
min,2000.0,1.252269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,10.0,0.0,-62.07592,111.927225,-40.900557,-175.198242
25%,2005.0,59.80089,23.175,3.54,260000.0,6.515,0.29,0.0,0.04,2.877847,3116.7373,3.17,2020.0,2.137095,1.383302,1337.813437,3.202778,-11.779889
50%,2010.0,98.36157,83.15,32.91,5665000.0,23.3,2.97,0.0,1.47,27.865068,13120.57,4.3,10500.0,6.290766,3.559855,4578.633208,17.189877,19.145136
75%,2015.0,100.0,100.0,112.21,55347500.0,55.245,26.8375,0.0,9.6,64.403792,33892.78,6.0275,60580.0,16.841638,5.830099,15768.615365,38.969719,46.199616
max,2020.0,100.0,100.0,3060.19,5202310000.0,96.04,5184.13,809.41,2184.94,100.00001,262585.7,32.57,10707220.0,86.836586,123.139555,123514.1967,64.963051,178.065032


In [4]:
# MISSING DATA ON THE DATASET (TOTAL):
sum_missing = df.isna().sum().sort_values(ascending = False)
percentage_missing = (df.isna().mean()*100).sort_values(ascending = False)

names = sum_missing.index.to_list()                                     # Storing names
sum_values = sum_missing.to_list()                                      # Getting count of missing values
perc_values = np.around(percentage_missing.to_list(), 3)                # Getting percentage of missing values
df_missing_values = pd.DataFrame({"NAMES" : names,                      # Presenting the data as a dataframe
                                  "VALUE COUNT" : sum_values,
                                  "PERCENTAGE (%)" : perc_values})
df_missing_values

Unnamed: 0,NAMES,VALUE COUNT,PERCENTAGE (%)
0,Renewables (% equivalent primary energy),2137,58.564
1,Financial flows to developing countries (US $),2089,57.249
2,Renewable-electricity-generating-capacity-per-...,931,25.514
3,Value_co2_emissions_kt_by_country,428,11.729
4,gdp_growth,317,8.687
5,gdp_per_capita,282,7.728
6,Energy intensity level of primary energy (MJ/$...,207,5.673
7,Renewable energy share in the total final ener...,194,5.317
8,Access to clean fuels for cooking,169,4.631
9,Electricity from nuclear (TWh),126,3.453


Due to a high proportion of missing values (exceeding 20%) in the initial three columns, it becomes challenging to effectively impute them. Therefore, a practical approach would be to discard these columns and proceed with the remaining ones.

In [17]:
# TAKING THE COLUMNS THAT HAVE A LOT OF MISSING VALUES:
names_todelete = df_missing_values[df_missing_values["PERCENTAGE (%)"] > 20]["NAMES"].tolist()
names_todelete

['Renewables (% equivalent primary energy)',
 'Financial flows to developing countries (US $)',
 'Renewable-electricity-generating-capacity-per-capita']