# Prediction of CO2 emissions from country-specific data

***

# Stage 3: Predictive data analysis with the varios machine learning algorithms

***

### Notebook Contents:

0. Introduction - project and notebook summaries, notes on the data source
1. Notebook setup - libraries and data import, dealing with randomness in the algorithms
2. Data overview
3. Used feature/column abbreviations
3. Hypothesis to be tested
4. Selection of dependent and independent variables
5. Dataset splitting into training and testing subsets
6. Feature selection with recursive feature elimination and cross-validation
7. Hyperparameter tuning
8. Train and evaluate the model with the best hyperparameters on the training data with cross-validation
9. Validate the model on the test subset (previously unseen data)
10. Conclusions

***

## 0. Introduction


The project is divided into three stages:

1. Data cleaning and preparation
2. Data exploration and visualization
3. Predictive analysis

Each of the stages is described in a separate Jupyter Notebook(.ipynb file) and a derived pdf file.

***

## Notebook summary - Stage 3: Predictive data analysis with the Random Forest machine learning algorithm

**Aim of this notebook**: This notebook will show the steps taken to develop a predictive Random Forest model by using the scikit-learn library. 

**Input**:
* csv data file produced by the script 1_data_exploration.py (output of Stage 1)
* trends and relationship insights gained during data visualization (output of Stage2)

**Output**:
* a predictive Random Forest model and its corresponding metrics by evaluating unseen data

**Programming language**: Python 3.8

**Libraries used in this notebook**: sklearn, numpy, pandas, seaborn, matplotlib, sys

***





In [33]:
# import all needed libraries
import pandas as pd
import numpy as np
import numpy.random as nr
import sys
import seaborn as sns
import matplotlib.pyplot as plt

import sklearn.model_selection as ms
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import feature_selection as fs

from sklearn.model_selection import KFold
from sklearn.feature_selection import RFECV


In [34]:
# load the cleaned dataset
data_cleaned = pd.read_csv(r'data_cleaned2.csv')

# load the pca dataset
data_pca = pd.read_csv(r'pca_result.csv')


In [35]:
data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3360 entries, 0 to 3359
Data columns (total 32 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   country                             3360 non-null   object 
 1   year                                3360 non-null   int64  
 2   clean_fuel_access_perc              3360 non-null   float64
 3   elec_access_perc                    3360 non-null   float64
 4   nat_res_depl_perc                   3360 non-null   float64
 5   forest_depl_perc                    3360 non-null   float64
 6   agri_land_perc                      3360 non-null   float64
 7   agri_forest_fish_val_perc           3360 non-null   float64
 8   co2_emissions_per_capita            3360 non-null   float64
 9   cooling_degree_days                 3360 non-null   float64
 10  energy_intensity_primary_energy     3360 non-null   float64
 11  fertility_rate                      3360 no

In [36]:
data_cleaned.head()

Unnamed: 0,country,year,clean_fuel_access_perc,elec_access_perc,nat_res_depl_perc,forest_depl_perc,agri_land_perc,agri_forest_fish_val_perc,co2_emissions_per_capita,cooling_degree_days,...,mortality_rate_under_5,net_migration,nitrous_oxide_emissions_per_capita,population_65_above_perc,population_density,female_to_male_labor_force_ratio,renewable_energy_consumption_perc,scientific_journal_articles,precipitation_evapotranspiration,unemployment_total_perc
0,Albania,2000,38.7,99.430855,0.467306,0.12294,41.751825,24.515412,1.031568,710.17,...,27.2,-63610.0,0.384636,7.821964,112.738212,69.454576,41.36,22.34,-2.147378,19.028
1,Albania,2001,41.0,99.421989,0.28616,0.060695,41.569343,22.716164,1.056868,686.38,...,25.8,-62059.0,0.37593,8.145374,111.685146,69.30304,39.04,18.38,-1.776391,18.575
2,Albania,2002,43.8,99.404579,0.296077,0.064922,41.605839,22.025114,1.233002,566.02,...,24.4,-59876.0,0.425487,8.508105,111.35073,69.232075,35.82,24.53,0.058111,17.895
3,Albania,2003,46.5,99.385628,0.3129,0.061352,40.912409,21.978257,1.361159,931.89,...,22.9,-57308.0,0.431772,8.899816,110.93489,69.876243,33.67,22.82,-0.869403,16.989
4,Albania,2004,49.2,99.372139,0.36536,0.052754,40.948905,20.537486,1.427944,554.96,...,21.5,-54383.0,0.420342,9.308444,110.472226,70.592238,35.84,17.91,-0.052818,16.31


In [37]:
data_pca.head()

Unnamed: 0,country,year,co2_emissions_per_capita,PC1,PC2,PC3
0,Albania,2000,1.031568,0.069028,-0.392074,-1.481405
1,Albania,2001,1.056868,-0.160769,-0.502685,-1.479002
2,Albania,2002,1.233002,-0.393178,-0.776851,-1.536911
3,Albania,2003,1.361159,-0.439709,-0.507872,-1.449016
4,Albania,2004,1.427944,-0.653768,-0.736615,-1.475953


In [38]:
feature_cols = data_cleaned.drop(columns=['country', 'year', 'co2_emissions_per_capita']).columns

# split the data into features and target
X = data_cleaned[feature_cols]
y = data_cleaned['co2_emissions_per_capita']

X_pca = data_pca.drop(columns=['country', 'year', 'co2_emissions_per_capita'])
y_pca = data_pca['co2_emissions_per_capita']

In [39]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

X_pca_train, X_pca_test, y_pca_train, y_pca_test = train_test_split(X_pca, y_pca, test_size=0.5, random_state=42)

In [40]:
X

Unnamed: 0,clean_fuel_access_perc,elec_access_perc,nat_res_depl_perc,forest_depl_perc,agri_land_perc,agri_forest_fish_val_perc,cooling_degree_days,energy_intensity_primary_energy,fertility_rate,food_prod_index,...,mortality_rate_under_5,net_migration,nitrous_oxide_emissions_per_capita,population_65_above_perc,population_density,female_to_male_labor_force_ratio,renewable_energy_consumption_perc,scientific_journal_articles,precipitation_evapotranspiration,unemployment_total_perc
0,38.7,99.430855,0.467306,0.122940,41.751825,24.515412,710.17,4.13,2.231,63.53,...,27.2,-63610.0,0.384636,7.821964,112.738212,69.454576,41.36,22.34,-2.147378,19.028
1,41.0,99.421989,0.286160,0.060695,41.569343,22.716164,686.38,3.89,2.150,65.39,...,25.8,-62059.0,0.375930,8.145374,111.685146,69.303040,39.04,18.38,-1.776391,18.575
2,43.8,99.404579,0.296077,0.064922,41.605839,22.025114,566.02,4.10,2.036,65.88,...,24.4,-59876.0,0.425487,8.508105,111.350730,69.232075,35.82,24.53,0.058111,17.895
3,46.5,99.385628,0.312900,0.061352,40.912409,21.978257,931.89,3.80,1.978,69.05,...,22.9,-57308.0,0.431772,8.899816,110.934890,69.876243,33.67,22.82,-0.869403,16.989
4,49.2,99.372139,0.365360,0.052754,40.948905,20.537486,554.96,3.96,1.890,72.33,...,21.5,-54383.0,0.420342,9.308444,110.472226,70.592238,35.84,17.91,-0.052818,16.310
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3355,30.0,42.465588,4.683162,3.809529,41.876696,7.873986,2990.32,13.44,3.771,102.05,...,57.9,-59918.0,0.342761,3.167414,37.359969,83.700513,82.07,316.99,-1.473253,5.918
3356,29.8,43.979065,5.905431,4.205873,41.876696,8.340969,2235.64,12.79,3.706,106.59,...,56.2,-59918.0,0.349407,3.233118,38.131320,83.796443,82.63,334.71,1.297028,6.349
3357,30.0,45.400288,2.783017,1.305149,41.876696,7.319375,2663.61,12.82,3.659,107.82,...,53.7,-59918.0,0.346090,3.293359,38.909614,84.079919,80.43,406.23,-0.690707,6.767
3358,30.2,46.682095,4.088445,2.028044,41.876696,9.819262,2998.17,13.40,3.599,105.74,...,52.7,-59918.0,0.338353,3.345781,39.691374,84.432828,81.52,431.62,-0.301619,7.370


***

## Feature selection with cross-validation

Having a high ratio of features to data points has the following disadvantages:
* Not all features are expected to have an important influence when predicting the CO2 emissions.
* Some features are correlated among each other and therefore partially duplicate their influence on the DV (multicollinearity of the variables). Having additional correlated features gives no additional information gain when learning the training set and is for some machine learning algorithms not allowed.
* Sometimes a too many variables means too many degrees of freedom for the algorithm, leading to overfitting on the training set and therefore reducing prediction generalization/precision on newly unseen data.
    
This is why it is necessary to conduct feature selection, in other words - to decide which features would be most suitable for the current predictive challenge. For the purpose of better prediction generalization on new data, the features are selected by evaluating a Random Forest model for different combinationf of features involved, simultaneously using cross-validation.

The feature ranking class sklearn.feature_selection.RFECV used here incorporates recursive feature elimination and cross-validated selections. Once fitted to the training data, it ranks the models with the different features by the R2 score and returns this rank. Consequently, only the most relevant features are kept for the further analysis for both the training and testing dataset (variables features_train_reduced and features_test_reduced).

In [None]:
# Set folds for cross-validation for the feature selection
random_state_num = 42
# nr.seed() is used to set the random seed for the random number generator
# By setting the seed, you ensure that the sequence of random numbers generated is the same each time the code is executed.
nr.seed(1) 
feature_folds = KFold(n_splits=4, shuffle=True, random_state=random_state_num)

# Define the model
rf_selector = RandomForestRegressor(random_state=random_state_num)

# Define an object for a model for recursive feature elimination with CV
nr.seed(1)
selector = RFECV(estimator=rf_selector, cv=feature_folds, scoring='r2', n_jobs=-1)

# Fit the selector to the training data
selector = selector.fit(X_train, np.ravel(y_train))

# Print the feature ranking
print("Feature ranking after RFECV:")
print(selector.ranking_)

# Print the important features
ranks_transform = list(np.transpose(selector.ranking_))
chosen_features = [i for i, j in zip(feature_cols, ranks_transform) if j == 1] # Selects the features with a ranking of 1 (most important features)
print("Chosen important features:")
print(chosen_features)

# create a DataFrame with the selected features
selected_features_df = X_train[chosen_features]

According to the feature rankings, the important parameters for this data set (with ranking 1) are ['clean_fuel_access_perc', 'elec_access_perc', 'agri_land_perc', 'agri_forest_fish_val_perc', 'energy_intensity_primary_energy', 'forest_area_perc', 'heat_index_35', 'land_surface_temp', 'life_expectancy', 'methane_emissions_per_capita', 'mortality_rate_under_5', 'net_migration', 'nitrous_oxide_emissions_per_capita', 'population_65_above_perc', 'population_density', 'renewable_energy_consumption_perc', 'scientific_journal_articles', 'unemployment_total_perc']

Consequently, only these will be kept for the further analysis for both the training and testing dataset (variables *features_train_reduced* and *features_test_reduced*):

In [31]:
# assign only the important variables to the features array of both training and testing dataset
features_train_reduced = selector.transform(X_train)
features_test_reduced = selector.transform(X_test)

print("Training subset shape before the recursive feature elimination:")
print(X_train.shape)
print("Training subset array shape after the recursive feature elimination:")
print(features_train_reduced.shape)
print("Test subset array shape after the recursive feature elimination:")
print(features_test_reduced.shape)

Training subset shape before the recursive feature elimination:
(1680, 29)
Training subset array shape after the recursive feature elimination:
(1680, 18)
Test subset array shape after the recursive feature elimination:
(1680, 18)


In [32]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1680 entries, 1410 to 3174
Data columns (total 29 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   clean_fuel_access_perc              1680 non-null   float64
 1   elec_access_perc                    1680 non-null   float64
 2   nat_res_depl_perc                   1680 non-null   float64
 3   forest_depl_perc                    1680 non-null   float64
 4   agri_land_perc                      1680 non-null   float64
 5   agri_forest_fish_val_perc           1680 non-null   float64
 6   cooling_degree_days                 1680 non-null   float64
 7   energy_intensity_primary_energy     1680 non-null   float64
 8   fertility_rate                      1680 non-null   float64
 9   food_prod_index                     1680 non-null   float64
 10  forest_area_perc                    1680 non-null   float64
 11  gdp_growth_perc                     1680 non-

In [27]:
X_train.head()


Unnamed: 0,clean_fuel_access_perc,elec_access_perc,agri_land_perc,agri_forest_fish_val_perc,energy_intensity_primary_energy,forest_area_perc,heat_index_35,land_surface_temp,life_expectancy,methane_emissions_per_capita,mortality_rate_under_5,net_migration,nitrous_oxide_emissions_per_capita,population_65_above_perc,population_density,renewable_energy_consumption_perc,scientific_journal_articles,unemployment_total_perc
1410,11.0,87.94,27.230084,15.185348,5.07,53.684549,0.22,27.374235,67.413,1.445725,45.5,-97733.0,0.215439,5.347609,118.816439,43.0,376.32,6.66
1476,100.0,100.0,61.837712,0.921153,2.47,9.934272,0.0,12.218516,79.241463,3.765071,4.9,94996.0,2.248192,10.79152,62.034998,3.23,4863.72,4.41
2674,0.3,13.006841,50.203935,52.956651,7.32,38.671945,0.14,27.844,50.365,0.479435,182.8,-6042.0,0.176752,3.243243,82.282668,86.35,4.72,4.097
83,99.9,100.0,43.029265,6.357034,3.46,10.440715,1.85,25.874429,75.892,2.887727,7.7,2344.0,1.125279,11.727454,16.580893,9.84,9729.75,11.46
2655,66.6,99.645485,40.224102,7.201489,6.05,30.73062,0.0,18.438369,73.985366,1.498784,7.8,46.0,0.692542,17.526963,83.704631,20.68,3712.68,16.12
