# Climate CO₂ Forecast

This notebook demonstrates a machine learning approach to forecast CO₂ emissions in support of **SDG 13: Climate Action**.

## IMPORT THE RELEVANT LIBRARIES

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


# LOAD THE DATASET



In [19]:
df = pd.read_csv("../DATA/owid-co2-data.csv")

df.head()

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1750,AFG,2802560.0,,0.0,0.0,,,,...,,,,,,,,,,
1,Afghanistan,1751,AFG,,,0.0,,,,,...,,,,,,,,,,
2,Afghanistan,1752,AFG,,,0.0,,,,,...,,,,,,,,,,
3,Afghanistan,1753,AFG,,,0.0,,,,,...,,,,,,,,,,
4,Afghanistan,1754,AFG,,,0.0,,,,,...,,,,,,,,,,


# DATA CLEANING

In this step, we prepare the dataset for analysis and machine learning by identifying and correcting issues in the raw data. Cleaning the dataset ensures the models we build are accurate, reliable, and not biased by missing or incorrect information.

## step 1. We begin by inspecting the dataset's structure, checking for:

- The number of rows and columns
- Data types of each column
- Null or missing values
- Summary statistics

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50191 entries, 0 to 50190
Data columns (total 79 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   country                                    50191 non-null  object 
 1   year                                       50191 non-null  int64  
 2   iso_code                                   42262 non-null  object 
 3   population                                 41019 non-null  float64
 4   gdp                                        15251 non-null  float64
 5   cement_co2                                 28863 non-null  float64
 6   cement_co2_per_capita                      25358 non-null  float64
 7   co2                                        29137 non-null  float64
 8   co2_growth_abs                             26981 non-null  float64
 9   co2_growth_prct                            26002 non-null  float64
 10  co2_including_luc     

In [21]:
df.describe()

Unnamed: 0,year,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,co2_including_luc,co2_including_luc_growth_abs,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
count,50191.0,41019.0,15251.0,28863.0,25358.0,29137.0,26981.0,26002.0,23585.0,23285.0,...,2108.0,41001.0,38060.0,41001.0,41001.0,38060.0,37410.0,37236.0,4535.0,4535.0
mean,1919.883067,56861410.0,330049500000.0,7.767746,0.059036,415.698178,6.208882,43.104462,535.581202,7.214604,...,7.512655,2.269285,0.003026,0.00767,0.011023,0.000509,488.542225,316.133529,-7.232399,20.52444
std,65.627296,319990500.0,3086383000000.0,62.595292,0.120328,1945.843973,62.322553,1729.939596,2202.219657,99.34798,...,17.671054,9.315325,0.016519,0.043694,0.061901,0.003043,2392.57991,1839.602293,250.640012,52.744956
min,1750.0,215.0,49980000.0,0.0,0.0,0.0,-1977.75,-100.0,-99.693,-2325.5,...,0.0,-0.81,-0.001,0.0,-0.001,0.0,-14.961,0.0,-2195.952,-98.849
25%,1875.0,327313.0,7874038000.0,0.0,0.0,0.374,-0.005,-1.1025,6.418,-0.908,...,0.20475,0.004,0.0,0.0,0.0,0.0,1.835,0.235,-3.1795,-6.168
50%,1924.0,2289522.0,27438610000.0,0.0,0.001,4.99,0.044,3.8035,27.691,0.078,...,0.838,0.078,0.0,0.0,0.0,0.0,15.0075,2.371,1.518,8.701
75%,1974.0,9862459.0,121262700000.0,0.486,0.07575,53.273,1.002,10.89075,123.959,2.62,...,3.211,0.359,0.001,0.001,0.001,0.0,78.24275,29.3375,9.1535,32.666
max,2023.0,8091735000.0,130112600000000.0,1696.308,2.484,37791.57,1865.208,180870.0,41416.48,2340.184,...,100.0,100.0,0.422,1.161,1.668,0.085,53816.852,44114.785,1798.999,568.635


In [22]:
df.isnull().sum().sort_values(ascending=False)

share_global_other_co2               48083
share_global_cumulative_other_co2    48083
other_co2_per_capita                 47717
other_industry_co2                   46989
cumulative_other_co2                 46989
                                     ...  
temperature_change_from_co2           9190
population                            9172
iso_code                              7929
year                                     0
country                                  0
Length: 79, dtype: int64

## step 2: Handling Missing Values
The dataset contains some missing values (NaN) in certain columns. Our approach:

- Drop columns with too many missing values that aren't essential for CO₂ prediction

- For remaining columns with few NaNs, fill them using methods like:

             - fillna(method='ffill'): forward-fill

             - fillna(method='bfill'): backward-fill

             - fillna(mean/median): numerical imputation

In [23]:
drop_columns = [
    'iso_code', 'continent', 'other_industry_co2', 'cement_co2_per_capita',
    'coal_co2_per_capita', 'oil_co2_per_capita', 'gas_co2_per_capita',
    'flaring_co2_per_capita', 'trade_co2', 'trade_co2_share'
]
df.drop(columns=drop_columns, inplace=True, errors='ignore')

In [24]:
# Filter dataset: Keep rows with key fields non-null
df = df[df['year'].notnull() & df['co2'].notnull()]

In [25]:
# Fill remaining missing values (numeric columns only)
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)


In [26]:
# Feature Engineering: Create useful derived columns
df['co2_per_gdp'] = df['co2'] / df['gdp']
df['co2_per_capita'] = df['co2'] / df['population']



In [28]:
# Select final feature set
selected_columns = [
    'country', 'year', 'co2', 'gdp', 'population',
    'coal_co2', 'oil_co2', 'gas_co2', 'co2_per_gdp', 'co2_per_capita'
]
df = df[selected_columns]
df.head()

Unnamed: 0,country,year,co2,gdp,population,coal_co2,oil_co2,gas_co2,co2_per_gdp,co2_per_capita
199,Afghanistan,1949,0.015,9421400000.0,7356890.0,0.015,0.0,0.0,1.59212e-12,2.038905e-09
200,Afghanistan,1950,0.084,9421400000.0,7776182.0,0.021,0.063,0.0,8.915872e-12,1.080222e-08
201,Afghanistan,1951,0.092,9692280000.0,7879343.0,0.026,0.066,0.0,9.492091e-12,1.16761e-08
202,Afghanistan,1952,0.092,10017330000.0,7987783.0,0.032,0.06,0.0,9.184089e-12,1.151759e-08
203,Afghanistan,1953,0.106,10630520000.0,8096703.0,0.038,0.068,0.0,9.97129e-12,1.309175e-08
