### CO2 Emission Reduction Classification Model

In [1]:
import pandas as pd 
import numpy as np
import os

# Set up workspace path - go up one directory from scripts folder to project root
workspace_root = os.path.dirname(os.getcwd())

# Set the data directory path
data_dir = os.path.join(workspace_root, 'data')
data_dir
# Read data
csv_file_path = os.path.join(data_dir, 'important_variables.csv')
df = pd.read_csv(csv_file_path)

In [2]:
# Sort by country and year to ensure proper time series ordering
df = df.sort_values(by=['REF_AREA_LABEL', 'YEAR']).copy()

# Calculate 10-year percentage change in CO2 emissions per capita
# This captures sustained emission trends rather than short-term fluctuations
df['CO2_change_pct_10yr'] = (
    df.groupby('REF_AREA_LABEL')['WB_WDI_EN_GHG_CO2_PC_CE_AR5']
      .pct_change(periods=10) * 100
)

# Create binary target: 1 if country reduces emissions by ≥10% over 10 years, 0 otherwise
# Threshold of -10% represents meaningful emission reduction
THRESHOLD = -10
df['target_reduction'] = (df['CO2_change_pct_10yr'] <= THRESHOLD).astype(int)

print(df[['REF_AREA_LABEL','YEAR','CO2_change_pct_10yr','target_reduction']].head(20))


   REF_AREA_LABEL  YEAR  CO2_change_pct_10yr  target_reduction
0         Albania  1995                  NaN                 0
1         Albania  1996                  NaN                 0
2         Albania  1997                  NaN                 0
3         Albania  1998                  NaN                 0
4         Albania  1999                  NaN                 0
5         Albania  2000                  NaN                 0
6         Albania  2001                  NaN                 0
7         Albania  2002                  NaN                 0
8         Albania  2003                  NaN                 0
9         Albania  2004                  NaN                 0
10        Albania  2005           112.142969                 0
11        Albania  2006           113.010999                 0
12        Albania  2007           184.666498                 0
13        Albania  2008           132.543247                 0
14        Albania  2009            47.063574           

In [None]:
def create_lags(df, group_col, cols_to_lag, n_lags=3):
    """
    Create lagged features to capture temporal dependencies.
    Policy changes often take time to show effects, so historical values are important predictors.
    """
    for col in cols_to_lag:
        for lag in range(1, n_lags + 1):
            df[f'{col}_lag{lag}'] = df.groupby(group_col)[col].shift(lag)
    return df

# Define core features that are most relevant for emission reduction prediction
features = [
    'WB_WDI_NY_GDP_MKTP_KD_ZG',      # GDP growth (annual %) - economic development
    'WB_WDI_SP_URB_TOTL_IN_ZS',      # Urban population (% of total) - urbanization trends
    'WB_WDI_SP_POP_TOTL',            # Population, total - scale effects
    'WB_WDI_EG_USE_PCAP_KG_OE',      # Energy use per capita - energy intensity
    'WB_WDI_EG_USE_CRNW_ZS',         # Combustible renewables and waste (% of total energy)
    'WB_WDI_EG_FEC_RNEW_ZS',         # Renewable energy consumption (% of total final energy)
    'WB_WDI_EG_ELC_FOSL_ZS',         # Electricity production from fossil fuels (% of total)
    'WB_WDI_EG_USE_COMM_FO_ZS',      # Fossil fuel energy consumption (% of total)
    'WB_WDI_EN_GHG_CO2_TR_MT_CE_AR5',# CO2 emissions from Transport - key sector
    'WB_WDI_EN_GHG_CO2_PI_MT_CE_AR5',# CO2 emissions from Power Industry - critical sector
    'WB_WDI_EN_GHG_CH4_AG_MT_CE_AR5' # CH4 emissions from Agriculture - important GHG
]

# Create 3-year lags to capture policy implementation delays and temporal effects
df = create_lags(df, 'REF_AREA_LABEL', features, n_lags=3)


In [None]:
# Eliminate rows with missing values
df_model = df.dropna(subset=features + ['target_reduction']).copy()

# Features finales = actuales + lags
lag_cols = [col for col in df_model.columns if 'lag' in col]
final_features = features + lag_cols

X = df_model[final_features]
y = df_model['target_reduction']


In [8]:
# Temporal split: train on historical data (1995-2015), test on future data (2016+)
# This tests the model's ability to predict future outcomes, which is the ultimate goal
train = df_model[df_model['YEAR'] <= 2015]
test = df_model[df_model['YEAR'] > 2015]

# Use only the core features (not lag features) for training to avoid data leakage
# Note: features variable should be defined earlier in the notebook
X_train, y_train = train[lag_features], train['target_reduction']
X_test, y_test = test[lag_features], test['target_reduction']



In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Create logistic regression pipeline with standardization
# class_weight='balanced' handles the imbalanced target variable
# max_iter=1000 ensures convergence for complex datasets
log_reg = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features for logistic regression
    ('model', LogisticRegression(class_weight='balanced', max_iter=1000))
])

log_reg.fit(X_train, y_train)


In [11]:
from sklearn.metrics import classification_report, roc_auc_score

# Generate predictions and probabilities
y_pred = log_reg.predict(X_test)
y_proba = log_reg.predict_proba(X_test)[:, 1]  # Probability of class 1 (successful reduction)

# Evaluate model performance
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba):.4f}")


Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.55      0.64       826
           1       0.47      0.71      0.56       461

    accuracy                           0.61      1287
   macro avg       0.62      0.63      0.60      1287
weighted avg       0.66      0.61      0.62      1287

ROC-AUC Score: 0.7000


In [12]:
# Analyze model coefficients to understand which factors drive emission reductions
print("Features used in training:", X_train.columns.tolist())
print("Number of features:", len(X_train.columns))
print("Number of coefficients:", len(log_reg.named_steps['model'].coef_[0]))

# Create coefficient analysis DataFrame
# Coefficients show the log-odds impact of each variable
# Odds ratios (exp(coefficient)) show the multiplicative effect on success probability
coef_df = pd.DataFrame({
    'Variable': X_train.columns,
    'Coefficient': log_reg.named_steps['model'].coef_[0],
    'Odds_Ratio': np.exp(log_reg.named_steps['model'].coef_[0])
}).sort_values(by='Odds_Ratio', ascending=False)

print("\nVariable Impact Analysis:")
print("Odds Ratio > 1: Variable increases odds of emission reduction")
print("Odds Ratio < 1: Variable decreases odds of emission reduction")
print("Odds Ratio = 1: Variable has no effect")
print("\nTop 5 Most Important Variables:")
print(coef_df.head())
print("\nBottom 5 Variables (Barriers):")
print(coef_df.tail())

coef_df

Features used in training: ['WB_WDI_NY_GDP_MKTP_KD_ZG', 'WB_WDI_SP_URB_TOTL_IN_ZS', 'WB_WDI_SP_POP_TOTL', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_CRNW_ZS', 'WB_WDI_EG_FEC_RNEW_ZS', 'WB_WDI_EG_ELC_FOSL_ZS', 'WB_WDI_EG_USE_COMM_FO_ZS', 'WB_WDI_EN_GHG_CO2_TR_MT_CE_AR5', 'WB_WDI_EN_GHG_CO2_PI_MT_CE_AR5', 'WB_WDI_EN_GHG_CH4_AG_MT_CE_AR5']
Number of features: 11
Number of coefficients: 11

Variable Impact Analysis:
Odds Ratio > 1: Variable increases odds of emission reduction
Odds Ratio < 1: Variable decreases odds of emission reduction
Odds Ratio = 1: Variable has no effect

Top 5 Most Important Variables:
                         Variable  Coefficient  Odds_Ratio
8  WB_WDI_EN_GHG_CO2_TR_MT_CE_AR5     1.517660    4.561537
2              WB_WDI_SP_POP_TOTL     1.231566    3.426591
1        WB_WDI_SP_URB_TOTL_IN_ZS     0.655112    1.925358
6           WB_WDI_EG_ELC_FOSL_ZS     0.082543    1.086046
4           WB_WDI_EG_USE_CRNW_ZS    -0.039184    0.961574

Bottom 5 Variables (Barriers):
  

Unnamed: 0,Variable,Coefficient,Odds_Ratio
8,WB_WDI_EN_GHG_CO2_TR_MT_CE_AR5,1.51766,4.561537
2,WB_WDI_SP_POP_TOTL,1.231566,3.426591
1,WB_WDI_SP_URB_TOTL_IN_ZS,0.655112,1.925358
6,WB_WDI_EG_ELC_FOSL_ZS,0.082543,1.086046
4,WB_WDI_EG_USE_CRNW_ZS,-0.039184,0.961574
5,WB_WDI_EG_FEC_RNEW_ZS,-0.06053,0.941265
3,WB_WDI_EG_USE_PCAP_KG_OE,-0.071662,0.930845
0,WB_WDI_NY_GDP_MKTP_KD_ZG,-0.41798,0.658375
7,WB_WDI_EG_USE_COMM_FO_ZS,-0.6421,0.526186
10,WB_WDI_EN_GHG_CH4_AG_MT_CE_AR5,-1.593584,0.203196


- WB_WDI_EN_GHG_CO2_TR_MT_CE_AR5: A one-unit increase in CO₂ emissions from transport (Mt CO₂e) multiplies the odds of achieving a reduction by 4.56×, indicating that countries with high transportation emissions are more likely to implement impactful policies in this sector.

- WB_WDI_SP_POP_TOTL: Align with the previous insight, larger populations are associated with 3.43× higher odds of emission reductions, potentially due to larger-scale interventions and greater international pressure.

- WB_WDI_SP_URB_TOTL_IN_ZS: A 1% increase in the urban population share almost doubles the odds of achieving reductions (1.93×), reflecting the role of efficient urban systems. 


Variables with bigger positive impact are related with city systems undersoring the importance of investing on efficienties on these. 



- WB_WDI_EN_GHG_CO2_PI_MT_CE_AR5: High emissions from the power sector reduce odds to 15% of baseline, making decarbonizing electricity grids a key priority.

- WB_WDI_EN_GHG_CH4_AG_MT_CE_AR5: High agricultural methane emissions drastically lower odds, to just 20% of baseline, signaling a major challenge for climate action in agriculture-intensive economies.

- WB_WDI_EG_USE_COMM_FO_ZS; Countries more reliant on fossil fuels see their odds of reduction cut almost in half, underscoring the need for clean energy transition.

