# AI for Climate Action: Carbon Emission Prediction Model 🌍🤖

## Week 2 Assignment: Machine Learning Meets UN SDG 13 - Climate Action

### Project Overview
This project develops a machine learning solution to predict carbon emissions and contribute to **UN Sustainable Development Goal 13: Climate Action**. We'll use supervised learning techniques to forecast CO2 emissions based on economic, industrial, and demographic factors.

### Learning Objectives
- Apply supervised learning concepts from Week 2
- Demonstrate how AI can address global sustainability challenges
- Implement ethical AI practices for sustainable development
- Create actionable insights for climate policy

---
*"AI can be the bridge between innovation and sustainability." — UN Tech Envoy*

## 1. SDG Selection and Problem Definition 🎯

### Chosen SDG: SDG 13 - Climate Action

**Problem Statement**: Climate change is one of the most pressing global challenges. Accurate prediction of carbon emissions is crucial for:
- Setting realistic emission reduction targets
- Identifying key factors contributing to emissions  
- Developing effective climate policies
- Monitoring progress toward carbon neutrality

### SDG Targets Addressed:
- **Target 13.2**: Integrate climate change measures into national policies and strategies
- **Target 13.3**: Improve education and awareness on climate change mitigation

### Machine Learning Approach:
- **Type**: Supervised Learning (Regression)
- **Primary Algorithm**: Random Forest Regression
- **Comparison Models**: Linear Regression, XGBoost
- **Features**: GDP, Population, Energy Consumption, Industrial Production
- **Target Variable**: CO2 Emissions (metric tons per capita)

### Expected Impact:
This model will help policymakers, organizations, and researchers make data-driven decisions for emission reduction strategies.

## 2. Data Collection and Exploration 📊

We'll use publicly available data from World Bank and UN databases to ensure transparency and reproducibility.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import pickle
import os

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create directories if they don't exist
os.makedirs('data', exist_ok=True)
os.makedirs('models', exist_ok=True)
os.makedirs('results', exist_ok=True)

print("✅ Libraries imported successfully!")
print("📁 Project directories created!")

In [None]:
# Create Synthetic Climate and Economic Data
# Note: In a real project, you would load data from World Bank, UN databases, or Kaggle

np.random.seed(42)

# Number of countries/regions to simulate
n_samples = 200

# Generate realistic country data
countries = [f"Country_{i}" for i in range(1, n_samples + 1)]

# Economic indicators
gdp_per_capita = np.random.lognormal(mean=9, sigma=1.2, size=n_samples)  # GDP per capita (USD)
population = np.random.lognormal(mean=15, sigma=1.5, size=n_samples)     # Population

# Energy and industrial factors
energy_consumption = np.random.gamma(shape=2, scale=100, size=n_samples)  # Energy consumption per capita
renewable_energy_pct = np.random.beta(a=2, b=5, size=n_samples) * 100    # Renewable energy percentage
industrial_production = np.random.gamma(shape=3, scale=50, size=n_samples)  # Industrial production index

# Environmental factors
forest_area_pct = np.random.beta(a=3, b=2, size=n_samples) * 100         # Forest area percentage
urbanization_rate = np.random.beta(a=5, b=3, size=n_samples) * 100       # Urban population percentage

# Development indicators
education_index = np.random.beta(a=8, b=2, size=n_samples)               # Education index (0-1)
healthcare_expenditure = np.random.gamma(shape=3, scale=2, size=n_samples)  # Healthcare expenditure % of GDP

# Calculate CO2 emissions with realistic relationships
co2_emissions = (
    0.3 * np.log(gdp_per_capita) +
    0.4 * np.log(energy_consumption) +
    0.2 * (industrial_production / 100) +
    -0.1 * (renewable_energy_pct / 100) +
    -0.05 * (forest_area_pct / 100) +
    0.1 * (urbanization_rate / 100) +
    np.random.normal(0, 0.5, n_samples)  # Add noise
)

# Ensure realistic bounds
co2_emissions = np.clip(co2_emissions, 0.1, 25)  # Typical range: 0.1-25 tons per capita

# Create DataFrame
data = pd.DataFrame({
    'Country': countries,
    'GDP_per_capita': gdp_per_capita,
    'Population': population,
    'Energy_consumption_per_capita': energy_consumption,
    'Renewable_energy_pct': renewable_energy_pct,
    'Industrial_production_index': industrial_production,
    'Forest_area_pct': forest_area_pct,
    'Urbanization_rate': urbanization_rate,
    'Education_index': education_index,
    'Healthcare_expenditure_pct': healthcare_expenditure,
    'CO2_emissions_per_capita': co2_emissions
})

# Save the dataset
data.to_csv('data/climate_economic_data.csv', index=False)

print("🌍 Synthetic climate and economic dataset created!")
print(f"📊 Dataset shape: {data.shape}")
print("\n📋 Dataset Info:")
print(data.info())

In [None]:
# Exploratory Data Analysis
print("📈 EXPLORATORY DATA ANALYSIS")
print("=" * 50)

# Display basic statistics
print("\n📊 Dataset Overview:")
print(data.head())

print("\n📈 Statistical Summary:")
print(data.describe().round(2))

# Check for missing values
print(f"\n❌ Missing values: {data.isnull().sum().sum()}")

# Visualize the distribution of CO2 emissions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# CO2 emissions distribution
axes[0, 0].hist(data['CO2_emissions_per_capita'], bins=30, alpha=0.7, color='red')
axes[0, 0].set_title('Distribution of CO2 Emissions per Capita')
axes[0, 0].set_xlabel('CO2 Emissions (tons per capita)')
axes[0, 0].set_ylabel('Frequency')

# GDP vs CO2 emissions
axes[0, 1].scatter(data['GDP_per_capita'], data['CO2_emissions_per_capita'], alpha=0.6)
axes[0, 1].set_title('GDP per Capita vs CO2 Emissions')
axes[0, 1].set_xlabel('GDP per Capita (USD)')
axes[0, 1].set_ylabel('CO2 Emissions (tons per capita)')

# Energy consumption vs CO2 emissions
axes[1, 0].scatter(data['Energy_consumption_per_capita'], data['CO2_emissions_per_capita'], alpha=0.6, color='orange')
axes[1, 0].set_title('Energy Consumption vs CO2 Emissions')
axes[1, 0].set_xlabel('Energy Consumption per Capita')
axes[1, 0].set_ylabel('CO2 Emissions (tons per capita)')

# Renewable energy vs CO2 emissions
axes[1, 1].scatter(data['Renewable_energy_pct'], data['CO2_emissions_per_capita'], alpha=0.6, color='green')
axes[1, 1].set_title('Renewable Energy % vs CO2 Emissions')
axes[1, 1].set_xlabel('Renewable Energy %')
axes[1, 1].set_ylabel('CO2 Emissions (tons per capita)')

plt.tight_layout()
plt.savefig('results/exploratory_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Exploratory analysis visualizations saved!")

## 3. Data Preprocessing and Feature Engineering 🔧

This section handles data cleaning, normalization, and feature creation to optimize model performance.

In [None]:
# Data Preprocessing and Feature Engineering
print("🔧 DATA PREPROCESSING & FEATURE ENGINEERING")
print("=" * 50)

# Create a copy for preprocessing
df_processed = data.copy()

# Remove country names for modeling (keep for later reference)
features_df = df_processed.drop(['Country'], axis=1)

# Calculate correlation matrix
correlation_matrix = features_df.corr()

# Visualize correlation heatmap
plt.figure(figsize=(12, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, 
            mask=mask,
            annot=True, 
            cmap='RdYlBu_r', 
            center=0,
            square=True,
            fmt='.2f')
plt.title('Feature Correlation Matrix', fontsize=16)
plt.tight_layout()
plt.savefig('results/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

# Feature Engineering - Create new meaningful features
df_processed['GDP_per_energy'] = df_processed['GDP_per_capita'] / df_processed['Energy_consumption_per_capita']
df_processed['Population_density_proxy'] = df_processed['Population'] * df_processed['Urbanization_rate'] / 100
df_processed['Green_development_index'] = (
    df_processed['Renewable_energy_pct'] * 0.4 + 
    df_processed['Forest_area_pct'] * 0.3 + 
    df_processed['Education_index'] * 100 * 0.3
)
df_processed['Industrial_intensity'] = df_processed['Industrial_production_index'] / df_processed['GDP_per_capita']

# Log transform skewed variables to improve model performance
skewed_features = ['GDP_per_capita', 'Population', 'Energy_consumption_per_capita']
for feature in skewed_features:
    df_processed[f'{feature}_log'] = np.log1p(df_processed[feature])

print("✅ Feature engineering completed!")
print(f"📊 New dataset shape: {df_processed.shape}")
print(f"🔢 Number of features: {df_processed.shape[1] - 1}")  # -1 for target variable

# Display correlation with target variable
target_correlations = correlation_matrix['CO2_emissions_per_capita'].sort_values(key=abs, ascending=False)
print("\n🎯 Features most correlated with CO2 emissions:")
print(target_correlations.drop('CO2_emissions_per_capita').head(10))

## 4. Model Selection and Implementation 🤖

We'll implement and compare three different machine learning algorithms:
1. **Linear Regression** - Simple baseline model
2. **Random Forest** - Ensemble method for complex relationships
3. **XGBoost** - Gradient boosting for high performance

In [None]:
# Prepare Data for Machine Learning
print("🎯 PREPARING DATA FOR MACHINE LEARNING")
print("=" * 50)

# Select features for modeling (exclude target variable and original country names)
feature_columns = [col for col in df_processed.columns 
                  if col not in ['CO2_emissions_per_capita', 'Country']]

X = df_processed[feature_columns]
y = df_processed['CO2_emissions_per_capita']

print(f"📊 Feature matrix shape: {X.shape}")
print(f"🎯 Target vector shape: {y.shape}")
print(f"\n🔢 Selected features: {feature_columns}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=None
)

print(f"\n📈 Training set size: {X_train.shape[0]} samples")
print(f"🧪 Testing set size: {X_test.shape[0]} samples")

# Scale the features for better model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler for future use
with open('models/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("✅ Data preprocessing completed!")
print("💾 Scaler saved for future predictions!")

## 5. Model Training and Hyperparameter Tuning ⚙️

We'll train multiple models and optimize their hyperparameters for best performance.

In [None]:
# Model Training and Hyperparameter Tuning
print("⚙️ MODEL TRAINING & HYPERPARAMETER TUNING")
print("=" * 50)

# Dictionary to store models and their performance
models = {}
model_scores = {}

# 1. Linear Regression (Baseline Model)
print("\n🔵 Training Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
models['Linear Regression'] = lr_model

# Cross-validation for Linear Regression
lr_cv_scores = cross_val_score(lr_model, X_train_scaled, y_train, cv=5, scoring='r2')
model_scores['Linear Regression'] = lr_cv_scores.mean()
print(f"✅ Linear Regression CV R² Score: {lr_cv_scores.mean():.4f} (+/- {lr_cv_scores.std() * 2:.4f})")

# 2. Random Forest with Hyperparameter Tuning
print("\n🌲 Training Random Forest with GridSearch...")
rf_model = RandomForestRegressor(random_state=42)

# Random Forest hyperparameter grid
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf_grid_search = GridSearchCV(
    rf_model, rf_param_grid, cv=3, scoring='r2', n_jobs=-1, verbose=1
)
rf_grid_search.fit(X_train_scaled, y_train)

models['Random Forest'] = rf_grid_search.best_estimator_
model_scores['Random Forest'] = rf_grid_search.best_score_
print(f"✅ Random Forest Best CV R² Score: {rf_grid_search.best_score_:.4f}")
print(f"🔧 Best parameters: {rf_grid_search.best_params_}")

# 3. XGBoost with Hyperparameter Tuning
print("\n🚀 Training XGBoost with GridSearch...")
xgb_model = xgb.XGBRegressor(random_state=42, verbosity=0)

# XGBoost hyperparameter grid
xgb_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0]
}

xgb_grid_search = GridSearchCV(
    xgb_model, xgb_param_grid, cv=3, scoring='r2', n_jobs=-1, verbose=1
)
xgb_grid_search.fit(X_train_scaled, y_train)

models['XGBoost'] = xgb_grid_search.best_estimator_
model_scores['XGBoost'] = xgb_grid_search.best_score_
print(f"✅ XGBoost Best CV R² Score: {xgb_grid_search.best_score_:.4f}")
print(f"🔧 Best parameters: {xgb_grid_search.best_params_}")

# Display model comparison
print("\n📊 MODEL COMPARISON (Cross-Validation R² Scores):")
print("-" * 50)
for model_name, score in model_scores.items():
    print(f"{model_name:15}: {score:.4f}")

# Select best model
best_model_name = max(model_scores, key=model_scores.get)
best_model = models[best_model_name]
print(f"\n🏆 Best Model: {best_model_name} (R² = {model_scores[best_model_name]:.4f})")

## 6. Model Evaluation and Performance Metrics 📈

Comprehensive evaluation of our models using multiple regression metrics.

In [None]:
# Model Evaluation and Performance Metrics
print("📈 MODEL EVALUATION & PERFORMANCE METRICS")
print("=" * 50)

# Function to calculate comprehensive metrics
def evaluate_model(model, X_test, y_test, model_name):
    """Calculate comprehensive regression metrics for a model"""
    y_pred = model.predict(X_test)
    
    metrics = {
        'MAE': mean_absolute_error(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'R²': r2_score(y_test, y_pred)
    }
    
    return y_pred, metrics

# Evaluate all models
evaluation_results = {}
predictions = {}

print("🧪 Evaluating models on test set...\n")

for model_name, model in models.items():
    y_pred, metrics = evaluate_model(model, X_test_scaled, y_test, model_name)
    evaluation_results[model_name] = metrics
    predictions[model_name] = y_pred
    
    print(f"📊 {model_name} Performance:")
    for metric_name, value in metrics.items():
        print(f"   {metric_name:4}: {value:.4f}")
    print()

# Create performance comparison dataframe
performance_df = pd.DataFrame(evaluation_results).round(4)
print("📋 PERFORMANCE COMPARISON TABLE:")
print(performance_df.to_string())

# Save evaluation results
performance_df.to_csv('results/model_performance.csv')

# Feature importance for tree-based models
if best_model_name in ['Random Forest', 'XGBoost']:
    feature_importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n🎯 TOP 10 MOST IMPORTANT FEATURES ({best_model_name}):")
    print(feature_importance.head(10).to_string(index=False))
    
    # Save feature importance
    feature_importance.to_csv('results/feature_importance.csv', index=False)

print("\n✅ Model evaluation completed!")
print("💾 Results saved to CSV files!")

## 7. Results Visualization and Interpretation 📊

Creating comprehensive visualizations to understand model performance and insights.

In [None]:
# Results Visualization and Interpretation
print("📊 CREATING COMPREHENSIVE VISUALIZATIONS")
print("=" * 50)

# 1. Model Performance Comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# R² Score comparison
models_list = list(evaluation_results.keys())
r2_scores = [evaluation_results[model]['R²'] for model in models_list]

axes[0, 0].bar(models_list, r2_scores, color=['lightblue', 'lightgreen', 'lightcoral'])
axes[0, 0].set_title('Model R² Score Comparison', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_ylim(0, 1)
for i, v in enumerate(r2_scores):
    axes[0, 0].text(i, v + 0.01, f'{v:.3f}', ha='center', fontweight='bold')

# RMSE comparison
rmse_scores = [evaluation_results[model]['RMSE'] for model in models_list]
axes[0, 1].bar(models_list, rmse_scores, color=['lightblue', 'lightgreen', 'lightcoral'])
axes[0, 1].set_title('Model RMSE Comparison', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('RMSE')
for i, v in enumerate(rmse_scores):
    axes[0, 1].text(i, v + 0.01, f'{v:.3f}', ha='center', fontweight='bold')

# 2. Predictions vs Actual (Best Model)
best_predictions = predictions[best_model_name]
axes[1, 0].scatter(y_test, best_predictions, alpha=0.7, color='darkgreen')
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1, 0].set_xlabel('Actual CO2 Emissions')
axes[1, 0].set_ylabel('Predicted CO2 Emissions')
axes[1, 0].set_title(f'{best_model_name}: Predictions vs Actual', fontsize=14, fontweight='bold')

# 3. Residuals plot
residuals = y_test - best_predictions
axes[1, 1].scatter(best_predictions, residuals, alpha=0.7, color='purple')
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Predicted CO2 Emissions')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title(f'{best_model_name}: Residual Plot', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('results/model_performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# 4. Feature Importance Visualization (if available)
if best_model_name in ['Random Forest', 'XGBoost']:
    plt.figure(figsize=(12, 8))
    top_features = feature_importance.head(10)
    
    bars = plt.barh(range(len(top_features)), top_features['importance'])
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Feature Importance')
    plt.title(f'Top 10 Most Important Features ({best_model_name})', fontsize=16, fontweight='bold')
    plt.gca().invert_yaxis()
    
    # Add value labels on bars
    for i, bar in enumerate(bars):
        width = bar.get_width()
        plt.text(width + 0.001, bar.get_y() + bar.get_height()/2, 
                f'{width:.3f}', ha='left', va='center', fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('results/feature_importance.png', dpi=300, bbox_inches='tight')
    plt.show()

# 5. Interactive Plotly visualization
fig_interactive = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Model Performance (R²)', 'Predictions vs Actual', 
                   'Feature Importance', 'Residual Distribution'),
    specs=[[{"type": "bar"}, {"type": "scatter"}],
           [{"type": "bar"}, {"type": "histogram"}]]
)

# Model performance
fig_interactive.add_trace(
    go.Bar(x=models_list, y=r2_scores, name='R² Score', 
           marker_color=['lightblue', 'lightgreen', 'lightcoral']),
    row=1, col=1
)

# Predictions vs Actual
fig_interactive.add_trace(
    go.Scatter(x=y_test, y=best_predictions, mode='markers',
               name='Predictions', marker=dict(color='darkgreen', opacity=0.7)),
    row=1, col=2
)
fig_interactive.add_trace(
    go.Scatter(x=[y_test.min(), y_test.max()], y=[y_test.min(), y_test.max()],
               mode='lines', name='Perfect Prediction', line=dict(color='red', dash='dash')),
    row=1, col=2
)

# Feature importance (if available)
if best_model_name in ['Random Forest', 'XGBoost']:
    fig_interactive.add_trace(
        go.Bar(y=top_features['feature'], x=top_features['importance'],
               orientation='h', name='Importance'),
        row=2, col=1
    )

# Residuals distribution
fig_interactive.add_trace(
    go.Histogram(x=residuals, name='Residuals', nbinsx=20),
    row=2, col=2
)

fig_interactive.update_layout(
    title=f'Comprehensive Model Analysis - {best_model_name}',
    height=800,
    showlegend=False
)

fig_interactive.write_html('results/interactive_analysis.html')
print("✅ Interactive visualization saved as HTML!")

print("\n🎯 KEY INSIGHTS:")
print(f"• Best performing model: {best_model_name}")
print(f"• Model explains {evaluation_results[best_model_name]['R²']:.1%} of variance in CO2 emissions")
print(f"• Average prediction error: {evaluation_results[best_model_name]['MAE']:.2f} tons CO2 per capita")

if best_model_name in ['Random Forest', 'XGBoost']:
    top_factor = feature_importance.iloc[0]['feature']
    print(f"• Most important factor: {top_factor}")

print("💾 All visualizations saved to results/ directory!")

## 8. Ethical Considerations and Bias Analysis ⚖️

Critical analysis of potential biases and ethical implications of our carbon emission prediction model.