In [None]:
# ðŸŒ² Random Forest for Smart Infrastructure

## Overview
This notebook demonstrates the power of **Random Forest** for industrial predictive modeling in the context of Smart Building Energy Management. 

### Why Random Forest?
- **Ensemble Learning**: Random Forest combines multiple decision trees to reduce variance and improve generalization
- **Handles Non-linearity**: Unlike linear regression, Random Forest captures complex, non-linear relationships in data
- **Robust to Noise**: The ensemble approach makes it resilient to outliers and noisy features
- **Feature Importance**: Provides interpretable insights into which factors drive energy consumption

### Use Case: City-Scale Building Management System (BMS)
We'll predict energy demand based on:
- **Occupancy**: Number of people in the building
- **Outside Temperature**: Environmental conditions
- **Time-of-Day**: Temporal patterns (peak vs. off-peak hours)
- **HVAC Status**: Heating/cooling system operational state

This scenario is representative of real-world Ambient Systems applications in building management and decarbonization initiatives.

## Notebook Structure
1. **Import Required Libraries** - Set up environment and dependencies
2. **Generate Synthetic Building Energy Dataset** - Create realistic BMS data with non-linear relationships
3. **Exploratory Data Analysis** - Understand feature distributions and correlations
4. **Train Random Forest Regressor** - Build the ensemble model
5. **Evaluate Performance** - Calculate metrics and visualize predictions
6. **Feature Importance Analysis** - Identify key energy drivers for building managers
7. **Production Deployment** - Serialize model for cloud-native environments

## 1. Import Required Libraries

In [None]:
# Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
import joblib
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Generate Synthetic Building Energy Dataset

We'll create a realistic Building Management System (BMS) dataset with 5,000 samples. This data includes:
- **Non-linear relationships**: Energy demand increases non-linearly with temperature extremes
- **Noise**: Random variations to simulate real-world sensor data
- **Temporal patterns**: Peak vs. off-peak hours affect energy demand

In [None]:
# Generate Synthetic Building Energy Dataset
# Number of samples representing hourly energy readings
n_samples = 5000

# Feature 1: Occupancy (0-500 people in the building)
occupancy = np.random.uniform(0, 500, n_samples)

# Feature 2: Outside Temperature (in Celsius, -10 to 40Â°C)
outside_temp = np.random.uniform(-10, 40, n_samples)

# Feature 3: Time-of-Day (0-23 hours)
time_of_day = np.random.uniform(0, 24, n_samples)

# Feature 4: HVAC Status (0-100%, system efficiency)
hvac_status = np.random.uniform(0, 100, n_samples)

# Target Variable: Energy Demand (kWh)
# Non-linear relationship: Energy increases quadratically with occupancy
# and temperature extremes (heating/cooling demand)
base_energy = 500

# Non-linear components demonstrate why Random Forest outperforms Linear Regression
occupancy_effect = 0.8 * occupancy + 0.001 * (occupancy ** 2)  # Quadratic term
temp_effect = 50 * np.abs(outside_temp - 20) ** 1.5  # Non-linear temperature deviation
time_effect = 150 * np.sin(2 * np.pi * time_of_day / 24)  # Sinusoidal time pattern
hvac_effect = -2 * hvac_status  # Better efficiency reduces consumption

# Add noise to simulate real-world sensor variations
noise = np.random.normal(0, 100, n_samples)

# Combine all components for final energy demand
energy_demand = base_energy + occupancy_effect + temp_effect + time_effect + hvac_effect + noise

# Ensure non-negative energy values
energy_demand = np.maximum(energy_demand, 100)

# Create DataFrame for easier manipulation
df = pd.DataFrame({
    'occupancy': occupancy,
    'outside_temperature': outside_temp,
    'time_of_day': time_of_day,
    'hvac_status': hvac_status,
    'energy_demand': energy_demand
})

# Display dataset information
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nDataset Statistics:")
print(df.describe())

## 3. Exploratory Data Analysis (EDA) and Correlation Heatmap

Understanding data distributions and feature relationships is crucial before model training. 
A correlation heatmap reveals which features are most strongly associated with energy demand.

In [None]:
# Plot Correlation Heatmap
# This helps identify which features have the strongest relationships with energy demand
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap - Building Energy Management System', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Print correlation with target variable
print("\nCorrelation with Energy Demand:")
print(correlation_matrix['energy_demand'].sort_values(ascending=False))

## 4. Train Random Forest Regressor and Compare with Linear Regression

We'll train both a Random Forest and a Linear Regression model to demonstrate the superiority of Random Forest 
for capturing non-linear relationships in building energy data.

In [None]:
# Prepare features and target variable
X = df[['occupancy', 'outside_temperature', 'time_of_day', 'hvac_status']]
y = df['energy_demand']

# Split data into training (80%) and testing (20%) sets
# Stratification is used to ensure representative samples in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples\n")

# Initialize and train Random Forest Regressor
# Key hyperparameters:
# - n_estimators: Number of trees in the ensemble (100 is a good default)
# - max_depth: Maximum depth to prevent overfitting
# - random_state: For reproducibility
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1  # Use all available processors for faster training
)

# Train the Random Forest model
rf_model.fit(X_train, y_train)
print("âœ“ Random Forest model trained successfully")

# For comparison, also train a Linear Regression model
# This demonstrates why Random Forest is superior for non-linear data
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
print("âœ“ Linear Regression model trained successfully")

## 5. Generate Predictions and Visualize Actual vs. Predicted Values

Now we'll generate predictions on the test set and create a scatter plot to visualize the model's performance.

In [None]:
# Generate predictions on test set
rf_predictions = rf_model.predict(X_test)
lr_predictions = lr_model.predict(X_test)

# Create a figure with two subplots for model comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Random Forest - Actual vs. Predicted
axes[0].scatter(y_test, rf_predictions, alpha=0.5, s=20, color='green')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Energy Demand (kWh)', fontsize=12)
axes[0].set_ylabel('Predicted Energy Demand (kWh)', fontsize=12)
axes[0].set_title('Random Forest: Actual vs. Predicted Energy Demand', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Linear Regression - Actual vs. Predicted
axes[1].scatter(y_test, lr_predictions, alpha=0.5, s=20, color='orange')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Energy Demand (kWh)', fontsize=12)
axes[1].set_ylabel('Predicted Energy Demand (kWh)', fontsize=12)
axes[1].set_title('Linear Regression: Actual vs. Predicted Energy Demand', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Visual comparison shows Random Forest captures non-linear patterns better than Linear Regression.")

## 6. Calculate Evaluation Metrics (MAE, MSE, RÂ²)

Evaluation metrics help us quantify model performance in business terms:

### Business Context:
- **MAE (Mean Absolute Error)**: Average prediction error in kWh. For a building manager, this tells you the typical variance in forecasted vs. actual energy consumption.
- **MSE (Mean Squared Error)**: Penalizes larger errors more heavily. Useful for budgeting when large forecast errors are costly.
- **RÂ² Score**: Percentage of variance explained. An RÂ² of 0.95 means the model explains 95% of energy demand variation, leaving only 5% unexplained.

In [None]:
# Calculate evaluation metrics for Random Forest
rf_mae = mean_absolute_error(y_test, rf_predictions)
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_predictions)

# Calculate evaluation metrics for Linear Regression
lr_mae = mean_absolute_error(y_test, lr_predictions)
lr_mse = mean_squared_error(y_test, lr_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, lr_predictions)

# Create comparison dataframe for readability
metrics_df = pd.DataFrame({
    'Metric': ['MAE (kWh)', 'MSE (kWhÂ²)', 'RMSE (kWh)', 'RÂ² Score'],
    'Random Forest': [rf_mae, rf_mse, rf_rmse, rf_r2],
    'Linear Regression': [lr_mae, lr_mse, lr_rmse, lr_r2]
})

print("=" * 70)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 70)
print(metrics_df.to_string(index=False))
print("=" * 70)

# Business-Oriented Interpretation
print("\nðŸ“Š BUSINESS INTERPRETATION FOR BUILDING MANAGERS:")
print("-" * 70)
print(f"Random Forest MAE: Â±{rf_mae:.2f} kWh")
print(f"  â†’ On average, energy predictions deviate by {rf_mae:.2f} kWh from actual")
print(f"  â†’ For a 5,000 kWh building, this is {100*rf_mae/5000:.1f}% error rate\n")

print(f"Random Forest RÂ² Score: {rf_r2:.4f}")
print(f"  â†’ Model explains {100*rf_r2:.2f}% of energy demand variation")
print(f"  â†’ Remaining {100*(1-rf_r2):.2f}% is unexplained (weather anomalies, etc.)\n")

print(f"Improvement over Linear Regression:")
print(f"  â†’ MAE improvement: {100*(lr_mae-rf_mae)/lr_mae:.1f}%")
print(f"  â†’ RÂ² improvement: {100*(rf_r2-lr_r2)/(1-lr_r2):.1f}%")
print("-" * 70)

## 7. Feature Importance Analysis for Building Managers

Feature importance tells us which factors have the greatest impact on energy consumption decisions. 
This is critical for building managers to understand where to focus optimization efforts.

### How to Interpret Feature Importance:
- **Higher values** = greater influence on energy demand predictions
- **Occupancy** - Does employee count drive energy more than temperature?
- **Temperature** - Are heating/cooling costs the primary driver?
- **Time-of-Day** - Do peak hours matter more than baseline consumption?
- **HVAC Status** - How much impact does system efficiency have?

Building managers can use these insights to:
1. **Prioritize energy-saving initiatives** (e.g., occupancy-based HVAC control)
2. **Forecast budgets** based on key drivers
3. **Allocate resources** to highest-impact optimization areas

In [None]:
# Extract feature importance scores from Random Forest
feature_importance = rf_model.feature_importances_
feature_names = X.columns

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)

# Display feature importance as a table
print("=" * 70)
print("FEATURE IMPORTANCE RANKING - WHAT DRIVES ENERGY COSTS?")
print("=" * 70)
print(importance_df.to_string(index=False))
print("=" * 70)

# Create a horizontal bar chart for visual representation
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue', edgecolor='navy')
plt.xlabel('Importance Score', fontsize=12, fontweight='bold')
plt.ylabel('Features', fontsize=12, fontweight='bold')
plt.title('Feature Importance Analysis - Building Energy Management System', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Business recommendations based on feature importance
print("\nðŸ’¡ BUILDING MANAGER RECOMMENDATIONS:")
print("-" * 70)
top_feature = importance_df.iloc[0]
print(f"PRIMARY DRIVER: {top_feature['Feature'].upper()} ({100*top_feature['Importance']:.1f}%)")
print(f"  â†’ This factor has the greatest impact on energy consumption")
print(f"  â†’ Focus optimization efforts here for maximum ROI\n")

for idx in range(1, min(4, len(importance_df))):
    feature = importance_df.iloc[idx]
    print(f"Factor {idx}: {feature['Feature']} ({100*feature['Importance']:.1f}%)")

print("\nAction Items for Energy Optimization:")
print("  1. Implement occupancy-based HVAC control if occupancy is high-importance")
print("  2. Install smart thermostats if temperature sensitivity is high")
print("  3. Schedule maintenance during off-peak hours (if time-of-day is important)")
print("  4. Monitor HVAC system performance (affects efficiency importance)")
print("-" * 70)

## 8. Production-Ready Model Serialization and Deployment

Now we'll serialize the trained Random Forest model using joblib for deployment in cloud-native environments. 
This enables:

- **Model Versioning**: Store trained models for reproducibility and rollback
- **Inference Efficiency**: Load pre-trained models without retraining
- **Scalability**: Deploy models across microservices and edge devices
- **CI/CD Integration**: Automate model updates in production pipelines

### Deployment Workflow:
1. Serialize the trained model to disk
2. Create a prediction function for cloud API endpoints
3. Demonstrate batch prediction and single-instance scoring
4. Show how to load and use the model in production