# Random Forest Regressor - Household Power Consumption

**Algorithm 4 of 7**

Random Forest is an ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce overfitting.

**Key Concepts:**
- Bootstrap Aggregating (Bagging)
- Feature randomness
- Voting/averaging across trees
- Out-of-bag error estimation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
print("‚úÖ Libraries loaded")

## 1. Load Preprocessed Data

In [None]:
with open('../datasets/processed/household_preprocessed.pkl', 'rb') as f:
    data = pickle.load(f)

X_train = data['X_train_scaled']
X_test = data['X_test_scaled']
y_train = data['y_train']
y_test = data['y_test']

print(f"Training samples: {X_train.shape[0]:,}")
print(f"Testing samples: {X_test.shape[0]:,}")
print(f"Number of features: {X_train.shape[1]}")

## 2. Random Forest Theory

**How Random Forest Works:**

1. **Bootstrap Sampling:** Create multiple random subsets of training data (sampling with replacement)

2. **Build Decision Trees:** Train a decision tree on each bootstrap sample
   - At each split, consider only a random subset of features
   - This introduces randomness and decorrelates the trees

3. **Aggregate Predictions:** Average predictions from all trees
   - For regression: Mean of all tree predictions
   - For classification: Majority vote

**Mathematical Formula:**
$$\hat{y} = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)$$

Where:
- $B$ = number of trees
- $\hat{f}_b(x)$ = prediction from tree $b$
- $\hat{y}$ = final ensemble prediction

## 3. Train Random Forest Model

In [None]:
print("="*70)
print("RANDOM FOREST REGRESSOR")
print("="*70)

# Initialize model with 100 trees
rf_model = RandomForestRegressor(
    n_estimators=100,      # Number of trees
    max_depth=15,          # Maximum depth of each tree
    min_samples_split=5,   # Minimum samples to split a node
    random_state=42,       # For reproducibility
    n_jobs=-1              # Use all CPU cores
)

# Train the model
print("\nTraining Random Forest with 100 trees...")
rf_model.fit(X_train, y_train.values.ravel())
print("‚úÖ Training complete!")

# Make predictions
y_pred_train = rf_model.predict(X_train)
y_pred_test = rf_model.predict(X_test)

print(f"\nModel trained with {rf_model.n_estimators} trees")
print(f"Maximum tree depth: {rf_model.max_depth}")

## 4. Model Evaluation

In [None]:
# Calculate metrics for training set
train_r2 = r2_score(y_train, y_pred_train)
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
train_mae = mean_absolute_error(y_train, y_pred_train)

# Calculate metrics for test set
test_r2 = r2_score(y_test, y_pred_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
test_mae = mean_absolute_error(y_test, y_pred_test)

print("="*70)
print("PERFORMANCE METRICS")
print("="*70)

print("\nüìä Training Set:")
print(f"   R¬≤ Score: {train_r2:.4f}")
print(f"   RMSE: {train_rmse:.4f}")
print(f"   MAE: {train_mae:.4f}")

print("\nüìä Test Set:")
print(f"   R¬≤ Score: {test_r2:.4f}")
print(f"   RMSE: {test_rmse:.4f}")
print(f"   MAE: {test_mae:.4f}")

# Check for overfitting
print("\nüîç Overfitting Check:")
r2_diff = train_r2 - test_r2
if r2_diff < 0.05:
    print(f"   ‚úÖ Good generalization (R¬≤ difference: {r2_diff:.4f})")
else:
    print(f"   ‚ö†Ô∏è  Potential overfitting (R¬≤ difference: {r2_diff:.4f})")

## 5. Feature Importance Analysis

In [None]:
# Extract feature importances
feature_importance = pd.DataFrame({
    'Feature': data['feature_names'],
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("="*70)
print("FEATURE IMPORTANCE")
print("="*70)
print(feature_importance.to_string(index=False))

# Visualize
plt.figure(figsize=(12, 6))
plt.barh(range(len(feature_importance)), feature_importance['Importance'].values, 
         color='steelblue', edgecolor='black')
plt.yticks(range(len(feature_importance)), feature_importance['Feature'].values)
plt.xlabel('Importance Score', fontweight='bold', fontsize=12)
plt.ylabel('Features', fontweight='bold', fontsize=12)
plt.title('Random Forest - Feature Importance', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüîë Most Important Feature: {feature_importance.iloc[0]['Feature']} ({feature_importance.iloc[0]['Importance']:.4f})")

## 6. Visualizations

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Predicted vs Actual
axes[0].scatter(y_test, y_pred_test, alpha=0.5, s=10)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Power (kW)', fontweight='bold', fontsize=12)
axes[0].set_ylabel('Predicted Power (kW)', fontweight='bold', fontsize=12)
axes[0].set_title(f'Random Forest: Predicted vs Actual (R¬≤ = {test_r2:.4f})', 
                  fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Residuals
residuals = y_test.values.ravel() - y_pred_test
axes[1].scatter(y_pred_test, residuals, alpha=0.5, s=10)
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Power (kW)', fontweight='bold', fontsize=12)
axes[1].set_ylabel('Residuals', fontweight='bold', fontsize=12)
axes[1].set_title('Residuals Plot', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Conclusions

**Random Forest Results:**
- Algorithm 4 of 7 successfully implemented
- Ensemble of 100 decision trees
- Strong performance with R¬≤ = {test_r2:.4f}

**Key Advantages:**
- Reduces overfitting compared to single decision tree
- Provides feature importance rankings
- Robust to outliers and noise
- Handles non-linear relationships well

**Applications:**
- Excellent for power consumption prediction
- Feature importance helps identify key factors
- Can guide energy efficiency improvements