ENERGY CONSUMPTION FORECASTING

The goal of this project is to develop a machine learning model that can forecast 
household electricity consumption. Accurate consumption forecasting is essential 
for optimizing energy production, managing demand, and supporting sustainability efforts.

In Week 1, we defined the problem and explored the dataset.  
We initially experimented with a statistical baseline model (ARIMA).  

For Week 2, we transition to a **Random Forest Regressor**, which better captures 
non-linear patterns and allows us to include calendar-based features alongside lagged values.




In [None]:
#  Import all libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

import warnings
warnings.filterwarnings("ignore")


Dataset:

Source: UCI Machine Learning Repository – Individual Household Electric Power Consumption

Time Range: December 2006 – November 2010

Frequency: Minute-level measurements

Key Features:

Global Active Power (kilowatts)

Global Reactive Power

Voltage

Global Intensity

Sub-metering values (1, 2, 3)

In [None]:
# Load dataset
data = pd.read_csv("household_power_consumption.txt",
                    sep=';',
                    parse_dates={'Datetime': ['Date', 'Time']}, 
                    infer_datetime_format=True,
                    low_memory=False,
                    na_values=['?'])

# Check shape & info
print("Shape:", data.shape)
print(data.info())
data.head()


In [None]:
# Keep datetime as index
data = data.set_index('Datetime')

# Focus on one key variable
data['Global_active_power'] = pd.to_numeric(data['Global_active_power'], errors='coerce')

# Drop missing values
data = data.dropna(subset=['Global_active_power'])

# Resample to hourly data (to reduce noise and computation)
data = data.resample('H').mean()

data.head()

In [None]:
# Summary statistics
print(data['Global_active_power'].describe())

# Plot consumption over time (sample)
plt.figure(figsize=(12,5))
data['Global_active_power'][:1000].plot()
plt.title("Household Global Active Power (first 1000 hours)")
plt.ylabel("kW")
plt.show()

# Distribution
plt.figure(figsize=(6,4))
sns.histplot(data['Global_active_power'], bins=50, kde=True)
plt.title("Distribution of Global Active Power")
plt.show()


Model Selection

In Week 1, we explored the dataset and established the problem definition.  
At that stage, we experimented with a classical time-series approach (ARIMA) as a baseline model.  

For Week 2, we decided to transition from ARIMA to a Random Forest Regressor.  
This change is motivated by the following considerations:

- Flexibility: ARIMA relies on strict assumptions about seasonality and stationarity, while Random Forest can handle more complex, non-linear relationships in the data.

- Feature usage: Random Forest allows us to incorporate not only lagged consumption values but also calendar features (hour, weekday, month), which are highly relevant for electricity demand.  
 
Therefore, the Random Forest Regressor was selected as the Week 2 model for forecasting household energy consumption.  


Model Implementation

We first create new features from the dataset:  
- Calendar features: hour, day, month, weekday.  
- Lag feature: previous hour’s energy consumption.  

These features help the model learn temporal patterns in energy demand.  
We then train a Random Forest Regressor using 80% of the data and evaluate on the remaining 20%.  


In [None]:
# Feature engineering
data = data.copy()
data['hour'] = data.index.hour
data['day'] = data.index.day
data['month'] = data.index.month
data['weekday'] = data.index.weekday
data['lag1'] = data['Global_active_power'].shift(1)

data = data.dropna()

# Features and target
X = data[['hour','day','month','weekday','lag1']]
y = data['Global_active_power']

# Train-test split (time-based)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)

# Model training
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)


Model Evaluation

The Random Forest Regressor is evaluated using Mean Absolute Error (MAE)
and Root Mean Squared Error (RMSE), which measure the prediction accuracy.  

- MAE indicates the average absolute difference between predicted and actual consumption.  
- RMSE penalizes larger errors more strongly, providing a stricter measure of accuracy.  

Additionally, we visualize the comparison between actual and predicted values to assess model performance qualitatively.  


In [None]:
# Evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("MAE:", mae)
print("RMSE:", rmse)

# Plot actual vs predicted
plt.figure(figsize=(12,6))
plt.plot(y_test.values[:200], label="Actual", linewidth=2)
plt.plot(y_pred[:200], label="Predicted", linewidth=2)
plt.legend()
plt.title("Random Forest: Actual vs Predicted (sample 200 points)")
plt.show()


In [None]:
# 🔑 Feature Importance from Random Forest
importances = rf.feature_importances_
features = X.columns

# Plot feature importances
plt.figure(figsize=(8,5))
sns.barplot(x=importances, y=features, palette="viridis")
plt.title("Feature Importance (Random Forest)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()