# PowerPulse — Household Energy Usage Forecast

This notebook combines **EDA → Preprocessing → Feature Engineering → Modeling → Evaluation** for the provided dataset. It loads the dataset from `/mnt/data/data_set.csv`.

**Author:** Generated by ChatGPT

---

## 0. Dependencies & Setup

Install packages (run in terminal if needed):

```bash
pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm joblib statsmodels missingno jupyter
```


In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import joblib

plt.rcParams['figure.figsize'] = (10,4)
sns.set(style='whitegrid')

print('Libraries loaded')

Libraries loaded


## 1. Load dataset

We load the dataset from the provided path: `/mnt/data/data_set.csv`. If your environment differs, update the path below.

In [2]:
# Path to dataset
PATH = "data_set.csv"

# Load
try:
    df = pd.read_csv(PATH, low_memory=False)
    print('Loaded dataset shape:', df.shape)
except Exception as e:
    print('Failed to load dataset from', PATH)
    raise e

# Show columns & head
print(df.columns.tolist())
df.head()

Loaded dataset shape: (2075259, 10)
['Unnamed: 0', 'Date', 'Time', 'Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']


Unnamed: 0.1,Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## 2. Initial cleaning & datetime parsing

Try to parse `Date` and `Time` columns (if available). Otherwise attempt to find a datetime-like column and set index.

In [None]:
df_clean = df.copy()

# Combine Date & Time if present
if 'Date' in df_clean.columns and 'Time' in df_clean.columns:
    df_clean['datetime'] = pd.to_datetime(df_clean['Date'].astype(str) + ' ' + df_clean['Time'].astype(str), dayfirst=True, errors='coerce')
    df_clean = df_clean.set_index('datetime').sort_index()
else:
    # try to parse any single datetime-like column
    parsed_col = None
    for col in df_clean.columns:
        try:
            parsed = pd.to_datetime(df_clean[col], errors='coerce')
            if parsed.notna().sum() > 0 and parsed.dtype == 'datetime64[ns]':
                parsed_col = col
                df_clean = df_clean.set_index(parsed).sort_index()
                break
        except Exception:
            continue

if df_clean.index.name is None:
    df_clean.index.name = 'datetime'

print('Index name:', df_clean.index.name)
print('Index sample:', df_clean.index[:3])

# Convert numeric columns stored as objects
for col in df_clean.columns:
    if df_clean[col].dtype == object:
        df_clean[col] = pd.to_numeric(df_clean[col].astype(str).str.replace(',', '.').str.strip(), errors='coerce')

print('After conversion, numeric cols:', df_clean.select_dtypes(include=['number']).columns.tolist())

df_clean.head()

## 3. Missing values & basic statistics

Show missing values, basic stats, and visualize missingness.

In [None]:
display(df_clean.isna().sum())

# Basic stats for numeric columns
display(df_clean.describe().T)

# Visualize missingness (may be heavy for large datasets)
try:
    msno.matrix(df_clean.sample(frac=min(1.0, 10000/len(df_clean))))
    plt.title('Missingness matrix (sample)')
    plt.show()
except Exception as e:
    print('missingno failed:', e)


## 4. Exploratory Data Analysis (EDA)

Univariate distributions and time-series trends. We'll focus on `Global_active_power` if present.

In [None]:
num_cols = df_clean.select_dtypes(include=['number']).columns.tolist()
num_cols

# Plot distributions for key numeric columns (top 6)
for col in num_cols[:6]:
    plt.figure(figsize=(8,3))
    sns.histplot(df_clean[col].dropna(), kde=True)
    plt.title(f'Distribution: {col}')
    plt.tight_layout()
    plt.show()

# Time series trend for Global_active_power
if 'Global_active_power' in df_clean.columns:
    series = df_clean['Global_active_power'].dropna().resample('H').mean()
    plt.figure(figsize=(14,4))
    plt.plot(series.index, series.values)
    plt.title('Global_active_power (hourly mean)')
    plt.ylabel('kW')
    plt.show()

    # seasonal decomposition (use 24-hour period)
    try:
        dec = seasonal_decompose(series.fillna(method='ffill'), model='additive', period=24)
        fig = dec.plot()
        fig.set_size_inches(12,9)
        plt.show()
    except Exception as e:
        print('Seasonal decomposition failed:', e)
else:
    print('Global_active_power not in columns; skip time series plots')

## 5. Preprocessing

Steps: resample to hourly, interpolate missing values, cap outliers using IQR. Save preprocessed CSV.

In [None]:
df_proc = df_clean.copy()
# Resample to hourly mean
try:
    df_proc = df_proc.resample('H').mean()
except Exception as e:
    print('Resample failed:', e)

# Interpolate then ffill/bfill
df_proc = df_proc.interpolate(method='time', limit=6)
df_proc = df_proc.fillna(method='ffill').fillna(method='bfill')

# Cap outliers (IQR) for numeric columns
def cap_iqr(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5*iqr
    upper = q3 + 1.5*iqr
    return series.clip(lower, upper)

for c in df_proc.select_dtypes(include=['number']).columns:
    df_proc[c] = cap_iqr(df_proc[c])

# Save
proc_path = '/mnt/data/preprocessed_hourly.csv'
df_proc.to_csv(proc_path)
print('Saved preprocessed data to', proc_path)
df_proc.head()

## 6. Feature Engineering

Create time features, lag features, rolling statistics, and prepare train/test split.

In [None]:
df_feat = df_proc.copy()
# Time features
if isinstance(df_feat.index, pd.DatetimeIndex):
    df_feat['hour'] = df_feat.index.hour
    df_feat['day'] = df_feat.index.day
    df_feat['dayofweek'] = df_feat.index.dayofweek
    df_feat['month'] = df_feat.index.month
    df_feat['is_weekend'] = (df_feat['dayofweek'] >= 5).astype(int)

# Ensure target exists
if 'Global_active_power' not in df_feat.columns:
    raise ValueError('Global_active_power not found in preprocessed data. Cannot proceed with modeling.')

# Lags and rolling
lags = [1,2,3,24]
for l in lags:
    df_feat[f'lag_{l}'] = df_feat['Global_active_power'].shift(l)

roll_windows = [3,6,12,24]
for w in roll_windows:
    df_feat[f'roll_mean_{w}'] = df_feat['Global_active_power'].rolling(window=w, min_periods=1).mean()
    df_feat[f'roll_std_{w}'] = df_feat['Global_active_power'].rolling(window=w, min_periods=1).std().fillna(0)

# Drop NA created by lags
df_feat = df_feat.dropna()

# Train-test split (time-based): last 20% as test
split_idx = int(len(df_feat)*0.8)
train = df_feat.iloc[:split_idx]
test = df_feat.iloc[split_idx:]

train.to_csv('/mnt/data/train_features.csv')
test.to_csv('/mnt/data/test_features.csv')
print('Saved train/test feature CSVs. Train shape:', train.shape, 'Test shape:', test.shape)

train.head()

## 7. Modeling

Train baseline Linear Regression and Random Forest. Evaluate using RMSE, MAE, R2.

In [None]:
target = 'Global_active_power'
features = [c for c in train.columns if c != target]

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
pred_lr = lr.predict(X_test)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

# Evaluation function
def eval_model(y_true, y_pred):
    return dict(RMSE = float(np.sqrt(mean_squared_error(y_true, y_pred))),
                MAE = float(mean_absolute_error(y_true, y_pred)),
                R2 = float(r2_score(y_true, y_pred)))

scores = {
    'LinearRegression': eval_model(y_test, pred_lr),
    'RandomForest': eval_model(y_test, pred_rf)
}

scores

# Save best model (choose RandomForest by default)
model_path = '/mnt/data/best_model_rf.pkl'
joblib.dump(rf, model_path)
print('Saved model to', model_path)

## 8. Evaluation & Interpretation

Plot Actual vs Predicted, residuals, and feature importances.

In [None]:
# Actual vs Predicted
plt.figure(figsize=(14,4))
plt.plot(y_test.index, y_test.values, label='Actual')
plt.plot(y_test.index, pred_rf, label='Predicted (RF)')
plt.legend()
plt.title('Actual vs Predicted - Global_active_power')
plt.show()

# Residuals
residuals = y_test.values - pred_rf
plt.figure(figsize=(8,3))
plt.hist(residuals, bins=50)
plt.title('Residual distribution')
plt.show()

# Feature importances
try:
    importances = rf.feature_importances_
    fi = sorted(zip(features, importances), key=lambda x: x[1], reverse=True)[:20]
    names = [x[0] for x in fi]
    vals = [x[1] for x in fi]
    plt.figure(figsize=(8,6))
    plt.barh(names[::-1], vals[::-1])
    plt.title('Top feature importances (RF)')
    plt.tight_layout()
    plt.show()
except Exception as e:
    print('Feature importance plotting failed:', e)

# Print evaluation metrics
print('Evaluation metrics:')
import json
print(json.dumps(scores, indent=2))