# Daily Energy Usage Pattern Clustering (Household Power)

This notebook is a **different energy project** from the forecasting work.
Here we focus on **unsupervised learning**:

- We use a public **household energy consumption** dataset.
- We turn raw high-frequency measurements into **daily profiles**.
- We apply **clustering** to discover typical usage patterns.
- We interpret clusters as **behavioural segments** and sketch tariff ideas.

The goal is to show how to go from time-series energy data to **customer / day
segments** that can support pricing, marketing, or operational decisions.


## 0. Dataset

We will use the classic UCI / Kaggle dataset:

- **Individual household electric power consumption**.
- 1-minute measurements of active/reactive power, voltage and sub-metering.
- Roughly 4 years of data for a single household.

You can download it for example from Kaggle (CSV or TXT format) and place it as:

```text
data/household_power.csv
```

We assume the file contains columns compatible with the original format:

- `Date`, `Time`
- `Global_active_power` (kW)
- `Global_reactive_power`, `Voltage`, `Global_intensity`
- `Sub_metering_1`, `Sub_metering_2`, `Sub_metering_3`

If your file name or columns differ, adapt the first data-loading cell.


In [None]:
from __future__ import annotations

from pathlib import Path
from typing import List, Dict, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (11, 5)

DATA_PATH = Path("data") / "household_power.csv"
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Expected dataset at {DATA_PATH.resolve()}\n"
        "Download 'Individual household electric power consumption' and save as 'data/household_power.csv'."
    )

raw = pd.read_csv(DATA_PATH)
raw.head()

## 1. Cleaning and timestamp parsing

We will:

- Combine `Date` and `Time` into a proper `timestamp`.
- Convert energy columns to numeric.
- Resample from 1-minute to **hourly** for a more compact profile.
- Compute hourly **kWh** from `Global_active_power` (kW).


In [None]:
def clean_household_power(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and resample the household power dataset to hourly kWh.

    Parameters
    ----------
    df : pd.DataFrame
        Raw dataframe with Date/Time, Global_active_power and other cols.

    Returns
    -------
    pd.DataFrame
        Hourly dataframe indexed by timestamp with at least:
        - kwh: energy consumed in that hour
        - global_active_power: mean kW in that hour
        - sub_metering_1/2/3: hourly sums
    """
    df = df.copy()

    # Create timestamp column
    if not {"Date", "Time"}.issubset(df.columns):
        raise ValueError("Expected 'Date' and 'Time' columns in dataset.")

    df["timestamp"] = pd.to_datetime(df["Date"] + " " + df["Time"], errors="coerce")
    df = df.dropna(subset=["timestamp"]).sort_values("timestamp")

    # Convert numeric columns, coerce errors to NaN
    num_cols = [
        "Global_active_power",
        "Global_reactive_power",
        "Voltage",
        "Global_intensity",
        "Sub_metering_1",
        "Sub_metering_2",
        "Sub_metering_3",
    ]
    for col in num_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    df = df.set_index("timestamp").sort_index()

    # Resample to hourly
    hourly = pd.DataFrame()
    if "Global_active_power" in df.columns:
        # kWh in each hour = mean kW * 1 hour
        hourly["global_active_power"] = df["Global_active_power"].resample("H").mean()
        hourly["kwh"] = hourly["global_active_power"]  # mean kW over 1 hour ~= kWh
    for col in ["Sub_metering_1", "Sub_metering_2", "Sub_metering_3"]:
        if col in df.columns:
            hourly[col] = df[col].resample("H").sum()

    hourly = hourly.dropna(subset=["kwh"])  # require at least kwh
    return hourly


hourly = clean_household_power(raw)
hourly.head()

### 1.1 Quick EDA: overall consumption and a sample week


In [None]:
# Overall
hourly["kwh"].plot(alpha=0.7)
plt.title("Hourly energy consumption (kWh)")
plt.ylabel("kWh")
plt.show()

# Sample week for visual intuition
sample_start = hourly.index.min() + pd.Timedelta(days=7)
sample_end = sample_start + pd.Timedelta(days=7)
sample = hourly.loc[sample_start:sample_end]
sample["kwh"].plot()
plt.title("Sample week of hourly consumption")
plt.ylabel("kWh")
plt.show()

## 2. Build daily profiles

We now turn the hourly series into **daily entities**, each with:

- Total daily kWh.
- Peak hour kWh.
- Day vs night energy fractions.
- Evening consumption (e.g. 18–23h).
- The raw **24-dimensional profile** (kWh per hour) for clustering.


In [None]:
def build_daily_profile_frame(hourly_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Create daily features and 24h profiles from hourly kWh.

    Parameters
    ----------
    hourly_df : pd.DataFrame
        Hourly data with at least a `kwh` column.

    Returns
    -------
    daily_features : pd.DataFrame
        One row per day with aggregate features.
    daily_profiles : pd.DataFrame
        One row per day, column per hour (0-23) with kWh.
    """
    df = hourly_df.copy()
    df["date"] = df.index.date
    df["hour"] = df.index.hour

    # Wide 24h profile: rows days, columns 0..23
    profile = df.pivot_table(
        index="date",
        columns="hour",
        values="kwh",
        aggfunc="mean",
    )
    profile.columns = [f"h_{h:02d}" for h in profile.columns]

    # Aggregate features by day
    daily = df.groupby("date").agg(
        total_kwh=("kwh", "sum"),
        max_kwh=("kwh", "max"),
        mean_kwh=("kwh", "mean"),
    )

    # Day vs night / evening fractions
    def _fraction_mask(mask: pd.Series) -> pd.Series:
        return df.loc[mask, :].groupby("date")["kwh"].sum()

    day_mask = (df["hour"] >= 8) & (df["hour"] < 18)
    night_mask = (df["hour"] < 6) | (df["hour"] >= 22)
    evening_mask = (df["hour"] >= 18) & (df["hour"] < 23)

    day_kwh = _fraction_mask(day_mask)
    night_kwh = _fraction_mask(night_mask)
    eve_kwh = _fraction_mask(evening_mask)

    daily["day_kwh"] = day_kwh
    daily["night_kwh"] = night_kwh
    daily["evening_kwh"] = eve_kwh

    # Fractions relative to total
    daily["day_frac"] = daily["day_kwh"] / daily["total_kwh"]
    daily["night_frac"] = daily["night_kwh"] / daily["total_kwh"]
    daily["evening_frac"] = daily["evening_kwh"] / daily["total_kwh"]

    # Join profiles and features on date index
    features = daily.join(profile, how="inner")

    return features, profile


daily_features, daily_profiles = build_daily_profile_frame(hourly)
daily_features.head()

### 2.1 Daily totals and seasonality by weekday


In [None]:
daily_features.index = pd.to_datetime(daily_features.index)
daily_features["weekday"] = daily_features.index.dayofweek  # 0=Mon

daily_features["total_kwh"].plot(alpha=0.7)
plt.title("Total daily kWh over time")
plt.ylabel("kWh/day")
plt.show()

sns.boxplot(data=daily_features, x="weekday", y="total_kwh")
plt.title("Daily energy use by weekday (0=Mon)")
plt.ylabel("kWh/day")
plt.show()

## 3. Clustering daily profiles

We now cluster days based on their **24-hour load shape + aggregate stats**.

Steps:

- Standardise all numerical features.
- Try several K values with **silhouette score**.
- Fit final **KMeans** and interpret clusters.


In [None]:
# Prepare feature matrix for clustering
cluster_cols = [c for c in daily_features.columns if c not in ["weekday"]]
X = daily_features[cluster_cols].fillna(0.0).to_numpy()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Try different K values
sil_scores: Dict[int, float] = {}
for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=20)
    labels = kmeans.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    sil_scores[k] = sil

sil_scores

In [None]:
plt.plot(list(sil_scores.keys()), list(sil_scores.values()), marker="o")
plt.xlabel("K (number of clusters)")
plt.ylabel("Silhouette score")
plt.title("Silhouette vs K – daily energy patterns")
plt.show()

Pick a K from the plot above (for example 3 or 4). We will default to **K=4**
and you can adjust if needed.

In [None]:
K = 4  # adjust based on silhouette / domain insights
kmeans_final = KMeans(n_clusters=K, random_state=RANDOM_STATE, n_init=50)
cluster_labels = kmeans_final.fit_predict(X_scaled)

daily_features["cluster"] = cluster_labels
daily_profiles_clustered = daily_profiles.copy()
daily_profiles_clustered["cluster"] = cluster_labels

daily_features[["total_kwh", "day_frac", "night_frac", "evening_frac", "cluster"]].head()

## 4. Cluster interpretation

We now look at:

- Cluster sizes.
- Mean aggregate features per cluster.
- **Average daily profile** (24h) per cluster.


In [None]:
# Cluster sizes
cluster_counts = daily_features["cluster"].value_counts().sort_index()
cluster_counts

In [None]:
agg_cols = ["total_kwh", "day_frac", "night_frac", "evening_frac", "max_kwh", "mean_kwh"]
cluster_summary = daily_features.groupby("cluster")[agg_cols].mean()
cluster_summary

In [None]:
# Average 24h profile per cluster
hour_cols = [c for c in daily_profiles.columns if c.startswith("h_")]
cluster_profiles = daily_profiles_clustered.groupby("cluster")[hour_cols].mean()

for cluster_id, row in cluster_profiles.iterrows():
    plt.plot(range(24), row.values, label=f"Cluster {cluster_id}")

plt.xticks(range(24))
plt.xlabel("Hour of day")
plt.ylabel("Average kWh")
plt.title("Average daily load shape per cluster")
plt.legend()
plt.show()

You can now **interpret** each cluster qualitatively, for example:

- Cluster with high **evening_frac** and peak around 19–22h → "evening-heavy" days.
- Cluster with high **day_frac** and flatter profile → "home all day" days.
- Cluster with low total_kwh → "low consumption" or absence/vacation days.


## 5. Visualising clusters in 2D with PCA

We project the standardised feature space into 2D using **PCA** for a quick
visual view of cluster separation.


In [None]:
pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_pca = pca.fit_transform(X_scaled)

pca_df = pd.DataFrame(
    X_pca,
    columns=["pc1", "pc2"],
    index=daily_features.index,
)
pca_df["cluster"] = daily_features["cluster"].values

sns.scatterplot(data=pca_df, x="pc1", y="pc2", hue="cluster", palette="tab10", alpha=0.7)
plt.title("Daily energy patterns – PCA projection")
plt.show()

## 6. Simple classifier for cluster prediction

To mimic a production use case, we train a simple model that predicts the
cluster from a **small set of aggregate features** only:

- `total_kwh`, `day_frac`, `night_frac`, `evening_frac`, `weekday`.

This could be used to quickly assign new days to segments.


In [None]:
clf_features = ["total_kwh", "day_frac", "night_frac", "evening_frac", "weekday"]
X_clf = daily_features[clf_features].to_numpy()
y_clf = daily_features["cluster"].to_numpy()

# Simple train/test split on time: first 80% days for train, last 20% for test
n_days = len(daily_features)
split_idx = int(n_days * 0.8)
X_clf_train, X_clf_test = X_clf[:split_idx], X_clf[split_idx:]
y_clf_train, y_clf_test = y_clf[:split_idx], y_clf[split_idx:]

rf_clf = RandomForestClassifier(
    n_estimators=300,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)
rf_clf.fit(X_clf_train, y_clf_train)

y_pred_clf = rf_clf.predict(X_clf_test)
print(classification_report(y_clf_test, y_pred_clf))
print("Confusion matrix:\n", confusion_matrix(y_clf_test, y_pred_clf))

## 7. Segments and tariff ideas

Now that we have clusters, you can frame them in business terms, for example:

- **Cluster A – evening peakers**:
  - High `evening_frac`, strong peak 19–22h.
  - Could benefit from **time-of-use tariffs** that make late peak more expensive
    or incentivise load shifting.

- **Cluster B – day-heavy**:
  - High `day_frac`, relatively flat evening.
  - Might be home workers or retirees; tariffs could encourage midday usage
    when wholesale prices are low.

- **Cluster C – low consumption / away days**:
  - Low `total_kwh` and flat profile.
  - Suitable for low fixed fees or alerts when consumption deviates.

The exact interpretation depends on how many clusters you chose and the shape
of their average profiles.

From here you can:

- Attach **monetary value** (cost, margin) to each cluster.
- Look at **seasonality of clusters** (which clusters are common in winter vs summer).
- Extend to **multi-household smart meter data** and build proper customer segments.
