# 🎯 Learning Objectives

By the end of this notebook, you should be able to:

- Download and load a real-world electricity load dataset.
- Explore basic statistics and missing values of time-series data.
- Aggregate data across consumers (daily, weekly, monthly).
- Visualize overall load trends and individual consumer patterns.
- Understand distribution of daily total electricity loads.
- Compare multiple consumers and analyze correlations.

In [None]:
# ---- Imports ----

import os
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import urllib.request
import ssl

In [None]:
# ---- Cross-platform: define base course path ----

try:
    import google.colab
    ON_COLAB = True
except ImportError:
    ON_COLAB = False

if ON_COLAB:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COURSE_PATH = "/content/drive/MyDrive/Industrial_ML_Course"
else:
    COURSE_PATH = r"D:\Industrial_ML_Course"  # <-- adjust if needed

# Dataset folder
DATASET_PATH = os.path.join(COURSE_PATH, "datasets/ElectricityLoad")
os.makedirs(DATASET_PATH, exist_ok=True)

In [None]:
# ---- Download & Extract Dataset ----

url = "https://archive.ics.uci.edu/static/public/321/electricityloaddiagrams20112014.zip"
zip_file = os.path.join(DATASET_PATH, "electricity_load.zip")
txt_file = os.path.join(DATASET_PATH, "LD2011_2014.txt")

# SSL context to ignore certificate issues
ssl_context = ssl._create_unverified_context()

# Download if not exists
if not os.path.exists(txt_file):
    if not os.path.exists(zip_file):
        print("Downloading dataset...")
        with urllib.request.urlopen(url, context=ssl_context) as response, open(zip_file, 'wb') as out_file:
            out_file.write(response.read())
        print("✅ Dataset downloaded.")
    else:
        print("Dataset zip already exists.")

    # Extract
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(DATASET_PATH)
    print("✅ Dataset extracted.")

    # Cleanup zip file
    os.remove(zip_file)
    print("✅ Zip file removed after extraction.")
else:
    print("Dataset already extracted.")

In [None]:
# ---- Load dataset ----

data = pd.read_csv(
    txt_file,
    sep=";",
    index_col=0,
    parse_dates=True,
    decimal=",",
    dtype="float32"
    # nrows=10000  # <- uncomment for quick testing
)

# Quick info
print("Shape:", data.shape)
print("Number of consumers:", len(data.columns))
print("Time range:", data.index.min(), "to", data.index.max())
display(data[100000:100005])

In [None]:
# ---- Load dataset ----

data = pd.read_csv(
    txt_file,
    sep=";",
    index_col=0,
    parse_dates=True,
    decimal=",",
    dtype="float32"
    # nrows=10000  # <- uncomment for quick testing
)

print("Shape:", data.shape)
print("Columns (consumers):", len(data.columns))
display(data[100000:100005])

In [None]:
# ---- Basic Exploration ----

print("Start:", data.index.min())
print("End:", data.index.max())
print("Frequency:", pd.infer_freq(data.index))

missing_counts = data.isna().sum().sum()
print("Total missing values:", missing_counts)
if missing_counts:
    print("Consumers with missing values:")
    print(missing_per_consumer[missing_per_consumer > 0])

In [None]:
# ---- Daily Load Summary ----

daily_load = data.sum(axis=1).resample("D").sum()

print("Daily load statistics:")
display(daily_load.describe())

# Show 5 days with highest and lowest total load
print("\nTop 5 highest daily loads:")
display(daily_load.nlargest(5))
print("\nTop 5 lowest daily loads:")
display(daily_load.nsmallest(5))

In [None]:
# ---- Visualize Daily Load ----
plt.figure(figsize=(12,4))
plt.plot(daily_load.index, daily_load.values, color='orange')
plt.title("Total Daily Electricity Load (2011–2014)")
plt.xlabel("Date")
plt.ylabel("kWh")
plt.grid(True)
plt.show()

# Daily load for a sample consumer
sample_customer = data.columns[0]
daily_customer = data[sample_customer].resample("D").sum()
plt.figure(figsize=(12,4))
plt.plot(daily_customer.index, daily_customer.values, color='green')
plt.title(f"Daily Load for {sample_customer}")
plt.xlabel("Date")
plt.ylabel("kWh")
plt.grid(True)
plt.show()

# Distribution of daily total loads
plt.figure(figsize=(6,4))
plt.hist(daily_load.values, bins=50, color='skyblue', edgecolor='black')
plt.title("Distribution of Daily Total Loads")
plt.xlabel("kWh")
plt.ylabel("Frequency")
plt.show()

In [None]:
# ---- Visualize Multiple Consumers ----

# Select a subset of consumers (adjust number as needed)
num_consumers_to_plot = 5
sample_consumers = data.columns[:num_consumers_to_plot]

plt.figure(figsize=(14,6))
for cust in sample_consumers:
    # Aggregate daily load per consumer
    daily_values = data[cust].resample("D").sum()

    # Optional normalization for visual comparison
    daily_values_norm = daily_values / daily_values.max()
    plt.plot(daily_values_norm.index, daily_values_norm.values, label=cust)

plt.title(f"Daily Load for {num_consumers_to_plot} Sample Consumers (Normalized)")
plt.xlabel("Date")
plt.ylabel("Normalized kWh")
plt.legend()
plt.grid(True)
plt.show()

# 🚀 Explore More! (Guided Exercises)

1. **Consumer Comparison**
   - Compare daily load trends of multiple consumers.
   - Normalize consumer loads for easier visual comparison.

2. **Peak and Off-Peak Analysis**
   - Compute and visualize maximum, minimum, and mean daily loads.
   - Identify peak demand periods.

3. **Seasonal Trends**
   - Aggregate loads monthly or yearly and explore trends.
   - Compare how electricity usage changes across seasons or years.

4. **Correlation Analysis**
   - Compute pairwise correlation between consumer loads.
   - Visualize using a heatmap to identify strongly correlated consumers.

5. **Resampling Exploration**
   - Resample data at different frequencies: hourly, weekly, monthly.
   - Plot the aggregated loads and observe patterns.

6. **Missing Value Handling**
   - Check for missing data and try simple imputation (`ffill`, `bfill`, or interpolation).
   - Observe the effect of missing value handling on plots.