# Notebook 01 – Problem Framing and Dataset Discovery

## 1. Introduction
In this notebook we define our machine learning problem and explore the dataset we will use.  
The project focuses on **classifying hourly electricity demand in Romania** into **Low, Medium, and High tiers**.  
This is motivated by sustainability concerns: better demand prediction allows grid operators to plan more efficiently, reduce fossil fuel reliance, and integrate more renewable energy sources.

---

## 2. Problem Definition

### Problem Statement
**Predict whether an hourly electricity consumption measurement will fall into Low, Medium, or High demand tier using data available up to that time (hour, day, season, weather) so that grid operators can plan ahead, reduce reliance on fossil fuels, and optimize imports/exports.**

### CTHI Framing
- **Context**: Electricity demand management for sustainability and energy security in Romania.  
- **Task**: Multiclass classification (Low / Medium / High).  
- **Hypothesis**: Demand depends on **time of day**, **day of week**, **season**, and possibly **external factors like temperature**.  
- **Impact**: Accurate classification enables better grid planning, improved renewable integration, and supports SDG 7, 12, and 13.

---

## 3. Dataset Description
We use the dataset **“Hourly electricity consumption and production in Romania”** ([Kaggle link](https://www.kaggle.com/datasets/stefancomanita/hourly-electricity-consumption-and-production)).

- **Features**:
  - Temporal: `hour`, `day_of_week`, `month`, `season`
  - Target: `demand_class` (Low, Medium, High based on tertiles of consumption)

⚠️ *Note*: We exclude production mix variables (coal, gas, hydro, nuclear, solar, wind, biomass, imports, exports) to avoid label leakage, since these are supply-side responses to demand.

---

In [1]:
import pandas as pd

In [2]:
# Load the CSV file (already in project repo under /data)
df = pd.read_csv("data/raw/electricityConsumptionAndProductioction.csv")

df.head()

Unnamed: 0,DateTime,Consumption,Production,Nuclear,Wind,Hydroelectric,Oil and Gas,Coal,Solar,Biomass
0,2019-01-01 00:00:00,6352,6527,1395,79,1383,1896,1744,0,30
1,2019-01-01 01:00:00,6116,5701,1393,96,1112,1429,1641,0,30
2,2019-01-01 02:00:00,5873,5676,1393,142,1030,1465,1616,0,30
3,2019-01-01 03:00:00,5682,5603,1397,191,972,1455,1558,0,30
4,2019-01-01 04:00:00,5557,5454,1393,159,960,1454,1458,0,30


---

## 5. Define Target Variable
We will transform continuous `consumption` into categorical demand tiers:
- **Low**: bottom third of consumption values  
- **Medium**: middle third  
- **High**: top third  

This makes the problem a clear multiclass classification task.

In [3]:
# Create demand tiers from consumption
df['demand_class'] = pd.qcut(df['Consumption'], q=3, labels=["Low", "Medium", "High"])

df['demand_class'].value_counts()

demand_class
Medium    18063
Low       18058
High      18049
Name: count, dtype: int64

---

## 6. Feature Types

| Feature Name | Variable Type | Role |
|--------------|---------------|------|
| `datetime`   | Categorical (time) | Raw input (will derive hour, day, season) |
| `hour`       | Categorical | Predictor |
| `day_of_week`| Categorical | Predictor |
| `month`      | Categorical | Predictor |
| `season`     | Categorical | Predictor |
| `consumption`| Numerical | Used to derive label |
| `demand_class` | Target (categorical) | Low / Medium / High |

---

## 7. Dataset Health Check
We will examine:
- Missing values
- Duplicates
- Outliers in consumption

In [4]:
# Missing values
missing = df.isna().sum()

# Duplicates
duplicates = df.duplicated().sum()

# Consumption outlier summary
cons_stats = df['Consumption'].describe()

missing, duplicates, cons_stats

(DateTime         0
 Consumption      0
 Production       0
 Nuclear          0
 Wind             0
 Hydroelectric    0
 Oil and Gas      0
 Coal             0
 Solar            0
 Biomass          0
 demand_class     0
 dtype: int64,
 np.int64(4),
 count    54170.000000
 mean      6526.463688
 std       1048.248455
 min       2922.000000
 25%       5710.000000
 50%       6474.000000
 75%       7268.000000
 max       9615.000000
 Name: Consumption, dtype: float64)