# Load and Clean Electricity Load Dataset

In this notebook, we load the raw dataset from OpenML, inspect its structure, 
and perform basic preprocessing. We extract useful time-based features 
(hour, day, month, weekday) and save the cleaned dataset for later modeling.

### 1. Load Dataset from OpenML

We download the dataset from OpenML (ID: 46214). This dataset contains 
15-minute electricity load values for multiple regions over several years.

In [8]:
import openml as oml
import pandas as pd

d = oml.datasets.get_dataset(46214, download_all_files=True)
df, *_ = d.get_data()

df.head()

  d = oml.datasets.get_dataset(46214, download_all_files=True)


Unnamed: 0,id_series,date,value_0,value_1,value_2,value_3,value_4,value_5,value_6,value_7,...,value_307,value_308,value_309,value_310,value_311,value_312,value_313,value_314,value_315,time_step
0,0,2012-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0,2012-01-01 00:15:00,3.807107,22.759602,77.324066,136.178862,70.731707,351.190476,9.609949,279.461279,...,128.479657,28500.0,1729.957806,1704.545455,15.645372,12.873025,504.828797,63.439065,761.730205,1
2,0,2012-01-01 00:30:00,5.076142,22.759602,77.324066,136.178862,73.170732,354.166667,9.044658,279.461279,...,127.765882,26400.0,1654.008439,1659.090909,15.645372,13.458163,525.021949,60.100167,702.346041,2
3,0,2012-01-01 00:45:00,3.807107,22.759602,77.324066,140.243902,69.512195,348.214286,8.479367,279.461279,...,114.20414,25200.0,1333.333333,1636.363636,15.645372,10.532475,526.777875,56.761269,696.480938,3
4,0,2012-01-01 01:00:00,3.807107,22.759602,77.324066,140.243902,75.609756,339.285714,7.348785,279.461279,...,112.062812,23800.0,1324.894515,1636.363636,15.645372,14.628438,539.947322,63.439065,693.548387,4


### 2. Inspect Dataset Structure

We quickly check the shape, column types, and descriptive statistics 
to understand what the data looks like before cleaning.

In [9]:
df.shape
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105217 entries, 0 to 105216
Columns: 319 entries, id_series to time_step
dtypes: category(1), float64(316), int64(1), object(1)
memory usage: 255.4+ MB


Unnamed: 0,value_0,value_1,value_2,value_3,value_4,value_5,value_6,value_7,value_8,value_9,...,value_307,value_308,value_309,value_310,value_311,value_312,value_313,value_314,value_315,time_step
count,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,...,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0,105217.0
mean,5.293122,27.684728,3.890152,109.553284,49.641948,188.258438,6.027018,255.141331,53.287807,56.260165,...,290.88247,50132.068962,2515.971874,3919.110894,87.196809,12.356675,565.549464,126.242954,833.470895,52608.0
std,6.382257,6.583655,12.567376,39.043562,17.825137,63.745258,6.855467,59.763872,21.806797,26.38996,...,186.523287,36983.188113,1656.715088,2472.671313,61.105478,9.777768,142.89474,67.921538,140.024961,30373.675974
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.269036,23.470839,1.737619,83.333333,36.585366,142.857143,2.826456,208.754209,38.461538,36.55914,...,112.062812,17300.0,907.172996,1500.0,22.164276,5.851375,492.537313,58.430718,728.005865,26304.0
50%,2.538071,27.738265,1.737619,99.593496,46.341463,181.547619,3.391747,252.525253,47.202797,50.537634,...,312.633833,42400.0,2320.675105,3363.636364,91.264668,8.777063,579.455663,123.539232,816.715543,52608.0
75%,5.076142,32.00569,2.606429,128.04878,59.756098,220.238095,5.652911,292.929293,62.937063,69.892473,...,434.689507,68100.0,3700.421941,6250.0,125.162973,14.0433,654.960492,175.292154,909.824047,78912.0
max,48.22335,115.220484,151.172893,321.138211,150.0,535.714286,44.657999,552.188552,157.342657,198.924731,...,852.96217,192800.0,7751.054852,12386.363636,335.071708,60.269163,1138.718174,362.270451,1549.120235,105216.0


### 3. Generate Time-Based Features

The date column is converted to a proper datetime type.  
We extract additional features that are often helpful for load prediction:
- hour of day  
- day of month  
- month  
- weekday  

These features help our later models capture daily and weekly patterns.

In [10]:
df['date'] = pd.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour
df['day'] = df['date'].dt.day
df['month'] = df['date'].dt.month
df['weekday'] = df['date'].dt.weekday

### 4. Save Cleaned Dataset

We save the cleaned dataframe as cleaned_data.pkl so that other notebooks 
(EDA, baseline modeling, dimensionality reduction, etc.) can reuse it.

In [7]:
df.to_pickle("../data/cleaned_data.pkl")