## 1. Import libraries and configuration

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import requests


## 2. Load continuous dataset

In [7]:
BASE = "https://api.energidataservice.dk/dataset/ConsumptionConsumerCategoryHour"

params = {
    "start": "2025-11-01",   # probá un mes cualquiera
    "end":   "2025-12-01",
    "sort": "TimeDK asc",
    "limit": 0               # 0 = devolver todo lo del rango
}

# Download latest version
r = requests.get(BASE, params=params, timeout=60)
r.raise_for_status()
records = r.json()["records"]
df = pd.DataFrame(records)
df.to_csv("../data/raw/energidataservice.csv", index=False)



In [14]:
df = pd.read_csv('../data/raw/energidataservice.csv', parse_dates=["TimeDK", "TimeUTC"], index_col='TimeDK')
df = df.sort_index()
df

Unnamed: 0_level_0,TimeUTC,RegionName,ConsumerCategory3,ConsumerCategory2,ConsumptionkWh
TimeDK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-11-01 00:00:00,2025-10-31 23:00:00,Region Hovedstaden,Erhverv,Erhverv,477154.978
2025-11-01 00:00:00,2025-10-31 23:00:00,Region Midtjylland,Erhverv,Erhverv,638411.063
2025-11-01 00:00:00,2025-10-31 23:00:00,Region Nordjylland,Erhverv,Erhverv,314396.356
2025-11-01 00:00:00,2025-10-31 23:00:00,Region Sjælland,Erhverv,Erhverv,296731.698
2025-11-01 00:00:00,2025-10-31 23:00:00,Region Syddanmark,Erhverv,Erhverv,679047.686
...,...,...,...,...,...
2025-11-30 23:00:00,2025-11-30 22:00:00,Region Hovedstaden,Erhverv,Erhverv,558995.137
2025-11-30 23:00:00,2025-11-30 22:00:00,Region Midtjylland,Erhverv,Erhverv,931012.065
2025-11-30 23:00:00,2025-11-30 22:00:00,Region Nordjylland,Erhverv,Erhverv,432596.728
2025-11-30 23:00:00,2025-11-30 22:00:00,Region Sjælland,Erhverv,Erhverv,335539.309


## 3. Basic dataset inspection

In [3]:
'''
Why: see the first rows to:

understand what each column means,

check if datetime has good format,

check for strange values (NaN, negatives, etc.).

Typical conclusion: "ok, seems like hourly demand + weather + holidays".
'''
df.head()


In [4]:
df.info()
'''
Why: review:

how many rows (dataset size),

how many columns,

data types (numbers, strings, dates),

if there are nulls (non-null < entries).

Typical conclusion:

"we have ~50k hours of data (several years)"

"no nulls in nat_demand, so it is usable as target".
'''


In [5]:
df.describe()
'''
Why: see statistics (mean, min, max) to detect outliers.

Is min nat_demand reasonable? (If it's 0 or negative, beware).

Are temperatures (T2M) in Kelvin or Celsius? (If mean ~290 -> Kelvin; if mean ~20 -> Celsius).

Typical conclusion:

"Demand looks normal (range A to B)".
"Temperatures seem to be in Celsius".
'''


## 4. Visualization (Trends and Seasonality)

In [6]:
plt.figure(figsize=(15, 5))
plt.plot(df.index, df['nat_demand'], label='National Demand', linewidth=0.5)
plt.title("Complete hourly demand history")
plt.xlabel("Date")
plt.ylabel("MW Load")
plt.legend()
plt.show()
'''
Why: Seeing the entire series helps to:

See if there is a long-term trend (increasing/decreasing demand).

See annual seasonality (peaks in summer/winter).

Detect large holes/outliers (e.g. weeks with 0).
'''


In [7]:
week_df = df.loc['2017-06-01':'2017-06-07']

plt.figure(figsize=(15, 5))
plt.plot(week_df.index, week_df['nat_demand'], marker='.', label='1 Week Demand')
plt.title("Weekly Zoom (7 days in June 2017)")
plt.grid(True, alpha=0.3)
plt.show()
'''
Why: Zooming in on a week allows us to see:

Daily pattern (day vs night).

Weekday vs Weekend (demand usually drops on Sat/Sun).

Usefulness: Confirms that "Hour of day" and "Day of week" are critical features.
'''


## 5. Correlation Analysis

In [8]:
import seaborn as sns

corr = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.title("Correlation Matrix")
plt.show()
'''
Why: To verify which variables affect demand the most.

We expect high correlation between T2M (temperature) and nat_demand (heating/cooling depending on the region).

Maybe correlation between regions (toc, san, dav).

Typical conclusion: "Temperature has a relationship, but it is not linear (U-shape usually), simple correlation might be low, but the feature is important".
'''


## 6. Histograms (Distribution)

In [9]:
plt.figure(figsize=(8, 5))
sns.histplot(df['nat_demand'], bins=50, kde=True)
plt.title("Demand Distribution")
plt.show()
'''
Why: Check if the target is Gaussian or skewed.

If it is very skewed, maybe applying log or sqrt helps (usually not strictly necessary in LSTM, but good to know).

It helps to detect outliers (bars very far to the right or left).
'''


## 7. Conclusions of the Exploration
*(Write here what you found)*

1. **Data Quality**: The data looks clean/dirty, [with/without] nulls.
2. **Patterns**: There is clear daily and weekly seasonality.
3. **Features**: Temperature seems relevant.
4. **Next Steps**: We need to normalize (Scale) and create time windows for the LSTM.