# Exploratory Data Analysis

### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(1320210409)
randomstate = np.random.RandomState(1320210409)

# The data

## Features

All features are hourly and a country-wide average.
- **Time** _[YYYY-MM-DD HH:MM:SS]_
- **el_load:** electricity load _[MW]_
- **prec:** rainfall amount _[mm]_
- **temp:** temperature _[°C]_
- **rhum:** relative humidity [%]
- **grad:** global radiation _[J/cm²]_
- **pres:** momentary sea level air pressure _[hPa]_
- **wind:** average wind speed _[m/s]_
- **Vel_tviz:** Velence water temperature in Agárd _[°C]_
- **Bal_tviz:** Balaton water temperature in Siófok _[°C]_
- **holiday:** 1 or 0 depending on if it's a holiday
- **weekend:** 1 or 0 depending on if it's a weekend
- **covid:** 1 or 0 depending on covid restrictions in Hungary (estimate)

### The goal

I want to predict Hungary's electricity load for the **next couple hours** using this dataset, or it's differently aggregated counterpart (country, region, county or station)

In [None]:
df = pd.read_csv(
    'data/final_dataframe.csv',
    parse_dates=['Time'],
    index_col='Time',
    sep=';'
)

df.info()

df

No null entries, I have dealt with those in the _data_organization_ notebook.

In [None]:
df['hour'] = df.index.hour
df['weekday'] = df.index.weekday
df['dayofmonth'] = df.index.day
df['month'] = df.index.month
df['year'] = df.index.year

df

# Eploring the el_load feature

In [None]:
group_by = ['hour', 'weekday', 'dayofmonth', 'month', 'year']

fig, ax = plt.subplots(len(group_by), 1, figsize=(15, 30))

for group, ax in zip(group_by, ax.flatten()):
    grouped = df.groupby(group)['el_load'].mean()
    ax.set_title(f"el_load mean grouped by {group}", fontsize=15)
    grouped.plot(ax=ax, color="#221f1f", marker="o")

#### el_load
- daily averages rises during the day, it hits its at 18-19
- lower during the weekend
- we don't learn too much from the day of the month at this time
- during the year, load is higher in winter, probably since there's less sunlight
- we can see the effects of covid between 2020-2022

- **Time** _[YYYY-MM-DD HH:MM:SS]_
- **el_load:** electricity load _[MW]_
- **prec:** rainfall amount _[mm]_
- **temp:** temperature _[°C]_
- **rhum:** relative humidity [%]
- **grad:** global radiation _[J/cm²]_
- **pres:** momentary sea level air pressure _[hPa]_
- **wind:** average wind speed _[m/s]_
- **Vel_tviz:** Velence water temperature in Agárd _[°C]_
- **Bal_tviz:** Balaton water temperature in Siófok _[°C]_
- **holiday:** 1 or 0 depending on if it's a holiday
- **weekend:** 1 or 0 depending on if it's a weekend
- **covid:** 1 or 0 depending on covid restrictions in Hungary (estimate)

In [None]:
group_by = ['hour', 'weekday', 'dayofmonth', 'month', 'year']

features = [
    ('temp', 'Temperature'),
    ('prec', 'Liquid precipitation'),
    ('rhum', 'Relative humidity'),
    ('wind', 'Wind speed'),
    ('grad', 'Global radiation'),
    ('pres', 'Pressure at sea level'),
    ('Vel_tviz', 'Velence water temperature'),
    ('Bal_tviz', 'Balton water temperature'),
]


for f, desc in features:
    fig, ax = plt.subplots(1, len(group_by), figsize=(45, 5))
    fig.suptitle(f"Feature: {f}")
    for i, ax in enumerate(ax):
        group = group_by[i % len(group_by)]
        grouped = df.groupby(group)[f].mean()
        ax.set_title(f"{desc} mean grouped by {group}", fontsize=12)
        ax.plot(grouped, marker="o")