# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Handling Missing Values](#handling-missing-values)
3. [Feature Distributions](#feature-distributions)
4. [Possible Biases](#possible-biases)
5. [Correlations](#correlations)


. [Correlations](#correlations)


In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Overview

[Provide a high-level overview of the dataset. This should include the source of the dataset, the number of samples, the number of features, and example showing the structure of the dataset.]


## Load the Dataset

In [5]:
df = pd.read_csv('../Data\Load_21-24.csv')

  df = pd.read_csv('../Data\Load_21-24.csv')


In [6]:
df.head()

Unnamed: 0,Time (CET/CEST),Day-ahead Total Load Forecast [MW] - BZN|DE-LU,Actual Total Load [MW] - BZN|DE-LU
0,01.01.2021 00:00 - 01.01.2021 00:15,43935.0,45458.0
1,01.01.2021 00:15 - 01.01.2021 00:30,43738.0,45237.0
2,01.01.2021 00:30 - 01.01.2021 00:45,43247.0,44886.0
3,01.01.2021 00:45 - 01.01.2021 01:00,43162.0,44585.0
4,01.01.2021 01:00 - 01.01.2021 01:15,42453.0,43952.0


## Data dimensions and description:

In [7]:
# Number of samples
num_samples = df.shape[0]

# Number of features
num_features = df.shape[1]

# Display these dataset characteristics
print(f"Number of samples: {num_samples}")
print(f"Number of features: {num_features}")

# Display the first few rows of the dataframe to show the structure
print("Example data:")
print(df.head())
print(df.info())


Number of samples: 140272
Number of features: 3
Example data:
                       Time (CET/CEST)  \
0  01.01.2021 00:00 - 01.01.2021 00:15   
1  01.01.2021 00:15 - 01.01.2021 00:30   
2  01.01.2021 00:30 - 01.01.2021 00:45   
3  01.01.2021 00:45 - 01.01.2021 01:00   
4  01.01.2021 01:00 - 01.01.2021 01:15   

  Day-ahead Total Load Forecast [MW] - BZN|DE-LU  \
0                                        43935.0   
1                                        43738.0   
2                                        43247.0   
3                                        43162.0   
4                                        42453.0   

  Actual Total Load [MW] - BZN|DE-LU  
0                            45458.0  
1                            45237.0  
2                            44886.0  
3                            44585.0  
4                            43952.0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140272 entries, 0 to 140271
Data columns (total 3 columns):
 #   Column                 

## Handling Missing Values

[Identify any missing values in the dataset, and describe your approach to handle them if there are any. If there are no missing values simply indicate that there are none.]


In [8]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values




Time (CET/CEST)                                     0
Day-ahead Total Load Forecast [MW] - BZN|DE-LU    216
Actual Total Load [MW] - BZN|DE-LU                 22
dtype: int64

In [9]:
df['Day-ahead Total Load Forecast [MW] - BZN|DE-LU'].ffill(inplace=True)
df['Actual Total Load [MW] - BZN|DE-LU'].ffill(inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Day-ahead Total Load Forecast [MW] - BZN|DE-LU'].ffill(inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Actual Total Load [MW] - BZN|DE-LU'].ffill(inplace=True)


# convert the time and do some feature engineering:

In [10]:
# Step 1: Split the timespan into start and end times
df[['start_time', 'end_time']] = df['Time (CET/CEST)'].str.split(' - ', expand=True)

# Step 2: Convert both to datetime format
df['start_time'] = pd.to_datetime(df['start_time'], format='%d.%m.%Y %H:%M')
df['end_time'] = pd.to_datetime(df['end_time'], format='%d.%m.%Y %H:%M')

# Step 3: Feature engineering
df['year'] = df['start_time'].dt.year
df['month'] = df['start_time'].dt.month
df['day'] = df['start_time'].dt.day


In [11]:
df.columns

Index(['Time (CET/CEST)', 'Day-ahead Total Load Forecast [MW] - BZN|DE-LU',
       'Actual Total Load [MW] - BZN|DE-LU', 'start_time', 'end_time', 'year',
       'month', 'day'],
      dtype='object')

## Feature Distributions

[Plot the distribution of various features and target variables. Comment on the skewness, outliers, or any other observations.]


In [None]:
import plotly.express as px
import pandas as pd

    # Convert columns to numeric
df['Day-ahead Total Load Forecast [MW] - BZN|DE-LU'] = pd.to_numeric(df['Day-ahead Total Load Forecast [MW] - BZN|DE-LU'], errors='coerce')
df['Actual Total Load [MW] - BZN|DE-LU'] = pd.to_numeric(df['Actual Total Load [MW] - BZN|DE-LU'], errors='coerce')

def plot_load_over_time(df, start_date, end_date):
    """
    Plots the actual load over time for a given time period using Plotly.

    Parameters:
    - df: DataFrame containing the load data
    - start_date: Start date (string or datetime) for the time period
    - end_date: End date (string or datetime) for the time period
    """
    # Ensure 'start_time' is in datetime format
    df['start_time'] = pd.to_datetime(df['start_time'])
    
    # Filter data based on the specified date range
    mask = (df['start_time'] >= start_date) & (df['start_time'] <= end_date)
    df_filtered = df[mask]

    # Plotting using Plotly
    fig = px.line(df_filtered, x='start_time', y='Actual Total Load [MW] - BZN|DE-LU',
                  labels={'start_time': 'Time', 'Actual Total Load [MW] - BZN|DE-LU': 'Actual Total Load [MW]'},
                  title=f'Actual Total Load from {start_date} to {end_date}')
    
    # Show the plot
    fig.show()

# Example of usage:
start_date = '2020-01-01'  # Start date in string format
end_date = '2025-01-01'  # End date in string format
plot_load_over_time(df, start_date, end_date)


In [24]:
start_date = '2021-01-01'  
end_date = '2022-01-01'  
plot_load_over_time(df, start_date, end_date)

In [25]:
start_date = '2021-01-01'  
end_date = '2021-02-01'  
plot_load_over_time(df, start_date, end_date)

In [None]:
start_date = '2021-01-04'  #Monday
end_date = '2021-01-11'  #Sunday
plot_load_over_time(df, start_date, end_date)

## Saving the files for the next steps:

In [30]:
df.columns

Index(['Time (CET/CEST)', 'Day-ahead Total Load Forecast [MW] - BZN|DE-LU',
       'Actual Total Load [MW] - BZN|DE-LU', 'start_time', 'end_time', 'year',
       'month', 'day'],
      dtype='object')

In [31]:
df.to_csv("../data/load_value_de_2021_2024.csv", index=False)

## Correlations

[Explore correlations between features and the target variable, as well as among features themselves.]


In [15]:
# Example: Plotting a heatmap to show feature correlations
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()


ValueError: could not convert string to float: '01.01.2021 00:00 - 01.01.2021 00:15'