# **Predicting Electricity Spot Prices Based on Weather Patterns in Nordic Countries**

In this project, we will combine historical weather data and electricity spot price data for the years 2015-2019 in Finland, Norway, and Sweden. Our goal is to predict the electricity spot prices by using weather features like temperature, precipitation, and wind speed.

In [1]:
#Set up and Libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import plotly.express as px

print("Libaries imported")

Libaries imported


In [2]:
#Loading Data
weather_df = pd.read_csv('/kaggle/input/finland-norway-and-sweden-weather-data-20152019/nordics_weather.csv')
electricity_df = pd.read_csv('/kaggle/input/electricity-spot-price/Elspotprices.csv')

print("Datasets Loaded")

Datasets Loaded


In [3]:
#Inital Data Exploration
weather_df.head()
electricity_df.head()

weather_df.info()
electricity_df.info()

weather_df.isnull().sum()
electricity_df.isnull().sum()


print(weather_df.head())
print(electricity_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5478 entries, 0 to 5477
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   country        5478 non-null   object 
 1   date           5478 non-null   object 
 2   precipitation  5478 non-null   float64
 3   snow_depth     5478 non-null   float64
 4   tavg           5478 non-null   float64
 5   tmax           5478 non-null   float64
 6   tmin           5478 non-null   float64
dtypes: float64(5), object(2)
memory usage: 299.7+ KB
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 50831 entries, ('2022-10-19 21:00;2022-10-19 23:00;DK2;978', '750000;131') to ('2016-12-31 23:00;2017-01-01 00:00;DK2;155', '820007;20')
Data columns (total 1 columns):
 #   Column                                              Non-Null Count  Dtype
---  ------                                              --------------  -----
 0   HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR  50831 non-

Based on the inital exploration of both datasets, we can see there are no missing values but the data structure needs to be cleaned and parsed correctly

Weather Data:
There are no missing values and the data appears to be clean.

The 'date' column is currently type object, which needs to be converted to 'datetime' and set as the iundex. This will allow for easier time series analysis and aslignment with the Electricity Spot Price Dataset

Electricity Spot Price Data:
Therer are no missing values but the dataset requires some cleaning.

The column formatting is currently a single column with concatenated values. These need to be split and orangized into seperate columns.

The 'SpotPriceDKK' and 'SpotPriceEUR' columns have commas not dots in the decimal place, this will cause issues when convering them to numerical values. These will be cleanded and converted to 'float'.

Both 'HourUTC' and 'HourDK' columns are strings which need to be converted to 'datetime' for time-based analysis just like our weather 'data' values. I will set 'HourUTC' as the index as this will allow me to merge the datesend on a common index




In [4]:
#Weather Dataset

# Covert 'date' to 'datetime' and set 'date' as index
weather_df['date'] = pd.to_datetime(weather_df['date'])
weather_df.set_index('date', inplace=True)

print(weather_df.head())

            country  precipitation  snow_depth       tavg      tmax       tmin
date                                                                          
2015-01-01  Finland       1.714141  284.545455   1.428571  2.912739  -1.015287
2015-01-02  Finland      10.016667  195.000000   0.553571  2.358599  -0.998718
2015-01-03  Finland       3.956061  284.294118  -1.739286  0.820382  -3.463871
2015-01-04  Finland       0.246193  260.772727  -7.035714 -3.110828  -9.502581
2015-01-05  Finland       0.036364  236.900000 -17.164286 -8.727564 -19.004487


In [5]:
#Electricity Spot Price Dataset

# Step 1: Ensure the column is of string type
electricity_df['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR'] = electricity_df['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR'].astype(str)

# Step 2: Clean extra spaces and multiple semicolons
# Normalize spaces and ensure only one semicolon between values
electricity_df['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR'] = (
    electricity_df['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR']
    .str.replace(r'\s+', ' ', regex=True)   # Replace multiple spaces with a single space
    .str.replace(r';\s*', ';', regex=True)  # Ensure semicolons are followed by no spaces
    .str.replace(r'\s*;', ';', regex=True)  # Ensure semicolons are preceded by no spaces
)

# Step 3: Check how many splits each row produces
split_data = electricity_df['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR'].str.split(';', expand=True)

# Step 4: Print the number of splits for each row to check for anomalies
split_counts = split_data.apply(lambda row: row.isnull().sum(), axis=1)  # Counts rows that failed to split
print("Rows with incorrect splits (less than 5 values):")
print(split_counts[split_counts > 0])

# Step 5: Check the first few rows to inspect how the split is working
print(split_data.head())

# Step 6: Handle the case if the split is successful
if split_data.shape[1] == 5:
    # Rename the columns
    split_data.columns = ['HourUTC', 'HourDK', 'PriceArea', 'SpotPriceDKK', 'SpotPriceEUR']
    
    # Step 7: Merge the split columns back to the original dataframe
    electricity_df = electricity_df.join(split_data)

    # Remove the original concatenated column
    electricity_df.drop(columns=['HourUTC;HourDK;PriceArea;SpotPriceDKK;SpotPriceEUR'], inplace=True)

    # Step 8: Clean and convert SpotPrice columns
    electricity_df['SpotPriceDKK'] = electricity_df['SpotPriceDKK'].str.replace(',', '.').astype(float)
    electricity_df['SpotPriceEUR'] = electricity_df['SpotPriceEUR'].str.replace(',', '.').astype(float)

    # Step 9: Convert HourUTC and HourDK to datetime
    electricity_df['HourUTC'] = pd.to_datetime(electricity_df['HourUTC'])
    electricity_df['HourDK'] = pd.to_datetime(electricity_df['HourDK'])

    # Step 10: Set 'HourUTC' as the index
    electricity_df.set_index('HourUTC', inplace=True)

    # Verify the cleaned dataframe
    print(electricity_df.head())

else:
    print("Split did not produce 5 columns. Please check the data for inconsistencies.")

Rows with incorrect splits (less than 5 values):
Series([], dtype: int64)
                                                            0
2022-10-19 21:00;2022-10-19 23:00;DK2;978  750000;131  570007
2022-10-19 20:00;2022-10-19 22:00;DK2;1102 079956;148  149994
2022-10-19 19:00;2022-10-19 21:00;DK2;1090 329956;146  570007
2022-10-19 18:00;2022-10-19 20:00;DK2;1238 589966;166  500000
2022-10-19 17:00;2022-10-19 19:00;DK2;1688 050049;226  919998
Split did not produce 5 columns. Please check the data for inconsistencies.


Handling Missing Values
Based on the two data sets in this poroject its important to handle missing values independatly. For the weather dataset, theere