# Notebook 4: regression model and naive algorithm

Notebook consists of two sections - for LSTM model and XGBoost algorithm.

# Regression and naive to compare with LSTM

In this section created are two models predicting data for the next, second and third hour for every weather component. This models will help to evaluate each LSTM model.

Linear regression models will make predictions based on last 12 timestamps - the same amount as for LSTM models.

Predicting in naive way will rely on comparisons of the weather conditions between timestamps being in distance of two, four and six steps, so as it is then as for making predictions for the next, second and third hour in the future.

Data on which models will be tested will be from the beginning of 2016 till the end of 2021 - the same data as for the LSTM models updates and testing by MAE calculations. Naive algorithm will use only that data. Regression models will be at first fitted with data from 2013 - the same data as used for LSTM models fits in the Notebook 1.

## All necessary libraries imports

In [1]:
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

Imports from helpful_functions.python script located in root/notebooks folder.

In [2]:
from helpful_functions import transform_data

## Files to load

In [3]:
# read the CSV files

# train data
train_data = pd.read_csv("all_data/data_for_main_model/data_ready_for_training.csv")
train_data = train_data[train_data['year'] == 2013]
# choose appropriate features
train_data = train_data[["relh", "skph", "temp"]]

# this data is the same as the test data from Notebook 1
# data from the beginning of 2016 till the end of 2021
data = pd.read_csv("all_data/data_for_main_model/data_ready_for_testing.csv")
data = data[data['year']!=2022]
# choose appropriate features
all_data = data[["relh", "skph", "temp"]]

# Regression model for relative humidity, speed of wind and temperature 

In [4]:
# data partition
humid_data = all_data['relh']
wind_data = all_data['skph']
temp_data = all_data['temp']

Results will be generated under the below code.

In [5]:
# last_pred_hour == 3 means that predictions are from +1 hour till +3 hours
last_pred_hour = 3
for hour in range(1,last_pred_hour*2,2): # cause every data timestamp is in 30 min distance
    # transforming data for linear regression models which will make predictions based on last 12 timestamps
    X_humid, y_humid = transform_data(humid_data, timestamps_count = hour, for_linear_regression = True)
    X_wind, y_wind = transform_data(wind_data, timestamps_count = hour, for_linear_regression = True)
    X_temp, y_temp = transform_data(temp_data, timestamps_count = hour, for_linear_regression = True)

    # fitting every model separately for each weather condition
    reg_humid = LinearRegression().fit(X_humid, y_humid)
    reg_wind = LinearRegression().fit(X_wind, y_wind)
    reg_temp = LinearRegression().fit(X_temp, y_temp)

    MAE_humid = []
    MAE_wind = []
    MAE_temp = []

    pred_humid = reg_humid.predict(X_humid)
    pred_wind = reg_wind.predict(X_wind)
    pred_temp = reg_temp.predict(X_temp)

    # append MAE for each hour
    MAE_humid.append(mean_absolute_error(y_humid, pred_humid))
    MAE_wind.append(mean_absolute_error(y_wind, pred_wind))
    MAE_temp.append(mean_absolute_error(y_temp, pred_temp))

    print(f'MAE for humidity [%] predictions in {(hour+1)/2} hour : {MAE_humid[-1]}.')
    print(f'MAE for speed of wind [km/h] predictions in {(hour+1)/2} hour : {MAE_wind[-1]}.')
    print(f'MAE for temperature [°C] predictions in {(hour+1)/2} hour : {MAE_temp[-1]}.\n')

MAE for humidity [%] predictions in 1.0 hour : 4.0885651296469625.
MAE for speed of wind [km/h] predictions in 1.0 hour : 2.680956220146139.
MAE for temperature [°C] predictions in 1.0 hour : 0.7137259196316501.

MAE for humidity [%] predictions in 2.0 hour : 6.195874445520641.
MAE for speed of wind [km/h] predictions in 2.0 hour : 3.4117880919231096.
MAE for temperature [°C] predictions in 2.0 hour : 1.1791144226029573.

MAE for humidity [%] predictions in 3.0 hour : 8.03564143187572.
MAE for speed of wind [km/h] predictions in 3.0 hour : 3.9947500857033846.
MAE for temperature [°C] predictions in 3.0 hour : 1.6520777359476866.



# Naive method

Results will be generated under the below code.

In [6]:
# last_pred_hour == 3 means that predictions are from +1 hour till +3 hours
last_pred_hour = 3
for hour in range(1,last_pred_hour*2,2): # cause every data timestamp is in 30 min distance
    X_humid = all_data['relh'][:-(hour+1)]
    y_humid = all_data['relh'][(hour+1):]
    MAE_humid = mean_absolute_error(X_humid,y_humid)

    X_wind = all_data['skph'][:-(hour+1)]
    y_wind = all_data['skph'][(hour+1):]
    MAE_wind = mean_absolute_error(X_wind,y_wind)

    X_temp = all_data['temp'][:-(hour+1)]
    y_temp = all_data['temp'][(hour+1):]
    MAE_temp = mean_absolute_error(X_temp,y_temp)

    print(f'MAE for humidity [%] predictions in {(hour+1)/2} hour : {MAE_humid}.')
    print(f'MAE for speed of wind [km/h] predictions in {(hour+1)/2} hour : {MAE_wind}.')
    print(f'MAE for temperature [°C] predictions in {(hour+1)/2} hour : {MAE_temp}.\n')

MAE for humidity [%] predictions in 1.0 hour : 3.965962451589575.
MAE for speed of wind [km/h] predictions in 1.0 hour : 2.7103195516266854.
MAE for temperature [°C] predictions in 1.0 hour : 0.7428799802076296.

MAE for humidity [%] predictions in 2.0 hour : 6.3645919173272185.
MAE for speed of wind [km/h] predictions in 2.0 hour : 3.5101276441873077.
MAE for temperature [°C] predictions in 2.0 hour : 1.3246676626478509.

MAE for humidity [%] predictions in 3.0 hour : 8.556541722572724.
MAE for speed of wind [km/h] predictions in 3.0 hour : 4.176327233625473.
MAE for temperature [°C] predictions in 3.0 hour : 1.8726959566835097.



# Naive to compare with XGBoost

In this section created is models predicting data for the next, second and third hour for every weather component. This models will help to evaluate each XGboost model.

Predicting in naive way will rely on comparisons of the weather conditions between timestamps being in distance of one, two and three steps, so as it is then as for making predictions for the next, second and third hour in the future.

Data on which models will be tested will be from 2022 - the same data as for the XGBoost models test and testing by MAE calculations. Naive algorithm will use only that data.

## All necessary libraries imports
Libraries have been imported in the previous section.

## Files to load

In [15]:
# read the CSV files
# test data from 2022
all_data_xgb = pd.read_csv("all_data/data_for_XGB/data_ready_for_testing.csv")

In [16]:
# prepare data
all_data_xgb = all_data_xgb[all_data_xgb['minutes']==0]

# Naive method

Results will be generated under the below code.

In [19]:
# data is concatenated and segrgegatted for four different cities - then there will be a few errors but not impactful as number of data is much bigger in comparison
stations=["EPRZ", "EPSC", "EPWA", "EPWR"]

# last_pred_hour == 3 means that predictions are from +1 hour till +3 hours
last_pred_hour = 3
for hour in range(last_pred_hour): # cause every data timestamp is in 30 min distance
    X_humid_xgb = all_data_xgb['relh'][:-(hour+1)]
    y_humid_xgb = all_data_xgb['relh'][(hour+1):]
    MAE_humid_xgb = mean_absolute_error(X_humid_xgb,y_humid_xgb)

    X_wind_xgb = all_data_xgb['sped'][:-(hour+1)]
    y_wind_xgb = all_data_xgb['sped'][(hour+1):]
    MAE_wind_xgb = mean_absolute_error(X_wind_xgb,y_wind_xgb)

    X_temp_xgb = all_data_xgb['tmpc'][:-(hour+1)]
    y_temp_xgb = all_data_xgb['tmpc'][(hour+1):]
    MAE_temp_xgb = mean_absolute_error(X_temp_xgb,y_temp_xgb)

    print(f'MAE for humidity [%] predictions in {(hour+1)} hour : {MAE_humid_xgb}.')
    print(f'MAE for speed of wind [km/h] predictions in {(hour+1)} hour : {MAE_wind_xgb}.')
    print(f'MAE for temperature [°C] predictions in {(hour+1)} hour : {MAE_temp_xgb}.\n')

MAE for humidity [%] predictions in 1 hour : 4.143399576198385.
MAE for speed of wind [km/h] predictions in 1 hour : 3.3176504839356276.
MAE for temperature [°C] predictions in 1 hour : 0.7964320485653743.

MAE for humidity [%] predictions in 2 hour : 6.600022908851408.
MAE for speed of wind [km/h] predictions in 2 hour : 4.233435680536068.
MAE for temperature [°C] predictions in 2 hour : 1.4102402565791357.

MAE for humidity [%] predictions in 3 hour : 8.849166666666667.
MAE for speed of wind [km/h] predictions in 3 hour : 4.996308310423826.
MAE for temperature [°C] predictions in 3 hour : 1.9833906071019474.

