# Time-series Forecasting

## Description
The data consists of 52,416 observations of energy consumption on a 10-minute window. Every observation is described by the following feature columns.

Your task is to **aggregate the observations on an interval of 2 hours**. For this time interval, using the values of the **4 previous time intervals**, forecast the target value one step in the future. Choose which features you are going to use.

**You must train a Boosting model for the task. Choose the model based on the number, and type of features available.**



Features:

* Date: Time window of ten minutes.
* Temperature: Weather Temperature.
* Humidity: Weather Humidity.
* WindSpeed: Wind Speed.
* GeneralDiffuseFlows: “Diffuse flow” is a catchall term to describe low-temperature (< 0.2° to ~ 100°C) fluids that slowly discharge through sulfide mounds, fractured lava flows, and assemblages of bacterial mats and macrofauna.
* DiffuseFlows

Target:

SolarPower

## Dataset links:
* [DS1](https://drive.google.com/file/d/1-Pcpb1xWpKc8Cgs-P7xqBFHw2NM0dBsA/view?usp=sharing)
* [DS2](https://drive.google.com/file/d/1-Pul07w6LXpm-uo99qbNc86FHhwl4yQD/view?usp=sharing)

## Read the datasets

In [58]:
import pandas as pd
from pandas.core.interchange.dataframe_protocol import DataFrame
from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
import missingno as msno
import  matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler



In [5]:
data1=pd.read_csv('power_consumption_g3_feat.csv')
data2=pd.read_csv('power_consumption_g3.csv')

In [12]:
data1.sample()

Unnamed: 0,Date,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows
29389,2017-04-30 13:20:00,24.14,45.15,4.92,889.0,40.72


In [19]:
data1['Date'] = pd.to_datetime(data1['Date'])
data1.set_index('Date', inplace=True)
data2['Date'] = pd.to_datetime(data2['Date'])
data2.set_index('Date', inplace=True)

In [9]:
data1.isnull().sum()/len(data1)*100

Date                   0.000000
Temperature            1.066468
Humidity               0.951999
WindSpeed              1.009234
GeneralDiffuseFlows    0.999695
DiffuseFlows           0.963446
dtype: float64

In [14]:
def knn_imputer(data,columns):
    imputer = KNNImputer(n_neighbors=3)
    data_copy=data.copy()
    for column in columns:
        data_copy[column]=imputer.fit_transform(data_copy[[column]])
    return data_copy

In [35]:
data_imputed=knn_imputer(data1,['Temperature','Humidity','WindSpeed','GeneralDiffuseFlows','DiffuseFlows'])

## Merge the datasets (and pre-processing if needed)

In [47]:
data=pd.merge(data_imputed,data2,on='Date',how='inner')


## Group the datasets into time intervals of 2 hours

In [50]:
data=data.resample('2h').mean()
data=data.sort_values(by='Date',ascending=True)
data.head(10)



Unnamed: 0_level_0,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows,SolarPower
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01 00:00:00,6.95289,75.497093,0.081917,0.060167,0.105667,26927.594937
2017-01-01 02:00:00,5.029333,78.008333,0.082583,0.061417,0.135083,21447.088607
2017-01-01 04:00:00,4.919667,74.641667,0.081667,0.061917,0.120833,20641.518987
2017-01-01 06:00:00,4.51275,74.575,0.082417,0.063583,0.1225,20094.683545
2017-01-01 08:00:00,4.632167,73.791667,0.082417,79.281917,15.761833,21255.189872
2017-01-01 10:00:00,8.019333,63.835833,2.913333,332.463903,34.108333,27986.835442
2017-01-01 12:00:00,15.263333,57.075,0.076167,486.391667,40.981667,30060.759495
2017-01-01 14:00:00,15.6625,56.914167,0.075667,377.458333,48.125,29558.481012
2017-01-01 16:00:00,15.309167,59.1125,0.07725,160.075833,169.773333,31576.70886
2017-01-01 18:00:00,12.911667,67.740833,0.077417,2.43275,2.487417,39969.113924


## Create lags

In [51]:
for lag in range(1, 5):  # Create lags 1, 2, 3, 4
    data[f'lag_{lag}'] = data['SolarPower'].shift(lag)
data=data.dropna()

data.head(10)

Unnamed: 0_level_0,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows,SolarPower,lag_1,lag_2,lag_3,lag_4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-01-01 08:00:00,4.632167,73.791667,0.082417,79.281917,15.761833,21255.189872,20094.683545,20641.518987,21447.088607,26927.594937
2017-01-01 10:00:00,8.019333,63.835833,2.913333,332.463903,34.108333,27986.835442,21255.189872,20094.683545,20641.518987,21447.088607
2017-01-01 12:00:00,15.263333,57.075,0.076167,486.391667,40.981667,30060.759495,27986.835442,21255.189872,20094.683545,20641.518987
2017-01-01 14:00:00,15.6625,56.914167,0.075667,377.458333,48.125,29558.481012,30060.759495,27986.835442,21255.189872,20094.683545
2017-01-01 16:00:00,15.309167,59.1125,0.07725,160.075833,169.773333,31576.70886,29558.481012,30060.759495,27986.835442,21255.189872
2017-01-01 18:00:00,12.911667,67.740833,0.077417,2.43275,2.487417,39969.113924,31576.70886,29558.481012,30060.759495,27986.835442
2017-01-01 20:00:00,12.319473,70.855,0.07625,0.060333,0.097083,39542.27848,39969.113924,31576.70886,29558.481012,30060.759495
2017-01-01 22:00:00,12.251667,69.178353,0.074667,0.066333,0.108333,32522.531644,39542.27848,39969.113924,31576.70886,29558.481012
2017-01-02 00:00:00,10.685,78.475,0.076417,0.068583,0.139583,23907.8481,32522.531644,39542.27848,39969.113924,31576.70886
2017-01-02 02:00:00,10.8,78.008333,0.077667,0.06875,6.381762,20521.518986,23907.8481,32522.531644,39542.27848,39969.113924


# Split the dataset into 80% training and 20% testing datasets

## Create the model, pre-process the data and make it suitable for training

In [64]:
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.fit_transform(X_test)


## Perofrm hyper-parameter optimization with a 5-fold cross validation.

Important: Do not use many values for the hyper-parameters due to time constraints.

KEEP IN MIND THE DATASET IS TIME-SERIES.

In [65]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit

model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, max_depth=6)
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],  # learning rate options
    'n_estimators': [50, 100, 200],  # number of trees
    'max_depth': [3, 6, 10],  # depth of the trees
    'subsample': [0.7, 0.8, 1.0],  # fraction of samples used for each tree
    'colsample_bytree': [0.7, 0.8, 1.0]  # fraction of features used for each tree
}
tscv = TimeSeriesSplit(n_splits=5)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=tscv, scoring='neg_mean_absolute_error', n_jobs=-1)

grid_search.fit(X_train_scaled, Y_train)


  _data = np.array(data, dtype=dtype, copy=copy,


## Fit the model with the best parameters on the training dataset

In [69]:
bestModel=grid_search.best_estimator_
bestModel.fit(X_train_scaled, Y_train)
y_pred=bestModel.predict(X_test_scaled)


## Calculate the adequate metrics on the testing dataset

In [83]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate MAE, MSE, RMSE, and R²
mae = mean_absolute_error(Y_test, y_pred)
mse = mean_squared_error(Y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, y_pred)

# Calculate the range of the SolarPower column
range_solarpower = data['SolarPower'].max() - data['SolarPower'].min()

# Print out the results
print(f"Range of SolarPower: {range_solarpower}")

# Calculate and print Relative MAE, Relative MSE, and Relative RMSE
print(f'Relative Mean Absolute Error (MAE): {(mae / range_solarpower) * 100}')
print(f'Relative Mean Squared Error (MSE): {(mse / (range_solarpower * range_solarpower)) * 100}')
print(f'Relative Root Mean Squared Error (RMSE): {(rmse / range_solarpower) * 100}')
print(f'R-squared (R²): {r2}')


Range of SolarPower: 35196.79793166666
Relative Mean Absolute Error (MAE): 3.332856359532495
Relative Mean Squared Error (MSE): 0.2965673791732361
Relative Root Mean Squared Error (RMSE): 5.445800025462155
R-squared (R²): 0.9261420071457704


## Visualize the targets against the predictions

ValueError: continuous is not supported