# First experiment

**Monofeature**

In this Jupyter Notebook, an XGBoost regressor is applied to guess the amount of rain that will occur at the next point (15 minute later in our dataset). The dataset used is the São Cristóvão station from AlertaRio's website, here both of the rainfall and meteorological data are used.

## Preparation

In this section we prepare some tool, modules, and preprocessing on the dataset to run the experiment.

In [None]:
import os
import sys
import re
import pandas as pd
import numpy as np
import json

import subprocess

from matplotlib import pyplot as plt

from time import time
from threading import Thread
from threading import Lock

import sklearn as sk
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
import xgboost as xgb

In [None]:
if "data/output" not in os.getcwd():
    os.chdir("data/output")

In [None]:
translate_dict = {
    "15min" : "15min",
    "01h" : "01h",
    "04h" : "04h",
    "24h" : "24h",
    "96h" : "96h",
    "DirVento" : "WindDir",
    "Pressao" : "Pressure",
    "Temperatura" : "Temperature",
    "Umidade" : "Humidity",
    "VelVento" : "WindSpeed"
}

### Loading the dataset

In this section the dataset is loaded and unified as one pandas dataframe.

In [None]:
data = {}
station = "sao_cristovao"
sources = ['AlertaRio_DadosPluv', 'AlertaRio_DadosMet']
data[station] = {}
source = sources[0]
data[station][source] = data[station][source] = pd.read_csv(source + "/full/" + station + ".csv", sep=',')
source = sources[1]
data[station][source] = data[station][source] = pd.read_csv(source + "/full/" + station + ".csv", sep=',')

for source in sources:
    data[station][source]['datetime'] = pd.to_datetime(data[station][source]['Dia'] + data[station][source]['Hora'], format='%d/%m/%Y%H:%M:%S')
    data[station][source].set_index('datetime', inplace=True)
    data[station][source] = data[station][source].asfreq("15T")["2000":"2023-05-18 02:00:00"]
data[station][sources[1]].drop(columns=["Chuva"], inplace=True)

Checking for the same range of date

In [None]:
data["sao_cristovao"]["AlertaRio_DadosPluv"].iloc[0].name == \
data["sao_cristovao"]["AlertaRio_DadosMet"].iloc[0].name, \
\
data["sao_cristovao"]["AlertaRio_DadosPluv"].iloc[-1].name == \
data["sao_cristovao"]["AlertaRio_DadosMet"].iloc[-1].name

Concatenate the 2 datasets and remove useless columns

In [None]:
drop_list = ['Dia', 'Hora']
data_features = pd.concat([data["sao_cristovao"]["AlertaRio_DadosPluv"].drop(columns=drop_list), data["sao_cristovao"]["AlertaRio_DadosMet"].drop(columns=drop_list)], axis=1)

### Features extraction

Extract the features from the date to run XGBoost.

In [None]:
def get_features(df):
    out = df.copy()
    out["min"] = out.index.minute
    out["hour"] = out.index.hour
    out["day"] = out.index.day
    out["month"] = out.index.month
    out["year"] = out.index.year
    
#     out['quarter'] = out.index.quarter
#     out['dayofyear'] = out.index.dayofyear
#     out['dayofmonth'] = out.index.day
    
#     out['weekofyear'] = out.index.isocalendar().week.astype(np.int64)
    return out

In [None]:
data_features = get_features(data_features)

### Target definition and group points

In this section the target is defined and for each point, the data from the previous points is added.

In [None]:
target = '04h'

Checking for missing data.

In [None]:
data_features[target].isna().sum()

Interpolating the missing data using a linear regression. And keeping only the data after 2002 as the São Cristóvão station doesn't provide meteorological data in the first 2 years.

In [None]:
data_features[target] = data_features[target].interpolate(method='linear')

data_features['2002':][target].isna().sum()

Grouping the N previous points in each rows of the dataset.

In [None]:
def get_colums_names(column_names, N):
    column_names = list(column_names)
    names = []
    for i in range(N, 0, -1):
        for name in column_names:
            names.append(name + str(i))
    names.extend(column_names)
    return names

data_features.reset_index(inplace=True)
all_available_features = list(data_features.columns)

N = 10 # Number of points to predict future
data_multiple = data_features.copy()

for i in range(1, N):
    data_multiple = pd.concat([data_multiple.iloc[:-1].reset_index(drop=True), data_features.iloc[i:].reset_index(drop=True)], axis=1)

data_multiple = pd.concat([data_multiple.iloc[:-1].reset_index(drop=True), data_features.iloc[N:].reset_index(drop=True)], axis=1)

data_multiple.columns = get_colums_names(data_features.columns, N)

data_multiple.drop(columns=["datetime" + str(i) for i in range(1, N + 1)], inplace=True)
data_multiple.set_index("datetime", inplace=True)

### Partitionning

Partitionning the dataset in a training and a testing set.

In [None]:
date1 = '2010'
date2 = '2018'

train, test = data_multiple[date1:date2], data_multiple[str(int(date2) + 1):]

Make sure the partition are next to each other and not mixed up.

In [None]:
t = pd.Timedelta("15T")
train.index.max() + t == test.index.min()

Visualizing the partition.

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(15, 15), sharex=True)

ax[0].set_title('Data Train/Test Split')
ax[0].plot(train[target])
ax[0].plot(test[target])
ax[0].legend(['Training Set', 'Test Set'])

ax[1].set_title('Training Data')
ax[1].plot(train[target])
ax[1].legend(['Training Set'])

ax[2].set_title('Testing Data')
ax[2].plot(test[target], color="orange")
ax[2].legend(['Test Set'])

ax[0].set_xlim([train.index.min(), test.index.max()])

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)

ax.set_title('Data Train/Test Split')
ax.plot(train[target])
ax.plot(test[target])
ax.legend(['Training Set', 'Test Set'])

plt.tight_layout()
plt.show()

## Training and evaluating

Defining the training features.

In [None]:
all_features = data_multiple.columns

training_features_list = ['15min', '01h', '04h', '24h', '96h', 'DirVento',
       'VelVento', 'Temperatura', 'Pressao', 'Umidade', 'hour', 'day', 'month',
       'year', 'quarter', 'dayofyear', 'dayofmonth', 'weekofyear']

# Features to avoid for each point (If a feature like 15min is in this list,
# then training feature will contains 15min1, ..., 15minN but not 15min)
avoid_features = ['datetime', '15min', '01h', '04h', '24h', '96h', 'DirVento',
       'VelVento', 'Temperatura', 'Pressao', 'Umidade']

def is_training_feature(feature, training_features, avoid_features):
    for training_feature in training_features:
        if feature not in avoid_features and \
        training_feature == feature[:len(training_feature)] and \
        (feature[len(training_feature):].isnumeric() or feature[len(training_feature):] == ""):
            return True
    return False

training_features = list(filter(lambda x : is_training_feature(x, training_features_list, avoid_features), all_features))

### Training the model

In [None]:
from time import time

args = {
    "n_estimators" : 500,
    "learning_rate" : 0.01
}

# If you are using a nvidia GPU you can use it below
if True:
    reg = xgb.XGBRegressor(tree_method="gpu_hist",
                           **args)
else:
    reg = xgb.XGBRegressor(**args)

t = time()

reg.fit(train[training_features], train[target],
        eval_set=[(train[training_features], train[target]), (test[training_features],test[target])],
        verbose=50)
print(time() - t)

### Evaluating the model

Evaluating the model using the RSME metric of XGBoost score method.

In [None]:
preds_train = reg.predict(train[training_features])
preds_test = reg.predict(test[training_features])

print("Training :", reg.score(train[training_features], train[target]))
print("Testing :", reg.score(test[training_features], test[target]))

Plotting the prediction on the training and testing partition

In [None]:
fig, ax = plt.subplots(6, 1, figsize=(15, 30), sharex=True)

ax[0].set_title('Training Data/predicted value')
ax[0].plot(train[target], color='r', alpha=1)
ax[0].plot(train.index, preds_train, color='b', alpha=0.5)
ax[0].legend(['Training Set', 'Prediction'])

ax[1].set_title('Training Data')
ax[1].plot(train[target])
ax[1].legend(['Training Set'])

ax[2].set_title('Prediction value')
ax[2].plot(train.index, preds_train, color="orange")
ax[2].legend(['Prediction'])

ax[3].set_title('Testing Data/predicted value')
ax[3].plot(test[target], alpha=0.7)
ax[3].plot(test.index, preds_test, alpha=0.7)
ax[3].legend(['Testing Set', 'Prediction'])

ax[4].set_title('Testing Data')
ax[4].plot(test[target])
ax[4].legend(['Testing Set'])

ax[5].set_title('Prediction value')
ax[5].plot(test.index, preds_test, color="orange")
ax[5].legend(['Prediction'])

plt.show()

Looking at the prediction during an extreme event (in the testing partition).

In [None]:
period1 = '2019 04 08'
period2 = '2019 04 10'

preds_period = reg.predict(test.loc[period1:period2][training_features])

plt.title(f'Testing Data/predicted value for feature {translate_dict[target]}')
plt.plot(test.loc[period1:period2][target], color="blue")
plt.plot(test.loc[period1:period2].index, preds_period, alpha=0.7, color="red")
plt.legend(['Testing Set', 'Prediction'])

plt.tight_layout()
plt.show()

Looking at the prediction during usual weather (in the testing partition).

In [None]:
period1 = '2019 04 14'
period2 = '2019 04 20'

preds_period = reg.predict(test.loc[period1:period2][training_features])

plt.title(f'Testing Data/predicted value for feature {translate_dict[target]}')
plt.plot(test.loc[period1:period2][target], color="blue")
plt.plot(test.loc[period1:period2].index, preds_period, alpha=0.7, color="red")
plt.legend(['Testing Set', 'Prediction'])

plt.tight_layout()
plt.show()

## Conclusion

The model is very good to predict one point into the future and predicting only rain. However, in order to predict multiple point in the future, we need the other features to be predicted.
Indeed, our prediction are made using wind speed, wind direction, temperature, humidity, and pressure, therefore the model need all those features to predict.
To fix this the next step in the experiment is to predict all the features using a XGBoost regressor.