# XGBoost using multiple input points

The objective of this notebook is to apply XGBoost algorithm on a simple dataset to predict the future. As the dataset contains the energy consumption for each timestamp, the final goal is to apply a regression XGBoost model.

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
import xgboost as xgb

Reading the dataset with pandas

In [None]:
data = pd.read_csv("PJME_hourly.csv")
data.head()

## Renaming the column and convert the date to pandas format

Convert the date into pandas date and sort by date to allow further computation (such as partitionning).

In [None]:
print(data.dtypes)
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.rename({"Datetime" : "date", "PJME_MW" : "out"}, axis=1, inplace=True)
data.sort_values("date", ascending=True, inplace=True, ignore_index=True)
print(data.head())
print(data.dtypes)

## Extracting all the features to improve our evaluation

XGBoost can't use the date of pandas, XGBoost only work with numbers (float, int, etc). Therefore this section is sorting the data by dates and extract the features from the pandas datetime format.

In [None]:
data_features = data.copy()

data_features['gap'] = (data_features['out'] - data_features['out'].shift()).shift()
data_features.set_index('date', inplace=True)

Function to extract features from the date.

In [None]:
def get_features(df):
    out = df.copy()
    out["hour"] = out.index.hour
    out["day"] = out.index.day
    out["month"] = out.index.month
    out["year"] = out.index.year

    out['quarter'] = out.index.quarter
    out['dayofyear'] = out.index.dayofyear
    out['dayofmonth'] = out.index.day    
    out['weekofyear'] = out.index.isocalendar().week.astype(np.int64)

    return out

In [None]:
data_features = get_features(data_features)

data_features.head()

## Grouping N input points for the training

To guess the next point, the model required need the previous points. Therefore, this section is adding the data of the previous points into the row of each example.

In [None]:
def get_colums_names(column_names, N):
    column_names = list(column_names)
    print(column_names)
    names = []
    for i in range(N, 0, -1):
        for name in column_names:
            names.append(name + str(i))
    names.extend(column_names)
    print(names)
    return names

data_features.reset_index(inplace=True)
all_available_features = list(data_features.columns)

N = 3 # Number of points to predict future
data_multiple = data_features.copy()

for i in range(1, N):
    data_multiple = pd.concat([data_multiple.iloc[:-1].reset_index(drop=True), data_features.iloc[i:].reset_index(drop=True)], axis=1)

data_multiple = pd.concat([data_multiple.iloc[:-1].reset_index(drop=True), data_features.iloc[N:].reset_index(drop=True)], axis=1)

data_multiple.columns = get_colums_names(data_features.columns, N)
data_multiple.set_index("date", inplace=True)

data_multiple.info()
data_multiple

dfi.export(data_multiple, "Data multiple.png", max_rows=6, max_cols=10)

## Choosing the wanted features for training and target

Define the training feature that must be used for the training. The code below allow to choose all the features of the previous points by creating a list of features name.

In [None]:
all_features = data_multiple.columns

training_features_list = ['hour', 'day', 'month', 'year', 'quarter', 'dayofyear',
       'dayofmonth', 'weekofyear', 'out']

def is_training_feature(feature, training_features):
    for training_feature in training_features:
        if feature != "out" and \
        training_feature == feature[:len(training_feature)] and \
        (feature[len(training_feature):].isnumeric() or feature[len(training_feature):] == ""):
            return True
    return False

training_features = list(filter(lambda x : is_training_feature(x, training_features_list), all_features))
print(training_features)

target = "out"

## Creating train_test subset

To train and evaluate the performance of the model, two partition are created, one that is used for training the model and the other one to test the model.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data_multiple, shuffle=False)

In [None]:
print(train.columns)

## Visualizing the partition

This section allow us to make sure the partitions are well made.

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(15, 15), sharex=True)

train['out'].plot(ax=ax[0], title='Data Train/Test Split')
test['out'].plot(ax=ax[0])
ax[0].legend(['Training Set', 'Test Set'])

train['out'].plot(ax=ax[1], title='Training Data')
ax[1].legend(['Training Set'])

test['out'].plot(ax=ax[2], title='Testing Data', color="orange")
ax[2].legend(['Test Set'])


plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)

train['out'].plot(ax=ax, title='Data Train/Test Split')
test['out'].plot(ax=ax)
ax.legend(['Training Set', 'Test Set'])


plt.show()

## Training our model using XGBoost

In this section we train the XGBoost model.

In [None]:
print(training_features)

In [None]:
from time import time

args = {
    "n_estimators" : 1000,
    "learning_rate" : 0.01,
}

# If you don't have a GPU card with the correspond driver you can uncomment the line below to use the CPU instead
# reg = xgb.XGBRegressor(**args)
reg = xgb.XGBRegressor(tree_method="gpu_hist",
                       **args)

t = time()
reg.fit(train[training_features], train[target],
        eval_set=[(train[training_features], train[target]), (test[training_features],test[target])],
        verbose=100)
print(time() - t)

In [None]:
xgb.plot_importance(reg)

## Predict and evaluate model performance

In this section the model is tested and evaluated by using the score function of XGBoost regressor. The score method of XGBoost use a RMSE metric.

In [None]:
preds_train = reg.predict(train[training_features])
preds_test = reg.predict(test[training_features])

print("Training ;", reg.score(train[training_features], train[target]))
print("Testing :", reg.score(test[training_features], test[target]))

Plotting the prediction on the train and test partition.

In [None]:
fig, ax = plt.subplots(6, 1, figsize=(15, 30), sharex=True)

train['out'].plot(ax=ax[0], title='Training Data/predicted value', color='r', alpha=1)
ax[0].plot(train.index, preds_train, color='b', alpha=0.5)
ax[0].legend(['Training Set', 'Prediction'])

train['out'].plot(ax=ax[1], title='Training Data')
ax[1].legend(['Training Set'])

ax[2].plot(train.index, preds_train, color="orange")
ax[2].legend(['Prediction'])

test['out'].plot(ax=ax[3], title='Testing Data/predicted value', alpha=0.7)
ax[3].plot(test.index, preds_test, alpha=0.7)
ax[3].legend(['Testing Set', 'Prediction'])

test['out'].plot(ax=ax[4], title='Testing Data')
ax[4].legend(['Testing Set'])

ax[5].set_title('Prediction value')
ax[5].plot(test.index, preds_test, color="orange")
ax[5].legend(['Prediction'])


plt.show()

Plotting the prediction over one week on the testing partition.

In [None]:
period1 = '2018 05 01'
period2 = '2018 05 07'

preds_period = reg.predict(test.loc[period1:period2][training_features])

fig, ax = plt.subplots(figsize=(10, 5))

ax.set_title('Testing Data/predicted value')
ax.plot(test.loc[period1:period2].index, preds_period, alpha=0.7, color="blue")
test.loc[period1:period2]['out'].plot(ax=ax, alpha=0.7, color="red")
ax.legend(['Prediction', 'Testing Set'])

pltshow()

In [None]:
fi = pd.DataFrame(data=reg.feature_importances_,
             index=reg.get_booster().feature_names,
             columns=['importance'])
fi.sort_values('importance').plot(kind='barh', title='Feature Importance')

plt.show()

In [None]:
import re

mapping_dict = {}
for feature in all_available_features:
    for i in range(N, 0, -1):
        if (feature != "date" and feature != "out") or i != 1:
            mapping_dict[feature + str(i)] = feature + str(i-1) if i > 1 else feature

## Predicting the future

In this section the future is predicted over multiple point by guessing one point and then using this new point to predict the next one and so on.

In [None]:
future_pred = [[], []]
start_index = np.random.randint(0, len(test))
nb_points = 50

new_row = test.iloc[start_index:(start_index + 1), :].drop(columns=['out'])

for _ in range(nb_points):
    last_row = new_row.copy()
    pred = reg.predict(new_row[training_features])[0]
    future_pred[0].append(new_row.index)
    future_pred[1].append(pred)


    for k in mapping_dict:
        new_row[k] = last_row[mapping_dict[k]]
    new_row["date1"] = last_row.index
    new_row.index += pd.Timedelta('1h')
    new_row = get_features(new_row)
    new_row['out1'] = pred

fig, ax = plt.subplots(figsize=(12, 5))

ax.set_title('Testing Data/predicted value based on old prediction')
ax.plot(test.iloc[start_index:(start_index + nb_points)]['out'], alpha=1, color="blue")
ax.plot(future_pred[0], future_pred[1], alpha=0.7, color="red")
ax.legend(['Testing Set', 'Prediction'])

plt.show()

The same code as above but to plot multiple try at different index of the testing dataset at the same time.

In [None]:
nb_tot_try = 4

fig, ax = plt.subplots(nb_tot_try, 1, figsize=(15, 5 * nb_tot_try))

for nb_try in range(nb_tot_try):
    future_pred = [[], []]
    start_index = np.random.randint(0, len(test))
    nb_points = 200

    new_row = test.iloc[start_index:(start_index + 1), :].drop(columns=['out'])

    for _ in range(nb_points):
        last_row = new_row.copy()
        pred = reg.predict(new_row[training_features])[0]
        future_pred[0].append(new_row.index)
        future_pred[1].append(pred)


        for k in mapping_dict:
            new_row[k] = last_row[mapping_dict[k]]
        new_row["date1"] = last_row.index
        new_row.index += pd.Timedelta('1h')
        new_row = get_features(new_row)
        new_row['out1'] = pred


    ax[nb_try].set_title('Testing Data/predicted value based on old prediction')
    ax[nb_try].plot(test.iloc[start_index:(start_index + nb_points)]['out'], alpha=1, color="blue")
    ax[nb_try].plot(future_pred[0], future_pred[1], alpha=0.7, color="red")
    ax[nb_try].legend(['Prediction', 'Testing Set'])

plt.show()

The results are inconsistent, sometime the prediction is really good for close time prediction and some time the prediction is getting wrong pretty fast.