# Benchmark QRT

This notebook illustrates a simple benchmark example that should help novice participants to start the competition.

## Used libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

## Loading data

The x_train and x_test sets are composed of 35 columns.

The target of this challenge is `TARGET` and corresponds to the price change for daily futures contracts of 24H electricity baseload.

Electricity prices can be quite volatile, so we have chosen the Spearman rank correlation as a robust measure for the challenge.

Both x_train and y_train have an ID column. This is a unique ID. 

You will notice some columns have missing values.


In [3]:
x_train = pd.read_csv('x_train.csv')
y_train = pd.read_csv('y_train.csv')
x_test = pd.read_csv('x_test.csv')

In [4]:
x_train.head()

Unnamed: 0,ID,DAY_ID,COUNTRY,DE_CONSUMPTION,FR_CONSUMPTION,DE_FR_EXCHANGE,FR_DE_EXCHANGE,DE_NET_EXPORT,FR_NET_EXPORT,DE_NET_IMPORT,...,FR_RESIDUAL_LOAD,DE_RAIN,FR_RAIN,DE_WIND,FR_WIND,DE_TEMP,FR_TEMP,GAS_RET,COAL_RET,CARBON_RET
0,1054,206,FR,0.210099,-0.427458,-0.606523,0.606523,,0.69286,,...,-0.444661,-0.17268,-0.556356,-0.790823,-0.28316,-1.06907,-0.063404,0.339041,0.124552,-0.002445
1,2049,501,FR,-0.022399,-1.003452,-0.022063,0.022063,-0.57352,-1.130838,0.57352,...,-1.183194,-1.2403,-0.770457,1.522331,0.828412,0.437419,1.831241,-0.659091,0.047114,-0.490365
2,1924,687,FR,1.395035,1.978665,1.021305,-1.021305,-0.622021,-1.682587,0.622021,...,1.947273,-0.4807,-0.313338,0.431134,0.487608,0.684884,0.114836,0.535974,0.743338,0.204952
3,297,720,DE,-0.983324,-0.849198,-0.839586,0.839586,-0.27087,0.56323,0.27087,...,-0.976974,-1.114838,-0.50757,-0.499409,-0.236249,0.350938,-0.417514,0.911652,-0.296168,1.073948
4,1101,818,FR,0.143807,-0.617038,-0.92499,0.92499,,0.990324,,...,-0.526267,-0.541465,-0.42455,-1.088158,-1.01156,0.614338,0.729495,0.245109,1.526606,2.614378


In [5]:
y_train.head()

Unnamed: 0,ID,TARGET
0,1054,0.028313
1,2049,-0.112516
2,1924,-0.18084
3,297,-0.260356
4,1101,-0.071733


## Model and local score

We chose a simple linear regression as the challenges' benchmark. The missing values are simply filled with 0, and the COUNTRY column is dropped. 

**Ideas of improvements**: This challenge will test your knowledge of modeling techniques and feature engineering, as well as proper EDA and validation. Knowledge about the fundamental price drivers of electricity in each country will also be useful. The dataset is small, so you will need to be careful not to overfit to the train data.


In [None]:
lr = LinearRegression()

Xt = x_train.drop(['COUNTRY'], axis=1).fillna(0)
yt = y_train['TARGET']

lr.fit(Xt, yt)

print('Spearman correlation for train set {:.1f}%'.format(100 * spearmanr(lr.predict(Xt), yt).correlation))


## Generate the submission

We process the test set the same way as we did on the train set and predict using our linear model, while saving the predictions to a csv file.


In [None]:
x_test.head()

In [None]:
Xv = x_test.drop(['COUNTRY'], axis=1).fillna(0)

y_test_submission = x_test[['ID']].copy()
y_test_submission['TARGET'] = y_pred
y_test_submission.to_csv('benchmark_qrt.csv', index=False)

The local spearman correlation is around 27.9%

After submitting the benchmark file at https://forms.gle/XrnXx92F6uo2NQ5E8, we obtain a public score of around 15.9 % 

# Data Description

We provide three csv file data sets: training inputs X_train, training outputs Y_train, and test inputs X_test.


NB: The input data X_train and X_test represent the same explanatory variables but over two different time periods.

The columns ID in X_train and Y_train are identical, and the same holds true for the testing data. 1494 rows are available for the training data sets while 654 observations are used for the test data sets.

Input data sets comprise 35 columns:

ID: Unique row identifier, associated with a day (DAY_ID) and a country (COUNTRY),

DAY_ID: Day identifier - dates have been anonymized, but all data corresponding to a specific day is consistent,

COUNTRY: Country identifier - DE = Germany, FR = France,

and then contains daily commodity price variations,

GAS_RET: European gas,

COAL_RET: European coal,

CARBON_RET: Carbon emissions futures,

weather measures (daily, in the country x),

x_TEMP: Temperature,

x_RAIN: Rainfall,

x_WIND: Wind,

energy production measures (daily, in the country x),

x_GAS: Natural gas,

x_COAL: Hard coal,

x_HYDRO: Hydro reservoir,

x_NUCLEAR: Daily nuclear production,

x_SOLAR: Photovoltaic,

x_WINDPOW: Wind power,

x_LIGNITE: Lignite,

and electricity use metrics (daily, in the country x),

x_CONSUMPTON: Total electricity consumption,

x_RESIDUAL_LOAD: Electricity consumption after using all renewable energies,

x_NET_IMPORT: Imported electricity from Europe,

x_NET_EXPORT: Exported electricity to Europe,

DE_FR_EXCHANGE: Total daily electricity exchange between Germany and France,

FR_DE_EXCHANGE: Total daily electricity exchange between France and Germany.
Output data sets are composed of two columns:

ID: Unique row identifier - corresponding to the input identifiers,

TARGET: Daily price variation for futures of 24H electricity baseload.

The solution files submitted by participants shall follow this output data set format, namely to contain two columns ID and TARGET, where the ID values correspond to those of the ID column of X_test. An example of submission file containing random predictions is provided.
