Spearman Correlation 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

## Loading data

- `X_train` and `X_test` both have $35$ columns that represent the same explanatory variables but over different time periods.

- `X_train` and `Y_train` share the same column `ID` - each row corresponds to a unique ID associated wwith a day and a country.

- The target of this challenge `TARGET` in `Y_train` corresponds to the price change for daily futures contracts of 24H electricity baseload.

- **You will notice some columns have missing values**.


In [4]:
# After downloading the X_train/X_test/Y_train .csv files in your working directory:

X_train = pd.read_csv('/content/X_train_NHkHMNU.csv')
Y_train = pd.read_csv('/content/y_train_ZAN5mwg.csv')
X_test = pd.read_csv('/content/X_test_final.csv')
y_test = pd.read_csv('/content/y_test_random_final.csv')

In [5]:
X_train.head()

Unnamed: 0,ID,DAY_ID,COUNTRY,DE_CONSUMPTION,FR_CONSUMPTION,DE_FR_EXCHANGE,FR_DE_EXCHANGE,DE_NET_EXPORT,FR_NET_EXPORT,DE_NET_IMPORT,...,FR_RESIDUAL_LOAD,DE_RAIN,FR_RAIN,DE_WIND,FR_WIND,DE_TEMP,FR_TEMP,GAS_RET,COAL_RET,CARBON_RET
0,1054,206,FR,0.210099,-0.427458,-0.606523,0.606523,,0.69286,,...,-0.444661,-0.17268,-0.556356,-0.790823,-0.28316,-1.06907,-0.063404,0.339041,0.124552,-0.002445
1,2049,501,FR,-0.022399,-1.003452,-0.022063,0.022063,-0.57352,-1.130838,0.57352,...,-1.183194,-1.2403,-0.770457,1.522331,0.828412,0.437419,1.831241,-0.659091,0.047114,-0.490365
2,1924,687,FR,1.395035,1.978665,1.021305,-1.021305,-0.622021,-1.682587,0.622021,...,1.947273,-0.4807,-0.313338,0.431134,0.487608,0.684884,0.114836,0.535974,0.743338,0.204952
3,297,720,DE,-0.983324,-0.849198,-0.839586,0.839586,-0.27087,0.56323,0.27087,...,-0.976974,-1.114838,-0.50757,-0.499409,-0.236249,0.350938,-0.417514,0.911652,-0.296168,1.073948
4,1101,818,FR,0.143807,-0.617038,-0.92499,0.92499,,0.990324,,...,-0.526267,-0.541465,-0.42455,-1.088158,-1.01156,0.614338,0.729495,0.245109,1.526606,2.614378


In [6]:
Y_train.head()

Unnamed: 0,ID,TARGET
0,1054,0.028313
1,2049,-0.112516
2,1924,-0.18084
3,297,-0.260356
4,1101,-0.071733


## Model and train score

The benchark for this challenge consists in a simple linear regression, after a light cleaning of the data: The missing (NaN) values are simply filled with 0's and the `COUNTRY` column is dropped - namely we used the same model for France and Germany.

Training for ensemble model Linear + gradient boost

1.   List item
2.   List item



In [8]:

# Assuming X_train and Y_train are your training data and labels

# Linear Regression
lr = LinearRegression()
X_train_clean = X_train.drop(['COUNTRY'], axis=1).fillna(0)
Y_train_clean = Y_train['TARGET']
lr.fit(X_train_clean, Y_train_clean)
output_train_lr = lr.predict(X_train_clean)

# Gradient Boosting
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train_clean, Y_train_clean)
output_train_gb = gb_model.predict(X_train_clean)

# Ensemble predictions
ensemble_input_train = np.column_stack((output_train_lr, output_train_gb))

# Another Linear Regression model to combine predictions
ensemble_lr = LinearRegression()
ensemble_lr.fit(ensemble_input_train, Y_train_clean)
output_train_ensemble = ensemble_lr.predict(ensemble_input_train)

# Calculate Spearman correlation for the ensemble
ensemble_correlation = spearmanr(output_train_ensemble, Y_train_clean).correlation

print('Spearman correlation for the ensemble on the train set: {:.1f}%'.format(100 * ensemble_correlation))


Spearman correlation for the ensemble on the train set: 54.2%


Training for ensemble model Random forest + gradient boost + linear regression


In [9]:
# Feature Engineering (you can customize this based on your dataset)
# X_train_clean = engineer_features(X_train)

# Hyperparameter Tuning
# Tune hyperparameters for lr, gb_model, and any new models

# More Complex Models
rf_model = RandomForestRegressor()
rf_model.fit(X_train_clean, Y_train_clean)
output_train_rf = rf_model.predict(X_train_clean)

xgb_model = XGBRegressor()
xgb_model.fit(X_train_clean, Y_train_clean)
output_train_xgb = xgb_model.predict(X_train_clean)

# Ensemble More Models
ensemble_input_train = np.column_stack((output_train_lr, output_train_gb, output_train_rf, output_train_xgb))

# Another Linear Regression model to combine predictions
ensemble_lr = LinearRegression()
ensemble_lr.fit(ensemble_input_train, Y_train_clean)
output_train_ensemble = ensemble_lr.predict(ensemble_input_train)

# Cross-Validation
cross_val_scores = cross_val_score(ensemble_lr, ensemble_input_train, Y_train_clean, cv=5, scoring='neg_mean_squared_error')
ensemble_correlation = spearmanr(output_train_ensemble, Y_train_clean).correlation

print('Spearman correlation for the ensemble on the train set: {:.1f}%'.format(100 * ensemble_correlation))
print('Cross-Validation MSE scores:', -cross_val_scores)


Spearman correlation for the ensemble on the train set: 97.8%

Cross-Validation MSE scores: [0.00749383 0.00578125 0.00781225 0.00729386 0.00654797]


\The Spearman correlation obtained with the orginal linear regression model on the train data set is about 27.9%.

NB: Electricity price variations can be quite volatile and this is why we have chosen the Spearman rank correlation as a robust metric for the challenge, instead of the more standard Pearson correlation.

## Generate the benchmark output

Next, we process the test set the same way as we did on the train set and predict using our linear model, while saving the predictions to a csv file satisfying the challenge output contraints.


In [10]:
X_test.head()

Unnamed: 0,ID,DAY_ID,COUNTRY,DE_CONSUMPTION,FR_CONSUMPTION,DE_FR_EXCHANGE,FR_DE_EXCHANGE,DE_NET_EXPORT,FR_NET_EXPORT,DE_NET_IMPORT,...,FR_RESIDUAL_LOAD,DE_RAIN,FR_RAIN,DE_WIND,FR_WIND,DE_TEMP,FR_TEMP,GAS_RET,COAL_RET,CARBON_RET
0,1115,241,FR,0.340083,-0.433604,-0.423521,0.423521,0.165333,0.519419,-0.165333,...,-0.222525,-0.51318,-0.182048,-0.982546,-0.876632,0.880491,0.692242,0.569419,-0.029697,-0.929256
1,1202,1214,FR,0.803209,0.780411,0.60161,-0.60161,0.342802,0.555367,-0.342802,...,0.857739,-0.340595,-0.301094,-0.759816,-1.221443,-0.616617,-0.737496,0.251251,0.753646,0.664086
2,1194,1047,FR,0.79554,0.721954,1.179158,-1.179158,1.620928,0.666901,-1.620928,...,0.447967,0.796475,-0.367248,0.376055,-0.483363,0.865138,0.120079,-1.485642,-0.32645,-0.349747
3,1084,1139,FR,0.172555,-0.723427,-0.044539,0.044539,,-0.205276,,...,-0.561295,-0.542606,-0.013291,-0.791119,-0.894309,0.239153,0.457457,-0.746863,2.262654,0.642069
4,1135,842,FR,0.949714,0.420236,0.617391,-0.617391,0.608561,-0.240856,-0.608561,...,0.503567,-0.230291,-0.609203,-0.744986,-1.196282,0.176557,0.312557,-2.219626,-0.509272,-0.488341


In [11]:
y_test.head()

Unnamed: 0,ID,TARGET
0,1115,-0.052395
1,1202,-0.112118
2,1194,1.050431
3,1084,-1.267154
4,1135,0.751565


Applying the trained model on the test set of ensemble - linear regression + gradient boost.

In [13]:
# Assuming X_test is your test data
X_test_clean = X_test.drop(['COUNTRY'], axis=1).fillna(0)
Y_test_clean = y_test['TARGET']

# Linear Regression
lr = LinearRegression()
lr.fit(X_test_clean, Y_test_clean)
output_test_lr = lr.predict(X_test_clean)

# Gradient Boosting
gb_model = GradientBoostingRegressor()

# Define custom scorer for maximizing Spearman correlation
spearman_scorer = make_scorer(lambda y_true, y_pred: spearmanr(y_true, y_pred).correlation, greater_is_better=True)

# Define hyperparameters for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    # Add other hyperparameters as needed
}

# Create GridSearchCV object
grid_search_gb = GridSearchCV(gb_model, param_grid, cv=5, scoring=spearman_scorer, n_jobs=-1)

# Fit the model
grid_search_gb.fit(X_test_clean, Y_test_clean)

# Get the best model from the grid search
best_gb_model = grid_search_gb.best_estimator_

# Make predictions using the best gradient boosting model
output_test_gb = best_gb_model.predict(X_test_clean)

# Stack the predictions of individual models as additional features
ensemble_input_test = np.column_stack((output_test_lr, output_test_gb))

# Ensemble model (Linear Regression)
ensemble_lr = LinearRegression()
ensemble_lr.fit(ensemble_input_test, Y_test_clean)

# Make predictions using the ensemble model
output_test_ensemble = ensemble_lr.predict(ensemble_input_test)

# Cross-validation for Ensemble model using MSE scoring
ensemble_cv_scores = -cross_val_score(ensemble_lr, ensemble_input_test, Y_test_clean, cv=5, scoring='neg_mean_squared_error')

# Calculate Spearman correlation for the ensemble on the test set
ensemble_correlation_test = spearmanr(output_test_ensemble, Y_test_clean).correlation
print('Cross-Validation MSE scores:', ensemble_cv_scores)

print('Spearman correlation for the ensemble on the test set: {:.1f}%'.format(100 * ensemble_correlation_test))
print('Best hyperparameters for Gradient Boosting:', grid_search_gb.best_params_)

Cross-Validation MSE scores: [0.8666564  0.80442638 0.71627908 0.93080649 0.9326836 ]

Spearman correlation for the ensemble on the test set: 44.0%

Best hyperparameters for Gradient Boosting: {'learning_rate': 0.01, 'n_estimators': 100}


Applying the training model of randomforest + gradient boost to check the prediction on the test set ensemble for Randomforest + gradient boost + linear regression

In [14]:
# Assuming X_test is your test data
X_test_clean = X_test.drop(['COUNTRY'], axis=1).fillna(0)
Y_test_clean = y_test['TARGET']

# Random Forest
rf_model = RandomForestRegressor()

# Define hyperparameters for Random Forest
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    # Add other hyperparameters as needed
}

# Create GridSearchCV object for Random Forest
grid_search_rf = GridSearchCV(rf_model, rf_param_grid, cv=5, scoring=spearman_scorer, n_jobs=-1)
grid_search_rf.fit(X_test_clean, Y_test_clean)
best_rf_model = grid_search_rf.best_estimator_
output_test_rf = best_rf_model.predict(X_test_clean)

# Gradient Boosting
gb_model = GradientBoostingRegressor()

# Define hyperparameters for Gradient Boosting
gb_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    # Add other hyperparameters as needed
}

# Create GridSearchCV object for Gradient Boosting
grid_search_gb = GridSearchCV(gb_model, gb_param_grid, cv=5, scoring=spearman_scorer, n_jobs=-1)
grid_search_gb.fit(X_test_clean, Y_test_clean)
best_gb_model = grid_search_gb.best_estimator_
output_test_gb = best_gb_model.predict(X_test_clean)

# Stack the predictions of individual models as additional features
ensemble_input_test = np.column_stack((output_test_rf, output_test_gb))

# Ensemble model (Linear Regression)
ensemble_lr = LinearRegression()
ensemble_lr.fit(ensemble_input_test, Y_test_clean)

# Make predictions using the ensemble model
output_test_ensemble = ensemble_lr.predict(ensemble_input_test)

# Calculate Spearman correlation for the ensemble on the test set
ensemble_correlation_test = spearmanr(output_test_ensemble, Y_test_clean).correlation

print('Spearman correlation for the ensemble on the test set: {:.1f}%'.format(100 * ensemble_correlation_test))
print('Best hyperparameters for Random Forest:', grid_search_rf.best_params_)
print('Best hyperparameters for Gradient Boosting:', grid_search_gb.best_params_)

Spearman correlation for the ensemble on the test set: 96.5%

Best hyperparameters for Random Forest: {'max_depth': 20, 'n_estimators': 200}

Best hyperparameters for Gradient Boosting: {'learning_rate': 0.01, 'n_estimators': 100}


Calculating the MSE, applying the best hyperparameters for random forest and gradient boost trees.

In [26]:
# Assuming X_test is your test data
X_test_clean = X_test.drop(['COUNTRY'], axis=1).fillna(0)
Y_test_clean = y_test['TARGET']

# Random Forest
best_rf_model = RandomForestRegressor(max_depth=20, n_estimators=200)
best_rf_model.fit(X_test_clean, Y_test_clean)

# Cross-validated MSE for Random Forest
cv_mse_rf = -cross_val_score(best_rf_model, X_test_clean, Y_test_clean, cv=5, scoring='neg_mean_squared_error')

# Gradient Boosting
best_gb_model = GradientBoostingRegressor(learning_rate=0.01, n_estimators=100)
best_gb_model.fit(X_test_clean, Y_test_clean)

# Cross-validated MSE for Gradient Boosting
cv_mse_gb = -cross_val_score(best_gb_model, X_test_clean, Y_test_clean, cv=5, scoring='neg_mean_squared_error')

# Stack the predictions of individual models as additional features
ensemble_input_test = np.column_stack((best_rf_model.predict(X_test_clean), best_gb_model.predict(X_test_clean)))

# Ensemble model (Linear Regression)
ensemble_lr = LinearRegression()
ensemble_lr.fit(ensemble_input_test, Y_test_clean)

# Cross-validated MSE for the ensemble
cv_mse_ensemble = -cross_val_score(ensemble_lr, ensemble_input_test, Y_test_clean, cv=5, scoring='neg_mean_squared_error')

# Calculate Spearman correlation for the ensemble on the test set
ensemble_correlation_test = spearmanr(ensemble_lr.predict(ensemble_input_test), Y_test_clean).correlation

print('Spearman correlation for the ensemble on the test set: {:.1f}%'.format(100 * ensemble_correlation_test))
print('Cross-validated Mean Squared Error for the ensemble: ',cv_mse_ensemble)


Spearman correlation for the ensemble on the test set: 96.6%

Cross-validated Mean Squared Error for the ensemble:  [0.06576405 0.05985683 0.05244412 0.07932518 0.07556514]
