# Wind prediction - Second assignment

## Authors

David Moreno Maldonado 100441714     
Inés Fernández Campos 100443936

## 0. Preliminaries

In [1]:
# Import some libraries
import os
import numpy as np              
import pandas as pd
import matplotlib.pyplot as plt 

import sys
import time
import math

from sklearn.experimental import enable_iterative_imputer
from sklearn import preprocessing, impute, model_selection, metrics, neighbors, ensemble, feature_selection
from sklearn.pipeline import Pipeline
import optuna
import optuna.visualization as ov

os.getcwd()

'/Users/roni/Desktop/master/2nd quarter/big data intelligence/assignments/assignment_2'

In [2]:
#MAIN PARAMETERS FOR THE ASSIGNMENT
budget = 20
random_state = 3
verbose = 0
n_jobs = 1

The "wind_pickle" file contains data in a binary format called "Pickle". Pickle data loads faster than text data.

In [3]:
data = pd.read_pickle('wind_pickle.pickle')

You can visualize the attributes in the dataset. Very important, the output attribute (i.e. the value to be predicted, **energy**, is the first attribute). **Steps** represents the hours in advance of the forecast. We will not use this variable here.

In [4]:
# The dataset contains 5937 instances and 556 attributes (including the outcome to be predicted)
print(data.shape)
#data.columns.values.tolist() 

(5937, 556)


In [5]:
#-1 for training, 0 for validation, 1 for testing
year_to_part = {
    2005: -1,
    2006: -1,
    2007: 0,
    2008: 0, 
    2009: 1,
    2010: 1
}
data['partition'] = data['year'].apply(lambda x: year_to_part[x])

We now remove the columns that cannot be used for training the models from the DataFrame

In [6]:
# Steps, month, day, hour, year should be removed, they cannot be used for training the models
to_remove = ['steps', 'month', 'year', 'day', 'hour']
for m in to_remove: data = data.drop(m, 1)

In [7]:
from numpy.random import randint

# we add na values at random
my_NIA = 100443936 + 100441714
np.random.seed(my_NIA)

how_many_nas = round(data.shape[0]*data.shape[1]*0.05)
print('Lets put '+str(how_many_nas)+' missing values \n')
x_locations = randint(0, data.shape[0], size=how_many_nas)
y_locations = randint(1, data.shape[1]-2, size=how_many_nas)

for i in range(len(x_locations)):
    data.iat[x_locations[i], y_locations[i]] = np.nan
    
data.to_pickle('wind_pickle_with_nan.pickle')

Lets put 163861 missing values 



From this point on, the file wind_pickle_with_nan should be used.

In [8]:
data = pd.read_pickle('wind_pickle_with_nan.pickle')
data.shape

(5937, 552)

## Input missing data

Since we have randomly inputed missing values throughout our data, prior to creating our models we must impute the missing values (except in the response). In the following cell we have implemented and iterative imputer two different ways (through *knnImputer* and *IterativeImputer*) and finally a simple imputer which is the one we have left uncommented simply because, although the first two are more complex and impute values using the entire set of available feature dimensions to estimate the missing values, they take far too long. 

In [9]:
print(data.isnull().values.any())
input_cols = data.columns.difference(['energy', 'partition'])
x = data[input_cols]

#Iterative imputer (takes too long)
'''iter_imp = impute.IterativeImputer(random_state=random_state, 
                                   initial_strategy='median', 
                                   max_iter=3,
                                   verbose=verbose)
no_nan = iter_imp.fit_transform(x)'''

#KNN imputer(takes too long)
'''knn_imp = impute.KNNImputer(weights='distance')
no_nan = knn_imp.fit_transform(x)'''

#Simple imputer
simp_imp = impute.SimpleImputer(strategy='median',
                               verbose=2)
no_nan = simp_imp.fit_transform(x)

data[input_cols] = pd.DataFrame(data=no_nan)
print(data.isnull().values.any())

True
False


## Scaling

Lastly we scale the data so that it is all within the same range.

In [10]:
scaler = preprocessing.StandardScaler().fit(data[input_cols]) 
data[input_cols] = scaler.transform(data[input_cols])

## Data split
We are going to use train/test for model evaluation (outer) and train/validation for hyperparameter tuning (inner), as follows:     
1. Train partition: the first two years of data. Given that there are 6 years worth of data, we will use the first 2/6 of the instances for training.     
2. Validation partition: the second two years of data. 
3. Test partition: the remaining data    


In [11]:
#-1 for training, 0 for validation, 1 for testing
test = data[data['partition'] == 1]
train = data[data['partition'] == -1]
val = data[data['partition'] == 0]

del test['partition']
del train['partition']

y_test = test['energy']
x_test = test[test.columns.difference(['energy'])]

y_train = train['energy']
x_train = train[train.columns.difference(['energy'])]


y_val = val['energy']
x_val = val[train.columns.difference(['energy'])]

# 1. MODEL SELECTION AND HYPER-PARAMETER TUNING

In [46]:
#Dataframes with all the information of each model
summary = {
    'knn': pd.DataFrame(columns=['Time (sec)', 'Score (RMSE)', 'N. neighbors', 'Weights', 'P']),
    'random_forest': pd.DataFrame(columns=['Time (sec)', 'Score (RMSE)', 'Min. samples split', 'Criterion', 'Max. depth', 'N. estimators','Max. features']),
    'gradient_boosting': pd.DataFrame(columns=['Time (sec)', 'Score (RMSE)'])
}

## 1.1 KNN

### 1.1.1 Default hyper-parameters

Here we train KNN with the default hyper-parameters, so the number of neighbors used will be 5 and the power parameter for the Minkowski metric is set to 2, so KNN will be using euclidean distance.

In [47]:
np.random.seed(random_state)
knn_default = neighbors.KNeighborsRegressor()

start_time = time.time()
knn_default = knn_default.fit(x_train, y_train)
y_val_pred = knn_default.predict(x_val)
score = math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))
end_time = time.time()

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': score, 
    'N. neighbors': 5, 
    'Weights': 'uniform', 
    'P': 2
    }, 
    name='default'))

### 1.1.2 Hyper-parameter tunning (OPTUNA)

In this subsection however, we are going to tune the hyper-parameters using Optuna.  
To do so, we create and objective function that will test a set of hyper-parameters for the model and evaluate the model's performance given those hyper-parameters and return that score. Within this objective function, we define three hyper-parameters to tune: 
- number of neighbors: taking any integer value in [1, 16].
- weights: weight function to be used in prediction can either be uniform weights or points can be distance weighted (closer implies more influence).
- p: power parametric for the Minkowski metric in order to choose between euclidean distance or manhattan distance.

In [48]:
min_n_neigbors = 1
max_n_neigbors = 50

In [49]:
np.random.seed(random_state)
def knn_objective(trial):
    n_neighbors = trial.suggest_int('n_neighbors', min_n_neigbors, max_n_neigbors)
    weights = trial.suggest_categorical('weights', ['uniform','distance'])
    p = trial.suggest_categorical('p', [1, 2])

    clf = neighbors.KNeighborsRegressor(
        n_neighbors=n_neighbors,
        weights=weights,
        p=p)
    
    clf = clf.fit(x_train, y_train)
    y_val_pred = clf.predict(x_val)
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))

knn_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
knn_optuna.optimize(knn_objective, n_trials=budget)
end_time = time.time()

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': knn_optuna.best_value, 
    'N. neighbors': knn_optuna.best_params['n_neighbors'], 
    'Weights': knn_optuna.best_params['weights'], 
    'P': knn_optuna.best_params['p']
    }, 
    name='optuna'))

[32m[I 2021-01-14 17:30:52,713][0m A new study created in memory with name: no-name-c943cd42-6a4a-416a-8f3b-5afefd078941[0m
[32m[I 2021-01-14 17:30:54,935][0m Trial 0 finished with value: 427.46488149623633 and parameters: {'n_neighbors': 27, 'weights': 'distance', 'p': 1}. Best is trial 0 with value: 427.46488149623633.[0m
[32m[I 2021-01-14 17:30:55,027][0m Trial 1 finished with value: 435.6329813032633 and parameters: {'n_neighbors': 28, 'weights': 'distance', 'p': 2}. Best is trial 0 with value: 427.46488149623633.[0m
[32m[I 2021-01-14 17:30:57,211][0m Trial 2 finished with value: 441.1599884529427 and parameters: {'n_neighbors': 5, 'weights': 'distance', 'p': 1}. Best is trial 0 with value: 427.46488149623633.[0m
[32m[I 2021-01-14 17:30:57,282][0m Trial 3 finished with value: 575.7646217953005 and parameters: {'n_neighbors': 1, 'weights': 'uniform', 'p': 2}. Best is trial 0 with value: 427.46488149623633.[0m
[32m[I 2021-01-14 17:30:57,377][0m Trial 4 finished with 

## 1.2 Random Forest

### 1.2.1 Default hyper-parameters

The following cell trains a random forest ensemble with default parameters: 100 estimators, mse criteiron and 2 samples minimum to continue splitting a node.

In [51]:
np.random.seed(random_state)
rf_default = ensemble.RandomForestRegressor(random_state=random_state, verbose=verbose, n_jobs=n_jobs)

start_time = time.time()
rf_default = rf_default.fit(x_train, y_train)
y_val_pred = rf_default.predict(x_val)
score =  math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))
end_time = time.time()

summary['random_forest'] = summary['random_forest'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': score,
    'Min. samples split': 2, 
    'Criterion': 'mse', 
    'Max. depth': 'None',
    'N. estimators': 100,
    'Max. features': 1
    },
    name='default'))

### 1.2.2 Hyper-parameter tunning (OPTUNA)

Once again, instead of settling for the default hyper-parameters we tune them using Optuna. In this case more hyper-parameters are tuned but the procedure is similar if not pratically the same to how we tuned the KNNRegressor using Optuna.

In [52]:
min_max_depth = 2
max_max_depth = 32
min_n_estimators = 50
max_n_estimators = 400

In [53]:
np.random.seed(random_state)
def random_forest_objective(trial):
    min_samples_split = trial.suggest_uniform('min_samples_split', 0+sys.float_info.min, 1)
    criterion = trial.suggest_categorical('criterion', ['mse','mae'])
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth)
    n_estimators = trial.suggest_int('n_estimators', min_n_estimators, max_n_estimators)
    max_features = trial.suggest_uniform('max_features', 0+sys.float_info.min, 0.6)

    clf = ensemble.RandomForestRegressor(
        random_state=random_state,
        min_samples_split=min_samples_split,
        criterion=criterion,
        max_depth=max_depth,
        n_estimators=n_estimators,
        max_features=max_features
        )
    clf = clf.fit(x_train, y_train)
    y_val_pred = clf.predict(x_val)
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))

rf_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
rf_optuna.optimize(random_forest_objective, n_trials=budget, n_jobs=n_jobs)
end_time = time.time()

summary['random_forest'] = summary['random_forest'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': rf_optuna.best_value,
    'Min. samples split': rf_optuna.best_params['min_samples_split'], 
    'Criterion': rf_optuna.best_params['criterion'], 
    'Max. depth': rf_optuna.best_params['max_depth'],
    'N. estimators': rf_optuna.best_params['n_estimators'],
    'Max. features': rf_optuna.best_params['max_features']
    },
    name='optuna'))

[32m[I 2021-01-14 17:34:47,027][0m A new study created in memory with name: no-name-6264c00f-23f0-4876-b70e-50c7b06c51f9[0m
[32m[I 2021-01-14 17:34:47,413][0m Trial 0 finished with value: 719.4608441722529 and parameters: {'min_samples_split': 0.6575167002537593, 'criterion': 'mae', 'max_depth': 15, 'n_estimators': 204, 'max_features': 0.40181980166836345}. Best is trial 0 with value: 719.4608441722529.[0m
[32m[I 2021-01-14 17:35:14,502][0m Trial 1 finished with value: 399.9725398015795 and parameters: {'min_samples_split': 0.3483686600938284, 'criterion': 'mse', 'max_depth': 21, 'n_estimators': 295, 'max_features': 0.40809651553870596}. Best is trial 1 with value: 399.9725398015795.[0m
[32m[I 2021-01-14 17:35:14,637][0m Trial 2 finished with value: 668.3821156793259 and parameters: {'min_samples_split': 0.7479993441301223, 'criterion': 'mse', 'max_depth': 19, 'n_estimators': 135, 'max_features': 0.5395298588307268}. Best is trial 1 with value: 399.9725398015795.[0m
[32m[I

## 1.3 Gradient Boosting

In this section we seek to apply Gradient Boosting. Boosting merely wants to boost weak models (as are our trees) through ensembles by sequentially adding a new model to that ensemble with the idea that every model added to the ensemble might do better than the last.

Gradient Boosting takes this idea of Boosting and applies it through Gradient Descent. Basically, every added model will approximate the distance between the ensembles output at that iteration and the actual output we want to get. And by adding new models to our ensembles we expect that difference of outputs to decrease.

### 1.3.1 Default hyper-parameters

Here we first implement Gradient Boosting using the scikit-learn library with default hyper-parameters. 

Note how, since we are not only working with regression trees but gradient descent, apart from hyper-parameters for the trees we now find hyper-parameters like *learning rate* that are used to determine the steplegnth in the descent direction towards the optima.

In [60]:
# implementation using sklearn
np.random.seed(random_state)
gb_sk_def = ensemble.GradientBoostingRegressor(random_state=random_state, verbose=verbose)

start_time = time.time()
gb_sk_def = gb_sk_def.fit(x_train, y_train)
y_val_pred = gb_sk_def.predict(x_val)
score =  math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))
end_time = time.time()

summary['gradient_boosting'] = summary['gradient_boosting'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': score,
    'Learning rate': 0.1,
    'N. estimators': 100,
    'Criterion': 'friedman_mse', 
    'Min. samples split': 2, 
    'Min. samples leaf': 1,
    'Max. depth': 3,
    'Max. leaf nodes': 'None'
    },
    name='default'))

We have also implemented Gradient Boosting using the XGBoost library.
 
The implementation is pratically the same, only having to transform or cast the input training and test matrices to the libraries matrices.
Regarding the hyper-parameters, some that might be present in scikit are not in xgb or vice-versa but let us implement Gradient Descent this way and later on evaluate its performance.

In [69]:
# implementation using xgboost
import xgboost as xgb

dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)

gb_xgb_def = xgb.XGBRegressor(objective='reg:squarederror')

start_time = time.time()
gb_xgb_def = gb_xgb_def.fit(x_train, y_train)
y_val_pred = gb_xgb_def.predict(x_val)
score = math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))
end_time = time.time()

summary['gradient_boosting'] = summary['gradient_boosting'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': score,
    'Learning rate': 0.3,
    'Max. depth': 6,
    'Max. leaf nodes': 0,
    'Gamma (min_split_loss)': 0,
    'Lambda': 1,
    'Alpha': 0,
    'N. estimators': gb_xgb_def.get_params()['n_estimators']
    },
    name='default_xgboost'))

### 1.3.2 Hyper-parameter tunning

As we did in the last subsection we will now implement Gradient Boosting for both scikit-learn and XGBoost tuning their hyper-parameters through Optuna

In [63]:
min_max_leaf_nodes = 2
max_max_leaf_nodes = 20
min_min_samples_leaf = 1
max_min_samples_leaf = 10

Using scikit-learn the execution took quite a while which is why we created a variable *short* to control how many hyper-parameters to train. However, the outputs execution you see below corresponds to training all hyper-parameters.

In [64]:
# hyperparam tuning for sklearn ensemble.GradientBoostingRegressor
np.random.seed(random_state)

def gradboosting_objective(trial):  
    gb_sk_opt = None
    short = False
    
    learning_rate = trial.suggest_uniform('learning_rate', 0+sys.float_info.min, 1)
    n_estimators = trial.suggest_int('n_estimators', min_n_estimators, max_n_estimators)
    min_samples_split = trial.suggest_uniform('min_samples_split', 0+sys.float_info.min, 1)
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth)
        
    if short == False: # it will take a long time to run 
        criterion = trial.suggest_categorical('criterion', ['mse','friedman_mse'])
        min_samples_leaf = trial.suggest_int('min_samples_leaf',min_min_samples_leaf, max_min_samples_leaf)
        max_leaf_nodes = trial.suggest_int('max_leaf_nodes', min_max_leaf_nodes, max_max_leaf_nodes)
        
        gb_sk_opt = ensemble.GradientBoostingRegressor(learning_rate=learning_rate, 
                                                   n_estimators=n_estimators,
                                                   criterion=criterion,
                                                   min_samples_split=min_samples_split,
                                                   min_samples_leaf=min_samples_leaf,
                                                   max_depth=max_depth,
                                                   max_leaf_nodes=max_leaf_nodes,
                                                   random_state=random_state,
                                                   verbose=verbose)
    else:  # will take less time        
        gb_sk_opt = ensemble.GradientBoostingRegressor(learning_rate=learning_rate, 
                                                   n_estimators=n_estimators,
                                                   min_samples_split=min_samples_split,
                                                   max_depth=max_depth,
                                                   random_state=random_state,
                                                   verbose=verbose)
        
    gb_sk_opt = gb_sk_opt.fit(x_train, y_train)
    y_val_pred = gb_sk_opt.predict(x_val)
    
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))

gb_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
gb_optuna.optimize(gradboosting_objective, n_trials=budget)
end_time = time.time()

summary['gradient_boosting'] = summary['gradient_boosting'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': gb_optuna.best_value,
    'Learning rate': gb_optuna.best_params['learning_rate'],
    'N. estimators': gb_optuna.best_params['n_estimators'],
    'Criterion': 'friedman_mse', 
    'Min. samples leaf': gb_optuna.best_params['min_samples_leaf'],
    'Min. samples leaf': 1,
    'Max. depth': gb_optuna.best_params['max_depth'],
    'Max. leaf nodes': gb_optuna.best_params['max_leaf_nodes']
    },
    name='optuna_sklearn'))

[32m[I 2021-01-14 18:11:26,619][0m A new study created in memory with name: no-name-038d6898-2561-4001-92f3-e55ab327c2a7[0m
[32m[I 2021-01-14 18:13:23,055][0m Trial 0 finished with value: 426.64002717133434 and parameters: {'learning_rate': 0.5065337886424471, 'n_estimators': 150, 'min_samples_split': 0.28252873137478773, 'max_depth': 27, 'criterion': 'friedman_mse', 'min_samples_leaf': 4, 'max_leaf_nodes': 11}. Best is trial 0 with value: 426.64002717133434.[0m
[32m[I 2021-01-14 18:15:59,022][0m Trial 1 finished with value: 498.9519962473219 and parameters: {'learning_rate': 0.9314245286134138, 'n_estimators': 212, 'min_samples_split': 0.30320790912098206, 'max_depth': 27, 'criterion': 'friedman_mse', 'min_samples_leaf': 4, 'max_leaf_nodes': 11}. Best is trial 0 with value: 426.64002717133434.[0m
[32m[I 2021-01-14 18:17:31,351][0m Trial 2 finished with value: 379.6951509882391 and parameters: {'learning_rate': 0.04766293389176224, 'n_estimators': 219, 'min_samples_split': 0

Tuning XGBRegressor with Optuna the execution takes less time so there was no need to take a similar approach to what we did above

In [71]:
# hyperparam tuning for XGBoost Regressor
def xgradboosting_objective(trial):
    
    eta = trial.suggest_uniform('eta', 0+sys.float_info.min, 1.0)
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth)
    n_estimators = trial.suggest_int('n_estimators', min_n_estimators, max_n_estimators)
    
    gamma = trial.suggest_float('gamma', 0.01, 1.0)
    reg_lambda = trial.suggest_uniform('lambda', 0.01, 0.5)
    reg_alpha = trial.suggest_uniform('alpha', 0.01, 0.5)

    gb_xgb_opt = xgb.XGBRegressor(objective='reg:squarederror',
                                  booster='gbtree',
                                  learning_rate=eta,
                                  gamma=gamma,
                                  reg_alpha=reg_alpha,
                                  reg_lambda=reg_lambda,
                                  max_depth=max_depth,
                                  n_estimators=n_estimators,
                                  random_state=random_state,
                                  verbosity=verbose
                                 )

    gb_xgb_opt = gb_xgb_opt.fit(x_train, y_train)
    y_val_pred = gb_xgb_opt.predict(x_val)
    
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))


gb_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
gb_optuna.optimize(xgradboosting_objective, n_trials=budget)
end_time = time.time()

summary['gradient_boosting'] = summary['gradient_boosting'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': gb_optuna.best_value,
    'Learning rate': gb_optuna.best_params['eta'],
    'Max. depth': gb_optuna.best_params['max_depth'],
    'Gamma (min_split_loss)': gb_optuna.best_params['gamma'],
    'Lambda': gb_optuna.best_params['lambda'],
    'Alpha': gb_optuna.best_params['alpha'],
    'N. estimators': gb_optuna.best_params['n_estimators']  
    },
    name='optuna_xgboost'))

[32m[I 2021-01-14 18:53:05,160][0m A new study created in memory with name: no-name-12fcd1a4-66f2-4b4b-8f6b-144c43a05c92[0m
[32m[I 2021-01-14 18:54:08,872][0m Trial 0 finished with value: 445.11596138786507 and parameters: {'eta': 0.5475110908996418, 'max_depth': 12, 'n_estimators': 399, 'gamma': 0.9945206996717629, 'lambda': 0.16947785185193312, 'alpha': 0.17868598835093621}. Best is trial 0 with value: 445.11596138786507.[0m
[32m[I 2021-01-14 18:55:24,697][0m Trial 1 finished with value: 431.4641278948212 and parameters: {'eta': 0.4029138842455492, 'max_depth': 25, 'n_estimators': 235, 'gamma': 0.9927485201217258, 'lambda': 0.2773160972893963, 'alpha': 0.3910011528521415}. Best is trial 1 with value: 431.4641278948212.[0m
[32m[I 2021-01-14 18:56:12,182][0m Trial 2 finished with value: 489.20321514411506 and parameters: {'eta': 0.7623333814899286, 'max_depth': 30, 'n_estimators': 150, 'gamma': 0.7825333606141713, 'lambda': 0.4279957296676617, 'alpha': 0.0620228490000086}. B

## Results

Let us take a look at the results. Generally speaking Optuna takes a longer time to execute but in the upside, also obtains better scores.

Comparing between the different methods, clearly knn is the worst in terms of scores but time wise, the default is a lot faster. 
Random Forest and Optuna tuned Scikit Gradient Boosting seem to get similar scores, however, that specific Gradient Boosting implementation takes a very long time to run whereas Random Forest gets one of the best scores in less time.

In [50]:
summary['knn']

Unnamed: 0,Time (sec),Score (RMSE),N. neighbors,Weights,P
default,0.1151,455.123868,5,uniform,2
optuna,31.7793,425.084222,12,distance,1


In [54]:
summary['random_forest']

Unnamed: 0,Time (sec),Score (RMSE),Min. samples split,Criterion,Max. depth,N. estimators,Max. features
default,78.0287,376.878001,2.0,mse,,100,1.0
optuna,1385.7564,373.963182,0.001337,mse,22.0,250,0.161814


In [72]:
summary['gradient_boosting']

Unnamed: 0,Time (sec),Score (RMSE),Criterion,Learning rate,Max. depth,Max. leaf nodes,Min. samples leaf,Min. samples split,N. estimators,Alpha,Gamma (min_split_loss),Lambda
default,29.0407,389.223359,friedman_mse,0.1,3.0,,1.0,2.0,100.0,,,
optuna_sklearn,2117.836,378.214463,friedman_mse,0.047467,4.0,16.0,1.0,,329.0,,,
default_xgboost,7.9902,409.80287,,0.3,6.0,0.0,,,100.0,0.0,0.0,1.0
optuna_xgboost,751.0554,404.734023,,0.075983,28.0,,,,105.0,0.493936,0.734717,0.418297


We will estimate the performance of our best model, in this case, random forest

In [73]:
rf_best_optuna = ensemble.RandomForestRegressor(
          random_state=random_state,
          min_samples_split=rf_optuna.best_params['min_samples_split'],
          criterion=rf_optuna.best_params['criterion'],
          max_depth=rf_optuna.best_params['max_depth'],
          n_estimators=rf_optuna.best_params['n_estimators'],
          max_features=rf_optuna.best_params['max_features']
      )
rf_best_optuna = rf_best_optuna.fit(x_train, y_train)

In [74]:
rf_opt_predict = rf_best_optuna.predict(x_test)
print('\nAttribute selection model performance:')
print('RMSE: {:.4f} (should be lower than the trivial predictor using the mean MSE: {:.4f})'.format(
    math.sqrt(metrics.mean_squared_error(y_test, rf_opt_predict)),
    math.sqrt(metrics.mean_squared_error(y_test, [y_test.mean() for i in range(len(y_test))]))))
print('R square: {:.4f} (should be higher than the trivial predictor using the mean: R square {:.4f})'.format(
    metrics.r2_score(y_test, rf_opt_predict),
    metrics.r2_score(y_test, [y_test.mean() for i in range(len(y_test))])))


Attribute selection model performance:
RMSE: 383.0463 (should be lower than the trivial predictor using the mean MSE: 689.8476)
R square: 0.6917 (should be higher than the trivial predictor using the mean: R square 0.0000)


# 2. ATTRIBUTE SELECTION

## 2.1 Select from all attributes

**Are all 550 input attributes actually necessary in order to get a good model? Is it possible to have an accurate model that uses fewer than 550 variables? How many?**

For this question we will be using the random forest as in previous sections, but now we will include the parameter for select only certain attributes. 

In [26]:
min_max_depth = 2
max_max_depth = 25
min_n_estimators = 50
max_n_estimators = 300
min_n_k = 10
max_n_k = 550

In order to evaluate whether all 550 attributes are necessary to obtain a good model, we add a new hyper-parameter to the optuna tuning objective function *k*. This hyper-parameter represents the number of attributes needed to obtain a good model. As we want to achieve at least some dimension reduction, we will set the maximum number of attributes to use at 350 of the 550 we had available. (33% reduction)

In order to do both feature selection and regression, we create a pipeline that allows us to first select the best k features and once the attributes have been decided upon, apply regression just as we did in the previous exercise.

In [31]:
np.random.seed(random_state)
def random_forest_objective_attr(trial):
    k = trial.suggest_int('k', min_n_k, max_n_k)
    min_samples_split = trial.suggest_uniform('min_samples_split', 0+sys.float_info.min, 1)
    criterion = trial.suggest_categorical('criterion', ['mse','mae'])
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth, log=True)
    n_estimators = trial.suggest_int('n_estimators', min_n_estimators, max_n_estimators)
    max_features = trial.suggest_uniform('max_features', 0+sys.float_info.min, 0.6)

    clf = Pipeline([
      ('feature_selection', feature_selection.SelectKBest(feature_selection.f_regression, k=k)),
      ('regression', ensemble.RandomForestRegressor(
          random_state=random_state,
          min_samples_split=min_samples_split,
          criterion=criterion,
          max_depth=max_depth,
          n_estimators=n_estimators,
          max_features=max_features
      ))
    ])

    clf = clf.fit(x_train, y_train)
    y_val_pred = clf.predict(x_val)
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))

rf_attr_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
rf_attr_optuna.optimize(random_forest_objective_attr, n_trials=budget, n_jobs=n_jobs)
end_time = time.time()
print(end_time-start_time)

[32m[I 2021-01-14 17:00:12,137][0m A new study created in memory with name: no-name-265ab6f7-48ef-4f90-8293-9335db6c1b5f[0m
[32m[I 2021-01-14 17:02:31,846][0m Trial 0 finished with value: 551.4198926282447 and parameters: {'k': 324, 'min_samples_split': 0.5865215732384921, 'criterion': 'mae', 'max_depth': 12, 'n_estimators': 179, 'max_features': 0.49862808139915354}. Best is trial 0 with value: 551.4198926282447.[0m
[32m[I 2021-01-14 17:02:32,571][0m Trial 1 finished with value: 493.95580466376583 and parameters: {'k': 86, 'min_samples_split': 0.4608837235993515, 'criterion': 'mse', 'max_depth': 2, 'n_estimators': 93, 'max_features': 0.37201623130564365}. Best is trial 1 with value: 493.95580466376583.[0m
[32m[I 2021-01-14 17:02:33,559][0m Trial 2 finished with value: 495.557605619882 and parameters: {'k': 116, 'min_samples_split': 0.38793609024141273, 'criterion': 'mse', 'max_depth': 2, 'n_estimators': 96, 'max_features': 0.3738917205994864}. Best is trial 1 with value: 493

744.9541549682617


In [32]:
print(rf_attr_optuna.best_params, rf_attr_optuna.best_value)

{'k': 263, 'min_samples_split': 0.16767418716856797, 'criterion': 'mse', 'max_depth': 16, 'n_estimators': 223, 'max_features': 0.2218126021308216} 397.7033856348979


As we can see, the results in terms of the RMSE are similar to the ones we get in the first section with random forest and gradient boosting, but using less amount of attributes. The RMSE is higher, but not a lot to the previous achieved. This means that we were using redundant information to train our models and we can use simpler models with the improvement in trainning time and optimization this means.

We are using now 263 variables of the 550 availables, this means over a 52% dimension reduction with similar results as in the previous models.

We can estimate the performance we will get forecasting energy in the future:

In [33]:
rf_attr = Pipeline([
      ('feature_selection', feature_selection.SelectKBest(feature_selection.f_regression,
                                                          k=rf_attr_optuna.best_params['k'])),
      ('regression', ensemble.RandomForestRegressor(
          random_state=random_state,
          min_samples_split=rf_attr_optuna.best_params['min_samples_split'],
          criterion=rf_attr_optuna.best_params['criterion'],
          max_depth=rf_attr_optuna.best_params['max_depth'],
          n_estimators=rf_attr_optuna.best_params['n_estimators'],
          max_features=rf_attr_optuna.best_params['max_features']
      ))
    ])
rf_attr = rf_attr.fit(x_train, y_train)

In [34]:
rf_attr_predict = rf_attr.predict(x_test)
print('\nAttribute selection model performance:')
print('RMSE: {:.4f} (should be lower than the trivial predictor using the mean MSE: {:.4f})'.format(
    math.sqrt(metrics.mean_squared_error(y_test, rf_attr_predict)),
    math.sqrt(metrics.mean_squared_error(y_test, [y_test.mean() for i in range(len(y_test))]))))
print('R square: {:.4f} (should be higher than the trivial predictor using the mean: R square {:.4f})'.format(
    metrics.r2_score(y_test, rf_attr_predict),
    metrics.r2_score(y_test, [y_test.mean() for i in range(len(y_test))])))


Attribute selection model performance:
RMSE: 447.8903 (should be lower than the trivial predictor using the mean MSE: 689.8476)
R square: 0.5785 (should be higher than the trivial predictor using the mean: R square 0.0000)


## 2.2 Use only Sotavento attributes
**Is it enough to use only the attributes for the actual Sotavento location? (13th location in the grid)**

We will select only Sotavento attributes and use again random forest as in previous section to train a model.

In [35]:
min_max_depth = 2
max_max_depth = 32
min_n_estimators = 50
max_n_estimators = 400

In [36]:
#Selecting sotavento attributes only
sot_attr = []
for attr in x_train.columns:
    if int(attr.split('.')[-1]) == 13:
        sot_attr.append(attr)

x_train_sot = x_train[sot_attr]
x_val_sot = x_val[sot_attr]
x_test_sot = x_test[sot_attr]
print(x_train_sot.shape,x_val_sot.shape,x_test_sot.shape)

(2528, 22) (1299, 22) (2110, 22)


In [38]:
np.random.seed(random_state)
def random_forest_sot_objective(trial):
    min_samples_split = trial.suggest_uniform('min_samples_split', 0+sys.float_info.min, 1)
    criterion = trial.suggest_categorical('criterion', ['mse','mae'])
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth)
    n_estimators = trial.suggest_int('n_estimators', min_n_estimators, max_n_estimators)
    max_features = trial.suggest_uniform('max_features', 0+sys.float_info.min, 0.6)

    clf = ensemble.RandomForestRegressor(
        random_state=random_state,
        min_samples_split=min_samples_split,
        criterion=criterion,
        max_depth=max_depth,
        n_estimators=n_estimators,
        max_features=max_features
        )

    clf = clf.fit(x_train_sot, y_train)
    y_val_pred = clf.predict(x_val_sot)
    return math.sqrt(metrics.mean_squared_error(y_val, y_val_pred))

rf_sot_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
rf_sot_optuna.optimize(random_forest_sot_objective, n_trials=budget, n_jobs=n_jobs)
end_time = time.time()

[32m[I 2021-01-14 17:22:31,720][0m A new study created in memory with name: no-name-15192830-873f-4389-bd91-e0a0bb1c0579[0m
[32m[I 2021-01-14 17:22:34,513][0m Trial 0 finished with value: 625.99684692581 and parameters: {'min_samples_split': 0.5927898583368811, 'criterion': 'mae', 'max_depth': 4, 'n_estimators': 334, 'max_features': 0.05012023176667339}. Best is trial 0 with value: 625.99684692581.[0m
[32m[I 2021-01-14 17:22:35,665][0m Trial 1 finished with value: 464.69454593130445 and parameters: {'min_samples_split': 0.4969889299707142, 'criterion': 'mse', 'max_depth': 5, 'n_estimators': 363, 'max_features': 0.3651138018549163}. Best is trial 1 with value: 464.69454593130445.[0m
[32m[I 2021-01-14 17:22:43,275][0m Trial 2 finished with value: 464.40310675536 and parameters: {'min_samples_split': 0.169789038688811, 'criterion': 'mae', 'max_depth': 31, 'n_estimators': 203, 'max_features': 0.12474337025562236}. Best is trial 2 with value: 464.40310675536.[0m
[32m[I 2021-01-

In [39]:
print(rf_sot_optuna.best_params, rf_sot_optuna.best_value)

{'min_samples_split': 0.0009326810839291322, 'criterion': 'mse', 'max_depth': 21, 'n_estimators': 290, 'max_features': 0.42268772785940734} 378.36972489322824


If we use a model using only the Sotavento attributes we can achive similar results as in the first section. Again, the RMSE is not as low as we have already obtained, but now we are using 13 attributes instead of the 550 availables (97% dimension reduction) so the tradeof between performance and execution time does really pay off.

We will now estimate the performance in predicting energy generation:

In [43]:
rf_sot = ensemble.RandomForestRegressor(
          random_state=random_state,
          min_samples_split=rf_sot_optuna.best_params['min_samples_split'],
          criterion=rf_sot_optuna.best_params['criterion'],
          max_depth=rf_sot_optuna.best_params['max_depth'],
          n_estimators=rf_sot_optuna.best_params['n_estimators'],
          max_features=rf_sot_optuna.best_params['max_features']
      )
rf_sot = rf_sot.fit(x_train_sot, y_train)

In [45]:
rf_sot_predict = rf_sot.predict(x_test_sot)
print('\nAttribute selection model performance:')
print('RMSE: {:.4f} (should be lower than the trivial predictor using the mean MSE: {:.4f})'.format(
    math.sqrt(metrics.mean_squared_error(y_test, rf_sot_predict)),
    math.sqrt(metrics.mean_squared_error(y_test, [y_test.mean() for i in range(len(y_test))]))))
print('R square: {:.4f} (should be higher than the trivial predictor using the mean: R square {:.4f})'.format(
    metrics.r2_score(y_test, rf_sot_predict),
    metrics.r2_score(y_test, [y_test.mean() for i in range(len(y_test))])))


Attribute selection model performance:
RMSE: 387.4636 (should be lower than the trivial predictor using the mean MSE: 689.8476)
R square: 0.6845 (should be higher than the trivial predictor using the mean: R square 0.0000)


Notice we expect to get a better performance with this Sotavento model than with the one used before attribute selection and it is really close to random forest and gradient boosting with all attributes.

# 3. Conclusions

We have obtained really good models to forecast energy generation with precision even though we artificially remove data from the dataset. First, we had to deal with this missing data. As the matrix we are dealing with is really big and due to our computing power restrictions, we could not really use advanced techniques to imput missing data as an iterative imputer or a KNN imputer. Therefore, we just put the sample median in the empty spots we generated.

Then, we have trained KNN, random forest and gradient boosting models. We have used an advanced hyper-parameter tunning optimizer as optuna and also xgboost library to get a better approach than with sklearn which models predict worst and are slower for trainning in this case. Again, we have to deal with the problem of computation time and because of computing power restirctions we cannot set a really high budget or times for trainning will be  really high for the assignment. After all, we obtained two good models for random forest and gradient boosting (xgboost) that could be used to predict energy generation.

Lastly, we have tried to do some attribute selection. We first try it by using pure computation and with Pipelines to first select the best attributes to predict and then train a random forest model with these specifications. The results were slightly worse than trainning with all the attributes, but satisfactory taking into account the dimension reduction. Then, we use only the Sotavento data that we have and again get an interesting model, not as good as the one using all the parameters, but another time sufficient if we think in the dimension reduction.

As a final conclusion, this assignment has made us undertand the importance of planning when trainning really big models as a lot of time could be wasted. We get the impression that we could have achieved quite better results with more powerful computers that will have helped us with deeper hyper-parameter tunning, but we are satisfied with the results, the knowledge and conclusion we have obtained during the assignment.