## PREDICTING WIND ENERGY PRODUCTION WITH SCIKIT-LEARN

**Authors: David de la Fuente López y Diego Perán Vacas**

## Setup

### First part

The first thing we have to do is to read the data, which is a Pandas dataframe in pickle format. For this purpose, we import pandas and run the corresponding function.

In [4]:
import pandas as pd
data = pd.read_pickle('wind_pickle.pickle')

We check the form of the Pandas dataframe. Since variables steps, month, day and hour cannot be used for training the models, we have to remove them. In order to do so, we are using two different procedures, to work with the techniques to  modify pandas dataframes. Firstly, we are creating a function to remove an specific column of the dataframe. Then we apply that function for the specific exercise, taking into account the variables we have to remove are in original places 2,4,5 and 6. Secondly, we are using a Pandas method called .drop(), including the argument axis = 1; to remove the columns we cannot use. 

In [5]:
data

Unnamed: 0,energy,steps,year,month,day,hour,p54.162.1,p54.162.2,p54.162.3,p54.162.4,...,v100.16,v100.17,v100.18,v100.19,v100.20,v100.21,v100.22,v100.23,v100.24,v100.25
0,402.71,0,2005,1,2,18,2.534970e+06,2.526864e+06,2.518754e+06,2.510648e+06,...,-4.683596,-4.545396,-4.407196,-4.268996,-4.131295,-4.669626,-4.528932,-4.388736,-4.248540,-4.107846
1,696.80,6,2005,1,3,0,2.537369e+06,2.529277e+06,2.521184e+06,2.513088e+06,...,-3.397886,-3.257192,-3.115998,-2.975304,-2.834609,-3.396390,-3.254198,-3.112506,-2.970314,-2.828622
2,1591.15,12,2005,1,3,6,2.533727e+06,2.525703e+06,2.517678e+06,2.509654e+06,...,-1.454105,-1.296447,-1.138290,-0.980134,-0.822476,-1.459094,-1.302933,-1.147271,-0.991110,-0.834949
3,1338.62,18,2005,1,3,12,2.534491e+06,2.526548e+06,2.518609e+06,2.510670e+06,...,1.255015,1.370265,1.485515,1.600765,1.716015,1.210612,1.319376,1.428140,1.536405,1.645169
4,562.50,0,2005,1,3,18,2.529543e+06,2.521623e+06,2.513702e+06,2.505782e+06,...,1.939031,2.023847,2.108663,2.193977,2.278793,1.873673,1.953000,2.031829,2.111157,2.189986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5932,211.78,0,2010,12,30,18,2.450279e+06,2.442801e+06,2.435324e+06,2.427846e+06,...,3.473201,3.445761,3.418819,3.391379,3.363938,3.499644,3.467214,3.434785,3.401856,3.369426
5933,944.52,6,2010,12,31,0,2.455407e+06,2.447817e+06,2.440226e+06,2.432635e+06,...,2.280789,2.372091,2.463892,2.555194,2.646994,2.333674,2.418490,2.503306,2.588621,2.673437
5934,224.06,12,2010,12,31,6,2.457296e+06,2.449624e+06,2.441947e+06,2.434271e+06,...,2.211939,2.341657,2.470877,2.600096,2.729815,2.188489,2.312221,2.436451,2.560183,2.684413
5935,0.37,18,2010,12,31,12,2.464015e+06,2.456257e+06,2.448494e+06,2.440732e+06,...,1.235059,1.311393,1.387228,1.463563,1.539897,1.263996,1.338335,1.412174,1.486513,1.560852


In [6]:
def drop_column(X,string):
    return X.iloc[:, X.columns != string]

In [7]:
data = drop_column(data,'steps')
data = drop_column(data,'month')
data = drop_column(data,'day')
data = drop_column(data,'hour')

In [8]:
data

Unnamed: 0,energy,year,p54.162.1,p54.162.2,p54.162.3,p54.162.4,p54.162.5,p54.162.6,p54.162.7,p54.162.8,...,v100.16,v100.17,v100.18,v100.19,v100.20,v100.21,v100.22,v100.23,v100.24,v100.25
0,402.71,2005,2.534970e+06,2.526864e+06,2.518754e+06,2.510648e+06,2.502537e+06,2.531111e+06,2.522721e+06,2.514330e+06,...,-4.683596,-4.545396,-4.407196,-4.268996,-4.131295,-4.669626,-4.528932,-4.388736,-4.248540,-4.107846
1,696.80,2005,2.537369e+06,2.529277e+06,2.521184e+06,2.513088e+06,2.504995e+06,2.533465e+06,2.525088e+06,2.516716e+06,...,-3.397886,-3.257192,-3.115998,-2.975304,-2.834609,-3.396390,-3.254198,-3.112506,-2.970314,-2.828622
2,1591.15,2005,2.533727e+06,2.525703e+06,2.517678e+06,2.509654e+06,2.501629e+06,2.529801e+06,2.521496e+06,2.513187e+06,...,-1.454105,-1.296447,-1.138290,-0.980134,-0.822476,-1.459094,-1.302933,-1.147271,-0.991110,-0.834949
3,1338.62,2005,2.534491e+06,2.526548e+06,2.518609e+06,2.510670e+06,2.502732e+06,2.530569e+06,2.522346e+06,2.514127e+06,...,1.255015,1.370265,1.485515,1.600765,1.716015,1.210612,1.319376,1.428140,1.536405,1.645169
4,562.50,2005,2.529543e+06,2.521623e+06,2.513702e+06,2.505782e+06,2.497861e+06,2.525621e+06,2.517421e+06,2.509215e+06,...,1.939031,2.023847,2.108663,2.193977,2.278793,1.873673,1.953000,2.031829,2.111157,2.189986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5932,211.78,2010,2.450279e+06,2.442801e+06,2.435324e+06,2.427846e+06,2.420368e+06,2.446240e+06,2.438486e+06,2.430738e+06,...,3.473201,3.445761,3.418819,3.391379,3.363938,3.499644,3.467214,3.434785,3.401856,3.369426
5933,944.52,2010,2.455407e+06,2.447817e+06,2.440226e+06,2.432635e+06,2.425045e+06,2.451400e+06,2.443533e+06,2.435667e+06,...,2.280789,2.372091,2.463892,2.555194,2.646994,2.333674,2.418490,2.503306,2.588621,2.673437
5934,224.06,2010,2.457296e+06,2.449624e+06,2.441947e+06,2.434271e+06,2.426599e+06,2.453288e+06,2.445341e+06,2.437393e+06,...,2.211939,2.341657,2.470877,2.600096,2.729815,2.188489,2.312221,2.436451,2.560183,2.684413
5935,0.37,2010,2.464015e+06,2.456257e+06,2.448494e+06,2.440732e+06,2.432974e+06,2.459993e+06,2.451955e+06,2.443917e+06,...,1.235059,1.311393,1.387228,1.463563,1.539897,1.263996,1.338335,1.412174,1.486513,1.560852


we have removed the four non-usable attributes. Let's try with the drop() method, which is quite more simpler. 

In [9]:
data = pd.read_pickle('wind_pickle.pickle')

In [10]:
data = data.drop(['steps', 'month', 'day','hour'], axis=1)
data

Unnamed: 0,energy,year,p54.162.1,p54.162.2,p54.162.3,p54.162.4,p54.162.5,p54.162.6,p54.162.7,p54.162.8,...,v100.16,v100.17,v100.18,v100.19,v100.20,v100.21,v100.22,v100.23,v100.24,v100.25
0,402.71,2005,2.534970e+06,2.526864e+06,2.518754e+06,2.510648e+06,2.502537e+06,2.531111e+06,2.522721e+06,2.514330e+06,...,-4.683596,-4.545396,-4.407196,-4.268996,-4.131295,-4.669626,-4.528932,-4.388736,-4.248540,-4.107846
1,696.80,2005,2.537369e+06,2.529277e+06,2.521184e+06,2.513088e+06,2.504995e+06,2.533465e+06,2.525088e+06,2.516716e+06,...,-3.397886,-3.257192,-3.115998,-2.975304,-2.834609,-3.396390,-3.254198,-3.112506,-2.970314,-2.828622
2,1591.15,2005,2.533727e+06,2.525703e+06,2.517678e+06,2.509654e+06,2.501629e+06,2.529801e+06,2.521496e+06,2.513187e+06,...,-1.454105,-1.296447,-1.138290,-0.980134,-0.822476,-1.459094,-1.302933,-1.147271,-0.991110,-0.834949
3,1338.62,2005,2.534491e+06,2.526548e+06,2.518609e+06,2.510670e+06,2.502732e+06,2.530569e+06,2.522346e+06,2.514127e+06,...,1.255015,1.370265,1.485515,1.600765,1.716015,1.210612,1.319376,1.428140,1.536405,1.645169
4,562.50,2005,2.529543e+06,2.521623e+06,2.513702e+06,2.505782e+06,2.497861e+06,2.525621e+06,2.517421e+06,2.509215e+06,...,1.939031,2.023847,2.108663,2.193977,2.278793,1.873673,1.953000,2.031829,2.111157,2.189986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5932,211.78,2010,2.450279e+06,2.442801e+06,2.435324e+06,2.427846e+06,2.420368e+06,2.446240e+06,2.438486e+06,2.430738e+06,...,3.473201,3.445761,3.418819,3.391379,3.363938,3.499644,3.467214,3.434785,3.401856,3.369426
5933,944.52,2010,2.455407e+06,2.447817e+06,2.440226e+06,2.432635e+06,2.425045e+06,2.451400e+06,2.443533e+06,2.435667e+06,...,2.280789,2.372091,2.463892,2.555194,2.646994,2.333674,2.418490,2.503306,2.588621,2.673437
5934,224.06,2010,2.457296e+06,2.449624e+06,2.441947e+06,2.434271e+06,2.426599e+06,2.453288e+06,2.445341e+06,2.437393e+06,...,2.211939,2.341657,2.470877,2.600096,2.729815,2.188489,2.312221,2.436451,2.560183,2.684413
5935,0.37,2010,2.464015e+06,2.456257e+06,2.448494e+06,2.440732e+06,2.432974e+06,2.459993e+06,2.451955e+06,2.443917e+06,...,1.235059,1.311393,1.387228,1.463563,1.539897,1.263996,1.338335,1.412174,1.486513,1.560852


Again, this method is checked and it is valid. 

### Second part

First we introduce the seed to avoid reproducibility.

In [11]:
import numpy as np 
import random
my_NIA = 100441742 
random.seed(my_NIA)

We have to chose randomly a 10% of the columns of the data set (without including the first two variables: year and energy). We are labelling the number of columns to select randomly as n. Then, we have to raplace randomly a 5% of the values of each of these columns by a missing value (introduced in Python as np.nan). We are labelling this number as m, which is constructed as the 5% of the observations of each columns (i.e. 5% of the total number of observations.

In [12]:
n = int(len(list(data)[2:])*0.1)
m = round(data.shape[0]*0.05) # To introduce at least a 5%. It is going to be a little more, since the 5% corresponds to 296.85.

At this point, we have the number of columns that we have to select (n = 55) and the number of observations of each of them that we need to replace by a missing value. We will transform the 55 columns to 55 numbers of columns selected at random from the list of labels (excluding the first two of them that correspond to the variables year and energy).

In [13]:
from random import sample
from random import randint

random.seed(my_NIA)
rannames = sample(list(data)[2:],k = n)

for i in range(len(rannames)):
    random.seed(my_NIA*i)
    randrows = random.sample(range(data.shape[0]),m)
    for row in range(m):
        data.loc[randrows[row],rannames[i]] = np.nan 

With this procedure, we iterate over the 55 columns selected at random. For each column, we create a random sample of 297 (m) elements at random and without replacement. We have done this under a seed (different for each column to not select the same elements for each column) in order to keep reproducibility. We then iterate over that column to transform the randomly selected 5% of its elements to missing values. After doing this, it is interesting to check the number of missing values introduced in our data set. It has to be equal to 55 x 297 = 16335 (which is the result we obtain).

In [14]:
print('The number of missing values is: {0}'.format(data.isna().sum().sum()))

The number of missing values is: 16335


After having introduced the missing values, we have to create three partitions of the data set. This is not going tobe done randomly since we have data belonging to consecutive hours. If we done this partition randomly, we might find the situation where the model in trained with similar data that then we use to validate it. This will lead in overestimated measures of quality of our model. Therefore, we are said to form these partitions considering data belonging to separated years. After forming the three partitions, we remove the variable year, which is no longer useful. 

In [15]:
condition1 = (data.year == 2005)|(data.year == 2006)
train = data.loc[condition1]

condition2 = (data.year == 2007)|(data.year == 2008)
valpart = data.loc[condition2]

condition3 = (data.year == 2009)|(data.year == 2010)
test = data.loc[condition3]

In [16]:
train = train.drop(['year'], axis='columns')

test = test.drop(['year'], axis='columns')

valpart = valpart.drop(['year'], axis='columns')

In [17]:
print("The dimensions of the training partition are: {0}".format(train.shape))

print("The dimensions of the testing partition are: {0}".format(test.shape))

print("The dimensions of the validation partition are: {0}".format(valpart.shape))

The dimensions of the training partition are: (2528, 551)
The dimensions of the testing partition are: (2110, 551)
The dimensions of the validation partition are: (1299, 551)


## Model selection and hyper-parameter tuning

### Part 1: KNN

We begin this part creating the training and the testing sets out of our data. As commented before, the sets are not formed randomly but are separated manually. This implies that we just need to separate the response variable and the attributes. The first one corresponds to the "y"; and the set of attributes to "X". We do this process both for the training and testing sets, which are going to be used in this section (in which we do not need to tune the hyper-parameters).

In [18]:
X_train = train.iloc[:,1:]
y_train = train.iloc[:,0]
X_test = test.iloc[:,1:]
y_test = test.iloc[:,0]

At this point, we define the methods that will construct our pipeline. First, we want to impute the missing values (that we introduced in the first part of the assignment). For this purpose, we are using the 'mean' strategy. Then, we want to scale the parameters, which we are doing by standarization (substracting the mean and dividing by the standard deviation). The first method computes the mean of each attribute and substitute the missing values by those means. It saves the values of the means, which are used also for the second method of standarization. Once the pre-process is done (using only the training data), we want to fit a regression model using the knn method. We define each of those methods and combine them into a pipeline.

In [19]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
knn = KNeighborsRegressor()

knn_pipe = Pipeline([('imputer', imputer),('scaler',scaler),('knn_regression',knn)])

We are ready to fit the pipeline with the training data and predict the values of the response variable (energy) for the characteristics of the observations left as testing set.

In [20]:
knn_pipe.fit(X_train, y_train)
y_test_pred0 = knn_pipe.predict(X_test)

from sklearn import metrics
print("The MAE for knn without tuning is = ",round(metrics.mean_absolute_error(y_test, y_test_pred0),4))

The MAE for knn without tuning is =  330.4147


### Part 1: SVM

For SVM, we will use the same pre-process part of the pipeline, since this method also requires scaling and imputation of missing values. However, we change the method of fitting the regression model: now we use SVM instead of KNN.

In [21]:
from sklearn.svm import SVR

SVM = SVR()
svm_pipe = Pipeline([('imputer', imputer),('scaler',scaler),('SVM',SVM)])
svm_pipe.fit(X_train, y_train)
y_test_pred1 = svm_pipe.predict(X_test)

print("The MAE for SVM without tuning is = ",round(metrics.mean_absolute_error(y_test, y_test_pred1),4))

The MAE for SVM without tuning is =  503.1826


We observe quite high value of the MSE in comparison with the knn method. The SVM is more complex that KNN, which can be a reason why the KNN without tuning works better than this method. Being more complex means having more hyper-parameters to tune. Obviously, working without tuning the parameters is worse for methods with more parameters. 

### Part 1: Trees

For this part, since we do not need to scale, we will impute the missing values and fit a regression model with the decision trees method directly.

In [22]:
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor()

tree_pipe = Pipeline([('imputer', imputer),('tree',tree)])
tree_pipe.fit(X_train, y_train)
y_test_pred2 = tree_pipe.predict(X_test)

print("The MAE for Trees without tuning is = ",round(metrics.mean_absolute_error(y_test, y_test_pred2),4))

The MAE for Trees without tuning is =  392.4143


### Part 2: KNN

First we have to create the X and the y of the validation partition that we separated previously.

In [23]:
X_valpart = valpart.iloc[:,1:]
y_valpart = valpart.iloc[:,0]

Second, we define the pipeline that must be used. In fact, it is the same one as before with some particular differences. Now we do not introduce the method of imputation ('mean' or 'median') but left it as a hyper-parameter to be tuned.

In [24]:
imputer = SimpleImputer()
scaler = StandardScaler()
knn = KNeighborsRegressor()

knn_pipe = Pipeline([('imputer', imputer),('scaler',scaler),('knn_regression',knn)])
knn_pipe.get_params() # we check the default parameters in order to check how they must be referred to

{'memory': None,
 'steps': [('imputer', SimpleImputer()),
  ('scaler', StandardScaler()),
  ('knn_regression', KNeighborsRegressor())],
 'verbose': False,
 'imputer': SimpleImputer(),
 'scaler': StandardScaler(),
 'knn_regression': KNeighborsRegressor(),
 'imputer__add_indicator': False,
 'imputer__copy': True,
 'imputer__fill_value': None,
 'imputer__missing_values': nan,
 'imputer__strategy': 'mean',
 'imputer__verbose': 0,
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'knn_regression__algorithm': 'auto',
 'knn_regression__leaf_size': 30,
 'knn_regression__metric': 'minkowski',
 'knn_regression__metric_params': None,
 'knn_regression__n_jobs': None,
 'knn_regression__n_neighbors': 5,
 'knn_regression__p': 2,
 'knn_regression__weights': 'uniform'}

At this point, it is necessary to define the search space. For this purpose, we checked the hyper-parameters of the pipeline, since we must know how they are named. We are particularly interested on: 'knn_regression__n_neighbors' (the number of neighbors of the knn method); and 'imputer__strategy' (whether we use the mean or the median for the imputation of the missing values).

In [25]:
param_grid = {'imputer__strategy':['mean','median'],'knn_regression__n_neighbors': [2,3,4,5,6,7,8,9,10,11]}

Before defining the grid for inner validation (tuning of the parameters), we must divide the validation partition into a training and a testing sets. We will include 2/3 of the observations for training and the rest of them for testing.

In [26]:
from sklearn.model_selection import PredefinedSplit

validation_indices = np.zeros(X_valpart.shape[0])
validation_indices[:round(2/3*X_valpart.shape[0])] = -1

tr_valpart = PredefinedSplit(validation_indices)

Now we are ready to create the grid for inner validation. We introduce the pipeline, the search space, the scoring method (which is going to be NMAE -> we use negative MAE because is a scoring method, which tends to be maximized. However, our metric is an error, that should be minimized. That is why we must use the negative one), the cross validation set (the 2/3 of the data of the validation partition), and other parameters as we did in theory lessons. 

In [27]:
from sklearn.model_selection import RandomizedSearchCV

knn_grid = RandomizedSearchCV(knn_pipe, param_grid,
                              scoring = 'neg_mean_absolute_error',
                              cv = tr_valpart, 
                              n_jobs=2, verbose=1)

We are ready to fit the model with the best parameters.

In [28]:
random.seed(my_NIA)
knn_grid = knn_grid.fit(X_train, y_train)

y_test_pred3 = knn_grid.predict(X_test)

Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    3.5s finished


In [29]:
knn_grid.best_params_

{'knn_regression__n_neighbors': 6, 'imputer__strategy': 'median'}

We calculate the MAE for the tunned model.

In [30]:
print("The MAE for knn tuning the hyper-parameters is = ",round(metrics.mean_absolute_error(y_test, y_test_pred3),4))

The MAE for knn tuning the hyper-parameters is =  327.307


We have reduced the MAE by tuning the hyper-parameters of the model.

### Part 2: SVM

In [31]:
imputer = SimpleImputer()
scaler = StandardScaler()
SVM = SVR()

svm_pipe = Pipeline([('imputer', imputer),('scaler',scaler),('SVM',SVM)])
svm_pipe.get_params() 

{'memory': None,
 'steps': [('imputer', SimpleImputer()),
  ('scaler', StandardScaler()),
  ('SVM', SVR())],
 'verbose': False,
 'imputer': SimpleImputer(),
 'scaler': StandardScaler(),
 'SVM': SVR(),
 'imputer__add_indicator': False,
 'imputer__copy': True,
 'imputer__fill_value': None,
 'imputer__missing_values': nan,
 'imputer__strategy': 'mean',
 'imputer__verbose': 0,
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'SVM__C': 1.0,
 'SVM__cache_size': 200,
 'SVM__coef0': 0.0,
 'SVM__degree': 3,
 'SVM__epsilon': 0.1,
 'SVM__gamma': 'scale',
 'SVM__kernel': 'rbf',
 'SVM__max_iter': -1,
 'SVM__shrinking': True,
 'SVM__tol': 0.001,
 'SVM__verbose': False}

After doing a bit of research, we have found that the parameters to tune in SVM are: **kernels**, which can be Gaussian, polynomial or a sigmoid kernel (those are the functions that transforms low-dimensional space into high-dimensional space); **c-parameter** (parameter for regularisation) which is the penalty parameter; and the **Gamma parameter**, which defines how far influences the calculation of plausible line of separation. A typical parameter grid is the following one. In this part, we have run several times the random search grid for the SVM. We have find huge differences when the random search find the 'rbf' method of the kernel and when it finds other different method. The obtained MAE differs from around 310 to more than 400, respectively. Therefore, after this visual inspection, we remove other kernel functions than rbf. 

In [32]:
param_grid = {'imputer__strategy':['mean','median'],
              'SVM__C': [0.1,1, 10, 100],
              'SVM__gamma':[1,0.1,0.01,0.001]}

svm_pipe = svm_pipe.set_params(**{'SVM__kernel':'rbf'}) # After having considered several trials

# 'SVM__kernel':('rbf', 'poly', 'sigmoid') --> For the first visual inspection.

In [33]:
svm_grid = RandomizedSearchCV(svm_pipe,param_grid,
                              scoring='neg_mean_absolute_error',
                              cv = tr_valpart , 
                              n_jobs=1, verbose=1)

svm_grid = svm_grid.fit(X_train, y_train)
y_test_pred4 = svm_grid.predict(X_test)

Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    6.8s finished


In [34]:
print("The MAE for SVMs tuning the hyper-parameters is = ", round(metrics.mean_absolute_error(y_test, y_test_pred4),4))

The MAE for SVMs tuning the hyper-parameters is =  479.0681


In [35]:
svm_grid.best_params_

{'imputer__strategy': 'median', 'SVM__gamma': 0.01, 'SVM__C': 10}

Note it is better than the result for KNN and it has a huge improvement in comparison with the same method without tuning the parameters. This might be caused due to the fact we have commented previously: SVM has more parameters and then, the results without tuning them, are expected to be worse.

### Part 2: Trees 

In [36]:
imputer = SimpleImputer()
tree = DecisionTreeRegressor()

tree_grid = Pipeline([('imputer', imputer),('tree_regression',tree)])
tree_grid.get_params() # we check the default parameters in order to check how they must be referred to

{'memory': None,
 'steps': [('imputer', SimpleImputer()),
  ('tree_regression', DecisionTreeRegressor())],
 'verbose': False,
 'imputer': SimpleImputer(),
 'tree_regression': DecisionTreeRegressor(),
 'imputer__add_indicator': False,
 'imputer__copy': True,
 'imputer__fill_value': None,
 'imputer__missing_values': nan,
 'imputer__strategy': 'mean',
 'imputer__verbose': 0,
 'tree_regression__ccp_alpha': 0.0,
 'tree_regression__criterion': 'mse',
 'tree_regression__max_depth': None,
 'tree_regression__max_features': None,
 'tree_regression__max_leaf_nodes': None,
 'tree_regression__min_impurity_decrease': 0.0,
 'tree_regression__min_impurity_split': None,
 'tree_regression__min_samples_leaf': 1,
 'tree_regression__min_samples_split': 2,
 'tree_regression__min_weight_fraction_leaf': 0.0,
 'tree_regression__presort': 'deprecated',
 'tree_regression__random_state': None,
 'tree_regression__splitter': 'best'}

In this pipeline, we have to tune the imputation method and the maximum depth of the tree. In addition, we are tunning the minimum value of splits.  

In [37]:
param_grid = {'imputer__strategy':['mean','median'],
              'tree_regression__max_depth': range(2,16,2),
              'tree_regression__min_samples_split': range(2,34,2)}

In [38]:
tree_grid = RandomizedSearchCV(tree_grid,param_grid,
                             scoring = 'neg_mean_absolute_error',
                             cv = tr_valpart, 
                             n_jobs = 1, verbose = 1)

tree_grid = tree_grid.fit(X_train , y_train)
y_test_pred5 = tree_grid.predict(X_test)

Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    2.5s finished


In [39]:
tree_grid.best_params_

{'tree_regression__min_samples_split': 20,
 'tree_regression__max_depth': 4,
 'imputer__strategy': 'median'}

In [40]:
print("The MAE for Trees tuning the hyper-parameters is = ", round(metrics.mean_absolute_error(y_test, y_test_pred5),4))

The MAE for Trees tuning the hyper-parameters is =  349.7526


After doing several trials of the random tuning of the parameters, we have found quite different results for the best hyper-parameters. However, the MAE does not seem to vary a lot from 340, which agains supposes an improvement with respect to the regression with the default parameters.

In summary, we have **improved** the three methods of regression **by tunning the hyper-parameters**. The relative improvement of each method is proportional to the number of hyper-parameters. Thus, SVM, which has more hyper-parameters to tune, has a greater relative improvement. KNN method, for its part, has the lowest improvement on the MAE, since this method has only the number of neighbours as hyper-parameter. In addition, we find the **best method after tunning the hyper-parameters is apparently SVM**. Even though we have not studied yet the particular behavior of SVM, we know it is the most complex method between the three that we had to use in the exercise. Therefore, this result is not surprising for us. 

## Attribute selection

For this section, we need just to use KNN. In the preivous section we did the tune of the KNN parameters, which were one relative to the imputation part of the pipeline and other to the appropriate number of neighbors of the KNN method itself. In particular, we found the 'mean' strategy was the best method for the imputation of missing values; and the most appropriate number of neighbors was 8. In this exercise we will be tuning the number of neighbors again, so we are just fixing the **imputation method to be the mean**.

### First pipeline: SelectKBest and regression with KNN

In [41]:
from sklearn.feature_selection import SelectKBest, f_regression


imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN
selector = SelectKBest(f_regression) # We define the feature selection method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method


fea_sel_pipe = Pipeline([('imputer',imputer),
                         ('scaler',scaler),
                        ('selector',selector),
                        ('knn_regressor',knn_regression)])

In [42]:
fea_sel_pipe.fit(X_train,y_train)

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('selector',
                 SelectKBest(score_func=<function f_regression at 0x000000AF09CAE048>)),
                ('knn_regressor', KNeighborsRegressor())])

In [43]:
y_test_pred6 = fea_sel_pipe.predict(X_test)

In [44]:
print("The MAE for KNN with feature selection and without tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred6),4))

The MAE for KNN with feature selection and without tuning the hyper-parameters is =  538.0683


First, we train the pipeline without tuning the parameters. This is not suitable in any case, since the default values are not normally appropriate for each data set. In our case, we have 551 variables (550 attributes). From those all 550 variables, the method has selected just 10 of them. This reduction entails a decrease of 98% of the number of variables used to train the model. We might be losing important information, instead of removing unimportant characteristics (which is the main purpose of feature selection). We observe a huge rise between the MAE for KNN without feature selection (roughly 330) and the MAE for KNN with feature selection (around 540) (both of them computed without tuning the parameters). Of course, **variable selection is not convenient when we do not specify the proper number of variables to be selected**.

In [45]:
selected = fea_sel_pipe['selector'].get_support()

selected_col = list()
for i in range(len(selected)):
    if(selected[i] == True):
        selected_col.append(i)

selected_col

[75, 76, 77, 78, 80, 81, 82, 85, 86, 90]

Here, we have extracted the indexes of the variables that have been selected from this method without tuning the parameters. 

### Second pipeline: PCA with KNN

In [46]:
from sklearn.decomposition import PCA

imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN 
pca = PCA() # We define the Principal Component Analysis method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

PCA_pipe = Pipeline([('imputer',imputer),
                         ('scaler',scaler),
                        ('pca',pca),
                        ('knn_regressor',knn_regression)])

In [47]:
PCA_pipe.fit(X_train,y_train)

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('pca', PCA()), ('knn_regressor', KNeighborsRegressor())])

In [48]:
y_test_pred7 = PCA_pipe.predict(X_test)

In [49]:
print("The MAE for KNN with PCA and without tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred7),4))

The MAE for KNN with PCA and without tuning the hyper-parameters is =  330.4147


In this case, we have defined a pipeline with imputation (using the mean due to the arguments of the previous pipeline), scaling (necessary for KNN), principal component analysis without selecting a particular number of PCs and fitting with KNN method. The MAE we have obtained is exactly the same as we got when using KNN without PCA. Therefore, we assume that setting no value for the number of components to be selected implies using all PCs. We are studying the proportion of variance explained by each of the PCs, to do a "parameter tuning" manually. 

In [50]:
sum(PCA_pipe['pca'].explained_variance_ratio_[0:12]).round(4)

0.9831

The first 12 PCs give us a proportion of explained deviance of roughly a 98%. This method could be used to fix a god hyper-parameter tuning of the number of PCs to be used without tuning them automatically. Let's run the code once again without tuning the number of PCs and selecting 3 of them (usual number of PCs).

In [51]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN 
pca = PCA(n_components = 3) # We define the Principal Component Analysis method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

PCA_pipe2 = Pipeline([('imputer',imputer),
                         ('scaler',scaler),
                        ('pca',pca),
                        ('knn_regressor',knn_regression)])

In [52]:
PCA_pipe2.fit(X_train,y_train)
y_test_pred8 = PCA_pipe2.predict(X_test)

In [53]:
print("The MAE for KNN with PCA and without tuning the hyper-parameters (with 3 PCs) is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred8),4))

The MAE for KNN with PCA and without tuning the hyper-parameters (with 3 PCs) is =  365.2589


Although we have selected the first 3 PCs, which explain roughly a 73% of the deviance of the data, we observe a rise of the MAE with respect to using all PCs. This is expected, we are training the model with less information.

### Third pipeline: feature selection and PCA with KNN

In [54]:
from sklearn.decomposition import PCA

imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN 
selector = SelectKBest(f_regression) # We define the feature selection method
pca = PCA(n_components = 3) # We define the Principal Component Analysis method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

combined_pipe = Pipeline([('imputer',imputer),
                          ('scaler',scaler),
                          ('selector',selector),
                          ('pca',pca),
                          ('knn_regressor',knn_regression)])

In [55]:
combined_pipe.fit(X_train,y_train)
y_test_pred9 = combined_pipe.predict(X_test)

In [56]:
print("The MAE for KNN with PCA and FS without tuning the hyper-parameters (with 3 PCs) is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred9),4))

The MAE for KNN with PCA and FS without tuning the hyper-parameters (with 3 PCs) is =  538.1044


We find a very similar MAE that the one we found with only the feature selection method. It is a bit higher since we only consider the first 3 PCs, and then we explain with less information. Again, this result manifests the feature selection needs a parameter tuning.  

### First hyper-parameter tuning: feature selection and KNN regression

First, we check the parameters that need to be tuned. The imputation method is defined to follow the mean strategy, so we will not tune that parameter. The scaling method does not need any parameter to be tuned. The KBest selection method needs the parameter 'selector__k': 10 to be tuned (in fact, we have observed this is crucial). Finally, the KNN regressor needs to have the number of neighbors tuned ('knn_regressor__n_neighbors'). Let's begin defining the search space for those parameters. 

In [57]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN
selector = SelectKBest(f_regression) # We define the feature selection method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

fea_sel_pipe = Pipeline([('imputer',imputer),
                        ('scaler',scaler),
                        ('selector',selector),
                        ('knn_regressor',knn_regression)])

# This is the initial search space. We will comment bellow how we have used it
# param_grid = {'selector__k': range(10,500,2),
#              'knn_regressor__n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14]}

# This is the second search space.
# param_grid = {'selector__k': range(410,500,1),
#              'knn_regressor__n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14]}

param_grid = {'selector__k': range(440,480,1),
              'knn_regressor__n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14]}

In [58]:
fea_sel_pipe = RandomizedSearchCV(fea_sel_pipe,param_grid,
                             scoring = 'neg_mean_absolute_error',
                             cv = tr_valpart, 
                             n_jobs = 1, verbose = 1)

fea_sel_pipe.fit(X_train , y_train)
y_test_pred10 = fea_sel_pipe.predict(X_test)

Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.5s finished


In [59]:
fea_sel_pipe.best_params_

{'selector__k': 473, 'knn_regressor__n_neighbors': 10}

In [60]:
print("The MAE for KNN with FS and tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred10),4))

The MAE for KNN with FS and tuning the hyper-parameters is =  322.3671


We have run this procedure with this search space many times. We find the number of important variables that the method finds is always above 410 and bellow 500. We will redefine the parameter grid search above and left the initial one commented. We also run this new searching space many times. We find the best MAE for values of k between 440 and 480. Therefore, we fix this range for a third and last search space. At this point, we observe a reduction of the MAE for the KNN method given all previous trials for the method. In short, apparently, the feature selection is useful for the model.

### Second hyper-parameter tuning: PCA and KNN regression

In this case, we need to tune the parameters: 'pca__n_components' and 'knn_regressor__n_neighbors'. They are respectively the number of PCs the model take and the number of neighbors considered in the KNN method. We begin defining the search space.

In [61]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN 
pca = PCA() # We define the Principal Component Analysis method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

PCA_pipe = Pipeline([('imputer',imputer),
                         ('scaler',scaler),
                        ('pca',pca),
                        ('knn_regressor',knn_regression)])

param_grid = {'pca__n_components': range(1,24,1),
              'knn_regressor__n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14]}

In [62]:
PCA_pipe = RandomizedSearchCV(PCA_pipe,param_grid,
                             scoring = 'neg_mean_absolute_error',
                             cv = tr_valpart, 
                             n_jobs = 1, verbose = 1)

PCA_pipe = PCA_pipe.fit(X_train , y_train)
y_test_pred11 = PCA_pipe.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.0s finished


In [63]:
PCA_pipe.best_params_

{'pca__n_components': 20, 'knn_regressor__n_neighbors': 7}

In [64]:
print("The MAE for KNN with PCA and tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred11),4))

The MAE for KNN with PCA and tuning the hyper-parameters is =  327.6659


Note this pipeline is generally tuned for a high value of PCs. This is predictable since the PCA is usually run to save time. When we face a huge data set, with a large number of observations and features, PCA is almost necessary to be able to get a model. However, if we test the best number of PCs in terms of the negative mean absolute error, we will always get the best value as the largest number tested (since it explains the greater part of the variance of the data set). Therefore, we obtain worse results in comparison with the last procedure (just including feature selection).

### Third hyper-parameter tuning: feature selection, PCA and KNN regression 

In this part of the exercise, we need to tune the 'knn_regressor__n_neighbors', the 'pca__n_components' and the 'selector__k'. We are using previous information we have determined in previous sections. For instance, we are using the proper range for the search space of the number of features; or we are considering the 12 first PCs explain 98% of the variance (we consider double of them).

In [65]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN 
selector = SelectKBest(f_regression) # We define the feature selection method
pca = PCA() # We define the Principal Component Analysis method
knn_regression = KNeighborsRegressor() # Finally we define the fitting method

combined_pipe = Pipeline([('imputer',imputer),
                          ('scaler',scaler),
                          ('selector',selector),
                          ('pca',pca),
                          ('knn_regressor',knn_regression)])

param_grid = {'selector__k': range(440,480,1),
              'pca__n_components': range(1,24,1),
              'knn_regressor__n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14]}

In [66]:
combined_pipe = RandomizedSearchCV(combined_pipe,param_grid,
                             scoring = 'neg_mean_absolute_error',
                             cv = tr_valpart, 
                             n_jobs = 1, verbose = 1)

combined_pipe = combined_pipe.fit(X_train , y_train)
y_test_pred12 = combined_pipe.predict(X_test)

Fitting 1 folds for each of 10 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.8s finished


In [67]:
combined_pipe.best_params_

{'selector__k': 474, 'pca__n_components': 21, 'knn_regressor__n_neighbors': 10}

In [68]:
print("The MAE for KNN with FS and PCA and tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred12),4))

The MAE for KNN with FS and PCA and tuning the hyper-parameters is =  321.4497


Note that systematically we get similar (or a bit above) results than those we obtained in the hyper-parameter tuning section for KNN and just feature selection. This is predictable since we are just including one method of PCA in this pipeline. This method, as commented previously, tends to maximize the deviance explained (maximize the number of PCs). However, in the pipeline with just feaure selection, we included all the relevant information of the data set. In this case, we only include a high percentage of this information, but not all (since we do not consider all PCs). 

To conclude, the **best results** in terms of mean absolute error are found for the **first pipeline**: fitting a regression method with KNN after doing feature selection with hyper-parameter tuning. In the previous section, without feature selection, we found results around 330 in the case of no hyper-parameter tuning and around 327 in the case of tuning the number of neighbors considered. In this section, including a feature selection process with the number of attributes tuned, we find a MAE between 322 and 325. Apparently, **there are some attributes that are not relevant** in our data set, and we **find some improvement over the results of the previous sections**.

In order to **get the attributes selected by the first pipeline** (feature selection) tuning the parameters, we are using the best parameters found for the best MAE we have found runing the previous algorithm. Those values are 441 for the selected features and 8 neighbors considered. Under thsese values, we get a MAE equal to 322.34. Considering this, we create a new pipeline, and train it with the training set. Second, we predict the values and verify the MAE is 322.34. Third, we obtain which columns are selected with a boolean set. Fourth, we create an empty list and iterate over the elements of the selected variables (range from 0 to 550 (the total number of attributes). If the attribute is selected (True), the name of the column is saved in the previous empty list. Finally, we **print those names**.

In [69]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN
selector = SelectKBest(f_regression,k = 441) # We define the feature selection method
knn_regression = KNeighborsRegressor(n_neighbors = 8) # Finally we define the fitting method

fea_sel_pipe = Pipeline([('imputer',imputer),
                        ('scaler',scaler),
                        ('selector',selector),
                        ('knn_regressor',knn_regression)])

fea_sel_pipe.fit(X_train,y_train)
y_test_pred13 = fea_sel_pipe.predict(X_test)

print("The MAE for KNN with PCA and without tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred13),4))

selected = fea_sel_pipe['selector'].get_support() # Getting the boolean set of columns selected.

selected_col = list()
for i in range(len(selected)):
    if(selected[i] == True):
        selected_col.append(list(X_train)[i])

The MAE for KNN with PCA and without tuning the hyper-parameters is =  322.3426


In [70]:
print("The names of the selected columns are: ",selected_col)

The names of the selected columns are:  ['p54.162.1', 'p54.162.2', 'p54.162.3', 'p54.162.4', 'p54.162.5', 'p54.162.6', 'p54.162.7', 'p54.162.8', 'p54.162.9', 'p54.162.10', 'p54.162.11', 'p54.162.12', 'p54.162.13', 'p54.162.14', 'p54.162.15', 'p54.162.16', 'p54.162.17', 'p54.162.18', 'p54.162.19', 'p54.162.20', 'p54.162.21', 'p54.162.22', 'p54.162.23', 'p54.162.24', 'p54.162.25', 'p59.162.1', 'p59.162.2', 'p59.162.3', 'p59.162.4', 'p59.162.5', 'p59.162.6', 'p59.162.7', 'p59.162.8', 'p59.162.9', 'p59.162.10', 'p59.162.11', 'p59.162.12', 'p59.162.13', 'p59.162.14', 'p59.162.15', 'p59.162.16', 'p59.162.17', 'p59.162.18', 'p59.162.19', 'p59.162.20', 'p59.162.21', 'p59.162.22', 'p59.162.23', 'p59.162.24', 'p59.162.25', 'lai_lv.1', 'lai_lv.2', 'lai_lv.3', 'lai_lv.4', 'lai_lv.5', 'lai_lv.6', 'lai_lv.7', 'lai_lv.8', 'lai_lv.9', 'lai_lv.10', 'lai_lv.11', 'lai_lv.12', 'lai_lv.13', 'lai_lv.14', 'lai_lv.15', 'lai_lv.16', 'lai_lv.17', 'lai_lv.18', 'lai_lv.19', 'lai_lv.20', 'lai_lv.21', 'lai_lv.22', 

This previous list of selected columns consists of 441. We are going to check how many of these elements correspond to each location of the 25 existing in our database. In this way we will know if **the method tend to select attributes which belong to the Sotavento** or not. It should be noted that the Sotavento location corresponds to number 13.



In [116]:
number = [t[len(t)-2:len(t)] for t in selected_col]
iter = [".1",".2",".3",".4",".5",".6",".7",".8",".9",10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
print([len([a for a in number if str(j) in a]) for j in iter])

[17, 17, 17, 18, 18, 17, 17, 17, 17, 18, 17, 17, 17, 17, 18, 17, 18, 18, 18, 19, 18, 18, 18, 19, 19]


As we can see, the number of columns that belong to each location **does not vary between some locations and others**. All values are between 17-19, that is, they are uniformly selected. Therefore, the location that interests us, Sotavento, cannot be said to be more relevant than the others.

It may happen that 441 is too high a value for the feature selection, taking into account that the maximum value is 550. This high value may cause that no location is more important than the rest, so to finish this study we are going to repeat the same process but in this case forcing the selecting method to take only k = 100 significant variables. Of course, we will lose information and then the MAE will be worse in this case.


In [2]:
imputer = SimpleImputer(strategy = 'mean') # we define the imputer with the fixed method
scaler = StandardScaler() # we define the scaler method, necessary for KNN
selector = SelectKBest(f_regression,k = 100) # We define the feature selection method
knn_regression = KNeighborsRegressor(n_neighbors = 8) # Finally we define the fitting method

fea_sel_pipe = Pipeline([('imputer',imputer),
                        ('scaler',scaler),
                        ('selector',selector),
                        ('knn_regressor',knn_regression)])

fea_sel_pipe.fit(X_train,y_train)
y_test_pred13 = fea_sel_pipe.predict(X_test)

print("The MAE for KNN with PCA and without tuning the hyper-parameters is = ",
      round(metrics.mean_absolute_error(y_test, y_test_pred13),4))

selected = fea_sel_pipe['selector'].get_support() # Getting the boolean set of columns selected.

selected_col_proof = list()
for i in range(len(selected)):
    if(selected[i] == True):
        selected_col_proof.append(list(X_train)[i])

NameError: name 'SimpleImputer' is not defined

In [139]:
print("The names of the selected columns are: ",selected_col_proof)

The names of the selected columns are:  ['p59.162.1', 'p59.162.2', 'p59.162.3', 'p59.162.4', 'p59.162.5', 'p59.162.6', 'p59.162.7', 'p59.162.8', 'p59.162.9', 'p59.162.10', 'p59.162.11', 'p59.162.12', 'p59.162.13', 'p59.162.14', 'p59.162.15', 'p59.162.16', 'p59.162.17', 'p59.162.18', 'p59.162.19', 'p59.162.20', 'p59.162.21', 'p59.162.22', 'p59.162.23', 'p59.162.24', 'p59.162.25', 'v10n.4', 'v10n.5', 'v10n.8', 'v10n.9', 'v10n.10', 'v10n.13', 'v10n.14', 'v10n.15', 'v10n.18', 'v10n.19', 'v10n.20', 'v10n.23', 'v10n.24', 'v10n.25', 'v10.5', 'v10.9', 'v10.10', 'v10.14', 'v10.15', 'v10.19', 'v10.20', 'v10.24', 'v10.25', 'iews.3', 'iews.4', 'iews.5', 'iews.8', 'iews.9', 'iews.10', 'iews.13', 'iews.14', 'iews.15', 'iews.19', 'iews.20', 'iews.24', 'iews.25', 'inss.1', 'inss.2', 'inss.3', 'inss.4', 'inss.5', 'inss.6', 'inss.7', 'inss.8', 'inss.9', 'inss.10', 'inss.11', 'inss.12', 'inss.13', 'inss.14', 'inss.15', 'inss.16', 'inss.17', 'inss.18', 'inss.19', 'inss.20', 'inss.21', 'inss.22', 'inss.23'

In [145]:
number=[t[len(t)-2:len(t)] for t in selected_col_proof]
iter=[".1",".2",".3",".4",".5",".6",".7",".8",".9",10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
list_len=[len([a for a in number if str(j) in a]) for j in iter]
print(list_len)
print(list_len[12])

[2, 2, 3, 4, 6, 2, 2, 4, 6, 6, 2, 2, 4, 6, 6, 2, 2, 3, 6, 6, 4, 3, 4, 7, 6]
4


After carrying out the same process, we find very similar results. No location is found that stands out with a very high number of appearances. In the case of Sotavento, it appears 4 times, so in general it does not seem that **this location has a special relevance in our method**.

Having already the frequency of each location for the two methods estimated for two different k, we are going to check if there is any variable more often preferred. In general, we cannot draw very clear conclusions from these results. But perhaps **locations 24, 25 and 20 have a greater presence in our method**. Since both for k = 441 and k = 100 present higher values than the rest, but these differences are not very significant.