# Selección de características mediante filtrado

En este cuaderno veremos 3 modos de seleccionar características mediante filtrado:
- umbralizado de la varianza
- umbralizado de la correlación entre características
- umbralizado de la información mutua entre las características y la etiqueta

---
    [ES] Código de Alfredo Cuesta Infante para 'Reconocimiento de Patrones'
       @ Master Universitario en Visión Artificial, 2024, URJC (España)
    [EN] Code by Alfredo Cuesta-Infante for 'Pattern Recognition'
       @ Master of Computer Vision, 2024, URJC (Spain)

    alfredo.cuesta@urjc.es

## 0. Imports & cargado de datos

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold

In [2]:
file_path = "../datasets/breast_cancer_winsconsin_dataset/"
file_name = "train_X.csv"
X_init = pd.read_csv(file_path+file_name, sep = ';', decimal = '.', index_col=0)
file_name = "train_Y.csv"  
Y_init = pd.read_csv(file_path+file_name, sep = ';', decimal = '.', index_col=0)
#------------------------------------
X_init.head()

Unnamed: 0_level_0,feat.1_1,feat.2_1,feat.3_1,feat.4_1,feat.5_1,feat.6_1,feat.7_1,feat.8_1,feat.9_1,feat.10_1,...,feat.1_3,feat.2_3,feat.3_3,feat.4_3,feat.5_3,feat.6_3,feat.7_3,feat.8_3,feat.9_3,feat.10_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130,12.19,13.29,79.08,455.8,0.1066,0.09509,0.02855,0.02882,0.188,0.06471,...,13.34,17.81,91.38,545.2,0.1427,0.2585,0.09915,0.08187,0.3469,0.09241
73,13.8,15.79,90.43,584.1,0.1007,0.128,0.07789,0.05069,0.1662,0.06566,...,16.57,20.86,110.3,812.4,0.1411,0.3542,0.2779,0.1383,0.2589,0.103
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
471,12.04,28.14,76.85,449.9,0.08752,0.06,0.02367,0.02377,0.1854,0.05698,...,13.6,33.33,87.24,567.6,0.1041,0.09726,0.05524,0.05547,0.2404,0.06639
216,11.89,18.35,77.32,432.2,0.09363,0.1154,0.06636,0.03142,0.1967,0.06314,...,13.25,27.1,86.2,531.2,0.1405,0.3046,0.2806,0.1138,0.3397,0.08365


## 1. Filtrado por la varianza

### Escalado de las características al intervalo unidad

In [3]:
unit_scaler = MinMaxScaler().set_output(transform="pandas")
unit_scaler.fit(X_init)
X_scl = unit_scaler.transform(X_init)
#------------------------------------------
X_init.head()
X_scl.head()

Unnamed: 0_level_0,feat.1_1,feat.2_1,feat.3_1,feat.4_1,feat.5_1,feat.6_1,feat.7_1,feat.8_1,feat.9_1,feat.10_1,...,feat.1_3,feat.2_3,feat.3_3,feat.4_3,feat.5_3,feat.6_3,feat.7_3,feat.8_3,feat.9_3,feat.10_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130,0.254856,0.121069,0.246594,0.132471,0.487226,0.248819,0.066893,0.143241,0.414141,0.310657,...,0.192458,0.154318,0.204044,0.088478,0.485838,0.224321,0.079193,0.282019,0.375197,0.244271
73,0.333627,0.205614,0.325903,0.186893,0.433962,0.363106,0.182498,0.251938,0.30404,0.330666,...,0.307364,0.235608,0.298272,0.154149,0.474971,0.31717,0.221965,0.476404,0.201696,0.313809
0,0.538627,0.022658,0.552093,0.363733,0.593753,0.882623,0.70314,0.731113,0.686364,0.605518,...,0.620776,0.141525,0.66831,0.450698,0.618284,0.619292,0.56861,0.914227,0.598383,0.418215
471,0.247517,0.623267,0.231011,0.129968,0.314977,0.126962,0.055459,0.118141,0.40101,0.147852,...,0.201708,0.567964,0.183425,0.093983,0.223664,0.067885,0.044121,0.191078,0.165221,0.073413
216,0.240178,0.292188,0.234295,0.12246,0.370136,0.31935,0.155483,0.156163,0.458081,0.277591,...,0.189256,0.401919,0.178246,0.085037,0.470896,0.269048,0.224121,0.392008,0.361002,0.186749


### Aplicar umbral

Podemos aplicar el filtrado por umbral de la varianza a dos conjuntos de datos: <code>X_init</code> y <code>X_scl</code>.

La diferencia es que en X_init hay columnas de diferentes escalas, mientras que en X_scl todas las columnas están limitadas al intervalo $[0,1]$.

Usamos el método <code>sklearn.preprocessing.MinMaxScaler</code> para escalar las características y <code>sklearn.feature_selection.VarianceThreshold</code> para seleccionar las características.

En concreto, <code>VarianceThreshold</code> eliminará del dataframe todas aquellas columnas que no superen <code>var_th</code>, el umbral determinado para la varianza.

In [4]:
var_th = 0.02
choice = 'scl' # 'scl' or 'init'
feat_selector = VarianceThreshold(var_th).set_output(transform='pandas')

flag_error = False

if choice == 'init':
    feat_selector.fit(X_init)
    X_sel = feat_selector.transform(X_init)
elif choice == 'scl':
    feat_selector.fit(X_scl)
    X_sel = feat_selector.transform(X_scl)    
else:
    flag_error = True
    print('--- error: Incorrect choice ! ---')

#----------------------
if not(flag_error):
    strMsg = X_sel.columns.to_list()
    print('%d features have been removed'  % (X_init.columns.shape[0]-len(strMsg)) ) 
    print('%d features have been selected:'% len(strMsg))
    print( strMsg )

14 features have been removed
16 features have been selected:
['feat.1_1', 'feat.2_1', 'feat.3_1', 'feat.4_1', 'feat.6_1', 'feat.7_1', 'feat.8_1', 'feat.10_1', 'feat.9_2', 'feat.1_3', 'feat.2_3', 'feat.3_3', 'feat.5_3', 'feat.6_3', 'feat.7_3', 'feat.8_3']


## 2. Filtrado por correlación

In [14]:
Y_num = (Y_init=='B')*1
X = pd.concat((X_init,Y_num), axis = 1)
X.head()

Unnamed: 0_level_0,feat.1_1,feat.2_1,feat.3_1,feat.4_1,feat.5_1,feat.6_1,feat.7_1,feat.8_1,feat.9_1,feat.10_1,...,feat.2_3,feat.3_3,feat.4_3,feat.5_3,feat.6_3,feat.7_3,feat.8_3,feat.9_3,feat.10_3,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130,12.19,13.29,79.08,455.8,0.1066,0.09509,0.02855,0.02882,0.188,0.06471,...,17.81,91.38,545.2,0.1427,0.2585,0.09915,0.08187,0.3469,0.09241,1
73,13.8,15.79,90.43,584.1,0.1007,0.128,0.07789,0.05069,0.1662,0.06566,...,20.86,110.3,812.4,0.1411,0.3542,0.2779,0.1383,0.2589,0.103,0
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
471,12.04,28.14,76.85,449.9,0.08752,0.06,0.02367,0.02377,0.1854,0.05698,...,33.33,87.24,567.6,0.1041,0.09726,0.05524,0.05547,0.2404,0.06639,1
216,11.89,18.35,77.32,432.2,0.09363,0.1154,0.06636,0.03142,0.1967,0.06314,...,27.1,86.2,531.2,0.1405,0.3046,0.2806,0.1138,0.3397,0.08365,1


In [15]:
mat_R = X.corr()

In [16]:
R_th = .97

aux = np.abs( np.triu(mat_R.values) - np.eye(31) )
ind_row = np.argmax(aux, axis=0)
ind_col=0
aboveTh_list=[]
removed_list=[]
for ir in ind_row:
    if (aux[ir,ind_col] >= R_th):
        aboveTh_list.append((ir,ind_col))
        removed_list.append(ind_col)
    ind_col=ind_col+1

In [17]:
print('Pairs with correlation above %0.4f are:'%R_th)
[print(' ',X.columns[aux[0]],'-',X.columns[aux[1]]) for aux in aboveTh_list]
print('so, features removed are:')
drop_list = [aux for aux in X.columns[removed_list] ]
print(' ',drop_list)

Pairs with correlation above 0.9700 are:
  feat.1_1 - feat.3_1
  feat.1_1 - feat.4_1
  feat.1_2 - feat.3_2
  feat.3_1 - feat.1_3
  feat.1_3 - feat.3_3
  feat.1_3 - feat.4_3
so, features removed are:
  ['feat.3_1', 'feat.4_1', 'feat.3_2', 'feat.1_3', 'feat.3_3', 'feat.4_3']


In [18]:
X_sel = X_init.drop(columns=drop_list)

X_sel.head()

Unnamed: 0_level_0,feat.1_1,feat.2_1,feat.5_1,feat.6_1,feat.7_1,feat.8_1,feat.9_1,feat.10_1,feat.1_2,feat.2_2,...,feat.8_2,feat.9_2,feat.10_2,feat.2_3,feat.5_3,feat.6_3,feat.7_3,feat.8_3,feat.9_3,feat.10_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130,12.19,13.29,0.1066,0.09509,0.02855,0.02882,0.188,0.06471,0.2005,0.8163,...,0.008094,0.02662,0.004143,17.81,0.1427,0.2585,0.09915,0.08187,0.3469,0.09241
73,13.8,15.79,0.1007,0.128,0.07789,0.05069,0.1662,0.06566,0.2787,0.6205,...,0.009206,0.0122,0.00313,20.86,0.1411,0.3542,0.2779,0.1383,0.2589,0.103
0,17.99,10.38,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,...,0.01587,0.03003,0.006193,17.33,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
471,12.04,28.14,0.08752,0.06,0.02367,0.02377,0.1854,0.05698,0.6061,2.643,...,0.01183,0.02047,0.003883,33.33,0.1041,0.09726,0.05524,0.05547,0.2404,0.06639
216,11.89,18.35,0.09363,0.1154,0.06636,0.03142,0.1967,0.06314,0.2963,1.563,...,0.01785,0.02793,0.004775,27.1,0.1405,0.3046,0.2806,0.1138,0.3397,0.08365


## 3. Filtrado por información mutua

In [10]:
from sklearn.feature_selection import mutual_info_classif

In [11]:
MI = mutual_info_classif(X_init.values, Y_init.values.ravel(), n_neighbors=4, random_state=1234)

In [12]:
MI

array([0.34798916, 0.11246718, 0.38680623, 0.35107253, 0.10863167,
       0.22539904, 0.37169469, 0.42963593, 0.03490681, 0.03802155,
       0.24504627, 0.00750819, 0.22537932, 0.31803713, 0.00236345,
       0.06269116, 0.12762749, 0.1213488 , 0.02537872, 0.03475301,
       0.4354578 , 0.1563207 , 0.46496131, 0.45933919, 0.11597963,
       0.2277714 , 0.33681706, 0.45977291, 0.09552671, 0.08188025])

In [13]:
imax = MI.argmax()
print('The feature with highest mutual information w.r.t. the target is "%s":'%X_init.columns[imax])
print('The mutual information is %0.3f'%MI[imax])

The feature with highest mutual information w.r.t. the target is "feat.3_3":
The mutual information is 0.465


# Ejercicios

- En la sección 1 hemos filtrado por varianza habiendo escalado previamente al intervalo [0,1] utilizando `MinMaxScaler`. <br>
Prueba ahora a utilizar `MaxAbsScaler` <br>
Después prueba con `StandardScaler`
<br><br>
Con uno de ellos verás que NO se elimina ninguna columna por mucho que varíes el umbral. <br>
¿Cuál es? ¿Por qué sucede esto?

- En la sección 2 hemos eliminado características que estaban muy correlacionadas entre sí.<br>
Otra opción sería buscar características que tengan una correlación muy alta con la columna `label`, es decir con la etiqueta, y quedarnos sólo con ellas. <br>
¿Qué posibles problemas podrían ocurrir al hacer esto?