## Feature Selection

**Feature Selection** es el proceso de seleccionar el mejor conjunto de columnas para entrenar modelos de predicción.

Los principales objetivos del Feature Selection son:
- Simplificar modelos.
- Reducir el tiempo de entrenamiento de modelos de predicción.
- Evitar overfitting.

In [1]:
import numpy as np
import pandas as pd

import matplotlib # Para ver la versión
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn # Para ver la versión

In [2]:
# Versiones

print(f"numpy=={np.__version__}")
print(f"pandas=={pd.__version__}")
print(f"matplotlib=={matplotlib.__version__}")
print(f"seaborn=={sns.__version__}")
print(f"scikit-learn=={sklearn.__version__}")

numpy==1.20.3
pandas==1.2.4
matplotlib==3.4.2
seaborn==0.11.1
scikit-learn==1.5.1


In [3]:
# Datos

df1 = pd.read_csv("../Data/winequality_red.csv", sep = ";")
df2 = pd.read_csv("../Data/winequality_white.csv", sep = ";")

df1["wine"] = 0
df2["wine"] = 1

df = pd.concat([df1, df2], ignore_index = True)

df.shape

(6497, 13)

In [4]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'wine'],
      dtype='object')

In [5]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0


In [6]:
from sklearn.preprocessing import MinMaxScaler

x_scaler = MinMaxScaler()
x_scaler.set_output(transform="pandas")

df = x_scaler.fit_transform(df)

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine
0,0.297521,0.413333,0.0,0.019939,0.111296,0.034722,0.064516,0.206092,0.612403,0.191011,0.202899,0.333333,0.0
1,0.330579,0.533333,0.0,0.030675,0.147841,0.083333,0.140553,0.186813,0.372093,0.258427,0.26087,0.333333,0.0
2,0.330579,0.453333,0.024096,0.026074,0.137874,0.048611,0.110599,0.190669,0.418605,0.241573,0.26087,0.333333,0.0
3,0.61157,0.133333,0.337349,0.019939,0.109635,0.055556,0.124424,0.209948,0.341085,0.202247,0.26087,0.5,0.0
4,0.297521,0.413333,0.0,0.019939,0.111296,0.034722,0.064516,0.206092,0.612403,0.191011,0.202899,0.333333,0.0


### Eliminar columnas con poca varianza

Elimina todas las columnas cuya variación no alcanza un threshold. Por defecto elimina todas las columnas de varianza cero, es decir, las columnas que tienen el mismo valor en todas las filas.

In [None]:
from sklearn.feature_selection import VarianceThreshold

X = df.drop("wine", axis = 1)
y = df["wine"]

feature_names = np.array(X.columns)

# threshold : p(1 - p)

p = 0.01
thresh = p * (1 - p)

print(thresh)

f_selection = VarianceThreshold(threshold = thresh)
f_selection.fit(X)

X_fs = f_selection.transform(X)

X_fs.shape

# Nota: Se usan todas las columnas exceptuando la columna objetivo

In [None]:
# Columnas
feature_names[f_selection.get_support()]

In [None]:
f_selection.get_support()

In [None]:
df.drop("wine", axis = 1).var()

### Feature Selection Univariable

Funciona seleccionando las mejores columnas basadas en pruebas estadísticas de una sola variable (columna).

- **SelectKBest** se queda con las K-columnas que tenga mayor metrica.

- **SelectPercentile** se queda con las columnas dentro de un percentil especificado.

Estos algoritmos funcionan con metricas específicas:

- **Para regresion:** f_regression, mutual_info_regression

- **Para clasificacion:** chi2, f_classif, mutual_info_classif

Los métodos **f_regression** y **f_classif** calculan el grado de dependencia lineal entre dos variables aleatorias.

Los métodos **mutual_info_regression** y **mutual_info_classif** calculan dependencia estadística, requieren más muestras (filas) para ser mas precisos.

In [8]:
# SelectKBest

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X = df.drop("wine", axis = 1)
y = df["wine"]

feature_names = np.array(X.columns)

k = 6

f_selection = SelectKBest(score_func = chi2,
                          k          = k)
f_selection.fit(X, y)
X_fs = f_selection.transform(X)

X_fs.shape
# Se queda con las mejores k columnas

(6497, 6)

In [9]:
# Columnas
feature_names[f_selection.get_support()]

array(['fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides',
       'total sulfur dioxide', 'sulphates'], dtype=object)

In [10]:
# SelectPercentile

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif

X = df.drop("wine", axis = 1)
y = df["wine"]

feature_names = np.array(X.columns)

perc = 20

f_selection = SelectPercentile(score_func = f_classif,
                               percentile = perc)
f_selection.fit(X, y)
X_fs = f_selection.transform(X)

X_fs.shape
# Se queda con las mejores columnas dentro del percentile

(6497, 3)

In [12]:
12 * 0.2

2.4000000000000004

In [None]:
np.percentile(a = f_selection.score_func(X, y)[0], q = 80)

In [None]:
f_selection.score_func(X, y)[0]

In [None]:
np.where(f_selection.score_func(X, y)[0] > np.percentile(a = f_selection.score_func(X, y)[0], q = 80))

In [None]:
# Columnas
feature_names[f_selection.get_support()]

In [None]:
f_selection.get_support()

### Feature selection con SelectFromModel

**SelectFromModel** es un transformador que se puede utilizar junto con cualquier clasificador que asigne importancia a cada columna a través de un atributo específico como **coef_**, **feature_importances_** o **important_getter**. Las columnas que se consideren sin importancia se eliminan si la importancia está por debajo de un threshold.
También se puede utilizar el parámetro **max_features** para seleccionar el numero de columnas.

### Tree-based feature selection

In [25]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

X = df.drop("wine", axis = 1)
y = df["wine"]

feature_names = np.array(X.columns)

clf = ExtraTreesClassifier(n_estimators = 200)
clf = clf.fit(X, y)

clf.feature_importances_

array([0.0915392 , 0.16545257, 0.03757984, 0.06729617, 0.1115784 ,
       0.04898554, 0.24394766, 0.09107023, 0.04202397, 0.07135919,
       0.01756481, 0.01160239])

In [20]:
cols_delete = feature_names[clf.feature_importances_ < 0.02]

In [22]:
X_new = X.drop(cols_delete, axis = 1)

In [23]:
clf = ExtraTreesClassifier(n_estimators = 200)
clf = clf.fit(X_new, y)

In [None]:
model = SelectFromModel(estimator    = clf,
                        prefit       = True,
                        threshold    = None,
                        max_features = 6)

X_fs = model.transform(X)
X_fs.shape

In [None]:
# Columnas
feature_names[model.get_support()]

### Feature impotrance cumsum based selection

In [None]:
max_importance = 0.8 # Suma de importances a lograr

# Importances
importances = clf.feature_importances_

# Suma acumulativa de las importances ordenadas de mayor a menor
indices = np.argsort(importances)[::-1]
cumsum = np.cumsum(importances[indices])

# Filtrado de las mejores columnas que sumen hasta el max_impotrance
n_features = np.argmax(cumsum >= max_importance) + 1
selected_indices = indices[:n_features]

# Columnas
feature_names[selected_indices]

### Mapa de Correlacion

In [None]:
fig, ax = plt.subplots(figsize = (15, 8))

sns.heatmap(df.corr(), annot = True);

In [None]:
################################################################################################################################