# Estudando feature selection

In [1]:
# Bibliotecas
import numpy  as np
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Função de avaliação
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

def metricas_classificacao(estimator, X_train, X_test, y_train, y_test):
    
    # ============================================

    print("\nMétricas de avaliação de treino:")

    y_pred_train = estimator.predict(X_train)

    ConfusionMatrixDisplay.from_predictions(y_train, y_pred_train)
    plt.show()

    print(classification_report(y_train, y_pred_train))

    # ============================================

    print("\nMétricas de avaliação de teste:")

    y_pred_test = estimator.predict(X_test)

    ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test)
    plt.show()

    print(classification_report(y_test, y_pred_test))

In [3]:
from sklearn import datasets

# Load do Dataset Breast Cancer
bc = datasets.load_breast_cancer(as_frame=True)
X = bc.data
y = bc.target

## Analisando o dataset

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [10]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify=y)

## Feature Selection

Seleção de atributos consiste na escolha, com base em alguns critérios, de um **subconjunto original** de features de um dado problema que proporcione um modelo com performance comparável. O resultado do processo é uma **redução de dimensionalidade** do espaço de features, mas mantendo o espaço de features originais.

Diferentes combinações de features são selecionadas, avaliadas utilizando um modelo e comparadas com as outras combinações com base dos resultados desse estimador. Dessa forma, a escolha das features depende do estimador escolhido e a **busca será feira em todas as possíveis combinações de features utiliznado a métrica escolhida**.

#### Backward elimination - RFE
Feature ranking with recursive feature elimination. Given an external estimator that assigns weigths to features (e.g., the coefficients of a linear model), the goal of a recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Parameters:
- estimator : superviser learning that provides information about the feature importance (esse método (RFE) só funciona com estimadores que retornam coef_, feature_importances_)
- n_features_to_select : number of features to select. If it is float, it is the fraction of features to select.

Attributes
- feature_names_in_ : names of features seen during .fit()
- ranking_ : Corresponds the ranking position of the i-th feature. Selected features are assigned rank 1
- support_ : mask of selected features

In [5]:
# Importando o método e o estimador
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

In [13]:
# Instaciamento e fit com rfe
dt = DecisionTreeClassifier(random_state = 42)
rfe = RFE(estimator=dt , n_features_to_select=10).fit(X_train, y_train)

In [41]:
# support_ retorna uma máscara com as features selecionadas
mask = rfe.support_
mask

array([False, False, False, False, False, False, False, False, False,
        True, False,  True, False,  True,  True, False, False, False,
       False, False,  True,  True, False, False, False,  True,  True,
        True,  True, False])

In [15]:
rfe.ranking_

array([21, 19, 18, 17, 16, 15, 14, 12, 10,  1,  9,  1,  7,  1,  1, 20,  6,
        5, 13, 11,  1,  1,  4,  3,  2,  1,  1,  1,  1,  8])

In [47]:
X.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [54]:
# Como usar a máscara para selecionar o nome das colunas - funciona se usar uma lista ou array de booleans com o mesmo length de columns
selected_feat = X.columns[[False,False,False,False,False,False,False,False,False, False,
                           False,False,False,False,False,False,False,False,False, False,
                           False,False,False,False,False,False,False,False,False, True]]
selected_feat

Index(['worst fractal dimension'], dtype='object')

In [55]:
# ou
selected_feat = X.columns[mask]
selected_feat

Index(['mean fractal dimension', 'texture error', 'area error',
       'smoothness error', 'worst radius', 'worst texture',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry'],
      dtype='object')

#### ANOVA

Pressupões que as variáveis são contínuas de uma lado e categóricas do outro. Para avaliar as features mais importantes, ela compara os grupos categóricos analisando se há uma igual variância nos dados contínuos. Se par aos diferentes grupos temos uma mesma variância, isso significa que essa feature contínua não é relevante para separar os grupos e, portanto, pode serr eliminada da modelagem. 

In [17]:
from sklearn.feature_selection import SelectKBest, f_classif

In [18]:
fs = SelectKBest(score_func = f_classif, k = 10)
X_new = fs.fit_transform(X_train, y_train)
print(f"A quantidade de features antes era de {X_train.shape[1]}, agora é de {X_new.shape[1]}")

A quantidade de features antes era de 30, agora é de 10


In [19]:
fs.get_feature_names_out()

array(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
       'mean concave points', 'worst radius', 'worst perimeter',
       'worst area', 'worst concavity', 'worst concave points'],
      dtype=object)

In [24]:
fs.scores_

array([4.80594009e+02, 7.30809534e+01, 5.19551421e+02, 4.50499138e+02,
       6.82480951e+01, 2.35101305e+02, 3.65480510e+02, 6.35025749e+02,
       5.24483545e+01, 5.31343724e-03, 2.22806961e+02, 1.27184749e+00,
       2.06798349e+02, 2.78960274e+02, 5.34605383e+00, 3.51063285e+01,
       2.36099869e+01, 7.68295227e+01, 5.00864231e-02, 7.85634972e-01,
       6.17391336e+02, 9.84586408e+01, 6.43459905e+02, 4.86344354e+02,
       9.34278899e+01, 2.37748913e+02, 3.35294070e+02, 7.32283255e+02,
       9.58762662e+01, 5.19076118e+01])

In [39]:
data = zip(fs.get_feature_names_out(), fs.scores_)
data_df = pd.DataFrame(data, columns=["Features", "scores"])
data_df.sort_values(by = 'scores', ascending = False)

Unnamed: 0,Features,scores
7,worst area,635.025749
2,mean area,519.551421
0,mean radius,480.594009
3,mean concavity,450.499138
6,worst perimeter,365.48051
5,worst radius,235.101305
1,mean perimeter,73.080953
4,mean concave points,68.248095
8,worst concavity,52.448354
9,worst concave points,0.005313
