# Dataset House - Discretização De Variáveis - Feature Engine

In [1]:
## para tratar os dados

import pandas as pd
import numpy as np

## configoracoes gerais

import warnings

warnings.filterwarnings('ignore')

## tratamento caso não haja cabeçalho no dataset

column_names = ['CRIM',
               'ZN',
               'INDUS',
               'CHAS',
               'NOX',
               'RM',
               'AGE',
               'DIS',
               'RAD',
               'TAX',
               'PTRATIO',
               'B',
               'LSTAT',
               'MEDV']


df = pd.read_csv('boston.csv', header=None, delimiter=r'\s+', names=column_names)

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


Discretização seria transformar variáveis numéricas contínuas em variáveis categóricas através de intervalos.

Tratamento depois, seria como qualquer categórica - através de Dummy ou Encoder.

Problema da Discretização seria uma possível perda de informação.

Para que fazer essa função? 

- Processamento - intervalor agilizam o processamento.
- Com intervalos fica mais fácil explicar os dados.
- Minimiza outliers.

## Como Fazer?

- Distância (1 a 100 - divisão de 10 em 10)
- Frequência
- Árvore de Decisão (teoricamente agrupa os semelhantes)

In [2]:
## pacote de discretização

from feature_engine.discretisation import (
    EqualWidthDiscretiser, 
    EqualFrequencyDiscretiser, 
    DecisionTreeDiscretiser)

In [5]:
ewd = EqualWidthDiscretiser()
ewd.fit(df[['AGE']])

In [7]:
df['Idade_Discretizada_ewd'] = ewd.transform(df[['AGE']])

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,Idade_Discretizada_ewd
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0,6
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6,7
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7,5
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4,4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2,5


In [10]:
df.groupby('Idade_Discretizada_ewd').agg({'AGE': ['mean', 'min', 'max', 'size']})

Unnamed: 0_level_0,AGE,AGE,AGE,AGE
Unnamed: 0_level_1,mean,min,max,size
Idade_Discretizada_ewd,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,7.442857,2.9,10.0,14
1,18.587097,13.0,22.3,31
2,28.675862,22.9,32.0,29
3,36.32619,32.1,41.5,42
4,45.978125,41.9,51.0,32
5,55.852632,51.8,61.1,38
6,66.317949,61.4,70.6,39
7,75.857143,71.0,80.3,42
8,85.795775,80.8,90.0,71
9,96.45,90.3,100.0,168


In [11]:
efd = EqualFrequencyDiscretiser()
efd.fit(df[['AGE']])

df['Idade_Discretizada_efd'] = efd.transform(df[['AGE']])

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,Idade_Discretizada_ewd,Idade_Discretizada_efd
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0,6,3
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6,7,5
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7,5,3
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4,4,2
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2,5,3


In [12]:
df.groupby('Idade_Discretizada_efd').agg({'AGE': ['mean', 'min', 'max', 'size']})

Unnamed: 0_level_0,AGE,AGE,AGE,AGE
Unnamed: 0_level_1,mean,min,max,size
Idade_Discretizada_efd,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,16.213725,2.9,26.3,51
1,32.443137,27.6,37.8,51
2,44.716,38.1,52.3,50
3,58.652941,52.5,65.4,51
4,71.8,66.1,77.3,50
5,82.254902,77.7,85.9,51
6,89.241176,86.1,91.8,51
7,93.982,91.9,95.6,50
8,97.354717,95.7,98.8,53
9,99.897917,98.9,100.0,48


In [14]:
dtd = DecisionTreeDiscretiser()
dtd.fit(df[['AGE']], df[['MEDV']])  ## nessa opção precisa ser inserido o target

df['Idade_Discretizada_dtd'] = dtd.transform(df[['AGE']])

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,Idade_Discretizada_ewd,Idade_Discretizada_efd,Idade_Discretizada_dtd
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0,6,3,24.730345
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6,7,5,21.681579
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7,5,3,24.730345
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4,4,2,24.730345
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2,5,3,24.730345


In [15]:
df.groupby('Idade_Discretizada_dtd').agg({'AGE': ['mean', 'min', 'max', 'size']})

Unnamed: 0_level_0,AGE,AGE,AGE,AGE
Unnamed: 0_level_1,mean,min,max,size
Idade_Discretizada_dtd,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
17.483673,97.182993,92.4,100.0,147
21.681579,85.337719,76.5,92.2,114
24.730345,57.090345,37.3,76.0,145
27.739,24.064,2.9,37.2,100


## Tópico Da Aula A Seguir - Feature Selection

Alguns métodos que podem ser adotados do pacote feature_engine:

- DropConstantFeatures : exclusão de features constantes;
- DropCorrelatedFeatures : exclusão de features com alta correlação;
- SmartCorrelatedSelection : olhe para um grupo de features altamente correlacionadas e mantenha uma
delas de acordo com um critério específico, sendo esses critérios: (1) maior variância, (2) maior cardinalidade, (3) menos missing, (4)
melhor desempenho do modelo;
- SelectBySingleFeaturePerformance : seleção de características com base no desempenho individual de cada feature;
- RecursiveFeatureElimination : irá rodar um processo recursivo, onde ocorrerá a eliminação de variável a variável, de acordo
com sua “irrelevância” em termos preditivos;
- RecursiveFeatureAddition : começamos pela feature mais importante e vamos adicionando uma por vez, para ver se o modelo melhora ou não.