# Feature Selection

Mesmo com um problema com poucas variáveis vamos realizar um método de seleção de features.  
Vamos utilizar como forma de seleção o algoritmo RandomForest com os parametros a seguir:
- `n_estimators=100` : Quantidade razoaável de árvores  
- `criterion='gini'` : Eficiência computacional vs entropia  
- `max_depth=5` : Quantidade razoável de nós  
- `max_features='sqrt` : Oportunidade de interação de diferentes features  
- `class_weight='balanced'` : Problema desbalanceado, dando maior peso para classes minoritárias  
A escolha do algoritmo se deve ao fato de ser uma forma simples de avaliar a importância de cada uma das variáveis, avaliando um método de árvore (mesmo que será utilizado no algoritmo final) e selecioanando as variáveis, dando oportunidade para elas aparecerem em diferentes árvores interagindo com diferentes variáveis.  

No meio das variáveis serão colocadas 4 variáveis aleatórias, 2 categóricas (baixa cardinalidade) e 2 continuas.  
As variáveis selecionadas serão aquelas que se mantiverem acima da primeira aleatória no ranking de importância por `ganho de informação` ou que acumularem juntas 95% de importância (valem testes para definir esses valores).  

(método parecido com Boruta, porém computacionalmente mais rápido)

In [31]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from feature_engine.encoding import OrdinalEncoder

# Data

In [32]:
df_train = pd.read_parquet('../data/enrich/fraud_dataset_v2_train.parquet.gzip')
df_valid = pd.read_parquet('../data/enrich/fraud_dataset_v2_valid.parquet.gzip')

In [36]:
for i in range(1,3):
    # replicability
    np.random.seed(i)
    df_train[f'___random_cat_{i}___'] = np.random.randint(0, i*25, size=df_train.shape[0])

    np.random.seed(i)
    df_train[f'___random_con_{i}___'] = np.random.uniform(0, 1000000, size=df_train.shape[0])

In [37]:
df_train.head()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,fecha,monto,score,fraude,ymd,day_of_week,hour_of_day,period_of_day,vl_cat_last7d,qty_cat_last7d,vl_fraud_cat_last7d,qty_fraud_cat_last7d,br_vl_cat_last7d,br_qty_cat_last7d,___random_cat_1___,___random_con_1___,___random_cat_2___,___random_con_2___
0,4,0.7518,42638.24,50.0,0.19021,9.0,BR,20,Mochila Impermeable Mujer Importada Premiun,cat_0cd53cb,0.364921,4338.0,366.0,1,NULO,Y,0.98,260445,100,2020-03-09 14:33:56,22.18,74.0,0,20200309,0,14,afternoon,964.89,26.0,0.0,0.0,0.0,0.0,5,417022.004703,40,435994.902142
1,4,0.7576,34602.36,4.0,0.317697,33.0,BR,7,Placa De Rede ( Pci-e / Pci Express X1 / Mini ...,cat_41464e4,0.056129,4605.0,297.0,1,NULO,Y,0.06,34602,100,2020-03-16 15:07:38,10.95,28.04,0,20200316,0,15,afternoon,413.53,20.0,4.45,1.0,0.010761,0.05,11,720324.493442,15,25926.231828
2,4,0.7427,11225.44,20.0,0.0,9.0,BR,14,Livro - Nova Coleção Tora Rashi - Novo,cat_8ef7164,0.027475,3480.0,462.0,1,N,Y,0.38,11225,100,2020-03-25 13:12:31,56.84,72.0,0,20200325,2,13,afternoon,4831.37,266.0,81.8,8.0,0.016931,0.030075,12,114.374817,45,549662.477879
3,4,0.8192,49224.53,2.0,0.446367,12.0,BR,14,Polaina Pvc Tipo Galocha Solado Integral Antid...,cat_e694239,0.338384,1732.0,16.0,1,NULO,N,0.02,49224,0,2020-03-29 18:46:25,5.75,13.0,0,20200329,6,18,evening,838.89,30.0,0.0,0.0,0.0,0.0,8,302332.572632,8,435322.392618
4,3,0.7462,147342.28,51.0,0.114652,140.0,BR,27,Kit 6 Lampada De Led Halopim G9 5w Para Lustre...,cat_2eabef6,0.551881,3314.0,883.0,1,NULO,Y,1.0,147342,100,2020-03-14 17:44:43,12.0,49.37,0,20200314,5,17,afternoon,3263.94,118.0,22.65,3.0,0.006939,0.025424,9,146755.890817,22,420367.802087


# Feature Selection

In [38]:
hard_remove = ['c', 'p', 'q', 'fecha', 'i', 'score', 'fraude', 'ymd']

In [39]:
ord_enc = OrdinalEncoder(
    encoding_method='arbitrary',
    variables=df_train.drop(columns=hard_remove).select_dtypes(include=['object']).columns.tolist()
).fit(df_train)

In [42]:
rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=5,
                            max_features='sqrt', class_weight='balanced', random_state=777, n_jobs=-1)

In [43]:
rf.fit(ord_enc.transform(df_train).drop(columns=hard_remove), df_train['fraude'])

In [53]:
df_importance = pd.DataFrame({
    'feature': rf.feature_names_in_,
    'importance': rf.feature_importances_
}).sort_values(by='importance', ascending=False).reset_index(drop=True)
df_importance['importance_cumsum'] = df_importance['importance'].cumsum()
df_importance

Unnamed: 0,feature,importance,importance_cumsum
0,o,0.364899,0.364899
1,f,0.111153,0.476052
2,n,0.097697,0.573748
3,l,0.09191,0.665658
4,a,0.08895,0.754608
5,m,0.059918,0.814526
6,s,0.033921,0.848448
7,br_qty_cat_last7d,0.03347,0.881918
8,d,0.021863,0.903781
9,br_vl_cat_last7d,0.016568,0.92035


Considerando o critério de acumular 95% de importância, temos 12 variáveis selecionadas:  
- o  
- f  
- n  
- l  
- a  
- m  
- s  
- br_qty_cat_last7d  
- d  
- br_vl_cat_last7d  
- qty_fraud_cat_last7d  
- h  
- vl_fraud_cat_last7d  