# Feature Engineering
- Extrair, dos dados brutos, os melhores recursos (features) para o nosso modelo de forma a aumentar a acurácia
- Vamos entender quais são as melhores features após a análise exploratória
    - https://www.youtube.com/watch?v=4sxhE3wP3Ug&t=94s

In [1]:
import pandas as pd

In [2]:
baseLimpa = pd.read_excel("ChavesClientesLimpo.xlsx")
baseLimpa.head()

Unnamed: 0,ChaveSituacao,ClassRisco,CatCliente,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco
0,32FC,Ccinza,Basic-Alpha,1,32,F,C,Basic,Alpha,C
1,25MV,AAmarelo,Black,1,25,M,V,Black,Comum,A
2,27MV,B-Amarelo,Basic-Beta,1,27,M,V,Basic,Beta,B-
3,26FD,BPreto,Black,0,26,F,D,Black,Comum,B
4,26FD,C-Amarelo,Black,0,26,F,D,Black,Comum,C-


**Podemos excluir as colunas que não vamos usar**

In [3]:
baseLimpa = baseLimpa.drop(['ChaveSituacao','ClassRisco','CatCliente'],axis=1)

In [4]:
baseLimpa.head(2)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco
0,1,32,F,C,Basic,Alpha,C
1,1,25,M,V,Black,Comum,A


**Com o One Hot Encoding podemos tratar valores que não tem relação de ordem entre eles**
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [5]:
# Importando e utilizando o OneHotEncoder para as colunas 'Genero' e 'EstadoCivil'
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe_transform = ohe.fit_transform(baseLimpa[['Genero','EstadoCivil']])

In [6]:
# Nome das features
ohe.get_feature_names()



array(['x0_F', 'x0_M', 'x1_C', 'x1_D', 'x1_S', 'x1_V'], dtype=object)

In [7]:
# Array de valores
ohe_transform.toarray()

array([[1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1.]])

In [8]:
# Transformando esses dados em um DataFrame
df_ohe = pd.DataFrame(ohe_transform.toarray())
df_ohe.columns = ohe.get_feature_names()
df_ohe.head()

Unnamed: 0,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V
0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0


In [9]:
# Para finalizar, podemos concatenar as duas bases
baseLimpa = pd.concat([baseLimpa,df_ohe],axis=1)

In [10]:
baseLimpa.head(2)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0


**Já se os valores tiverem uma relação de ordem, podemos usar o Ordinal Encoding**
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

In [11]:
# Entendendo a relação entre a coluna "Categoria"
baseLimpa.Categoria.value_counts()

Black       7
Platinum    7
Basic       6
Name: Categoria, dtype: int64

In [12]:
# Importando e utilizando o OrdinalEncoder para a coluna 'Categoria'
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe_transform = oe.fit_transform(baseLimpa.Categoria.values.reshape(-1, 1))

In [13]:
oe_transform

array([[0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [0.],
       [1.],
       [2.],
       [0.],
       [0.],
       [0.],
       [2.],
       [1.],
       [2.],
       [1.],
       [1.],
       [2.],
       [2.]])

In [14]:
# E podemos adicionar essa coluna
baseLimpa['NrCategoria'] = oe_transform

In [15]:
# Visualizando a base
baseLimpa.head(2)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V,NrCategoria
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0


In [16]:
# Fazendo o mesmo para a coluna risco
oe = OrdinalEncoder(categories=[['C-','C','C+','B-','B','B+','A-','A','A+']])
oe_transform_risco = oe.fit_transform(baseLimpa.Risco.values.reshape(-1, 1))
oe_transform_risco

array([[1.],
       [7.],
       [3.],
       [4.],
       [0.],
       [0.],
       [6.],
       [0.],
       [6.],
       [2.],
       [7.],
       [0.],
       [4.],
       [6.],
       [1.],
       [3.],
       [6.],
       [0.],
       [1.],
       [7.]])

In [17]:
baseLimpa['NrRisco'] = oe_transform_risco

In [18]:
baseLimpa.head(2)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V,NrCategoria,NrRisco
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0


**Por fim, podemos criar funções para transformar colunas como transformar a CatVIP para verificar apenas se o cliente é VIP ou não**

In [19]:
# Criando uma função para verificar se o cliente é VIP
def define_VIP(valor):
    if valor == 'Alpha' or valor == 'Beta':
        return 1
    else:
        return 0

In [20]:
# Aplicando essa função na coluna 'CatVIP'
baseLimpa['NrVIP'] = baseLimpa.CatVIP.apply(define_VIP)

In [21]:
baseLimpa.head(2)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V,NrCategoria,NrRisco,NrVIP
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0,0


**Limpando novamente as colunas desnecessárias**

In [22]:
# Retirando novamente as colunas desnecessárias
baseLimpa = baseLimpa.drop(['Genero','EstadoCivil','Categoria','CatVIP','Risco'],axis=1)
baseLimpa.head()

Unnamed: 0,Pagamento,Idade,x0_F,x0_M,x1_C,x1_D,x1_S,x1_V,NrCategoria,NrRisco,NrVIP
0,1,32,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
1,1,25,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0,0
2,1,27,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,1
3,0,26,1.0,0.0,0.0,1.0,0.0,0.0,1.0,4.0,0
4,0,26,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0


### Usando novamente em um modelo de Regressão Linear

In [23]:
# Selecionando os valores de X e y
X = baseLimpa.drop('Pagamento',axis=1)
y = baseLimpa.Pagamento

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)

reg.score(X,y)

0.6197521275369209