<a href="https://colab.research.google.com/github/SarahFSBorges/data.science/blob/main/Vari%C3%A1veis_categ%C3%B3ricas_(dados_de_c%C3%A2ncer_de_mama).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Este exercício foi realizado com base no curso Data Science na Prática 3.0 do Sigmoidal.

# Lidando com variáveis categóricas

Em machine learning, muitos modelos não conseguirão lidar diretamente com variáveis categóricas. 

É importante conhecer por exemplo o `LabelEncoder` e `OneHotEncoder`. 
O dataset utilizado é referente ao câncer de mama, disponibilizado pela [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/breast+cancer).

In [None]:
# importar pacotes
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# importar dados
df = pd.read_csv("https://raw.githubusercontent.com/carlosfab/dsnp2/master/datasets/breast-cancer.data", header = None,
                 names=["class","age","menopause","tumor_size",
                        "inv_nodes","nodes-caps","deg_malig","breast",
                        "breast_quad","irradiat"])
df.head()

Unnamed: 0,class,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [None]:
# dividir dados de treino e teste
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X,y)

## Label encoding

Para o Label Encoding, atribuímos a cada categoria um número. Por exemplo:

* Sem tumor = `0`
* Tumor benigno = `1`
* Tumor maligno = `2`
* Inconclusivo = `3`


In [None]:
# y_train antes do encoding
y_train

109    no-recurrence-events
51     no-recurrence-events
243       recurrence-events
104    no-recurrence-events
170    no-recurrence-events
               ...         
130    no-recurrence-events
13     no-recurrence-events
42     no-recurrence-events
49     no-recurrence-events
135    no-recurrence-events
Name: class, Length: 214, dtype: object

In [None]:
# y_test antes do encoding
y_test

100    no-recurrence-events
234       recurrence-events
204       recurrence-events
138    no-recurrence-events
175    no-recurrence-events
               ...         
66     no-recurrence-events
222       recurrence-events
117    no-recurrence-events
92     no-recurrence-events
140    no-recurrence-events
Name: class, Length: 72, dtype: object

In [None]:
# codificar a variável alvo
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)
le.classes_ # mostra que há duas classes de eventos

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

In [None]:
# y_train depois do encoding
y_train

109    no-recurrence-events
51     no-recurrence-events
243       recurrence-events
104    no-recurrence-events
170    no-recurrence-events
               ...         
130    no-recurrence-events
13     no-recurrence-events
42     no-recurrence-events
49     no-recurrence-events
135    no-recurrence-events
Name: class, Length: 214, dtype: object

In [None]:
# y_test depois do encoding
y_test

100    no-recurrence-events
234       recurrence-events
204       recurrence-events
138    no-recurrence-events
175    no-recurrence-events
               ...         
66     no-recurrence-events
222       recurrence-events
117    no-recurrence-events
92     no-recurrence-events
140    no-recurrence-events
Name: class, Length: 72, dtype: object

In [None]:
# visualizando as classes (fase do fit)
le.classes_

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

## One-hot encoding

E quando a ordem não representa, necessariamente, uma escala real de importância?

<center><img alt="Colaboratory logo" width="45%" src="https://raw.githubusercontent.com/carlosfab/dsnp2/master/img/encoding.png"></center>


In [None]:
# X_train antes do OneHotEncoder
X_train

Unnamed: 0,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
109,60-69,ge40,30-34,0-2,no,1,right,left_up,no
51,30-39,premeno,20-24,0-2,no,2,left,right_low,no
243,50-59,ge40,20-24,3-5,yes,3,right,right_up,no
104,40-49,premeno,10-14,0-2,no,2,right,left_low,no
170,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
...,...,...,...,...,...,...,...,...,...
130,40-49,premeno,35-39,9-11,yes,2,right,right_up,yes
13,50-59,ge40,25-29,0-2,no,3,left,right_up,no
42,60-69,ge40,5-9,0-2,no,1,left,central,no
49,40-49,premeno,20-24,0-2,no,1,right,left_low,no


In [None]:
# importar OneHotEncoder dentro do preprocessing e instanciar 
# classifica e cria uma nova variável
from sklearn.preprocessing import OneHotEncoder

le = OneHotEncoder()
le.fit(X_train)
X_train_enc = le.transform(X_train)

In [None]:
# verifica os novos dados de treino
X_train_enc
# Uma matriz esparsa possui uma grande quantidade de elementos possui valor padrão (por exemplo zero) ou nulos ou faltantes.

<214x43 sparse matrix of type '<class 'numpy.float64'>'
	with 1926 stored elements in Compressed Sparse Row format>

In [None]:
# visualizar dados de treino em formato de lista (array)
X_train_enc.toarray()

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.]])

## Dummies values

Uma *Dummy Variable* assume um valor 0 ou 1 para indicar a ausência ou presença de determinada variável. Diferente do Label Encoding, onde cada categoria assume um valor numérico, aqui criamos uma espécie de matriz esparça, onde cada categoria ganha uma coluna, com valores 0 indicando ausência, e 1 presença.

In [None]:
# ver dados
df.head()

Unnamed: 0,class,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [None]:
# criar variáveis dummies, cada categoria ganha uma coluna com valores 0 para ausência e 1 para presença
pd.get_dummies(df, columns=['menopause','breast'])

Unnamed: 0,class,age,tumor_size,inv_nodes,nodes-caps,deg_malig,breast_quad,irradiat,menopause_ge40,menopause_lt40,menopause_premeno,breast_left,breast_right
0,no-recurrence-events,30-39,30-34,0-2,no,3,left_low,no,0,0,1,1,0
1,no-recurrence-events,40-49,20-24,0-2,no,2,right_up,no,0,0,1,0,1
2,no-recurrence-events,40-49,20-24,0-2,no,2,left_low,no,0,0,1,1,0
3,no-recurrence-events,60-69,15-19,0-2,no,2,left_up,no,1,0,0,0,1
4,no-recurrence-events,40-49,0-4,0-2,no,2,right_low,no,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,recurrence-events,30-39,30-34,0-2,no,2,left_up,no,0,0,1,1,0
282,recurrence-events,30-39,20-24,0-2,no,3,left_up,yes,0,0,1,1,0
283,recurrence-events,60-69,20-24,0-2,no,1,left_up,no,1,0,0,0,1
284,recurrence-events,40-49,30-34,3-5,no,3,left_low,no,1,0,0,1,0


In [None]:
# para transformar todas as variáveis em dummies
df_enc = pd.get_dummies(df)