<a href="https://colab.research.google.com/github/JhonnyLimachi/Sigmoidal/blob/main/33_Lidando_com_vari%C3%A1veis_categ%C3%B3ricas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 4.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*

---

# Lidando com variáveis categóricas

Em machine learning, muitos modelos não conseguirão lidar diretamente com variáveis categóricas. Dessa maneira, é importante conhecer os principais métodos e saber como aplicá-los.
<center><img src="https://resources.workable.com/wp-content/uploads/2016/01/category-manager-640x230.jpg"width="70%"></center>


Nesta aula veremos como usar o `LabelEncoder` e `OneHotEncoder`. Mais que isso, vou te mostrar algumas situações onde colunas numéricas são, na verdade, variáveis categóricas.

Para exemplificar o uso dessas técnicas, vou usar o dataset de câncer de mama, disponibilizado pela [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/breast+cancer).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://raw.githubusercontent.com/carlosfab/dsnp2/master/datasets/breast-cancer.data", header=None,
                 names=["class", "age", "menopause", "tumor_size",
                        "inv_nodes", "nodes-caps", "deg_malig", "breast",
                        "breast_quad", "irradiat"])

df.head()

Unnamed: 0,class,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [2]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Label encoding

Para o Label Encoding, atribuímos a cada categoria um número. Por exemplo:

* Sem tumor = `0`
* Tumor benigno = `1`
* Tumor maligno = `2`
* Inconclusivo = `3`


In [3]:
# y_train antes do encoding
y_train

183    no-recurrence-events
164    no-recurrence-events
3      no-recurrence-events
141    no-recurrence-events
281       recurrence-events
               ...         
52     no-recurrence-events
187    no-recurrence-events
89     no-recurrence-events
215       recurrence-events
256       recurrence-events
Name: class, Length: 214, dtype: object

In [4]:
# y_test antes do encoding
y_test

251       recurrence-events
198    no-recurrence-events
196    no-recurrence-events
181    no-recurrence-events
199    no-recurrence-events
               ...         
15     no-recurrence-events
171    no-recurrence-events
39     no-recurrence-events
230       recurrence-events
43     no-recurrence-events
Name: class, Length: 72, dtype: object

In [5]:
# codificando a variável alvo
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [6]:
# y_train depois do encoding
y_train

array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1])

In [7]:
# y_test depois do encoding
y_test

array([1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0])

In [8]:
# visualizando as classes (fase do fit)
le.classes_

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

In [9]:
# recuperando e convertendo os labels
le.inverse_transform(y_train)[:5]

array(['no-recurrence-events', 'no-recurrence-events',
       'no-recurrence-events', 'no-recurrence-events',
       'recurrence-events'], dtype=object)

## One-hot encoding

E quando a ordem não representa, necessariamente, uma escala real de importância?

<center><img alt="Colaboratory logo" width="45%" src="https://raw.githubusercontent.com/carlosfab/dsnp2/master/img/encoding.png"></center>


In [10]:
# X_train antes do OneHotEncoder
X_train

Unnamed: 0,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
183,50-59,ge40,30-34,9-11,?,3,left,left_up,yes
164,60-69,ge40,25-29,3-5,?,1,right,left_low,yes
3,60-69,ge40,15-19,0-2,no,2,right,left_up,no
141,50-59,premeno,20-24,3-5,yes,2,left,left_low,no
281,30-39,premeno,30-34,0-2,no,2,left,left_up,no
...,...,...,...,...,...,...,...,...,...
52,50-59,premeno,15-19,0-2,no,1,left,left_low,no
187,60-69,ge40,15-19,0-2,no,2,left,left_up,yes
89,40-49,premeno,40-44,0-2,no,1,right,left_up,no
215,40-49,ge40,20-24,0-2,no,2,right,left_up,no


In [11]:
from sklearn.preprocessing import OneHotEncoder

le = OneHotEncoder()
le.fit(X_train)
X_train_enc = le.transform(X_train)

In [12]:
X_train_enc

<214x43 sparse matrix of type '<class 'numpy.float64'>'
	with 1926 stored elements in Compressed Sparse Row format>

In [13]:
X_train_enc.toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 1.]])

## Dummies values

Uma *Dummy Variable* assume um valor 0 ou 1 para indicar a ausência ou presença de determinada variável. Diferente do Label Encoding, onde cada categoria assume um valor numérico, aqui criamos uma espécie de matriz esparça, onde cada categoria ganha uma coluna, com valores 0 indicando ausência, e 1 presença.

In [14]:
pd.get_dummies(df, columns=['menopause', 'breast'])

Unnamed: 0,class,age,tumor_size,inv_nodes,nodes-caps,deg_malig,breast_quad,irradiat,menopause_ge40,menopause_lt40,menopause_premeno,breast_left,breast_right
0,no-recurrence-events,30-39,30-34,0-2,no,3,left_low,no,0,0,1,1,0
1,no-recurrence-events,40-49,20-24,0-2,no,2,right_up,no,0,0,1,0,1
2,no-recurrence-events,40-49,20-24,0-2,no,2,left_low,no,0,0,1,1,0
3,no-recurrence-events,60-69,15-19,0-2,no,2,left_up,no,1,0,0,0,1
4,no-recurrence-events,40-49,0-4,0-2,no,2,right_low,no,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,recurrence-events,30-39,30-34,0-2,no,2,left_up,no,0,0,1,1,0
282,recurrence-events,30-39,20-24,0-2,no,3,left_up,yes,0,0,1,1,0
283,recurrence-events,60-69,20-24,0-2,no,1,left_up,no,1,0,0,0,1
284,recurrence-events,40-49,30-34,3-5,no,3,left_low,no,1,0,0,1,0


In [15]:
pd.get_dummies(df)

Unnamed: 0,deg_malig,class_no-recurrence-events,class_recurrence-events,age_20-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-79,menopause_ge40,...,breast_left,breast_right,breast_quad_?,breast_quad_central,breast_quad_left_low,breast_quad_left_up,breast_quad_right_low,breast_quad_right_up,irradiat_no,irradiat_yes
0,3,1,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
1,2,1,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
2,2,1,0,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2,1,0,0,0,0,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
4,2,1,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,2,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0
282,3,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,1
283,1,0,1,0,0,0,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
284,3,0,1,0,0,1,0,0,0,1,...,1,0,0,0,1,0,0,0,1,0


In [16]:
df_enc = pd.get_dummies(df)