Machine Learning Mastery
    
    AUTHOR: Dr. Jason Brownlee 

### Ordinal and One-Hot Encodings for Categorical Data

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# import os
# DIR = "/content/drive/MyDrive/Colab Notebooks/MNA/TC4029 - Ciencia y analítica de datos/Semana 6/Machine-Learning-Mastery_Encoding"
# os.chdir(DIR)

In [3]:
import pandas as pd
from numpy import asarray
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

**NOTA**. Sólo se incluye el código de la codificación, que es lo relevante al tema de Ingeniería de características

Se incluye un ejemplo de **codificación binaria**

## Ordinal Encoding

In [4]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


## One-Hot Encoding

In [5]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]


TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

## Dummy Variable Encoding

In [None]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

## Binary Encoding

Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. 

Binary encoding encodes the data in fewer dimensions than one-hot encoding. More generally, we determine the number of binary features needed to encode a variable as log2(number of distinct categories). This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need 7 (log2(128)=7). 

In [None]:
!pip install category_encoders
from category_encoders.binary import BinaryEncoder

In [None]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define binary encoding
encoder = BinaryEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

# Breast Cancer Dataset
## OrdinalEncoder Transform

In [None]:
# load dataset
dataset = pd.read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values

In [None]:
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

In [None]:
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)

In [None]:
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [None]:
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

In [None]:
# Same previous example with dataframe treatment
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
ordinal_encoder = OrdinalEncoder()
ordinal_ar = ordinal_encoder.fit_transform(dataset[categorical_variables])
# The returned ndarray is converted to a dataframe
ordinal_df = pd.DataFrame(ordinal_ar, columns=categorical_variables)
ordinal_df

# Breast Cancer Dataset
## OneHotEncoder Transform

In [None]:
# load dataset
dataset = pd.read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values

In [None]:
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

In [None]:
# one hot encode input variables
onehot_encoder = OneHotEncoder(drop='first', sparse=False)
X = onehot_encoder.fit_transform(X)

In [None]:
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [None]:
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

In [None]:
# Same previous example with dataframe treatment
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
onehot_encoder = OneHotEncoder(drop='first', sparse=False)
onehot_ar = onehot_encoder.fit_transform(dataset[categorical_variables])
# The returned ndarray is converted to a dataframe
onehot_df = pd.DataFrame(onehot_ar)
onehot_df.columns = onehot_encoder.get_feature_names_out()
onehot_df

# Breast Cancer Dataset
## BinaryEncoder Transform

In [None]:
dataset = pd.read_csv('breast-cancer.csv', header=None)
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
binary_encoder = BinaryEncoder()
binary_df = binary_encoder.fit_transform(dataset[categorical_variables])
# Returns directly a dataframe with the names in the columns
binary_df

In [None]:
print(dataset['age'].nunique())
print(dataset['age'].unique())

6 categories in `age` were transformed in:

*   (6 - 1) = 5 columns with onehot encoding (`age_'30-39'`, `age_'40-49'`, `age_'50-59'`, `age_'60-69'`, `age_'70-79'`)
*   log2(6) = 2.5849... = 3 columns with binary encoding (`age_0`, `age_1`, `age_2`)

