# Hoja de Trabajo \# 4

---


por Josué Obregón <br>
DS6011 - Feature Engineering <br>
UVG Masters - Escuela de Negocios<br>


## Objetivos

El objetivo de esta hoja de trabajo  es presentar al estudiante diferentes técnicas de codificación de variables categóricas. Éstas técnicas incluyen codificadores clásicos, de contraste y codificadores supervisados o bayesianos.

También se busca que el estudiante practique la utilización de éstas técnicas con las librerías disponibles en el lenguaje Python.


## Importación de librerías y carga de los datos a varios pandas [DataFrames](https://pandas.pydata.org/pandas-docs/version/1.1.5/reference/frame.html)




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [None]:
!mkdir data

In [None]:
import gdown

urls = ['https://drive.google.com/uc?export=download&id=16AGQw1nM9NYILv2aSZaSNSn9jBPByWPq', # okc_train  https://drive.google.com/file/d/16AGQw1nM9NYILv2aSZaSNSn9jBPByWPq/view?usp=sharing
        ]
outputs = ['okc_train.csv']
for url,output in zip(urls,outputs):
  gdown.download(url, f'data/{output}', quiet=False)

In [None]:
df = pd.read_csv('data/okc_train.csv',index_col=0)

In [None]:
df.head()

In [None]:
df['drinks'].value_counts()

In [None]:
df['status'].value_counts()

Pequeño dataset de prueba para algunas explicaciones y para las figuras en las diapositivas


In [None]:
df_test = pd.DataFrame({
... 'City': ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC',
... 'Seattle', 'Seattle', 'Seattle'],
... 'Rent': [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]
... })

In [None]:
df_test

Para esta hoja de trabajo vamos a utilizar la librería [category_encoders](http://contrib.scikit-learn.org/category_encoders/index.html), la cual es compatible con scikit-learn.

In [None]:
!pip install category_encoders

In [None]:
col = 'drinks'

# Unsupervised Encoders

## Classic Encoders

### One-Hot Encoder

In [None]:
from category_encoders import OneHotEncoder

In [None]:
onehot_enc = OneHotEncoder(use_cat_names=True  )

In [None]:
onehot_enc.fit_transform(df[[col]])

In [None]:
onehot_enc.mapping[0]['mapping'] #   mapping[0]['mapping']

### Binary Encoder

In [None]:
from category_encoders import BinaryEncoder

In [None]:
bin_encoder = BinaryEncoder()

In [None]:
bin_encoder.fit_transform(df[col])

In [None]:
bin_encoder.mapping

In [None]:
bin_encoder.mapping[0]

### Frequency or Count Encoder

In [None]:
from category_encoders import CountEncoder

In [None]:
count_enc = CountEncoder( )

In [None]:
count_enc.fit_transform(df[col])

In [None]:
count_enc.mapping

In [None]:
df[col].value_counts()

Intentemos utilizando las funciones de combinacion de grupos

In [None]:
count_enc = CountEncoder(combine_min_nan_groups=True, min_group_size=500, min_group_name='otros', normalize=True)

In [None]:
count_enc.fit_transform(df[col])

In [None]:
count_enc.mapping

### Ordinal Encoder

In [None]:
from category_encoders import OrdinalEncoder

In [None]:
ord_enc = OrdinalEncoder( )

In [None]:
ord_enc.fit_transform(df[col])

In [None]:
ord_enc.category_mapping[0]['mapping']

In [None]:
drink_dict =  {None: 0, 'not_at_all': 1,
               'rarely': 2, 'socially': 3,
               'often': 4,'very_often': 5 ,
               'desperately': 6, 'drinks_missing':-1}
col_drink_dict = {'col': 'drinks', 'mapping': drink_dict}

In [None]:
ord_enc = OrdinalEncoder(mapping=[col_drink_dict] )

In [None]:
ord_enc.fit_transform(df[col], )

In [None]:
ord_enc.category_mapping[0]['mapping']

###Feature Hashing

In [None]:
from sklearn.feature_extraction import FeatureHasher

In [None]:
col = 'where_town'

In [None]:
df[col].describe()

In [None]:
df[col].head()

In [None]:
hash_enc = FeatureHasher(n_features=8, input_type='string', alternate_sign=True)

In [None]:
hashed_features = hash_enc.fit_transform([[x] for x in df[col]])

In [None]:
hashed_features.toarray()

In [None]:
np_hashed = np.array(hashed_features.todense())

In [None]:
np.unique(np_hashed,axis=0).shape

In [None]:
np.unique(np_hashed,axis=0)

In [None]:
from sys import getsizeof

In [None]:
print('Our pandas Series, in bytes: ', getsizeof(df[col]))
print('Our hashed numpy array, in bytes: ', getsizeof(hashed_features))

## Contrast Encoders

### Diferencia entre One-hot Encoding y Dummy Coding

In [None]:
df_test

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

In [None]:
one_hot_df = pd.get_dummies( df_test, prefix=['city'] )

In [None]:
one_hot_df

In [None]:
one_hot_df[['city_NYC','city_SF','city_Seattle']].drop_duplicates()

Con la codificación one-hot, el término de intersección (intercepto) representa la media global de la variable objetivo: 'Renta', y cada uno de los coeficientes lineales representa que tanto difiere la renta media de esa ciudad respecto a la media global.

In [None]:
from sklearn import linear_model

In [None]:
model = linear_model.LinearRegression()
model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],one_hot_df['Rent'])

In [None]:
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

Ahora entrenemos el mismo modelo con dummy coding

In [None]:
dummy_df = pd.get_dummies(df_test, prefix=['city'], drop_first=True)
dummy_df

Con codificación dummy, el coeficiente de sesgo (intercepto) representa el valor medio de la variable $y$ para la categoría de referencia, que en el ejemplo es la ciudad NYC. El coeficiente para la *i*-ésima característica es igual a la diferencia entre el valor medio de la respuesta para la i-ésima categoría y la media de la categoría de referencia.

In [None]:
model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

### Dummy Coding

In [None]:
col='drinks'

In [None]:
from sklearn.preprocessing import OneHotEncoder as OneHotEncoder_sk

In [None]:
dummy_enc =OneHotEncoder_sk(drop='first', sparse=False )

In [None]:
dummy_enc.fit_transform(df[[col]])

In [None]:
dummy_enc.categories_

In [None]:
dummy_df2 = pd.get_dummies(df[col], prefix='dr', drop_first=True) #dummy_nabool
dummy_df2.head()

In [None]:
dummy_df2.drop_duplicates()


### Sum (or Deviation) Coding

In [None]:
from category_encoders import SumEncoder

In [None]:
sum_enc = SumEncoder()

In [None]:
sum_enc.fit_transform(df[[col]])

In [None]:
sum_enc.mapping[0]['mapping']

Utilizando el ejemplo de prueba

In [None]:
sum_enc = SumEncoder()
sum_df = sum_enc.fit_transform(df_test['City'])
sum_df

In [None]:
sum_enc.mapping[0]['mapping']

In [None]:
sum_enc.ordinal_encoder.category_mapping[0]['mapping']

In [None]:
sum_df['Rent']=df_test['Rent']
sum_df

In [None]:
model = linear_model.LinearRegression()
model.fit(sum_df[['City_0', 'City_1']],sum_df['Rent']) # 0 = SF, 1 = NYC

La codificación de efectos (sum coding) es muy similar a la codificación dummy, pero da como resultado modelos de regresión lineal que son aún más simples de interpretar.

En el ejemplo, vemos que el término de intersección representa la media global del la variable respuesta, y los coeficientes individuales indican cuánto las medias de las categorías individuales difieren de la media global. (Esto se llama el efecto principal de la categoría o nivel, de ahí el nombre "codificación de efectos").

La codificación One-hot, encontró la misma intersección y coeficientes, pero en ese caso hay coeficientes lineales para cada ciudad. En la codificación de efecto, ningun feature representa la categoría de referencia, ***por lo que el efecto de la categoría de referencia debe calcularse por separado como la suma negativa de los coeficientes de todas las demás categorías.***

In [None]:
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')
print(f'Negative sum of coaefficients:{np.sum(model.coef_*-1)}')

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

Restulados de la regresión lineal con One-hot encoding

Coefficients: $[ 166.66666667,   666.66666667,  -833.33333333]$

Intercept: $3333.3333333333335$

### Backward Difference Coding

In [None]:
from category_encoders import BackwardDifferenceEncoder

In [None]:
back_diff_enc = BackwardDifferenceEncoder()

In [None]:
back_diff_enc.fit_transform(df[[col]])

In [None]:
back_diff_enc.mapping[0]['mapping']

Utilizando el ejemplo de prueba

In [None]:
back_diff_enc = BackwardDifferenceEncoder(  )
back_diff_df = back_diff_enc.fit_transform(df_test['City'])
back_diff_df['Rent']=df_test['Rent']
back_diff_df

In [None]:
back_diff_enc.mapping[0]['mapping']

In [None]:
back_diff_enc.ordinal_encoder.category_mapping[0]['mapping']

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(back_diff_df[['City_0', 'City_1']],back_diff_df['Rent'])  # 0 = SF, 1 = NYC

La codificación de diferencia en reversa es útil para codificar variables ordinales.

En el ejemplo, vemos que el término de intersección representa la media global de la variable respuesta, y los coeficientes individuales indican cuánto las medias de las categorías individuales difieren de la media de la categoría inmediatamente anterior.

In [None]:
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

### Helmert Coding

In [None]:
from category_encoders import HelmertEncoder

In [None]:
helm_enc = HelmertEncoder()

In [None]:
helm_enc.fit_transform(df[[col]])

In [None]:
helm_enc.mapping[0]['mapping']

Utilizando el ejemplo de prueba

In [None]:
helm_enc = HelmertEncoder( )
helm_df = helm_enc.fit_transform(df_test['City'])
helm_df['Rent']=df_test['Rent']
helm_df

In [None]:
helm_enc.mapping[0]['mapping']

In [None]:
model = linear_model.LinearRegression()
model.fit(helm_df[['City_0', 'City_1']],helm_df['Rent'])  # 0 = SF, 1 = NYC

In [None]:
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

In [None]:
print('Global mean: ',df_test['Rent'].mean())
print(df_test.groupby('City').mean())

In [None]:
df_test[df_test['City']!='NYC']['Rent'].mean()

In [None]:
df_test[df_test['City']!='Seattle']['Rent'].mean()

#Supervised Encoders

In [None]:
col_cat = 'drinks'
col_num = 'essay_length'

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label_enc = LabelEncoder()
df['Class_enc'] = label_enc.fit_transform(df['Class'])

In [None]:
df.head()

## Target Encoder

In [None]:
from category_encoders import TargetEncoder

In [None]:
target_enc = TargetEncoder() #min_samples_leaf = k, smoothing = f
target_enc.fit_transform(df[col_cat],df['Class_enc'])

In [None]:
target_enc.mapping

In [None]:
target_enc = TargetEncoder( ) # min_samples_leaf=1 (k), smoothing=1.0 (f)
target_enc.fit_transform(df[col_cat],df[col_num])

In [None]:
target_enc.mapping

## Leave-one-out Encoder

In [None]:
from category_encoders import LeaveOneOutEncoder

In [None]:
lou_enc = LeaveOneOutEncoder( sigma=0.05) #sigma
lou_enc.fit_transform(df[col_cat],df['Class_enc'])

In [None]:
lou_enc.mapping

In [None]:
lou_enc = LeaveOneOutEncoder()
lou_enc.fit_transform(df[col_cat],df['essay_length'])

In [None]:
lou_enc.mapping

## M-estimate Encoder

In [None]:
from category_encoders import MEstimateEncoder

In [None]:
mest_enc = MEstimateEncoder(m=10) # m , #sigma # randomized
mest_enc.fit_transform(df[col_cat],df['Class_enc'])

In [None]:
mest_enc.mapping

## Weight of Evidence Encoder

In [None]:
from category_encoders import WOEEncoder

In [None]:
woe_enc = WOEEncoder() # randomized=False, sigma=0.05, regularization=1.0
woe_enc.fit_transform(df[col_cat],df['Class_enc'])

In [None]:
woe_enc.mapping