# Transformadores de Sklearn Personalizados

#### Descripcion
En SciKit learn tenemos dos tipos de objectos primarios:
- Estimadores: tienen los metodos fit y predict (generan modelos de Machine Learning)
- Transformadores: tienen los metodos fit y transform (realizan una transformacion en datos)

Sklearn tiene por defecto muchos **transformadores**, aca vamos a mostrar como podemos crear algunos que realizen lo que nos interece, sea ***Escalado, Imputacion, Binarizacion, etc...***


#### Algo de Teoria
Para crear cualquier transformador en skleanr solo debemos crear una nueva clase que herede de BaseEstimator y TransformerMixin.
En la clase debemos definir 2 funciones de modo obligatorio fit(X, y, ..) e transform(X, y, ..).
- **fit()** no deberia devolver nada, es usada para calcular "metricas" que se requieran para la transformacion (la media de cada columna en StandarScaler)
- **transform()** deberia retornar los datos con la transformacion aplicada usando las "metricas" calculadas en fit


- Nota:
Esas metricas las debemos guardar en el objecto *self.* con el fin de poder usarlas en las diversas funciones, y poder realizar transformacion en conjuntos de Test con los mismos valores calculador en el fit() sobre el conjunto de Train

In [2]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

### Escalador
- Primer ejemplo, un escalador logaritmico, util para distribuciones exponenciales o de poison

In [103]:
class Log1pScaler(BaseEstimator, TransformerMixin):
    """
    Scale features apply log(1+X), where `X` is each feature.
    
    Features given must be numeric
    
    Parameters
    ----------
    features: None or a array-like, default=None
        array-like with features names if fit() over pandas.DataFrame else index-like
    """
    def __init__(self, features=None):
        self.features = features

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.features is None:
            X = np.log1p(X)
        else:
            if isinstance(X, np.ndarray):
                X = np.log1p(X[:, self.features])
            elif isinstance(X, pd.DataFrame):
                X = np.log1p(X[self.features])
        return X


---
## Testing

Cargamos el dataset de boston para probar el Log1Scaler de 3 formas diferentes

In [4]:
import pandas as pd
from sklearn.datasets import load_boston

In [5]:
data = load_boston()

df = pd.DataFrame(np.c_[data['data'], data['target']], columns=data['feature_names'].tolist() + ['target'])
df.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7


### Testing sobre todo el dataset

In [106]:
log_scaler = Log1pScaler()
log_scaler

Log1pScaler()

In [112]:
df_scaled = scaler_log.fit_transform(df)
df_scaled.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.0063,2.944439,1.196948,0.0,0.430483,2.024853,4.19268,1.627278,0.693147,5.693732,2.791165,5.986201,1.788421,3.218876
1,0.026944,0.0,2.088153,0.0,0.384582,2.004314,4.380776,1.786261,1.098612,5.493061,2.933857,5.986201,2.316488,3.11795
2,0.026924,0.0,2.088153,0.0,0.384582,2.102303,4.128746,1.786261,1.098612,5.493061,2.933857,5.975919,1.61542,3.575151


### Testing sobre algunas columnas usando pandas DataFrame

In [114]:
columns = ['CRIM', 'ZN', 'RM', 'AGE']

log_scaler = Log1pScaler(features=columns)
log_scaler

Log1pScaler(features=['CRIM', 'ZN', 'RM', 'AGE'])

In [115]:
X = log_scaler.fit_transform(df)
pd.DataFrame(X, columns=columns).head(3)

Unnamed: 0,CRIM,ZN,RM,AGE
0,0.0063,2.944439,2.024853,4.19268
1,0.026944,0.0,2.004314,4.380776
2,0.026924,0.0,2.102303,4.128746


### Testing sobre algunas columnas usando una matriz de numpy

In [118]:
columns = ['CRIM', 'ZN', 'RM', 'AGE']
indexes = df.columns.get_indexer(columns)

log_scaler = Log1pScaler(features=indexes)
log_scaler

Log1pScaler(features=array([0, 1, 5, 6]))

In [119]:
X = log_scaler.fit_transform(df.values)
pd.DataFrame(X, columns=columns).head(3)

Unnamed: 0,CRIM,ZN,RM,AGE
0,0.0063,2.944439,2.024853,4.19268
1,0.026944,0.0,2.004314,4.380776
2,0.026924,0.0,2.102303,4.128746


---
---
# Otro ejemplo mas complejo

Vamos a crear un transformador para borrar outlier basados en el **rango intercuartilico** y **las 3 sigmas**

A diferencia del Transformador anterior este debe *"calcular"* y *"guardar"* datos (estadisticos) en la entrapa de **fit** y usarlos en la etapa de **transform**

In [3]:
# Creamos nuestra nueva clase

class DropOutlier(BaseEstimator, TransformerMixin):

    def __init__(self, columns, q1=.25, q3=.75, relation='union'):
        """
        Eliminar outliers basados en el rango intercuartilico y las 3 sigmas

        :param: columns: list, nombres de las columnas a trabajar
        :param: q1: float, el valor para el 1er cuartil
        :param: q3: float, el valor para el 3er cuartil
        :param: relation: str,
            'union': indicar si tomar la union de las reglas
            'inter': indicar si tomar la interseccion de las reglas
        """
        self.columns = columns
        self.q1 = q1
        self.q3 = q3
        self.relation = relation
        # Variables inicializadas para completar en el fit
        self.iqr_min_ = {}
        self.iqr_max_ = {}
        self.sig_min_ = {}
        self.sig_max_ = {}

    def _indx_outlier_rang_int(self, series, is_fit=True):
        if is_fit:
            q25 = series.quantile(self.q1)
            q75 = series.quantile(self.q3)
            iqr = q75 - q25
            minimo = q25 - 1.5*iqr
            maximo = q75 + 1.5*iqr
            ## guardo esos datos en un diccionario usando el nombre de la serie
            self.iqr_min_[series.name] = minimo
            self.iqr_max_[series.name] = maximo
        else:
            minimo = self.iqr_min_[series.name]
            maximo = self.iqr_max_[series.name]

            mascara_outliers = (series < minimo) | (series > maximo)
            return list(series[mascara_outliers].index)

    def _indx_outlier_3_sig(self, series, is_fit=True):
        if is_fit:
            valor_medio = series.mean()
            std = series.std()
            minimo = valor_medio - 3*std
            maximo = valor_medio + 3*std

            ## guardo esos datos
            self.sig_min_[series.name] = minimo
            self.sig_max_[series.name] = maximo
        else:
            minimo = self.sig_min_[series.name]
            maximo = self.sig_max_[series.name]

            mascara_outliers = (series < minimo) | (series > maximo)
            return list(series[mascara_outliers].index)

    def fit(self, X, y=None):

        for col in self.columns:
            self._indx_outlier_rang_int(X[col])
            self._indx_outlier_3_sig(X[col])

        return self

    def transform(self, X, y=None):

        indx_total=[]
        for col in self.columns:
            if self.relation == 'union':
                indx_total = indx_total + list(set(self._indx_outlier_rang_int(X[col], is_fit=False)
                                                   + self._indx_outlier_3_sig(X[col], is_fit=False)))
            elif self.relation == 'inter':
                indx_total = indx_total + list(set(self._indx_outlier_rang_int(X[col], is_fit=False)
                                                  ).intersection(set(self._indx_outlier_3_sig(X[col], is_fit=False))))
            else:
                assert False, f"realtion parameter should be 'union' or 'inter'. Given '{self.relation}'"

        self.indx_droped = list(sorted(indx_total))
        return X.drop(list(indx_total))

In [6]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [7]:
columns = ['PTRATIO','LSTAT','target','AGE']
drop_outliers = DropOutlier(columns=columns, relation='inter')

In [8]:
drop_outliers.fit(df)

DropOutlier(columns=['PTRATIO', 'LSTAT', 'target', 'AGE'], relation='inter')

### Vamos a ver los datos calculador en la etapa de fit

In [9]:
drop_outliers.iqr_min_

{'PTRATIO': 13.199999999999998,
 'LSTAT': -8.057500000000005,
 'target': 5.0624999999999964,
 'AGE': -28.54999999999999}

In [10]:
drop_outliers.iqr_max_

{'PTRATIO': 24.4,
 'LSTAT': 31.962500000000006,
 'target': 36.962500000000006,
 'AGE': 167.64999999999998}

In [11]:
drop_outliers.sig_min_

{'PTRATIO': 11.960697025694623,
 'LSTAT': -8.770121292938992,
 'target': -5.058505938028777,
 'AGE': -15.87168303494009}

In [12]:
drop_outliers.sig_max_

{'PTRATIO': 24.95037016798127,
 'LSTAT': 34.07624777515244,
 'target': 50.124118586250134,
 'AGE': 153.0214854064816}

## Realizamos la transformacion de los datos (borrar los outliers) y guardamos el resultado en un nuevo df

In [13]:
df_clean = drop_outliers.transform(df)
df_clean.shape

(501, 14)

#### veamos cuantas rows borramos

In [16]:
drop_outliers.indx_droped

[141, 373, 374, 412, 414]