# Tranformadores de Sklearn Personalizados

#### Descripcion
En SciKit learn tenemos dos tipos de objectos primarios:
- Estimadores: tienen los metodos fit y predict (generan modelos de Machine Learning)
- Tranformadores: tienen los metodos fit y transform (realizan una transformacion en datos)

Sklearn tiene por defecto muchos **transformadores**, aca vamos a mostrar como podemos crear algunos que realizen lo que nos interece, sea ***Escalado, Imputacion, Binarizacion, etc...***


#### Algo de Teoria
Para crear cualquier tranformador en skleanr solo debemos crear una nueva clase que herede de BaseEstimator y TranformerMixin.
En la clase debemos definir 2 funciones de modo obligatorio fit(X, y, ..) e transform(X, y, ..).
- **fit()** no deberia devolver nada, es usada para calcular "metricas" que se requieran para la tranformacion (la media de cada columna en StandarScaler)
- **transform()** deberia retornar los datos con la tranformacion aplicada usando las "metricas" calculadas en fit


- Nota:
Esas metricas las debemos guardar en el objecto *self.* con el fin de poder usarlas en las diversas funciones, y poder realizar tranformacion en conjuntos de Test con los mismos valores calculador en el fit() sobre el conjunto de Train

In [14]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

### Escalador
- Primer ejemplo, un escalador logaritmico, util para distribuciones exponenciales o de poison

In [103]:
class Log1pScaler(BaseEstimator, TransformerMixin):
    """
    Scale features apply log(1+X), where `X` is each feature.
    
    Features given must be numeric
    
    Parameters
    ----------
    features: None or a array-like, default=None
        array-like with features names if fit() over pandas.DataFrame else index-like
    """
    def __init__(self, features=None):
        self.features = features

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.features is None:
            X = np.log1p(X)
        else:
            if isinstance(X, np.ndarray):
                X = np.log1p(X[:, self.features])
            elif isinstance(X, pd.DataFrame):
                X = np.log1p(X[self.features])
        return X


---
## Testing

Cargamos el dataset de boston para probar el Log1Scaler de 3 formas diferentes

In [104]:
import pandas as pd
from sklearn.datasets import load_boston

In [111]:
data = load_boston()

df = pd.DataFrame(np.c_[data['data'], data['target']], columns=data['feature_names'].tolist() + ['target'])
df.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7


### Testing sobre todo el dataset

In [106]:
log_scaler = Log1pScaler()
log_scaler

Log1pScaler()

In [112]:
df_scaled = scaler_log.fit_transform(df)
df_scaled.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.0063,2.944439,1.196948,0.0,0.430483,2.024853,4.19268,1.627278,0.693147,5.693732,2.791165,5.986201,1.788421,3.218876
1,0.026944,0.0,2.088153,0.0,0.384582,2.004314,4.380776,1.786261,1.098612,5.493061,2.933857,5.986201,2.316488,3.11795
2,0.026924,0.0,2.088153,0.0,0.384582,2.102303,4.128746,1.786261,1.098612,5.493061,2.933857,5.975919,1.61542,3.575151


### Testing sobre algunas columnas usando pandas DataFrame

In [114]:
columns = ['CRIM', 'ZN', 'RM', 'AGE']

log_scaler = Log1pScaler(features=columns)
log_scaler

Log1pScaler(features=['CRIM', 'ZN', 'RM', 'AGE'])

In [115]:
X = log_scaler.fit_transform(df)
pd.DataFrame(X, columns=columns).head(3)

Unnamed: 0,CRIM,ZN,RM,AGE
0,0.0063,2.944439,2.024853,4.19268
1,0.026944,0.0,2.004314,4.380776
2,0.026924,0.0,2.102303,4.128746


### Testing sobre algunas columnas usando una matriz de numpy

In [118]:
columns = ['CRIM', 'ZN', 'RM', 'AGE']
indexes = df.columns.get_indexer(columns)

log_scaler = Log1pScaler(features=indexes)
log_scaler

Log1pScaler(features=array([0, 1, 5, 6]))

In [119]:
X = log_scaler.fit_transform(df.values)
pd.DataFrame(X, columns=columns).head(3)

Unnamed: 0,CRIM,ZN,RM,AGE
0,0.0063,2.944439,2.024853,4.19268
1,0.026944,0.0,2.004314,4.380776
2,0.026924,0.0,2.102303,4.128746
