# Introduction to Transformers

### Introduction

So far we have seen various techniques for engineering our data.  We've seen how to handle missing values by replacing the nan with the mean value, and adding a boolean column to indicate missingness for each observation.  We've seen techniques for converting categorical variables into dummy variables.

It may not be surprising that SKlearn has some tools to allow us to better automate these techniques.  In the lesson, we'll see how we can do so with sklearn's transformers.

### Loading our Data

In [5]:
import pandas as pd

df = pd.read_csv('./nyc_hs_sat.csv', index_col = 0)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 356 entries, 0 to 355
Data columns (total 11 columns):
dbn                    356 non-null object
name                   356 non-null object
num_test_takers        327 non-null float64
reading_avg            327 non-null float64
math_avg               327 non-null float64
writing_score          327 non-null float64
boro                   356 non-null object
total_students         356 non-null int64
graduation_rate        351 non-null float64
attendance_rate        356 non-null float64
college_career_rate    351 non-null float64
dtypes: float64(7), int64(1), object(3)
memory usage: 33.4+ KB


### Missing Data

Now sklearn has a number of `transformers` which provide some out of the box feature engineering.  For example, in handling missing data, one of our techniques is to replace na values with the mean of the column.

In [66]:
graduation_rate = df[['graduation_rate']]

In [67]:
graduation_rate.isna().sum()

graduation_rate    5
dtype: int64

We can perform this with the `SimpleImputer`.

In [12]:
from sklearn.impute import SimpleImputer

In [50]:
imputer = SimpleImputer(strategy='mean', missing_values = np.nan)

In [68]:
import numpy as np
imputer.fit(graduation_rate)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [69]:
transformed_graduation = imputer.transform(graduation_rate)

In [71]:
np.isnan(transformed_graduation).sum()

0

We can see that there are no any nan values in the returned numpy array.

### Reviewing our Transformer

Let's take a moment to unpack how we got the code above to work.

In [None]:
from sklearn.impute import SimpleImputer

We first imported the transformer, and initialized the transformer with some configuration.  We could have just used the default configuration.

In [72]:
imputer = SimpleImputer()

Then we call the fit method.  The fit method learns information from the data.  For example, with our SimpleImputer, it learns what of the column is.

In [73]:
imputer.fit(graduation_rate)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

Here, it learns that the mean of the graduation rate is .79.

In [74]:
imputer.statistics_

array([0.79156695])

And if we call transform, our `imputer` replaces the null values with that value.

In [79]:
transformed_graduation[transformed_graduation == 0.79]

array([0.79, 0.79, 0.79, 0.79, 0.79, 0.79])

In [82]:
np.isnan(transformed_graduation).sum()

0

So we can see that the fit method learns from the data, while the `transform` method applies the changes to the data.  For example, given a new array of data.

In [85]:
new_array = np.array([np.nan, 1, 100, 10000]).reshape(-1, 1)

In [86]:
imputer.transform(new_array)

array([[7.91566952e-01],
       [1.00000000e+00],
       [1.00000000e+02],
       [1.00000000e+04]])

It makes the same transformation of replacing the nan value with the previously found mean.

### Transformers and Pipelines

Now oftentimes, we'll use transformers with pipelines.  Pipelines are useful because they condense the transformations that we make to our data.  And they allow us to apply the same transformations to new arrays.

For example, let's say that we want to replace our null values with the mean and then translate our values into the respective z-scores.

We can do so with a pipeline.

In [90]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [91]:
pipeline = Pipeline(steps = [
    ('impute', SimpleImputer()),
    ('standardize', StandardScaler()),
])

So we just initialized a pipeline giving it the steps of `impute`, and `standardize`.  We can use the same interface of fit and transform to apply these steps to our data.

In [96]:
pipeline.fit(graduation_rate)

Pipeline(memory=None,
         steps=[('impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardize',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

In [100]:
transformed_grad = pipeline.transform(graduation_rate)

In [104]:
transformed_grad[:6]

array([[-0.97597463],
       [ 0.8043654 ],
       [ 0.95272707],
       [-0.38252795],
       [ 1.32363125],
       [ 1.47199292]])

So we can now see that both transfomations were applied to our column above.

### Summary 

In this lesson we saw how we can use transformers, in combination with pipelines, to help us with feature engineering.  Transformers make transformations to features by first learning from the data with `fit` and then applying changes to data with the transform method.

```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
imputer.fit(X)
imputer.transform(X)
```

Pipelines call these methods for us.  Pipelines also allow us to chain transformations in a single pipeline.  

```python
pipeline = Pipeline(steps = [
    ('impute', SimpleImputer()),
    ('standardize', StandardScaler()),
])
```

With pipelines, we can use the same `fit` and `transform` interface, to both learn from the data, and then apply changes to the data.

### Resources

[Pipelines Kaggle Tutorial](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines)