# Easier Pipelines

### Introduction 

In earlier lessons we saw how we could use pipelines, combined with transformers, to conduct feature engineering.  With pipelines and transformers, we could chain transformations by first defining the pipeline.  

```python
pipeline = Pipeline(steps = [
    ('impute', SimpleImputer()),
    ('standardize', StandardScaler()),
])
```

And then fitting the pipeline to the data.

```python
na_numeric_pipeline.fit(na_features_train)
```

Followed by transforming the data.

```python
transformed_na_features_test = na_numeric_pipeline.transform(na_features_test)
```

In this lesson, we'll learn about using the sklearn pandas library, to do more with pipelines, as well as use an easier interface.

### Making Pipelines Easier

After a few lessons with transformers and pipelines, we still may have not grown to love them.  One of the downsides of pipelines is that we are not returned a pandas dataframe, but rather a numpy array.  This makes it difficult to check that we transformed our data properly, or place our transformations in context.

Another difficulty with pipelines is applying different transformations to different columns.  
> Notice that so far, we made the same transformations to all columns.

Because of this, we'll be working with the `sklearn-pandas` library, which has an easier interface.

### Using Sklearn Pandas

Let's learn about the sklearn-pandas library by way of example.

This is our data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/nyc_hs_sat.csv"
hs_df = pd.read_csv(url, index_col = 0)

In [3]:
hs_df[:3]

Unnamed: 0,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,M,683,0.92,0.94,0.77


Ok, now let's see how we can apply our transformation of imputing the data, this time with `sklearn_pandas`.

In [2]:
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer

In [3]:
data_mapper = DataFrameMapper([
     (['graduation_rate'], SimpleImputer()),
], df_out = True)

In [4]:
transformed = data_mapper.fit_transform(hs_df)
transformed[:3]

Unnamed: 0,graduation_rate
0,0.66
1,0.9
2,0.92


Let's break how this worked.

This time, instead of a initializing a Pipeline, we initialized a `DataFrameMapper`.  Like our Pipeline, this takes a list of steps, where each step is a tuple.  One difference is that in the tuple's first element, we specify each column we wish to transform.  

For example, we specified the `graduation_rate` column like so:

```python
(['graduation_rate'], SimpleImputer())
```

We also specified that we want our DataFrameMapper to return a dataframe.  We did this with the argument `df_out = True`.

```python
data_mapper = DataFrameMapper([
     (['graduation_rate'], SimpleImputer()),
], df_out = True)
```

Now let's look at where we fit our data mapper and transformed our data.

```python
transformed = data_mapper.fit_transform(hs_df)
```

Notice that because with a `DataFrameMapper`, we specify the columns to transform, this time we passed through the *entire* dataframe, `hs_df` and had our `DataFrameMapper` fit to the column and apply the change.

### Aliases

Sometimes we may want to select one column from our dataframe, but then rename the column.  We can do this with an `alias`.  For example, below we'll impute the data from our `graduation_rate` column and return `imputed_grad_rate` as the name of the column. 

In [13]:
data_mapper = DataFrameMapper([
     (['graduation_rate'], SimpleImputer(), {'alias': 'imputed_grad_rate'}),
], df_out = True)
transformed = data_mapper.fit_transform(hs_df)
transformed[:2]

Unnamed: 0,imputed_grad_rate
0,0.66
1,0.9


So we do so by adding a third element, which is a dictionary with a key of `alias`.

### Applying multiple changes

We can apply multiple transformations to the same column by making the second argument a list of Transformers:

In [20]:
from sklearn.preprocessing import StandardScaler
mapper = DataFrameMapper([
     (['graduation_rate'], [SimpleImputer(), StandardScaler()]),
], df_out = True)

transformed_two = mapper.fit_transform(hs_df)
transformed_two[:3]

Unnamed: 0,graduation_rate
0,-0.975975
1,0.804365
2,0.952727


So we can see that changed our `graduation_rate` data to be in z-scores.

### Adding new columns

The last change we can show is how to add additional columns to our dataframe.  For example, we generally handle missing data not just by imputing the mean, but also by adding a new column to indicate if the data is missing.  Let's see how we can do this with the `DataFrameMapper`.   

In [21]:
from sklearn.impute import MissingIndicator

In [27]:
mapper = DataFrameMapper([
     (['graduation_rate'], [SimpleImputer()]),
    (['graduation_rate'], [MissingIndicator()], {'alias': 'grad_rate_is_na'}),
], df_out = True)

In [28]:
transformed_three = mapper.fit_transform(hs_df)
transformed_three[:3]

Unnamed: 0,graduation_rate,grad_rate_is_na
0,0.66,False
1,0.9,False
2,0.92,False


> Note that we added an alias to the second column, to avoid overriding our initial `graduation_rate` column.

### Keeping Columns

Notice that when we use the DataFrameMapper it only returns the columns that we specified.  What if we want to hold only to columns but not make a transformation.  Well we can specify any other columns we want to return like so.

In [6]:
hs_df.columns

Index(['dbn', 'name', 'num_test_takers', 'reading_avg', 'math_avg',
       'writing_score', 'boro', 'total_students', 'graduation_rate',
       'attendance_rate', 'college_career_rate'],
      dtype='object')

In [13]:
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer, MissingIndicator
mapper = DataFrameMapper([
    (['writing_score'], None),
     (['graduation_rate'], [SimpleImputer()]),
    (['graduation_rate'], [MissingIndicator()], {'alias': 'grad_rate_is_na'}),
], df_out = True)

In [14]:
with_writing_df = mapper.fit_transform(hs_df)
with_writing_df[:2]

Unnamed: 0,writing_score,graduation_rate,grad_rate_is_na
0,363.0,0.66,False
1,366.0,0.9,False


So notice that here we did not impute any missing values, or make any transformations, we simpy specified to return the `writing_score` column, and added None as the second argument in the step.

If we want to change the default behavior to return **all** columns not specified, we can do that too.

In [26]:
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer, MissingIndicator
default_all_mapper = DataFrameMapper([
     (['graduation_rate'], [SimpleImputer()]),
    (['graduation_rate'], [MissingIndicator()], {'alias': 'grad_rate_is_na'}),
], default = None, df_out = True)

In [27]:
default_return = default_all_mapper.fit_transform(hs_df)
default_return[:2]

Unnamed: 0,graduation_rate,grad_rate_is_na,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,attendance_rate,college_career_rate
0,0.66,False,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,M,171,0.87,0.36
1,0.9,False,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,M,465,0.93,0.7


### Summary

In this lesson, we learned about using the `DataFrameMapper` from the `sklearn_pandas` library.  The DataFramMapper is similar to pipelines, but provides for an easier interface, as our changes became more complex.

For example, it allowed us to output a dataframe instead of a numpy array, specify features to coerce, and return new columns in our dataframe like adding an is missing column.  

### Resources

[sklearn_pandas](https://github.com/scikit-learn-contrib/sklearn-pandas)