# Introduction to Pipelines

### Introduction

### Loading our Data

First let's load up our data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/nyc_hs_sat.csv"
hs_df = pd.read_csv(url, index_col = 0)

In [10]:
hs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 356 entries, 0 to 355
Data columns (total 11 columns):
dbn                    356 non-null object
name                   356 non-null object
num_test_takers        327 non-null float64
reading_avg            327 non-null float64
math_avg               327 non-null float64
writing_score          327 non-null float64
boro                   356 non-null object
total_students         356 non-null int64
graduation_rate        351 non-null float64
attendance_rate        356 non-null float64
college_career_rate    351 non-null float64
dtypes: float64(7), int64(1), object(3)
memory usage: 33.4+ KB


### Transformers and Pipelines

Now this time, let's use transformers with our pipelines.  Pipelines are useful because they condense the transformations that we make to our data.  And they allow us to apply the same transformations to new arrays.

For example, let's say that we want to replace our null values with the mean and then translate our values into the respective z-scores.

We can do so with a pipeline.

In [4]:
from sklearn.pipeline import Pipeline

We first import the `Pipeline` class from the `pipeline` module.  

Then we initialize a new `Pipeline` instance, specifying the transformations that we would like to make to our data.

> Here we'll just focus on transforming one column of data.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [6]:
pipeline = Pipeline(steps = [
    ('impute', SimpleImputer()),
    ('standardize', StandardScaler()),
])

Let's break down what we did above.

We pass `Pipeline` the keyword argument `steps`, which takes a list of steps.  Each step is represented as a tuple: 

```python
('impute', SimpleImputer())
```

The first element of the tuple is a name that we assign the step, and then second element is an instance of the transformer that we would like to apply.

> We can name the step whatever we prefer.

Once we initialize the pipeline, we can then examine the steps with the `named_steps` method.

In [7]:
pipeline.named_steps

{'impute': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
               missing_values=nan, strategy='mean', verbose=0),
 'standardize': StandardScaler(copy=True, with_mean=True, with_std=True)}

So this just returns a dictionary of our defined steps, with each key pointing to the respective transformer.

### Fitting our data

Ok, now that we used the pipeline to define the tranformations we would like to apply, we can try it on some data.

We do so using the same interface that we saw with transformers: fit to learn parameters from the data, and `transform` to apply the changes.

In [13]:
graduation_rate = hs_df[['graduation_rate']]

pipeline.fit(graduation_rate)

Pipeline(memory=None,
         steps=[('impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardize',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

In [14]:
transformed_grad = pipeline.transform(graduation_rate)

In [15]:
transformed_grad[:6]

array([[-0.97597463],
       [ 0.8043654 ],
       [ 0.95272707],
       [-0.38252795],
       [ 1.32363125],
       [ 1.47199292]])

So we can now see that both transfomations were applied to our column above.

### Summary 

In this lesson we saw how we can use pipelines to help us with feature engineering.  Pipelines store a sequence of transformers and then call the transformers for us.  

```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps = [
    ('impute', SimpleImputer()),
    ('standardize', StandardScaler()),
])
```

We initialize a pipeline by passing a list of steps to the `steps` argument.  Each step is a tuple, where the first element is the name that we assign to the step, and the second argument is an instance of the transformer that we wish to apply.

### Resources

[Pipelines Kaggle Tutorial](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines)