## Working with Pipelines Lab

### Introduction

In this lesson, we'll practice working with pipelines in sklearn.  Pipelines allow us to store a procedure of changes in a pipeline object.  And just like as we saw with transformers, pipelines store our changes so that we can reapply the same changes to holdout sets and future data.

### Loading our Data

In [2]:
import pandas as pd

df_directory = pd.read_csv('./hs_directory.csv')

In [3]:
url = 'report-hs.csv'
report_hs = pd.read_csv(url)

In [5]:
report_hs[:3]

Unnamed: 0,DBN,School Name,School Type,Enrollment,Rigorous Instruction Rating,Collaborative Teachers Rating,Supportive Environment Rating,Effective School Leadership Rating,Strong Family-Community Ties Rating,Trust Rating,...,Percent HRA Eligible,Percent Asian,Percent Black,Percent Hispanic,Percent White,Years of principal experience at this school,Percent of teachers with 3 or more years of experience,Student Attendance Rate,Percent of Students Chronically Absent,Teacher Attendance Rate
0,01M292,Orchard Collegiate Academy,High School,140,Meeting Target,Exceeding Target,Meeting Target,Meeting Target,Approaching Target,Exceeding Target,...,0.621,0.15,0.243,0.55,0.05,1.9,0.5,0.867,0.448,0.973
1,01M448,University Neighborhood High School,High School,392,Meeting Target,Meeting Target,Exceeding Target,Meeting Target,Meeting Target,Exceeding Target,...,0.538,0.301,0.245,0.421,0.028,7.5,0.429,0.925,0.244,0.971
2,01M450,East Side Community School,High School,393,Exceeding Target,Exceeding Target,Exceeding Target,Exceeding Target,Exceeding Target,Exceeding Target,...,0.405,0.122,0.201,0.547,0.104,15.8,0.78,0.94,0.164,0.99


### Working with Pipelines

Let's use pipelines to work with our missing values by applying the SimpleImputer and then apply the StandardScaler in sequence.  Let's first identify the columns with missing values.

We can do so with the following line of code.

In [9]:
cols_na = report_hs.isna().any(axis = 0)

cols_na[:4]

DBN            False
School Name    False
School Type    False
Enrollment     False
dtype: bool

Above we ask each column if there are any nan values.

We can get a list of indices by then calling values.

In [10]:
cols_na.values

array([False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True])

Next use the `iloc` method, combined with boolean indexing to select the columns with `nan` values.

In [12]:
na_vals_df = report_hs.iloc[:, cols_na.values]

Further reduce the dataframe by excluding any columns of type object.

In [16]:
num_cols_with_na = na_vals_df.select_dtypes(exclude = 'object')

In [17]:
num_cols_with_na.columns

# Index(['Rigorous Instruction - Percent Positive',
#        'Collaborative Teachers - Percent Positive',
#        'Supportive Environment - Percent Positive',
#        'Effective School Leadership - Percent Positive',
#        'Strong Family-Community Ties - Percent Positive',
#        'Trust - Percent Positive',
#        'Years of principal experience at this school',
#        'Percent of teachers with 3 or more years of experience',
#        'Student Attendance Rate', 'Percent of Students Chronically Absent',
#        'Teacher Attendance Rate'],
#       dtype='object'

Index(['Rigorous Instruction - Percent Positive',
       'Collaborative Teachers - Percent Positive',
       'Supportive Environment - Percent Positive',
       'Effective School Leadership - Percent Positive',
       'Strong Family-Community Ties - Percent Positive',
       'Trust - Percent Positive',
       'Years of principal experience at this school',
       'Percent of teachers with 3 or more years of experience',
       'Student Attendance Rate', 'Percent of Students Chronically Absent',
       'Teacher Attendance Rate'],
      dtype='object')

### Using our Pipeline

Let's begin by loading our `Pipeline` class from sklearn.

In [6]:
from sklearn.pipeline import Pipeline

Next initialize a pipeline with the steps of replacing missing values with the mean, and then scaling the data.  Name the steps `impute` and `scale`.

In [19]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

na_numeric_pipeline = Pipeline(steps = [
 ('impute', SimpleImputer()),
 ('scale', StandardScaler())   
])

We can check that we have initialized the pipeline correcly with the `named_steps_` method.  Call the method on our pipeline below.

In [22]:
na_numeric_pipeline.named_steps

# {'impute': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
#                missing_values=nan, strategy='mean', verbose=0),
#  'scale': StandardScaler(copy=True, with_mean=True, with_std=True)}

{'impute': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
               missing_values=nan, strategy='mean', verbose=0),
 'scale': StandardScaler(copy=True, with_mean=True, with_std=True)}

### Fitting and Transforming the Data

Now that we have defined our pipeline, let's use it to fit and transform our `num_cols_with_na` data.  First we'll split the data.

In [29]:
from sklearn.model_selection import train_test_split
na_features_train, na_features_test = train_test_split(num_cols_with_na)

Then fit the pipeline to the `na_features_train` data.

In [30]:
na_numeric_pipeline.fit(na_features_train)

# Pipeline(memory=None,
#          steps=[('impute',
#                  SimpleImputer(add_indicator=False, copy=True, fill_value=None,
#                                missing_values=nan, strategy='mean',
#                                verbose=0)),
#                 ('scale',
#                  StandardScaler(copy=True, with_mean=True, with_std=True))],
#          verbose=False)

Pipeline(memory=None,
         steps=[('impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scale',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

We can see what our pipeline learned, by accessing each of the transformers.  We can access the transformers through the dictionary returned from the `named_steps` method.  Let's see how.

In [35]:
na_numeric_pipeline.named_steps
# {'impute': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
#                missing_values=nan, strategy='mean', verbose=0),
#  'scale': StandardScaler(copy=True, with_mean=True, with_std=True)}

{'impute': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
               missing_values=nan, strategy='mean', verbose=0),
 'scale': StandardScaler(copy=True, with_mean=True, with_std=True)}

Now let's take a look at the scaler, which is stored in our dictionary.

In [36]:
na_numeric_pipeline.named_steps['scale']

StandardScaler(copy=True, with_mean=True, with_std=True)

In [38]:
na_numeric_pipeline.named_steps['scale'].var_

array([4.32787316e-03, 8.76807832e-03, 5.54019147e-03, 9.32202791e-03,
       2.75535972e-03, 3.38247340e-03, 1.65037704e+01, 1.74994337e-02,
       3.14591934e-03, 2.69238058e-02, 8.63467424e-05])

So we can see all of the variance values learned for each of the columns.

* Now it's your turn

Access the `SimpleImputer` in our `na_numeric_pipeline` and return the means that it used for imputing data in the columns.

> Hint: this is stored on the SimpleImputer's `statistics_` method.

In [47]:
imputed_means = na_numeric_pipeline.named_steps['impute'].statistics_
imputed_means

# array([0.78698895, 0.81748619, 0.71055249, 0.84707182, 0.8401105 ,
#        0.87041436, 5.85867508, 0.74013249, 0.8744349 , 0.35726039,
#        0.96608202])


array([0.78698895, 0.81748619, 0.71055249, 0.84707182, 0.8401105 ,
       0.87041436, 5.85867508, 0.74013249, 0.8744349 , 0.35726039,
       0.96608202])

### Transforming the data

Now that our pipeline has been fitted to the data, we can then use the parameters learned from the pipeline to transform our training and test data.

> Use the pipeline to transform both the `na_features_train` and the `na_features_test` data.

In [51]:
transformed_na_features_train = na_numeric_pipeline.transform(na_features_train)

transformed_na_features_train[:2]

# array([[ 0.00000000e+00,  0.00000000e+00,  1.49158444e-15,
#          0.00000000e+00,  0.00000000e+00,  1.90894332e-15,
#         -2.18629434e-16,  0.00000000e+00, -1.97941309e-15,
#          0.00000000e+00,  1.19477809e-14],
#        [-4.10250134e-01, -4.00330858e-01, -5.44822579e-01,
#         -3.83962443e-01,  5.69415831e-01, -1.03877866e+00,
#         -2.18629434e-16,  0.00000000e+00, -8.45715374e-01,
#          1.46716626e+00,  1.19477809e-14]])

array([[ 0.00000000e+00,  0.00000000e+00,  1.49158444e-15,
         0.00000000e+00,  0.00000000e+00,  1.90894332e-15,
        -2.18629434e-16,  0.00000000e+00, -1.97941309e-15,
         0.00000000e+00,  1.19477809e-14],
       [-4.10250134e-01, -4.00330858e-01, -5.44822579e-01,
        -3.83962443e-01,  5.69415831e-01, -1.03877866e+00,
        -2.18629434e-16,  0.00000000e+00, -8.45715374e-01,
         1.46716626e+00,  1.19477809e-14]])

In [32]:
transformed_na_features_test = na_numeric_pipeline.transform(na_features_test)

In [50]:
transformed_na_features_test[:2]

# array([[ 0.04576997,  0.98799414, -0.81352254,  0.96248112,  1.33144367,
#         -0.00712468,  1.01940725, -0.39409097,  0.61626001, -0.91574864,
#         -0.00882655],
#        [ 0.65379678,  1.09478837,  0.26127731,  0.75533595, -0.76413289,
#          0.33675998,  0.25632719, -0.55283862, -0.27518839,  0.37017219,
#         -0.00882655]])

array([[ 0.04576997,  0.98799414, -0.81352254,  0.96248112,  1.33144367,
        -0.00712468,  1.01940725, -0.39409097,  0.61626001, -0.91574864,
        -0.00882655],
       [ 0.65379678,  1.09478837,  0.26127731,  0.75533595, -0.76413289,
         0.33675998,  0.25632719, -0.55283862, -0.27518839,  0.37017219,
        -0.00882655]])

### Summary

In this lesson, we practiced using pipelines to fit to, and then transform our data.  We created a pipeline by initializing our pipeline with a list of steps.

```python
na_numeric_pipeline = Pipeline(steps = [
 ('impute', SimpleImputer()),
 ('scale', StandardScaler())   
])
```

Once initialized, we can then call fit, to learn the parameters of the pipeline, and then transform to apply the changes to a specified dataset.

```python
na_numeric_pipeline.fit(na_features_train)
transformed_na_features_test = na_numeric_pipeline.transform(na_features_test)
```