In [None]:
import pandas as pd
import numpy as np

##Transformer objects: 

These can each be used individually, or integrated into a pipeline.

What is important is that they should be fit on the training data, and **not re-fit, but just used to transform** the testing data. 

###SimpleImputer

In [None]:
from sklearn.impute import SimpleImputer #This is very googleable

In [None]:
x1 = [1, 3, np.nan, 25, 1]
x2  = [np.nan, np.nan, 2, 1, 3]
x3 = [40, 24, np.nan, 13, 2]

train = pd.DataFrame({'x1':x1, 'x2':x2, 'x3':x3})

In [None]:
train.head()

Unnamed: 0,x1,x2,x3
0,1.0,,40.0
1,3.0,,24.0
2,,2.0,
3,25.0,1.0,13.0
4,1.0,3.0,2.0


In [None]:
x1_test = [135, 24, np.nan]
x2_test = [np.nan, np.nan, np.nan]
x3_test = [50, 135, np.nan]

test = pd.DataFrame({'x1':x1_test, 'x2':x2_test, 'x3':x3_test})

In [None]:
test.head()

Unnamed: 0,x1,x2,x3
0,135.0,,50.0
1,24.0,,135.0
2,,,


Initialize a simpleimputer

In [None]:
imputer = SimpleImputer(strategy="median")

Here, the imputer is a blank slate. It hasn't yet learned the values it's supposed to impute for given columns. 

In [None]:
imputer.statistics_

AttributeError: ignored

Here, we **fit** the imputer to our training data. THIS DOES NOT CHANGE THE TRAINING DATA IN ANY WAY, but the imputer has learned the values that it should impute for this dataset.

In [None]:
imputer.fit(train)

In [None]:
imputer.statistics_

In [None]:
#Notice the unchanged training set. 
train

By using **.transform**, we can apply the transformation and **actually change the values of the training data.** Notice that this does not operate in place, and as such we must override the existing train memory location. 

In [None]:
train = imputer.transform(train)

Also important to note is that sklearn transformations transform from a dataframe into a numpy array. This is for efficiency, and is what allows sklearn code to be so fast. However, it can cause problems for us down the line in identifying what variables we're actually looking at. 

In [None]:
train

In [None]:
pd.DataFrame(train)

#### The last two steps we did can be combined. 

If we want to **fit (have the imputer learn what it's supposed to impute) and transform (actually change the values) at the same time...**

In [None]:
x1 = [1, 3, np.nan, 25, 1]
x2  = [np.nan, np.nan, 2, 1, 3]
x3 = [40, 24, np.nan, 13, 2]

train = pd.DataFrame({'x1':x1, 'x2':x2, 'x3':x3})

In [None]:
second_imputer = SimpleImputer(strategy="median")

train = second_imputer.fit_transform(train)

In [None]:
second_imputer.statistics_

In [None]:
train

#### .transform 

We can now use this to transform our test set according to our training set parameters. Crucially, this does **not** change the learned parameters of our imputer object. 

In [None]:
test = imputer.transform(test)

In [None]:
test

In [None]:
imputer.statistics_ #Notice that these are the same as before. 

### One more transformer example - StandardScaler

Now that we've filled our missing values (which StandardScaler would have trouble with), we can implement our scaler. 

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
#Initialize a scaler object 

our_scaler = StandardScaler()

In [None]:
train

In [None]:
test

The pattern below is pretty representative of how we want to use transformers. To harp on this one more time, because it couldn't possible be more important: 

**learn parameters (fit) from the training data**.


**transform both the training and testing data.**


In [None]:
train = our_scaler.fit_transform(train)
test = our_scaler.transform(test)

In [None]:
train

In [None]:
test

Just like with imputer, we can get the statistics about the training data (the parameters that were "fit") reported back to us. 

In [None]:
our_scaler.mean_

In [None]:
our_scaler.var_

In [None]:
our_scaler.var_**0.5

### Pipelines 

In a real use case, it'd be convenient if we could do all of our preprocessing in a consistent order that we can control, using one object. 

This is _much_ easier on us (a lot less typing and clicking) but also eliminates an enormous amount of potential for human error through running cells twice, in the wrong order, forgetting one or more, etc. 

This is where sklearn's **Pipeline** objects come into play. 

In [None]:
x1 = [1, 3, np.nan, 25, 1]
x2  = [np.nan, np.nan, 2, 1, 3]
x3 = [40, 24, np.nan, 13, 2]

train = pd.DataFrame({'x1':x1, 'x2':x2, 'x3':x3})

x1_test = [135, 24, np.nan]
x2_test = [np.nan, np.nan, np.nan]
x3_test = [50, 135, np.nan]

test = pd.DataFrame({'x1':x1_test, 'x2':x2_test, 'x3':x3_test})

In [None]:
#Import pipeline object 
from sklearn.pipeline import Pipeline 

Through the Pipeline constructor, we're able to give each transformer object a name. If you don't care to do this, you can use the make_pipeline convenience function, which will set default names. 

Pass in names and their associated transformers as a list of tuples. 

In [None]:
pipe = Pipeline([
                 ('imputer', SimpleImputer(strategy="median")), 
                 ('scaler', StandardScaler())
])

The pipeline can now be treated like a transformer object itself, since it is compmosed exclusively of transformers!

In [None]:
train = pipe.fit_transform(train)

In [None]:
test = pipe.transform(test)

In [None]:
train

Unnamed: 0,x1,x2,x3
0,1.0,,40.0
1,3.0,,24.0
2,,2.0,
3,25.0,1.0,13.0
4,1.0,3.0,2.0


In [None]:
test

Unnamed: 0,x1,x2,x3
0,135.0,,50.0
1,24.0,,135.0
2,,,


### Column transformers

There's one big problem here, though. We really need to keep the operations we perform on numerical and categorical columns separate. 

Why? 

One-hot-encoding numerical variables seems like an extremely bad idea. We'd be treating continuous features as if they were categorical, and creating a new column for each unique value. Talk about the curse of dimensionality...

Likewise, one-hot-encoding creates sparsity in the data. It doesn't make sense (is trivial, and might slow computation time through eliminating sparsity) to scale a column of ones and zeroes. 

Enter **ColumnTransformer**.

It allows us to create multiple pipelines and specify which to apply to which features! Wonderful. 

In [None]:
#To show what's going on, we need a new, multi type test dataframe.
surface_area = [9910, 23000, 22300, 7340, 31700]
elevation = [571, 577, 577, 246, np.nan]
avg_depth = [62, np.nan, 279, 283, 483]
lake_quality = ["awesome", "meh", "meh", np.nan, "bad"]

lake = pd.DataFrame({'surface_area':surface_area, 'elevation':elevation, 'avg_depth':avg_depth, 'lake_quality':lake_quality})

In [None]:
lake

Unnamed: 0,surface_area,elevation,avg_depth,lake_quality
0,9910,571.0,62.0,awesome
1,23000,577.0,,meh
2,22300,577.0,279.0,meh
3,7340,246.0,283.0,
4,31700,,483.0,bad


In [None]:
#Identify columns by type.
numeric = lake.select_dtypes(include=['int64', 'float64']).columns

categorical = lake.select_dtypes(include=['object']).columns

In [None]:
numeric

Index(['surface_area', 'elevation', 'avg_depth'], dtype='object')

In [None]:
categorical

Index(['lake_quality'], dtype='object')

Now, we can define two separate pipelines, differing by how we want to treat each subset of columns. 

In [None]:
#NUMERIC PIPELINE: 
numeric_pipe = Pipeline(
    [('imputer', SimpleImputer(strategy='median')), 
     ('scaler', StandardScaler())]
)

In [None]:
#CATEGORICAL PIPELINE
from sklearn.preprocessing import OneHotEncoder
categorical_pipe = Pipeline(
    [('cat_imputer', SimpleImputer(strategy = 'most_frequent')), #Different null handling!
     ('encoder', OneHotEncoder())]
)

When using column transformer, the main argument is a list of tuples, just like for each individual pipeline. However, the tuples now have 3 args instead of 2. 

Name, transformer, features.

In [None]:
from sklearn.compose import ColumnTransformer

full_transformer = ColumnTransformer(
    transformers = [
        ('numeric', numeric_pipe, numeric),
        ('categorical', categorical_pipe, categorical)
    ]
)

Now, use it as a transformer. 

In [None]:
lake_processed = full_transformer.fit_transform(lake)

In [None]:
pd.DataFrame(lake)

Unnamed: 0,surface_area,elevation,avg_depth,lake_quality
0,9910,571.0,62.0,awesome
1,23000,577.0,,meh
2,22300,577.0,279.0,meh
3,7340,246.0,283.0,
4,31700,,483.0,bad


In [None]:
pd.DataFrame(lake_processed)

Unnamed: 0,0,1,2,3,4,5
0,-0.991315,0.471415,-1.618582,1.0,0.0,0.0
1,0.460174,0.517036,0.025525,0.0,0.0,1.0
2,0.382554,0.517036,0.01051,0.0,0.0,1.0
3,-1.27629,-1.999714,0.04054,0.0,0.0,1.0
4,1.424876,0.494226,1.542007,0.0,1.0,0.0


This is exactly what we wanted! Expansion of our dummy variables without expansion (to maintain sparsity), while we scale the numeric features. 

This is also useful when you want to use a different imputation strategy across different columns, which we did here. 



#### Last note: including models in pipelines

You can use the pipeline object to sequentially preprocess AND run your data — this makes it all happen in one cell, and is super awesome. 

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [None]:
#We'll say this is our y variable that we're trying to predict using the preprocessed data in lake...

y_train = [134, 245, 1630, 234, 984]

In [None]:
full_thing = Pipeline(steps = 
                      [('preprocessing', full_transformer), 
                       ('prediction', model)])

Now instead of being a transformer (like our full_transformer and other pipelines were, this mega-pipeline is a MODEL — so, we fit and predict rather than fit and transform. 

In [None]:
full_thing.fit(lake, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numeric',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                        

In [None]:
full_thing.predict(lake)

array([ 134.,  245., 1630.,  234.,  984.])