# Transformers and Pipelines Lab

### Introduction

In this lesson, we'll practice using transformers and pipelines to coerce our data.  We'll walk through some additional features of pipelines in sklearn.

### Loading our Data

In [29]:
import pandas as pd

housing_df = pd.read_csv('./kc_house_data_missing_values.csv')

In [30]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7740 entries, 0 to 7739
Data columns (total 7 columns):
id             7740 non-null int64
date           7740 non-null object
price          7740 non-null float64
bedrooms       7740 non-null int64
bathrooms      7740 non-null float64
sqft_living    7740 non-null int64
floors         7662 non-null float64
dtypes: float64(3), int64(3), object(1)
memory usage: 423.4+ KB


### Back to Transformers

Let's begin with using some of our transformers.  We'll start with the SimpleImputer.  Remember that this will allow us to replace our missing values with the mean.

Let's select the `floors` column as it currently has missing values.

In [31]:
floors = housing_df[['floors']]

Begin by loading the SimpleImputer from the `impute` module, and then initialize a new imputer.

In [20]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()

In [21]:
imputer
# SimpleImputer(add_indicator=False, copy=True, fill_value=None,
#               missing_values=nan, strategy='mean', verbose=0)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

We can see that the imputer is initialized with some default arguments.  For example, we see that we can replace our missing values in floors with the mean.  

First, we learn what the mean is, by fitting the data.

In [23]:
imputer.fit(floors)

# SimpleImputer(add_indicator=False, copy=True, fill_value=None,
#               missing_values=nan, strategy='mean', verbose=0)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

> We can check that this worked by looking at the mean, captured in the `statistics_` method.

In [25]:
imputer.statistics_
# array([1.61243801])

array([1.61243801])

Then we transform the data, with the `transform` method. 

In [28]:
transformed_floors = imputer.transform(floors)
transformed_floors[:3]

# array([[2.],
#        [1.],
#        [1.]])

array([[2.],
       [1.],
       [1.]])

### More Practice with Transformers

Now oftentimes, we use transformers so that we can apply the same transformations to both our training and holdout datasets, as well as any future data.  

Let's see this in practice by sorting our dataset by `date` and then splitting our data.

In [32]:
sorted_housing_df = housing_df.sort_values('date')

In [36]:
sorted_housing_df[:2]

# 	id	date	price	bedrooms	bathrooms	sqft_living	floors
# 6354	3396800120		540000.0	3	2.50	2180	1.0
# 831	9547205660		603000.0	3	2.25	1700	2.0

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,floors
6354,3396800120,,540000.0,3,2.5,2180,1.0
831,9547205660,,603000.0,3,2.25,1700,2.0


Assign `X` to equal every feature except `price`, `id` and `date`, and set `y` to equal price.

In [55]:
X = sorted_housing_df.drop(columns = ['id', 'price', 'date'])

In [56]:
y = sorted_housing_df['price']

In [57]:
X[:2]


# 	bedrooms	bathrooms	sqft_living	floors
# 6354	3	2.50	2180	1.0
# 831	3	2.25	1700	2.0

Unnamed: 0,bedrooms,bathrooms,sqft_living,floors
6354,3,2.5,2180,1.0
831,3,2.25,1700,2.0


In [58]:
y[:3]

# 6354    540000.0
# 831     603000.0
# 5565    965800.0
# Name: price, dtype: float64

6354    540000.0
831     603000.0
5565    965800.0
Name: price, dtype: float64

Next let's split our data using `train_test_split` placing `20` percent of our data in the test dataset.  

> Make sure to set shuffle equal to False.

In [59]:
from sklearn.model_selection import train_test_split

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle = False)

In [61]:
X_train[:2]
# 	bedrooms	bathrooms	sqft_living	floors
# 6354	3	2.50	2180	1.0
# 831	3	2.25	1700	2.0

Unnamed: 0,bedrooms,bathrooms,sqft_living,floors
6354,3,2.5,2180,1.0
831,3,2.25,1700,2.0


Ok, now let's go back to transformers.  This time we'll use the `StandardScaler`.  Import `StandardScaler`from the `preprocessing` library, and then initialize the new StandardScaler.  

In [62]:
from sklearn.preprocessing import StandardScaler

In [63]:
scaler = StandardScaler()

Now this time we'll apply the transformer to more than one column.  We'll do so by fitting with our `X_train` data.

In [65]:
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

> We can see the learned means with the `mean_` method.  Note that it learned a different mean for each feature.

In [72]:
scaler.mean_

array([3.58552972e+00, 2.33570198e+00, 2.41132214e+03, 1.61196239e+00])

In [75]:
scaler.var_
# array([9.32779414e-01, 4.43753362e-01, 5.95523995e+05, 2.83651085e-01])

array([9.32779414e-01, 4.43753362e-01, 5.95523995e+05, 2.83651085e-01])

Now let's use the scaler to transform the `X_train` data.

In [79]:
transformed_X_train = scaler.transform(X_train)

transformed_X_train[:3]
# array([[-0.60626077,  0.24663886, -0.29975578, -1.14903301],
#        [-0.60626077, -0.12865303, -0.92175752,  0.7285873 ],
#        [ 0.42914487, -0.87923682,  0.11491205, -1.14903301]])


array([[-0.60626077,  0.24663886, -0.29975578, -1.14903301],
       [-0.60626077, -0.12865303, -0.92175752,  0.7285873 ],
       [ 0.42914487, -0.87923682,  0.11491205, -1.14903301]])

So notice that we can use the transformers to change multiple features at once.

> Now let's transform the `X_test` data using the same parameters we found above.

In [87]:
transformed_test = scaler.transform(X_test)

In [88]:
transformed_test[:3]
# array([[-0.60626077, -0.12865303, -0.96063263,  0.7285873 ],
#        [-1.6416664 , -2.00511249, -0.94767426, -1.14903301],
#        [-0.60626077, -2.00511249, -1.18092492, -1.14903301]])

array([[-0.60626077, -0.12865303, -0.96063263,  0.7285873 ],
       [-1.6416664 , -2.00511249, -0.94767426, -1.14903301],
       [-0.60626077, -2.00511249, -1.18092492, -1.14903301]])

### The workflow

Because we often fit and then transform a dataset, we can combine these two procedures in one method - `fit_transform`.  Let's show how it's done.  

If we wish to scale our `X_train` data, and then apply the same changes to the `X_test` data, we'll do the following.

In [89]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [91]:
X_train_scaled = scaler.fit_transform(X_train)

In [92]:
X_train_scaled[:3]
# array([[-0.60626077,  0.24663886, -0.29975578, -1.14903301],
#        [-0.60626077, -0.12865303, -0.92175752,  0.7285873 ],
#        [ 0.42914487, -0.87923682,  0.11491205, -1.14903301]])

array([[-0.60626077,  0.24663886, -0.29975578, -1.14903301],
       [-0.60626077, -0.12865303, -0.92175752,  0.7285873 ],
       [ 0.42914487, -0.87923682,  0.11491205, -1.14903301]])

So this method above both fitted to the data, and then returned the transformed array of data.

In [94]:
scaler.mean_

array([3.58552972e+00, 2.33570198e+00, 2.41132214e+03, 1.61196239e+00])

Now that the scaler has been fit, we can apply the same changes to the test data.

In [95]:
transformed_X_test = scaler.transform(X_test)

In [96]:
transformed_X_test[:2]

array([[-0.60626077, -0.12865303, -0.96063263,  0.7285873 ],
       [-1.6416664 , -2.00511249, -0.94767426, -1.14903301]])

> Note that we do not need to call `fit` on the test data.  In fact it would be a mistake to do so.  This is because, the goal is to provide the same transformations to both datasets.  This way, when we train our model on the transformed `X_train` data, we can then test the model on holdout sets that have been transformed the same way.  

### Transformers and Dimensions

One thing to note is that transformers only work with two dimensional data.  If we try to select a column as a one dimensional array, we get an error both with the `fit` method and the `transform` method.

In [98]:
single_dim_X_train =  transformed_X_train[:, 0]
single_dim_X_train

array([-0.60626077, -0.60626077,  0.42914487, ...,  0.42914487,
        0.42914487, -0.60626077])

> Uncomment and recomment the line below.

In [100]:
# scaler.transform(single_dim_X_train)

### Summary

In this lesson, we learned about the interface for pandas transformers.  We saw that the transformers have us first `fit`, whereby they learn the transformation, and `transform`, where the transformation is applied. 

One way that transformers assist us, is by allowing us to make the same transformations to different dataset.  We do this, simply by reusing our transformer and calling `transform` on a different dataset.