# Learn and practice Pipelines notebook!


Objective 
* break down Transformers and Pipelines so that they can be understood better.

Definitions:


A sklearn Transformer is usually used to pass data from one stage to another (but they can actually be used to pass anything). Examples include:
* OneHotEncoder
* StandardScaler
* ....etc

A sklearn `Pipeline` is used to chain multiple transformers together.

Pipeline is also a generic term used to describe the process of moving data from one form/state to another.

##### To begin
Inspect the scikit learn documentation corresponding to the following: `SimpleImputer`,  `Pipeline`, `StandardScaler`, `OneHotEncoder`. For each of these read its purpose and an example of use. 

Add the necessary import statements to use them.

In [48]:
# code here
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder


In [49]:
import pandas as pd
import numpy as np

# Transformers

## Numeric transformers

### Inputer

In [50]:
num_data1 = {'scores':[5,5,7,1,8,np.nan]}
df_num1 = pd.DataFrame(num_data1)
df_num1

Unnamed: 0,scores
0,5.0
1,5.0
2,7.0
3,1.0
4,8.0
5,


Instatiate a variable `imputer` to a `SimpleImputer` with `median` value as strategy. Run `.fit_transform()` on the numeric dataframe above and save the result in the `imputed_data` variable.

In [51]:
# code here
imputer = SimpleImputer(strategy="median")

imputed_data = imputer.fit_transform(df_num1)

imputed_data

array([[5.],
       [5.],
       [7.],
       [1.],
       [8.],
       [5.]])

Check the changes for the object passed as input to the imputer (`df_num1`).


It is not longer a dataframe. What type is it? What values are changed? Answer below.

this is now an array with missing values passed to the median of the series

Try changing the `stratergy` input.

Observe the difference in results and write them down in the cell below.

In [52]:
# code here
imputer2 = SimpleImputer(strategy="mean")
imputed_data2 = imputer2.fit_transform(df_num1)

imputed_data2

array([[5. ],
       [5. ],
       [7. ],
       [1. ],
       [8. ],
       [5.2]])

the missing value is now passes at the mean value of the array

### Scaler

In [53]:
num_data2 = {'scores':[1,2,3,1,2,3]}
df_num2 = pd.DataFrame(num_data2)
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


Instatiate a variable `scaler`to a `StandardScaler`. Run `fit_transform()` and save the result in a new variable `scaled_data`. Inspect the output and write down your observations below.

In [54]:
# code here
scaler = StandardScaler()

scaled_data = scaler.fit_transform(df_num2)

scaled_data

array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487],
       [-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

the values of the array are scaled to a z-score transformer 

Calulate the mean and standard deviation of your data and save them in two variables. Print the content of these variables.

In [55]:
# code here
std = df_num2.scores.std()
mean = df_num2.scores.mean()

print("std = ",std,"mean = ",mean)

std =  0.8944271909999159 mean =  2.0


Compute the values obtained if you substract from each element in `df_num2` the mean and then divide by the standard deviation.
`(value - mean)/std`. Save the results in an array.

In [56]:
def std_transform(x):
    return (x-mean)/std

result = df_num2.scores.apply(std_transform)

result

0   -1.118034
1    0.000000
2    1.118034
3   -1.118034
4    0.000000
5    1.118034
Name: scores, dtype: float64

How do the results compare to the values in `scaled_data`? Why are they different?

the standard scaler take a biased estimator equal to np.std(x,ddof=0) 

Make the necessary modifications to obtain the same results as in `scaled_data`. Hint: Inspect the default parameter values for the functions called.

In [57]:
# code here
std = np.std(df_num2.scores)
result2 = df_num2.scores.apply(std_transform)
result2

0   -1.224745
1    0.000000
2    1.224745
3   -1.224745
4    0.000000
5    1.224745
Name: scores, dtype: float64

## Non-numeric transformers

In [58]:
cat_data1 = {'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier']} 
df_cat = pd.DataFrame(cat_data1)
df_cat

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sopie
5,Xavier


Look up the documentation for `OrdinalEncoder`, add the necessary import statement, and initialize a variable `ord_encoder` to a `OrdinalEncoder`. Use `fit_transform()` and save the result in a variable `ord_encoded_data`. Print the content of the variable.

In [59]:
from sklearn.preprocessing import OrdinalEncoder

In [60]:
# code here
ord_encoder = OrdinalEncoder()

ord_encoded_data = ord_encoder.fit_transform(df_cat)

ord_encoded_data

array([[2.],
       [1.],
       [0.],
       [0.],
       [3.],
       [4.]])

Import `OneHotEncoder` and use `fit_transform()` on the same data,  (set `sparse = False` for readability). Save the result in a variable and print its content.

In [61]:
# code here
enc = OneHotEncoder(sparse=False)

encoded = enc.fit_transform(df_cat)

encoded

array([[0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [62]:
enc.categories_

[array(['Eric', 'Michael', 'Sophie', 'Sopie', 'Xavier'], dtype=object)]

What is this returning? How does it relate to the data saved in `ord_encoded_data`?

HINT: look at the `.catagories_` attribute after you have used `fit_transform()`. Write your observations below.

* the ordinal encoder gives the ordinal position on the categories array and the one-hot encoder gives the position in the array by a one-hot array 

# Using on test set


In general, the idea is to use the `fit_transform()` method on the train data, followed by the `transform()` method on the test data. This allows one to make sure exactly the same processing steps are applied on both data sets. 


#### Non-numerical data


In [63]:
#The encoder saved in `ord_encoder` has already been fit to the train data and so is ready to be used to transform the test data
ord_encoder

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

For example using our `ord_encoder` above, we can pass new data to the `transform()` function. Let there be a new dataframe `df_cat3` below.

In [64]:
cat_data3 = {'names':['Xavier','Xavier','Eric']} #'Antoine','Xavier',
df_cat3 = pd.DataFrame(cat_data3)
df_cat3

Unnamed: 0,names
0,Xavier
1,Xavier
2,Eric


Call `transform()` on using this new data set. What do you see has happened? Answer in the space below.

In [65]:
ord_encoder.transform(df_cat3)

array([[4.],
       [4.],
       [0.]])

the encoder transorm the new dataset to same methode with the train dataset

Note that the above will throw an error if the new data has categories not present in the training data. What are possible solutions to avoid that? Answer below.

a possible solution is to define a new encoder with the new dataset

#### Numerical data

Let's take a look now at how to employ a similar process on new numerical data.

In [66]:
num_data3 = {'scores':[4,8,12]}
df_num3 = pd.DataFrame(num_data3)
df_num3

Unnamed: 0,scores
0,4
1,8
2,12


Now use the previously fitted `StandardScaler` to `transform()` the new data in `df_num3`. Which are the observed effects? Answer in the space below.

In [67]:
# code here
scaler.transform(df_num3)

array([[ 2.44948974],
       [ 7.34846923],
       [12.24744871]])

because the scaler is fit to a mean of 2 and a std~=1 values higher than 3 will be outliers of the model 

# Pipeline

We can chain multiple commands together in a pipeline. The pipeline will run by preserving the order in which the commands are placed.

Make the necessary import from scikit learn to use `Pipeline`.

In [68]:
# code here

#### Numerical data

Define a `Pipeline` consisting of two operations: the application of a `SimpleImputer` and the application of a `StandardScaler`. Save the result in a variable `num_pipeline`. 

In [69]:
SI = SimpleImputer(strategy="mean")
SS = StandardScaler()
num_pipeline = Pipeline([("SI",SI),("SS",SS)])

Call `fit_transform()` on the `df_num1` data using the `num_pipeline` and look at the output. What do you observe? Answer below.

In [70]:
transformed = num_pipeline.fit_transform(df_num1)

transformed

array([[-0.09128709],
       [-0.09128709],
       [ 0.82158384],
       [-1.91702895],
       [ 1.2780193 ],
       [ 0.        ]])

the pipeline as applied the transformation of df_num1 with a simple imputer and a standard scaler

Comment the whole part of the code containing the `StandardScaler`. What happens now? Answer below.

In [71]:
SI = SimpleImputer(strategy="mean")
num_pipeline = Pipeline([("SI",SI)])
transformed = num_pipeline.fit_transform(df_num1)

transformed

array([[5. ],
       [5. ],
       [7. ],
       [1. ],
       [8. ],
       [5.2]])

it just apply the simple imputer transformer

What order is the code executing the transformers in? What happens if you try putting the transformers the other way around? Answer below.

In [72]:
SI = SimpleImputer(strategy="mean")
SS = StandardScaler()
num_pipeline = Pipeline([("SS",SS),("SI",SI)])
transformed = num_pipeline.fit_transform(df_num1)

transformed

array([[-8.33333333e-02],
       [-8.33333333e-02],
       [ 7.50000000e-01],
       [-1.75000000e+00],
       [ 1.16666667e+00],
       [-4.44089210e-17]])

the transformation will be done in the other way 

#### Non-numerical data

Create a catagorical pipeline to process the `df_cat` dataframe. Use a `OrdinalEncoder` within this (yes you can make a pipeline with only one Transformer). 

In [73]:
ord_encoder = OrdinalEncoder()

cat_pipeline = Pipeline([("simpleiputer",SimpleImputer(strategy="constant")),("OE",ord_encoder)])

res = cat_pipeline.fit_transform(df_cat)

res

array([[2.],
       [1.],
       [0.],
       [0.],
       [3.],
       [4.]])

# Combining Pipelines

In the following we will see how we can combine multiple pipelines together. Let's take a new dataframe `df_data` containing both numerical and non-numerical data. 

In [74]:
df_data = {
    'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier'],
    'scores':[5,5,7,1,8,np.nan],
            }

df = pd.DataFrame(df_data)
df

Unnamed: 0,names,scores
0,Sophie,5.0
1,Michael,5.0
2,Eric,7.0
3,Eric,1.0
4,Sopie,8.0
5,Xavier,


As the dataframe above has both types of data: numerical and categorical, we must create seperate pipelines for processing the different types of data. We can then use a `ColumnTransfomer` to put them together. 

Make the necessary imports from scikit learn to be able to use `ColumnTransformer`. Read the corresponding documentation page to understand its purpose and the way to use it.

In [75]:
from sklearn.compose import ColumnTransformer


In [76]:
type(num_attribs)

list

In [77]:
# num_attribs and cat_attribs should be each a list of strings, the strings being column names

num_attribs = ["scores"] # put here the column names (from the dataframe above) corresponding to the numerical attributes you wish to include
cat_attribs = ["names"] # put here the column names (from the dataframe above) corresponding to the categorical attributes you wish to include
preprocessor = ColumnTransformer([("num",num_pipeline,num_attribs),
                                   ("cat",cat_pipeline,cat_attribs)])
# make the full pipeline
full_pipeline = Pipeline([("preprocessor",preprocessor)])


You can call `fit_transform()` on the dataframe `df` using the `full_pipeline` you have created. The `ColumnTransformer` runs the `num_pipeline` on each of the columns contained in `num_attribs`. Then it runs the `cat_pipeline` on the columns contained in `cat_attributes`. It then merges the output together to return one array containing everything.

In [78]:
# code here

full_pipeline.fit_transform(df)

array([[-8.33333333e-02,  2.00000000e+00],
       [-8.33333333e-02,  1.00000000e+00],
       [ 7.50000000e-01,  0.00000000e+00],
       [-1.75000000e+00,  0.00000000e+00],
       [ 1.16666667e+00,  3.00000000e+00],
       [-4.44089210e-17,  4.00000000e+00]])

Try `transforming` (only) the below new test set using the `full_pipeline` above. Are the results what you expected? Answer below.

In [79]:
df_data_test = {
    'names':['Xavier','Michael','Michael'],
    'scores':[100,9,np.nan],
            }

df_test = pd.DataFrame(df_data_test)
df_test

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,Michael,


In [80]:
# code here
full_pipeline.transform(df_test)

array([[ 3.95000000e+01,  4.00000000e+00],
       [ 1.58333333e+00,  1.00000000e+00],
       [-4.44089210e-17,  1.00000000e+00]])

at the nan value the transformer has put a value where it shouldn't be

Try now on this new  data below. Not working? Can you fix it?

In [81]:
df_data_test2 = {
    'names':['Xavier','Michael',np.nan],
    'scores':[100,9,np.nan],
            }

df_test2 = pd.DataFrame(df_data_test2)
df_test2

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,,


In [82]:
full_pipeline.fit_transform(df_test2)

array([[ 1.,  1.],
       [-1.,  0.],
       [ 0.,  2.]])

we can put a simple imputer transformer in the pipeline for implement missing values

# Customised Transformers

In order to create custom transformers we must write our own class. It takes the form below. Try running `fit_transform()` on the dataframe `df_num2` below. What happens?

In [83]:
# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class DontChangeAnything(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

    

In [84]:
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


In [90]:
custom = DontChangeAnything()

custom.fit_transform(df_num2)

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


it don't change anything

Now try using only `fit()`. What happens? Why?

In [91]:
custom.fit(df_num2)

DontChangeAnything()

it shows the name of the transformer

Make a copy of the transformer class above. Edit it to return the `.describe()` of the dataframe only.

In [93]:
class DontChangeAnything_copy(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return X.describe()
    def transform(self, X):
        return X

custom_copy = DontChangeAnything_copy()

custom_copy.fit(df_num2)

Unnamed: 0,scores
count,6.0
mean,2.0
std,0.894427
min,1.0
25%,1.25
50%,2.0
75%,2.75
max,3.0


# Edit your own custom transformer

Copy the `DontChangeAnything` transformer. Assuming that its input will be a numpy array - edit it so that the numpy array is put into a dataframe which is returned. Test it on the array data below. 

In [88]:
array = np.array(df_test2)
array

array([['Xavier', 100.0],
       ['Michael', 9.0],
       [nan, nan]], dtype=object)

In [94]:
class DontChangeAnything_copy2(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = pd.DataFrame(X)
        return X

custom2 = DontChangeAnything_copy2()

custom2.fit_transform(array)

Unnamed: 0,0,1
0,Xavier,100.0
1,Michael,9.0
2,,


##### Congratulations! You have reached the end of this notebook! At this point you should be able to use Transformers and Pipelines to tackle new exercises!