# Learn and practice Pipelines notebook!


Objective 
* break down Transformers and Pipelines so that they can be understood better.

Definitions:


A sklearn Transformer is usually used to pass data from one stage to another (but they can actually be used to pass anything). Examples include:
* OneHotEncoder
* StandardScaler
* ....etc

A sklearn `Pipeline` is used to chain multiple transformers together.

Pipeline is also a generic term used to describe the process of moving data from one form/state to another.

##### To begin
Inspect the scikit learn documentation corresponding to the following: `SimpleImputer`,  `Pipeline`, `StandardScaler`, `OneHotEncoder`. For each of these read its purpose and an example of use. 

Add the necessary import statements to use them.

In [376]:
# code here

In [377]:
import pandas as pd
import numpy as np
from sklearn import impute

# Transformers

## Numeric transformers

### Inputer

In [407]:
num_data1 = {'scores':[5,5,7,1,8,np.nan]}
df_num1 = pd.DataFrame(num_data1)
df_num1
df_num2=pd.DataFrame({'scores':[5,5,7,1,8,np.nan]})

Instatiate a variable `imputer` to a `SimpleImputer` with `median` value as strategy. Run `.fit_transform()` on the numeric dataframe above and save the result in the `imputed_data` variable.

In [408]:
# code here
imputer = impute.SimpleImputer(strategy='mean')

imputed_data = imputer.fit_transform(df_num2)

imputed_data

array([[5. ],
       [5. ],
       [7. ],
       [1. ],
       [8. ],
       [5.2]])

Check the changes for the object passed as input to the imputer (`df_num1`).


It is not longer a dataframe. What type is it? What values are changed? Answer below.

*Answer here*

Try changing the `stratergy` input.

Observe the difference in results and write them down in the cell below.

In [409]:
# code here


*Observations here*

### Scaler

In [410]:
num_data2 = {'scores':[1,2,3,1,2,3]}
df_num2 = pd.DataFrame(num_data2)
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


Instatiate a variable `scaler`to a `StandardScaler`. Run `fit_transform()` and save the result in a new variable `scaled_data`. Inspect the output and write down your observations below.

In [411]:
# code here
from sklearn import preprocessing as pp
scaler=pp.StandardScaler(with_mean=False,with_std=False)
scaled_data=scaler.fit(df_num2)
(scaled_data.transform(df_num2).std()==df_num2.std(),(scaled_data.transform(df_num2))==(df_num2))


(scores    False
 dtype: bool,    scores
 0    True
 1    True
 2    True
 3    True
 4    True
 5    True)

*Write down your observations here*
Centered around 0 according to the median
the doc says :

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

In [412]:
(df_num2-2)/df_num2.std()

Unnamed: 0,scores
0,-1.118034
1,0.0
2,1.118034
3,-1.118034
4,0.0
5,1.118034


In [413]:
(df_num2-2).std()==df_num2.std()

scores    True
dtype: bool

In [417]:
scaled_data.std()

AttributeError: 'StandardScaler' object has no attribute 'std'

Calulate the mean and standard deviation of your data and save them in two variables. Print the content of these variables.

In [418]:
# code here
std = df_num2.std()
mean = df_num2.mean()
print(std,mean)

scores    0.894427
dtype: float64 scores    2.0
dtype: float64


Compute the values obtained if you substract from each element in `df_num2` the mean and then divide by the standard deviation.
`(value - mean)/std`. Save the results in an array.

In [419]:
# code here
((df_num2-mean)/std)

Unnamed: 0,scores
0,-1.118034
1,0.0
2,1.118034
3,-1.118034
4,0.0
5,1.118034


How do the results compare to the values in `scaled_data`? Why are they different?

*Answer here*
They are less spread std is not 1 
The fit_transform does center data before standardizing them which give a different std, weirdly enough, since it's unrelated.

Make the necessary modifications to obtain the same results as in `scaled_data`. Hint: Inspect the default parameter values for the functions called.

In [420]:
# code here
scaler=pp.StandardScaler(with_std=False)
scaled_data=scaler.fit_transform(df_num2)
scaled_data/np.std(df_num2)[0]

array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487],
       [-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

In [421]:
np.std(df_num2-2,ddof=0)
scaled_data.std()

0.816496580927726

## Non-numeric transformers

In [422]:
cat_data1 = {'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier']} 
df_cat = pd.DataFrame(cat_data1)
df_cat

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sopie
5,Xavier


Look up the documentation for `OrdinalEncoder`, add the necessary import statement, and initialize a variable `ord_encoder` to a `OrdinalEncoder`. Use `fit_transform()` and save the result in a variable `ord_encoded_data`. Print the content of the variable.

In [423]:
# code here
ord_encoder=pp.OrdinalEncoder()
ord_encoded_data=ord_encoder.fit_transform(df_cat)
ord_encoded_data

array([[2.],
       [1.],
       [0.],
       [0.],
       [3.],
       [4.]])

In [424]:
ord_encoder.categories_

[array(['Eric', 'Michael', 'Sophie', 'Sopie', 'Xavier'], dtype=object)]

Import `OneHotEncoder` and use `fit_transform()` on the same data,  (set `sparse = False` for readability). Save the result in a variable and print its content.

In [425]:
# code here
ord_encoder=pp.OneHotEncoder(sparse = False)
ord_encoded_data=ord_encoder.fit_transform(df_cat)
ord_encoded_data

array([[0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [426]:
ord_encoder.categories_

[array(['Eric', 'Michael', 'Sophie', 'Sopie', 'Xavier'], dtype=object)]

What is this returning? How does it relate to the data saved in `ord_encoded_data`?

HINT: look at the `.catagories_` attribute after you have used `fit_transform()`. Write your observations below.

*Answer here*

It's a matrix of the name as vector (columns) and lines as index of name Series
Plus the vector are ordered by alphabetics order for the first letter of the name.

# Using on test set


In general, the idea is to use the `fit_transform()` method on the train data, followed by the `transform()` method on the test data. This allows one to make sure exactly the same processing steps are applied on both data sets. 


#### Non-numerical data


In [427]:
#The encoder saved in `ord_encoder` has already been fit to the train data and so is ready to be used to transform the test data
ord_encoder

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

For example using our `ord_encoder` above, we can pass new data to the `transform()` function. Let there be a new dataframe `df_cat3` below.

In [428]:
cat_data3 = {'names':['Xavier','Xavier','Eric']} #'Antoine','Xavier',
df_cat3 = pd.DataFrame(cat_data3)
df_cat3

Unnamed: 0,names
0,Xavier
1,Xavier
2,Eric


Call `transform()` on using this new data set. What do you see has happened? Answer in the space below.

In [429]:
# code here
ord_encoder.transform(df_cat3)

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.]])

*Answer here*

It keeps the same order of vector as previously, therefore all the vectors.

Note that the above will throw an error if the new data has categories not present in the training data. What are possible solutions to avoid that? Answer below.

*Answer here*

encode the list of all the names instead of a sample.

#### Numerical data

Let's take a look now at how to employ a similar process on new numerical data.

In [430]:
num_data3 = {'scores':[4,8,12]}
df_num3 = pd.DataFrame(num_data3)
df_num3

Unnamed: 0,scores
0,4
1,8
2,12


Now use the previously fitted `StandardScaler` to `transform()` the new data in `df_num3`. Which are the observed effects? Answer in the space below.

In [431]:
# code here
scaler.transform(df_num3)

array([[ 2.],
       [ 6.],
       [10.]])

*Answer here*

It substract the mean of the previous set, and normalize it according to the std also.
If not removed of argument.

# Pipeline

We can chain multiple commands together in a pipeline. The pipeline will run by preserving the order in which the commands are placed.

Make the necessary import from scikit learn to use `Pipeline`.

In [432]:
# code here
from sklearn import pipeline as pi


#### Numerical data

Define a `Pipeline` consisting of two operations: the application of a `SimpleImputer` and the application of a `StandardScaler`. Save the result in a variable `num_pipeline`. 

In [433]:
# code here
num_pipeline=pi.make_pipeline(impute.SimpleImputer(),pp.StandardScaler())

Call `fit_transform()` on the `df_num1` data using the `num_pipeline` and look at the output. What do you observe? Answer below.

In [434]:
# code here
num_pipeline.fit_transform(df_num1)

array([[-0.09128709],
       [-0.09128709],
       [ 0.82158384],
       [-1.91702895],
       [ 1.2780193 ],
       [ 0.        ]])

*Answer here*

It does fill the value by the mean then scale it, it's the whole population ddos is probably at 0

Comment the whole part of the code containing the `StandardScaler`. What happens now? Answer below.

In [435]:
# code here
num_pipeline=pi.make_pipeline(pp.StandardScaler(),impute.SimpleImputer())
num_pipeline.fit_transform(df_num1)
pp.StandardScaler().fit_transform(df_num1)[:-1].mean()

-4.4408920985006264e-17

*Answer here*

What order is the code executing the transformers in? What happens if you try putting the transformers the other way around? Answer below.

# code here
the standard deviation is calculated according to the data set missing value, 

filling with the mean before will lead to no ddof=0 so a biased mean

the opposite : ddos=1 give a corrected mean before filling the data


*Answer here*

#### Non-numerical data

Create a catagorical pipeline to process the `df_cat` dataframe. Use a `OrdinalEncoder` within this (yes you can make a pipeline with only one Transformer). 

In [436]:
df_cat

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sopie
5,Xavier


In [437]:
# code here
josh=pi.make_pipeline(pp.OrdinalEncoder())
josh.fit_transform(df_cat)

array([[2.],
       [1.],
       [0.],
       [0.],
       [3.],
       [4.]])

# Combining Pipelines

In the following we will see how we can combine multiple pipelines together. Let's take a new dataframe `df_data` containing both numerical and non-numerical data. 

In [438]:
df_data = {
    'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier'],
    'scores':[5,5,7,1,8,np.nan],
            }

df = pd.DataFrame(df_data)
df

Unnamed: 0,names,scores
0,Sophie,5.0
1,Michael,5.0
2,Eric,7.0
3,Eric,1.0
4,Sopie,8.0
5,Xavier,


As the dataframe above has both types of data: numerical and categorical, we must create seperate pipelines for processing the different types of data. We can then use a `ColumnTransfomer` to put them together. 

Make the necessary imports from scikit learn to be able to use `ColumnTransformer`. Read the corresponding documentation page to understand its purpose and the way to use it.

In [439]:
# code here
from sklearn import compose as cp



In [440]:
# num_attribs and cat_attribs should be each a list of strings, the strings being column names

num_attribs = ["scores"] # put here the column names (from the dataframe above) corresponding to the numerical attributes you wish to include
cat_attribs = ["names"] # put here the column names (from the dataframe above) corresponding to the categorical attributes you wish to include
C=cp.ColumnTransformer([("score_wanted",num_pipeline,num_attribs),("name_cat",josh,cat_attribs)])
# make the full pipeline
full_pipeline = pi.make_pipeline(C)
full_pipeline

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('score_wanted',
                                                  Pipeline(memory=None,
                                                           steps=[('standardscaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True)),
                                                                  ('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                         

You can call `fit_transform()` on the dataframe `df` using the `full_pipeline` you have created. The `ColumnTransformer` runs the `num_pipeline` on each of the columns contained in `num_attribs`. Then it runs the `cat_pipeline` on the columns contained in `cat_attributes`. It then merges the output together to return one array containing everything.

In [441]:
pp.OneHotEncoder(sparse = False).fit_transform(df[["names"]])

array([[0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [442]:
# code here
full_pipeline.fit_transform(df)

array([[-8.33333333e-02,  2.00000000e+00],
       [-8.33333333e-02,  1.00000000e+00],
       [ 7.50000000e-01,  0.00000000e+00],
       [-1.75000000e+00,  0.00000000e+00],
       [ 1.16666667e+00,  3.00000000e+00],
       [-4.44089210e-17,  4.00000000e+00]])

Try `transforming` (only) the below new test set using the `full_pipeline` above. Are the results what you expected? Answer below.

In [443]:
df_data_test = {
    'names':['Xavier','Michael','Michael'],
    'scores':[100,9,np.nan],
            }

df_test = pd.DataFrame(df_data_test)
df_test

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,Michael,


In [444]:
# code here
full_pipeline.transform(df_test)

array([[ 3.95000000e+01,  4.00000000e+00],
       [ 1.58333333e+00,  1.00000000e+00],
       [-4.44089210e-17,  1.00000000e+00]])

*Answer here*

michael got 1 xavier 4, wich is what was expected for names
for scores also, seems ok : 0 for nan 1.58 for 9 39.5 for 100

Try now on this new  data below. Not working? Can you fix it?

In [445]:
df_data_test2 = {
    'names':['Xavier','Michael',np.nan],
    'scores':[100,9,np.nan],
            }

df_test2 = pd.DataFrame(df_data_test2)
df_test2

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,,


In [446]:
# code here
"""
df_test2.names=df_test2.names.fillna("NoName")
what_i_adds=pd.DataFrame({"names":"NoName"}, index=[0])
what_i_adds=what_i_adds.append(df_cat)
josh=pi.make_pipeline(pp.OrdinalEncoder())
josh.fit_transform(what_i_adds)

ne marche pas car il faut fit un modèle contement tout les noms
"""
full_pipeline.transform(df_test2)

ValueError: Input contains NaN

*Answer here*
no i didn't implement any method to fill nan value in names. i need to fit a different
I'll have either to remove nan line, or filling it with a standrasize method in order to keep a count

# Customised Transformers

In order to create custom transformers we must write our own class. It takes the form below. Try running `fit_transform()` on the dataframe `df_num2` below. What happens?

In [403]:
# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class DontChangeAnything(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

    

In [453]:
# code here
DontChangeAnything().fit_transform(df_num2)


Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


*Answer here*
it changes nothing

Now try using only `fit()`. What happens? Why?

In [465]:
# code here
DontChangeAnything().fit(df_num2).fit(df_num2).fit(df_num2).fit(df_num2)


DontChangeAnything()

*Answer here*

I get the same object. Since fit returns the same object

Make a copy of the transformer class above. Edit it to return the `.describe()` of the dataframe only.

In [468]:
# code here
class Describation(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def d(self, X, y=None):
        return X.describe()

Describation().d(df_num2)    

Unnamed: 0,scores
count,6.0
mean,2.0
std,0.894427
min,1.0
25%,1.25
50%,2.0
75%,2.75
max,3.0


# Edit your own custom transformer

Copy the `DontChangeAnything` transformer. Assuming that its input will be a numpy array - edit it so that the numpy array is put into a dataframe which is returned. Test it on the array data below. 

In [471]:
array = np.array(df_test2)
array

array([['Xavier', 100.0],
       ['Michael', 9.0],
       [nan, nan]], dtype=object)

In [473]:
# code here
class Convert(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def conv(self, X, y=None):
        return pd.DataFrame(X)
Convert().conv(array)

Unnamed: 0,0,1
0,Xavier,100.0
1,Michael,9.0
2,,


##### Congratulations! You have reached the end of this notebook! At this point you should be able to use Transformers and Pipelines to tackle new exercises!