# Learn and practice Pipelines notebook!


Objective 
* break down Transformers and Pipelines so that they can be understood better.

Definitions:


A sklearn Transformer is usually used to pass data from one stage to another (but they can actually be used to pass anything). Examples include:
* OneHotEncoder
* StandardScaler
* ....etc

A sklearn `Pipeline` is used to chain multiple transformers together.

Pipeline is also a generic term used to describe the process of moving data from one form/state to another.

##### To begin
Inspect the scikit learn documentation corresponding to the following: `SimpleImputer`,  `Pipeline`, `StandardScaler`, `OneHotEncoder`. For each of these read its purpose and an example of use. 

Add the necessary import statements to use them.

In [None]:
# code here

In [None]:
import pandas as pd
import numpy as np

# Transformers

## Numeric transformers

### Inputer

In [None]:
num_data1 = {'scores':[5,5,7,1,8,np.nan]}
df_num1 = pd.DataFrame(num_data1)
df_num1

Instatiate a variable `imputer` to a `SimpleImputer` with `median` value as strategy. Run `.fit_transform()` on the numeric dataframe above and save the result in the `imputed_data` variable.

In [None]:
# code here
imputer = None

imputed_data = None

imputed_data

Check the changes for the object passed as input to the imputer (`df_num1`).


It is not longer a dataframe. What type is it? What values are changed? Answer below.

*Answer here*

Try changing the `stratergy` input.

Observe the difference in results and write them down in the cell below.

In [None]:
# code here


*Observations here*

### Scaler

In [None]:
num_data2 = {'scores':[1,2,3,1,2,3]}
df_num2 = pd.DataFrame(num_data2)
df_num2

Instatiate a variable `scaler`to a `StandardScaler`. Run `fit_transform()` and save the result in a new variable `scaled_data`. Inspect the output and write down your observations below.

In [None]:
# code here


*Write down your observations here*

Calulate the mean and standard deviation of your data and save them in two variables. Print the content of these variables.

In [None]:
# code here
std = 0
mean = 0

Compute the values obtained if you substract from each element in `df_num2` the mean and then divide by the standard deviation.
`(value - mean)/std`. Save the results in an array.

In [None]:
# code here

How do the results compare to the values in `scaled_data`? Why are they different?

*Answer here*

Make the necessary modifications to obtain the same results as in `scaled_data`. Hint: Inspect the default parameter values for the functions called.

In [None]:
# code here

## Non-numeric transformers

In [None]:
cat_data1 = {'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier']} 
df_cat = pd.DataFrame(cat_data1)
df_cat

Look up the documentation for `OrdinalEncoder`, add the necessary import statement, and initialize a variable `ord_encoder` to a `OrdinalEncoder`. Use `fit_transform()` and save the result in a variable `ord_encoded_data`. Print the content of the variable.

In [None]:
# code here

Import `OneHotEncoder` and use `fit_transform()` on the same data,  (set `sparse = False` for readability). Save the result in a variable and print its content.

In [None]:
# code here

What is this returning? How does it relate to the data saved in `ord_encoded_data`?

HINT: look at the `.catagories_` attribute after you have used `fit_transform()`. Write your observations below.

*Answer here*

# Using on test set


In general, the idea is to use the `fit_transform()` method on the train data, followed by the `transform()` method on the test data. This allows one to make sure exactly the same processing steps are applied on both data sets. 


#### Non-numerical data


In [None]:
#The encoder saved in `ord_encoder` has already been fit to the train data and so is ready to be used to transform the test data
ord_encoder

For example using our `ord_encoder` above, we can pass new data to the `transform()` function. Let there be a new dataframe `df_cat3` below.

In [None]:
cat_data3 = {'names':['Xavier','Xavier','Eric']} #'Antoine','Xavier',
df_cat3 = pd.DataFrame(cat_data3)
df_cat3

Call `transform()` on using this new data set. What do you see has happened? Answer in the space below.

In [None]:
# code here

*Answer here*

Note that the above will throw an error if the new data has categories not present in the training data. What are possible solutions to avoid that? Answer below.

*Answer here*

#### Numerical data

Let's take a look now at how to employ a similar process on new numerical data.

In [None]:
num_data3 = {'scores':[4,8,12]}
df_num3 = pd.DataFrame(num_data3)
df_num3

Now use the previously fitted `StandardScaler` to `transform()` the new data in `df_num3`. Which are the observed effects? Answer in the space below.

In [None]:
# code here

*Answer here*

# Pipeline

We can chain multiple commands together in a pipeline. The pipeline will run by preserving the order in which the commands are placed.

Make the necessary import from scikit learn to use `Pipeline`.

In [None]:
# code here

#### Numerical data

Define a `Pipeline` consisting of two operations: the application of a `SimpleImputer` and the application of a `StandardScaler`. Save the result in a variable `num_pipeline`. 

In [None]:
# code here

Call `fit_transform()` on the `df_num1` data using the `num_pipeline` and look at the output. What do you observe? Answer below.

In [None]:
# code here

*Answer here*

Comment the whole part of the code containing the `StandardScaler`. What happens now? Answer below.

In [None]:
# code here

*Answer here*

What order is the code executing the transformers in? What happens if you try putting the transformers the other way around? Answer below.

In [None]:
# code here

*Answer here*

#### Non-numerical data

Create a catagorical pipeline to process the `df_cat` dataframe. Use a `OrdinalEncoder` within this (yes you can make a pipeline with only one Transformer). 

In [None]:
# code here

# Combining Pipelines

In the following we will see how we can combine multiple pipelines together. Let's take a new dataframe `df_data` containing both numerical and non-numerical data. 

In [None]:
df_data = {
    'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier'],
    'scores':[5,5,7,1,8,np.nan],
            }

df = pd.DataFrame(df_data)
df

As the dataframe above has both types of data: numerical and categorical, we must create seperate pipelines for processing the different types of data. We can then use a `ColumnTransfomer` to put them together. 

Make the necessary imports from scikit learn to be able to use `ColumnTransformer`. Read the corresponding documentation page to understand its purpose and the way to use it.

In [None]:
# code here

In [None]:
# num_attribs and cat_attribs should be each a list of strings, the strings being column names

num_attribs = [] # put here the column names (from the dataframe above) corresponding to the numerical attributes you wish to include
cat_attribs = [] # put here the column names (from the dataframe above) corresponding to the categorical attributes you wish to include

# make the full pipeline
full_pipeline = None


You can call `fit_transform()` on the dataframe `df` using the `full_pipeline` you have created. The `ColumnTransformer` runs the `num_pipeline` on each of the columns contained in `num_attribs`. Then it runs the `cat_pipeline` on the columns contained in `cat_attributes`. It then merges the output together to return one array containing everything.

In [None]:
# code here

Try `transforming` (only) the below new test set using the `full_pipeline` above. Are the results what you expected? Answer below.

In [None]:
df_data_test = {
    'names':['Xavier','Michael','Michael'],
    'scores':[100,9,np.nan],
            }

df_test = pd.DataFrame(df_data_test)
df_test

In [None]:
# code here

*Answer here*

Try now on this new  data below. Not working? Can you fix it?

In [None]:
df_data_test2 = {
    'names':['Xavier','Michael',np.nan],
    'scores':[100,9,np.nan],
            }

df_test2 = pd.DataFrame(df_data_test2)
df_test2

In [None]:
# code here

*Answer here*

# Customised Transformers

In order to create custom transformers we must write our own class. It takes the form below. Try running `fit_transform()` on the dataframe `df_num2` below. What happens?

In [None]:
# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class DontChangeAnything(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

    

In [None]:
df_num2

In [None]:
# code here

*Answer here*

Now try using only `fit()`. What happens? Why?

In [None]:
# code here

*Answer here*

Make a copy of the transformer class above. Edit it to return the `.describe()` of the dataframe only.

In [None]:
# code here

# Edit your own custom transformer

Copy the `DontChangeAnything` transformer. Assuming that its input will be a numpy array - edit it so that the numpy array is put into a dataframe which is returned. Test it on the array data below. 

In [None]:
array = np.array(df_test2)
array

In [None]:
# code here

##### Congratulations! You have reached the end of this notebook! At this point you should be able to use Transformers and Pipelines to tackle new exercises!