# Practice pipelines notebook!


Objective 
* break down transformers and pipelines so that they can be understood better.

Definitions:


A sklearn transformer is usually used to pass data from one stage to another (but they can actually be used to pass anything). Examples include:
* OneHotEncoder
* StandardScaler
* ....etc

A  sklearn pipeline is used to chain multiple transformers together.

Pipeline is also a generic term used to describe the process of moving data from one form/state to another.

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# Transformers

## Numeric transformers

### Imputer

In [6]:
num_data1 = {'Scores':[5,5,7,1,8,np.nan]}
df_num1 = pd.DataFrame(num_data1)
df_num1

Unnamed: 0,Scores
0,5.0
1,5.0
2,7.0
3,1.0
4,8.0
5,


Import SimpleImputer and run `.fit_transform()` on the numeric df above.

In [5]:
from sklearn.impute import SimpleImputer
    
imputer = SimpleImputer(strategy='median')

imputed_data = imputer.fit_transform(df_num1)
imputed_data

array([[5.],
       [5.],
       [7.],
       [1.],
       [8.],
       [5.]])

Look to see what has changed in the object we passed it (`df_num`).


It is no longer a dataframe. What type is it?

--> Data are no longer a dataframe but a np.array
--> The nan value was replaced by the median of the serie = 5

Try:
* changing the `strategy` input - see what happens

In [11]:
imputer = SimpleImputer(strategy="most_frequent")
imputer.fit(df_num1)
imputed_most_data = imputer.transform(df_num1)
imputed_most_data

array([[5.],
       [5.],
       [7.],
       [1.],
       [8.],
       [5.]])

### Scaler

In [16]:
num_data2 = {'scores':[1,2,3,1,2,3]}
df_num2 = pd.DataFrame(num_data2)
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


Import `StandardScaler` and use fit transform, see what happens to the output.

In [17]:
from sklearn.preprocessing import StandardScaler

# create an instance of StandardScaler
scaler = StandardScaler()

In [20]:

# fit and transform your data using the scaler
df_num2_scaled = scaler.fit_transform(df_num2)
df_num2_scaled

array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487],
       [-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_num2)
scaled_data
```

</details>

Calulate the mean and standard deviation of your data. This will help you understand how the data has been transformed.

In [23]:
print("°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°")
print(f"Mean of df_num2 => {df_num2.mean()}")
print(f"Mean of df_num2_scaled => {df_num2_scaled.mean()}")
print("°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°")
print(f"std of df_num2 => {df_num2.std()}")
print(f"std of df_num2_scaled => {df_num2_scaled.std()}")

°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
Mean of df_num2 => scores    2.0
dtype: float64
Mean of df_num2_scaled => 0.0
°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
std of df_num2 => scores    0.894427
dtype: float64
std of df_num2_scaled => 0.9999999999999999


<details><summary>Solution</summary><br>

```python
std = df_num2.std().values[0]
mean = df_num2.mean().values[0]

print('std',std)
print('mean',mean)

print((2 - mean)/std)
print((3 - mean)/std)

```

</details>

### Example of MinMax scaling

In [25]:
from sklearn.preprocessing import MinMaxScaler
scalerMinMax = MinMaxScaler()
df_num2_scaledMinMax = scalerMinMax.fit_transform(df_num2)
df_num2_scaledMinMax

array([[0. ],
       [0.5],
       [1. ],
       [0. ],
       [0.5],
       [1. ]])

## Non-numeric transformers

In [28]:
cat_data1 = {'names':['Sophie','Michael','Eric','Eric','Sophie', 'Xavier']} 
df_cat = pd.DataFrame(cat_data1)
df_cat

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sophie
5,Xavier


Import OrdinalEncoder and use `fit_transform`. What is this returning?

In [37]:
from sklearn.preprocessing import OrdinalEncoder

# create an instance of OrdinalEncoder
ord_encoder = OrdinalEncoder()
# fit and transform your data using the encoder
X = df_cat
X_encoded = ord_encoder.fit_transform(X)
X_encoded

array([[2.],
       [1.],
       [0.],
       [0.],
       [2.],
       [3.]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder()
ord_encoded_data = ord_encoder.fit_transform(df_cat)
ord_encoded_data
```

</details>

Import OneHotEncoder and use `fit_transform` on the same data.  (set `sparse = False`)

What is this returning? How does it relates to the data returned above?
HINT: look at `.catagories_` attribute after you have used fit_transform

In [44]:
from sklearn.preprocessing import OneHotEncoder

# create an instance of OrdinalEncoder
encoder = OneHotEncoder(sparse=False)
# fit and transform your data using the encoder
X = df_cat
X_encoded = encoder.fit_transform(X)
display(df_cat)
X_encoded

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sophie
5,Xavier


array([[0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import OneHotEncoder

ohot_encoder = OneHotEncoder(sparse=False)
ohot_encoded_data = ohot_encoder.fit_transform(df_cat)
print(ohot_encoded_data)


print(ohot_encoder.categories_)
```

</details>

# Using on test set

The idea is always that you use the `fit_transform` method on the train data. Then the `transform` method on the train data. This allows you to make sure you do exactly the same process on both data sets. (stops you messing up somwhere in your code). For example using our ordinal encoder above, if we pass new data with `transform`....

In [69]:
# Show the ordinal encoder
ord_encoder
## The encoder has already been fit and is ready to be used to transform new data

OrdinalEncoder()

In [70]:
cat_data3 = {'names':['Xavier','Xavier','Eric', 'Antoine']} #,'Xavier',
df_cat3 = pd.DataFrame(cat_data3)
df_cat3

Unnamed: 0,names
0,Xavier
1,Xavier
2,Eric
3,Antoine


Call `transform()` on using this new data set. What do you see has happened? 

In [71]:
X_encoted_error = ord_encoder.transform(df_cat3)
X_encoted_error

ValueError: Found unknown categories ['Antoine'] in column 0 during transform

In [72]:
X_encoted_error = ord_encoder.fit_transform(df_cat3)
X_encoted_error

array([[2.],
       [2.],
       [1.],
       [0.]])

<details><summary>Solution</summary><br>

```python
ord_encoder.transform(df_cat3)
```

</details>

It has learnt the encodings to use on new datasets. (NOTE: if it sees a new catagory, it will break!!)

TRY 
* using your fitted StandardScaler to `transform` the new data below.

In [49]:
num_data3 = {'scores':[4,8,12]}
df_num3 = pd.DataFrame(num_data3)
df_num3

Unnamed: 0,scores
0,4
1,8
2,12


In [50]:
df_num3_fitted = scaler.transform(df_num3)
df_num3_fitted

array([[ 2.44948974],
       [ 7.34846923],
       [12.24744871]])

* using your fitted ordinal encoder to `transform` the new data below.

In [52]:
cat_data4 = {'names':['Xavier','Paul','Eric']} #'Antoine','Xavier',
df_cat4 = pd.DataFrame(cat_data4)
df_cat4

Unnamed: 0,names
0,Xavier
1,Paul
2,Eric


In [55]:
df_cat4 = ord_encoder.transform(df_cat4)

ValueError: Found unknown categories ['Paul'] in column 0 during transform

Can you explain what happened?

# Pipelines

We can chain these commands together in a pipeline...this runs through in the order they are placed.

In [76]:
from sklearn.pipeline import Pipeline

In [77]:
num_pipeline = Pipeline([
     ('imputer', SimpleImputer(strategy="median")),
     ('std_scaler', StandardScaler()), # try commenting in this line and running fit_transform with and without it
 ])

TRY
* call fit_transform on `df_num1` using the `num_pipeline`, look at the output.
* try commenting in the whole  line of code containing the standard scaler, what happens now?
* what order is the code executing the transformers in?
* try putting the transformers the other way around. What is the difference? Why?

In [78]:
df_num1

Unnamed: 0,Scores
0,5.0
1,5.0
2,7.0
3,1.0
4,8.0
5,


In [79]:
df_num1_pipe = num_pipeline.fit_transform(df_num1)
df_num1_pipe

array([[-0.07602859],
       [-0.07602859],
       [ 0.83631451],
       [-1.9007148 ],
       [ 1.29248607],
       [-0.07602859]])

<details><summary>Solution</summary><br>

```python
num_pipeline.fit_transform(df_num1)
```

</details>

Create a categorical pipeline to process the df_cat dataframe. Use a `OrdinalEncoder` within this. (yes you can make a pipeline with only one Transformer).

In [80]:
cat_pipeline = Pipeline([
     ('ord_encoder', OrdinalEncoder()),
     ])

In [85]:
df_cat_encoded_fit = cat_pipeline.fit(df_cat)
df_cat_encoded = cat_pipeline.transform(df_cat)

In [82]:
df_cat_encoded = cat_pipeline.fit_transform(df_cat)

In [83]:
df_cat_encoded

array([[2.],
       [1.],
       [0.],
       [0.],
       [2.],
       [3.]])

<details><summary>Solution</summary><br>

```python

from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ('encoder', OrdinalEncoder()),
])

cat_pipeline.fit_transform(df_cat)
```

</details>

# Combining Pipelines

In [86]:
df_data = {
    'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier'],
    'scores':[5,5,7,1,8,np.nan],
            }

df = pd.DataFrame(df_data)
df

Unnamed: 0,names,scores
0,Sophie,5.0
1,Michael,5.0
2,Eric,7.0
3,Eric,1.0
4,Sopie,8.0
5,Xavier,


We must create seperate pipelines for differnet types of data. e.g. a numerical and a categorical one. We can use a `ColumnTransfomer`. The df above has both types!

You can use the numerical and categorical pipelines you just make before

In [88]:
from sklearn.compose import ColumnTransformer

In [None]:
imputer = SimpleImputer(strategy="most_frequent")

In [105]:
#num_attribs should be a list of strings, the strings being column names

num_attribs = ["scores"] #put column names in here of num attribs you wish to include
cat_attribs = ["names"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
 ])

You can call `fit_transform` on the full_pipeline you have created. The `ColumnTransformer` runs the num_pipeline on each of the columns contained in num_attribs. Then runs the cat_pipeline on the columns contained in cat_attributes. It then merges the output together to return one array containing everything.

In [106]:
df_transform = full_pipeline.fit_transform(df)
df_transform

array([[-0.07602859,  2.        ],
       [-0.07602859,  1.        ],
       [ 0.83631451,  0.        ],
       [-1.9007148 ,  0.        ],
       [ 1.29248607,  3.        ],
       [-0.07602859,  4.        ]])

<details><summary>Solution</summary><br>

```python

#### use cat_pipeline created in task above...
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ('encoder', OrdinalEncoder()),
])

cat_pipeline.fit_transform(df_cat)


#### define which column we wish each pipeline to act
num_attribs = ['scores']
cat_attribs = ['names']

#### then make full pipeline
full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", cat_pipeline, cat_attribs),
 ])

#### call the pipeline on the df provided to return the transformed dataframe
full_pipeline.fit_transform(df)
```

</details>

TRY `transforming` (only) this new test set using the pipeline above. Is the results what you expect?

In [107]:
df_data_test = {
    'names':['Xavier','Michael','Michael'],
    'scores':[100,9,np.nan],
            }

df_test = pd.DataFrame(df_data_test)
df_test

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,Michael,


<details><summary>Solution</summary><br>

```python

full_pipeline.transform(df_test)
```

</details>

TRY this new  data below. Not working? Can you fix it?

In [108]:
df_data_test2 = {
    'names':['Xavier','Michael',np.nan],
    'scores':[100,9,np.nan],
            }

df_test2 = pd.DataFrame(df_data_test2)
df_test2

Unnamed: 0,names,scores
0,Xavier,100.0
1,Michael,9.0
2,,


In [109]:
df_2 = full_pipeline.transform(df_test2);

ValueError: Input contains NaN

In [113]:
#num_attribs should be a list of strings, the strings being column names

num_attribs = ["scores"] #put column names in here of num attribs you wish to include
cat_attribs = ["names"]

num_pipeline = Pipeline([
     ('imputer', SimpleImputer(strategy="median")),
     ('std_scaler', StandardScaler()), # try commenting in this line and running fit_transform with and without it
 ])

cat_pipeline = Pipeline([
     ('imputer', SimpleImputer(strategy="most_frequent")),
     ('ord_encoder', OrdinalEncoder()),
     ])


full_pipeline_nan = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
 ])
df_2 = full_pipeline_nan.fit_transform(df_test2);

<details><summary>Solution</summary><br>

```python

# Showing pipeline
full_pipeline

# Improving categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder()),
])

# Creating new pipeline
full_pipeline2 = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", cat_pipeline, cat_attribs),
 ])

# Fitting the modified pipeline
full_pipeline2.fit_transform(df)

# Transform df_test2
full_pipeline2.transform(df_test2)

```

</details>

All you need to complete Etape2 is above!!! you can stop here

# Customised Transformers

In order to make custom transformers we have to write our own class. It takes the form below. Try running `fit_transform` on the data below. What happens?

In [48]:
# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class DontChangeAnything(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X):
        return self
    def transform(self, X):
        print('Transforming X in no way at all')
        return X
    

In [49]:
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


In [50]:
#### try here




<details><summary>Solution</summary><br>

```python

null_transformer = DontChangeAnything()
null_transformer.fit_transform(df_num2)
```

</details>

Try using only `fit`. What happens?

In [51]:
#### try here




<details><summary>Solution</summary><br>

```python

#hopefully nothing. In the .fit() part of the class is there anything written? 
```

</details>

Make a copy of the transformer class above. Edit it to return the `.describe()` of the dataframe only.

In [52]:
### try here




<details><summary>Solution</summary><br>

```python

# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class ReturnDescribe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.describe()


describe_transformer = ReturnDescribe()

describe_transformer.fit_transform(df_num2)
```

</details>

# Edit your own custom transformer

Again copy the `DontChangeAnything` transformer. Assuming that it's input will be a numpy array - edit it so that the numpy array is put into a dataframe. Which is returned. Test it on the array data below. 

In [53]:
array = np.array(df_test2)
array

array([['Xavier', 100.0],
       ['Michael', 9.0],
       [nan, nan]], dtype=object)

In [54]:
## try here





<details><summary>Solution</summary><br>

```python

## try this

from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToDataframe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X):
        return self
    def transform(self, X):
        return pd.DataFrame(X)
    
df_transformer = ConvertToDataframe()

df_transformer.fit_transform(array)
```

</details>

Now try editing your class to label the column names of your dataframe.

In [55]:
## try here





<details><summary>Solution</summary><br>

```python

from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToDataframe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, column_names = None):
        self.column_names = column_names
        return self
    def transform(self, X):
        if self.column_names == None:
            return pd.DataFrame(X)
        else:
            return pd.DataFrame(X, columns = self.column_names)
    
df_transformer = ConvertToDataframe()
df_transformer.fit_transform(array)
```

</details>

<details><summary>Customized labels</summary><br>

```python

df_transformer.fit_transform(array, ['name', 'score'])
```

</details>