# Practice pipelines notebook!


Objective 
* break down transformers and pipelines so that they can be understood better.

Definitions:


A sklearn transformer is usually used pass data from one stage to another (but they can actually be used to pass anything). Examples include:
* OneHotEncoder
* StandardScaler
* ....etc

A  sklearn pipeline is used to chain multiple transformers together.

Pipeline is also a generic term used to describe the process of moving data from one form/state to another.

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [1]:
import pandas as pd
import numpy as np

# Transformers

## numeric transformers

### inputer

In [2]:
num_data1 = {'scores':[5,5,7,1,8,np.nan]}
df_num1 = pd.DataFrame(num_data1)
df_num1

Unnamed: 0,scores
0,5.0
1,5.0
2,7.0
3,1.0
4,8.0
5,


Import SimpleImputer and run `.fit_transform()` on the numeric df above.

In [3]:
from sklearn.impute import SimpleImputer
    
imputer = SimpleImputer(strategy='median')

imputed_data = imputer.fit_transform(df_num1)
imputed_data

array([[5.],
       [5.],
       [7.],
       [1.],
       [8.],
       [5.]])

Look to see what has changing the object we passed it (`df_num`).


It is not longer a dataframe. What type is it?

Try:
* changing the `stratergy` input - see what happens

In [27]:
### try here

imputer = SimpleImputer(strategy='constnt')

imputed_data = imputer.fit_transform(df_num1)
imputed_data



ValueError: Can only use these strategies: ['mean', 'median', 'most_frequent', 'constant']  got strategy=constnt

### scaler

In [8]:
num_data2 = {'scores':[1,2,3,1,2,3]}
df_num2 = pd.DataFrame(num_data2)
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


Import `StandardScaler` use fit transform, see what happens to the output.

In [9]:
## try here
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_data = scaler.fit_transform(df_num2)
scale_data

array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487],
       [-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_num2)
scaled_data
```

</details>

Calulate the mean and standard deviation of your data. This will help you understand how the data has been transformed.

In [11]:
### try here
scale_data.std()

0.9999999999999999

<details><summary>Solution</summary><br>

```python
std = df_num2.std().values[0]
mean = df_num2.mean().values[0]

print('std',std)

print('mean',mean)

print((2 - mean)/std)
print((3-mean)/std)

# i dont actually understand why this doesnt give 1.2247... it should
#> pandas vs python divise par n ou n-1 pour le calcul de la std
```

</details>

## non-numeric transformers

In [12]:
cat_data1 = {'names':['Sophie','Michael','Eric','Eric','Sophie', 'Xavier']} 
df_cat = pd.DataFrame(cat_data1)
df_cat

Unnamed: 0,names
0,Sophie
1,Michael
2,Eric
3,Eric
4,Sophie
5,Xavier


Import OrdinalEncoder and use `fit_transform`. What is this returning?

In [20]:
#### try here 
from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder()
enc_data = ord_encoder.fit_transform(df_cat)
print(ord_encoder.categories_)
enc_data


[array(['Eric', 'Michael', 'Sophie', 'Xavier'], dtype=object)]


array([[2.],
       [1.],
       [0.],
       [0.],
       [2.],
       [3.]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder()
ord_encoded_data = ord_encoder.fit_transform(df_cat)
ord_encoded_data
```

</details>

Import Onehotencoder and use `fit_transform` on the same data.  (set `sparse = False`)

What is this returning? How does it related to the data returned above?
HINT: look at `.catagories_` attribute after you have used fit_transform

In [18]:
## try here 
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(sparse=False)
oh_data = onehot.fit_transform(df_cat)
print(onehot.categories_)
oh_data




[array(['Eric', 'Michael', 'Sophie', 'Xavier'], dtype=object)]


array([[0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

<details><summary>Solution</summary><br>

```python
from sklearn.preprocessing import OneHotEncoder

ohot_encoder = OneHotEncoder(sparse=False)
ohot_encoded_data = ohot_encoder.fit_transform(df_cat)
print(ohot_encoded_data)


print(ohot_encoder.categories_)
```

</details>

# using on test set

The idea is always that you use the `fit_transform` method on the train data. Then the `transform` method on the train data. This allows you to make sure you do exactly the same process on both data sets. (stops you messing up somwhere in your code). For example using our ordinal encoder above, if we pass new data with `transform`....

In [21]:
#### this is what i called the one i fit before (you may need to run the solution code)
#, it has already been fit and so is ready to be used ot transform new data
ord_encoder

OrdinalEncoder()

In [22]:
cat_data3 = {'names':['Xavier','Xavier','Eric']} #'Antoine','Xavier',
df_cat3 = pd.DataFrame(cat_data3)
df_cat3

Unnamed: 0,names
0,Xavier
1,Xavier
2,Eric


Call `transform()` on using this new data set. What do you see has happened? 

In [23]:
### try here 
ordinal_data = ord_encoder.transform(df_cat3)
ordinal_data



array([[3.],
       [3.],
       [0.]])

<details><summary>Solution</summary><br>

```python
ord_encoder.transform(df_cat3)
```

</details>

It has learnt the encodings to use on new datasets. (NOTE: if it sees a new catagory, it will break!!)

TRY 
* using your fitted StandardScaler to `transform` the new data below.

In [24]:
num_data3 = {'scores':[4,8,12]}
df_num3 = pd.DataFrame(num_data3)
df_num3

Unnamed: 0,scores
0,4
1,8
2,12


In [35]:
# try here
std_data = scaler.fit_transform(df_num3)
std_data


array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

In [36]:
std_data= pd.DataFrame(std_data)
std_data.describe()

Unnamed: 0,0
count,3.0
mean,0.0
std,1.224745
min,-1.224745
25%,-0.612372
50%,0.0
75%,0.612372
max,1.224745


# Pipeline

We can chain these commands together in a pipeline...this runs though in the order they are placed.

In [26]:
from sklearn.pipeline import Pipeline

In [None]:
num_pipeline = Pipeline([
     ('imputer', SimpleImputer(strategy="median")),
#      ('std_scaler', StandardScaler()), # try commenting in this line and running fit_transform with and without it
 ])

TRY
* call fit_transform on `df_num1` using the `num_pipeline`, look at the output.
* try commenting in the whole  line of code containing the standard scaler, what happens now?
* what order is the code executing the transformers in?
* try putting the transformers the other way around. What is the difference? Why?

In [None]:
# try here




<details><summary>Solution</summary><br>

```python
num_pipeline.fit_transform(df_num1)
```

</details>

Create a catagorical pipeline to process the df_cat dataframe. Use a `OrdinalEncoder` within this. (yes you can make a pipeline with only one Transformer). 



In [None]:
### try here




<details><summary>Solution</summary><br>

```python

from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ('encoder', OrdinalEncoder()),
])

cat_pipeline.fit_transform(df_cat)
```

</details>

# Combining Pipelines

In [None]:
df_data = {
    'names':['Sophie','Michael','Eric','Eric','Sopie', 'Xavier'],
    'scores':[5,5,7,1,8,np.nan],
            }

df = pd.DataFrame(df_data)
df

We must create seperate pipelines for differnet types of data. e.g. a numerical and a catagorical. We can use a `ColumnTransfomer`. The df above has both types!

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
#num_attribs should be a list of strings, the strings being column names

num_attribs = [] #put column names in here of num attribs you wish to include
cat_attribs = []

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", cat_pipeline, cat_attribs),
 ])


You can call `fit_transform` on the full_pipeline you have created. The `ColumnTransformer` runs the num_pipeline on each of the columns contained in num_attribs. Then runs the cat_pipeline on the columns contained in cat_attributes. It then merges the output together to return one array containing everything.

In [None]:
## try here




<details><summary>Solution</summary><br>

```python

#### use cat_pipeline created in task above...
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ('encoder', OrdinalEncoder()),
])

cat_pipeline.fit_transform(df_cat)




#### define which column we wish each pipeline to act
num_attribs = ['scores']
cat_attribs = ['names']

#### then make full pipeline
full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", cat_pipeline, cat_attribs),
 ])

#### call the pipeline on the df provided to return the transformed dataframe
full_pipeline.fit_transform(df)
```

</details>

TRY `transforming` (only) this new test set uisng the pipeline above. Is the results what you expect?

In [None]:
df_data_test = {
    'names':['Xavier','Michael','Michael'],
    'scores':[100,9,np.nan],
            }

df_test = pd.DataFrame(df_data_test)
df_test

<details><summary>Solution</summary><br>

```python

full_pipeline.transform(df_test)
```

</details>

TRY this new  data below. Not working? Can you fix it?

In [None]:
df_data_test2 = {
    'names':['Xavier','Michael',np.nan],
    'scores':[100,9,np.nan],
            }

df_test2 = pd.DataFrame(df_data_test2)
df_test2

All you need to complete Etape2 is above!!! you can stop here

# Customised Transformers

in order to make custom transformers we must write our own class. It takes the form below. Try running `fit_transform` on the data below. What happens?

In [15]:
# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class DontChangeAnything(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X):
        return self
    def transform(self, X):
        print('Transforming X in no way at all')
        return X

    

In [16]:
df_num2

Unnamed: 0,scores
0,1
1,2
2,3
3,1
4,2
5,3


In [None]:
#### try here




<details><summary>Solution</summary><br>

```python

null_transformer = DontChangeAnything()

null_transformer.fit_transform(df_num2)
```

</details>

Try using only `fit`. What happens?

In [None]:
#### try here




<details><summary>Solution</summary><br>

```python

#hopefully nothing. In the .fit() part of the class is there anything written? 
```

</details>

Make a copy of the transformer class above. Edit it to return the `.describe()` of the dataframe only.

In [33]:
### try here




<details><summary>Solution</summary><br>

```python

# we must import these
from sklearn.base import BaseEstimator, TransformerMixin

class ReturnDescribe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.describe()


describe_transformer = ReturnDescribe()

describe_transformer.fit_transform(df_num2)
```

</details>

# Edit your own custom transformer

Again copy the `DontChangeAnything` transformer. Assuming that it's input will be a numpy array - edit it so that the numpy array is put into a dataframe. Which is returned. Test it on the array data below. 

In [36]:
array = np.array(df_test2)
array

array([['Xavier', 100.0],
       ['Michael', 9.0],
       [nan, nan]], dtype=object)

In [None]:
## try here





<details><summary>Solution</summary><br>

```python

## try this

from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToDataframe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X):
        return self
    def transform(self, X):
        return pd.DataFrame(X)
    
df_transformer = ConvertToDataframe()

df_transformer.fit_transform(array)
```

</details>

Now try editing your class to label the column names of your dataframe.

In [None]:
## try here





<details><summary>Solution</summary><br>

```python

from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToDataframe(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, column_names = None):
        self.column_names = column_names
        return self
    def transform(self, X):
        if self.column_names == None:
            return pd.DataFrame(X)
        else:
            return pd.DataFrame(X, columns = self.column_names)
    
df_transformer = ConvertToDataframe()

df_transformer.fit_transform(array)
```

</details>