# Day 2 - ML Pipelines

In this series of exercices, you will learn how a build robust a ML pipeline using [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

An important part of a ML pipeline is the pre-processing part. For this, you will learn how to master 
Sklearn encoders and tranformers as part of the [Preprocessing Sklearn module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)


In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
import warnings
warnings.filterwarnings('ignore')

## Preprocessing Data

1. [Scaling with StandardScaler](#exo1)
2. [Encoding Categorical Features](#exo2)
3. [Dealing with missing data](#exo3)
4. [Custom Transformers and Encoders](#exo4)

### 1. Scaling with StandardScaler <a id='exo1'/>

Standardize features by removing the mean and scaling to unit variance is a common pre-processing step we apply to help many machine learning algorithms behave more efficiently.

[Sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) can do the scaling transformation for you.
The standard score of a sample x is calculated as:

z = (x - u) / s

The goal of this exercice is to re-implement it.

As you know, there are 2 main methods for any encoder/transformer. 
- `fit` which computes the mean and std to be used for later scaling.
- `tranform` which performs standardization by centering and scaling

#### Exercice
- Given the numpy arrays `data` and `test_data`, write a simple custom implementation of standard scaler. To test it, fit the scaler with `data` and tranform `test_data` with it.
- Compare your results with `StandardScaler`
- Make the custom implementation using a python class

In [26]:
import numpy as np
data = np.array([[1, 10], [2, -1], [0, 22], [3, 15]])
test_data = np.array([[2, 1], [5, 1], [3, 55], [3, 1]])

In [31]:
# axis 
np.nanmean(data, axis = 0)

array([ 1.5, 11.5])

In [20]:
## Simple custom implementation of Standard Scaler

def fit(X):
    """implement fit method"""
    mean = np.nanmean(X, axis=0)
    std = np.nanstd(X, axis=0)
    return {"mean" : mean, "std" : std}


def transform(X, params):
    """implement transformation method"""
    return (X - params["mean"]) / params["std"]

In [21]:
params = fit(data)
transformed_test_data = transform(test_data,params)
transformed_test_data

array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [22]:
# Use Sklearn StandardScaler and compare results
from sklearn.preprocessing import StandardScaler

# your code that uses StandardScaler 

scaler = StandardScaler()
scaler.fit(data)
transformed_test_data_2 = scaler.transform(test_data)
transformed_test_data_2

array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [27]:
# Python program to demonstrate instantiating a class 

# Class has attributes (attr1 and attr2)
# x.attr1 will return the value of the attribute
# x.attr2 will return the value of the attribute

# Class has methods (functions). That you can use with ()

# Self refers to the instance of the class once it gets created
# Object will be Rodger. fun() applies to itself (Rodger)
# And uses the attr1 and attr2 of Roger (itself)

# Class methods must have an extra first parameter in method definition. We do not give a value for this parameter when we call the method, Python provides it.
# If we have a method which takes no arguments, then we still have to have one argument.
# This is similar to this pointer in C++ and this reference in Java.
# When we call a method of this object as myobject.method(arg1, arg2), this is automatically converted by Python into MyClass.method(myobject, arg1, arg2) – this is all the special self is about.
  
class Dog:  
      
    # A simple class attribute 
    attr1 = "mamal"
    attr2 = "dog"
  
    # A sample method   
    def fun(self):  
        print("I'm a", self.attr1) 
        print("I'm a", self.attr2) 

#Driver code 
# Object instantiation 
Rodger = Dog() 
  
# Accessing class attributes 
# and method through objects 
print(Rodger.attr1) 
Rodger.fun() 

mamal
I'm a mamal
I'm a dog


In [23]:
# Class definition
# https://www.geeksforgeeks.org/python-classes-and-objects/ 
class Person:  
    
    # init method or constructor   
    def __init__(self, name):  
        self.name = name  
    
    # Sample Method   
    def say_hi(self):  
        print('Hello, my name is', self.name)  
    
p = Person('Nikhil')  
p.say_hi() 

## THIS RETURNS AN ERROR 
## WE NEED TO INITIALIZE THE CLASS WITH NAME
#p = Person()

Hello, my name is Nikhil


In [29]:
# Custom Implementation with a Class

class Scaler(object):
    def __init__(self):
        self.mean = None
        self.std = None
    
    def fit(self, X):
        self.mean = np.nanmean(X, axis=0)
        self.std = np.nanstd(X, axis=0)
        
    def transform(self, X):
        return (X - self.mean) / self.std
        
scaler = Scaler()
scaler.fit(data)
transformed_test_data_3 = scaler.transform(test_data)
transformed_test_data_3


array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [12]:
scaler

<__main__.Scaler at 0x11be25590>

In [13]:
scaler.fit(data)

### 2. Encoding Categorical Variables <a id='exo2'/>

Often features are not given as continuous values but categorical. However, machine learning algorithms only accept numerical data as inputs. That is why we need to make sure categorical variables are encoded before passed in ML estimators.

One encoder that is commonly used for categorical variables is [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

#### Exercise

Given `data` and `test_data`, implement a OneHotEncoder by yourself and then use the Sklearn implementation to make you got it right.

In [39]:
import numpy as np
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])
test_data = np.array([['China'], ['USA'], ['Italy']])

In [32]:
class CustomOneHotEncoder(object):
    """re-implement one hot encoder"""
    
    pass


for i in 

In [49]:
list_of_columns = np.unique(test_data)
list_of_columns

array(['France', 'Germany', 'Italy', 'Japan', 'UK', 'USA'], dtype='<U7')

In [51]:
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])
data[2][0]

'Italy'

In [53]:
list_of_columns[1]

'Germany'

In [60]:
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])

#list_of_columns = np.unique(data)
# i = 0
# k = 0




l = []

var i:
for i in data[i][0]:
    var k;
    for k in list_of_columns[k]:
        if i == k:
            x = 1
        else:
            x=0
    l.append(x)

SyntaxError: invalid syntax (<ipython-input-60-2c68514b38e2>, line 12)

In [72]:
x = []
for k in range(len(data)):
    l = []
    for i in range(len(data)):
        if data[k][0] == data[i][0]:
            l.append(1)
            #print("this is l:", l)
        else :
            l.append(0)
            #print("this is l:", l)
    print("this is the final l:", l)
    x.append(l)
    #x = np.append(x, l, axis = 1)
    print("this is the status of the array: ",x)

this is the final l: [1, 0, 0, 0, 0, 0, 0, 0]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0]]
this is the final l: [0, 1, 0, 0, 0, 0, 1, 0]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0]]
this is the final l: [0, 0, 1, 0, 0, 0, 0, 0]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0]]
this is the final l: [0, 0, 0, 1, 0, 0, 0, 1]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 1]]
this is the final l: [0, 0, 0, 0, 1, 0, 0, 0]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0, 0, 0]]
this is the final l: [0, 0, 0, 0, 0, 1, 0, 0]
this is the status of the array:  [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0, 0

In [73]:
x

[[1, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 1, 0],
 [0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 1],
 [0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 1, 0, 0, 0, 1]]

In [74]:
type(x)

list

In [75]:
y = np.array(x)

In [76]:
y

array([[1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 1]])

In [40]:
enc.fit(test_data)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)

In [41]:
## use Sklearn OneHotEncoder and compare results on test_data
from sklearn.preprocessing import OneHotEncoder

test_data
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(data)
enc.transform(test_data).toarray()
# your code to use OneHotEncoder and check that you get the samed transformed_test_data

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [47]:
enc.fit(data)
enc.transform(data).toarray()

array([[1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0.]])

### 3. Dealing with Missing Data <a id='exo3'/>

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

For this, Sklearn has multiple ways to impute from missing data with the [Inpute module](https://scikit-learn.org/stable/modules/impute.html#)

#### Exercise
- Re-implement the [`SimpleInputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) tranformer using `mean` strategy.
- Test your implementation with `data` and `test_data`
- Compare with transformed data using `Sklearn SimpleInputer`
- Bonus: Implement for all 4 strategies (`mean`, `median`, `most_frequent` and `constant`)

In [34]:
# Data solution

import numpy as np
data = np.array([[1, 3, 3], [2, np.nan, 6], [3, 9, 9]])
test_data = np.array([[1, 1, 1], [1, np.nan, 1], [1, 1, 1]])
data[np.isnan(data)]

array([nan])

In [None]:
import numpy as np
data = np.array([[1, 3, 3], [2, np.nan, 6], [3, 9, 9]])
test_data = np.array([[1, 1, 1], [1, np.nan, 1], [1, 1, 1]])

data_other = np.array([[1, 3, 3], [2, np.nan, 6], [np.nan, 9, 9]])

In [None]:
ind = np.where(np.isnan(data))
print(ind)

ind_other = np.where(np.isnan(data_other))
print(ind_other)
#data[ind]
#data[ind]
ind_other[1]


data_other[ind_other[1]]

In [35]:
class CustomSimpleInputer(object):
    """Implement SimpleInputer """

    def ___init__(self, strategy="mean"):
        self.strategy = strategy
        self.mean = None
        
    def fit(self, X, **kwargs):
        self.mean = np.nanmean(X, axis=0)
    
    def transform(self, X, **kwargs):
        #Find indicies that you need to replace
        inds = np.where(np.isnan(X))
        X[inds] = np.take(self.mean, inds[1])
        return X

In [36]:
# Instantiate the class
csi = CustomSimpleInputer()

In [37]:
# To instance csi, apply method fit().
# Fit is going to need the instance (called self)
# And the output of fit is assigning a *mean* to the instance (called self)
# Fit assigns self.mean. So in our case, self = csi
# Value given to csi.mean

csi.fit(data)

In [38]:
# Which is available here, now
csi.mean

array([2., 6., 6.])

In [None]:
# Which is then used in transform - where we use self.mean

In [None]:
## use Sklearn Simple Inputer and compare transformed data using your custom implementation
from sklearn.impute import SimpleImputer

from sklearn.impute import SimpleImputer
inpute = SimpleImputer(strategy="mean")
inpute.fit(data)
inpute.transform(test_data)
# your code

### 4. Custom Transformers and Encoders <a id='exo4'/>

Sklearn provides a large collection of transformers and encoders but you might need to implement you own encoder to fit the needs of your data and problem.

For this, there are two very useful Sklearn classes:
1. [FunctionTransfomer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) which lets you Construct a transformer from an arbitrary callable.
2. [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) are base classes one can use to implement completely new custom encoders


#### Exercice
With the Taxi Fare Prediction Challenge data:

- Using `FunctionTransformer` implement a transformer that computes haversine distance between pickup and dropoff location
- With `BaseEstimator` and `TransformerMixin`, implement a custom encoder that extract time features from `pickup_datetime`
- Use these two new encoders to fit and transform the training data

In [1]:
# your code
import os
import pandas as pd

os.getcwd()
os.chdir('/Users/nicolasbancel/git/data')

df = pd.read_csv('train.csv', nrows = 1000)

In [2]:
def haversine_vectorized(df, 
    start_lat="start_lat", 
    start_lon="start_lon", 
    end_lat="end_lat", 
    end_lon="end_lon"):

    """ 
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df
        Computes distance in kms
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [4]:
import numpy as np

df["hav_dist"] = haversine_vectorized(df, start_lat="pickup_latitude", start_lon="pickup_longitude",
            end_lat="dropoff_latitude", end_lon="dropoff_longitude")

In [6]:
df.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hav_dist
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,1.030764
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,8.450134
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,1.389525
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,2.79927
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,1.999157


In [5]:
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(haversine_vectorized, kw_args=dict(start_lat="pickup_latitude", start_lon="pickup_longitude",
                                                                end_lat="dropoff_latitude", end_lon="dropoff_longitude"))

hav_dist = transformer.fit_transform(df)

In [7]:
hav_dist

0      1.030764
1      8.450134
2      1.389525
3      2.799270
4      1.999157
         ...   
995    8.131868
996    6.833256
997    9.991246
998    1.544828
999    3.169336
Length: 1000, dtype: float64

In [8]:
df["hav_dist"] = transformer.fit_transform(df)

In [9]:
df["hav_dist"].head(10)

0    1.030764
1    8.450134
2    1.389525
3    2.799270
4    1.999157
5    3.787239
6    1.555807
7    4.155444
8    1.253232
9    2.849627
Name: hav_dist, dtype: float64

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd 

In [None]:
class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, time_column, time_zone_name='America/New_York'):
        self.time_column = time_column
        self.time_zone_name = time_zone_name

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        X.index = pd.to_datetime(X[self.time_column])
        X.index = X.index.tz_convert(self.time_zone_name)
        X["dow"] = X.index.weekday
        X["hour"] = X.index.hour
        X["month"] = X.index.month
        X["year"] = X.index.year
        return X[["dow", "hour", "month", "year"]].reset_index(drop=True)

    def fit(self, X, y=None):
        return self

In [None]:
tf = TimeFeaturesEncoder("pickup_datetime")
tf.transform(df).head()

## Putting all together as a Pipeline

A Pipeline is very useful concept. In Machine Learning, you often need to perform a sequence of different transformations (scaling, filling missing values, transforming, encoding) of raw dataset before applying a final estimator.

A Pipeline gives you a simple interface for all these different steps of transformation and the resulting estimator. With that, it is easier to iterate and improve models because you can easily add, remove or re-order these different steps. Also, changing one or several parameters is very strightforward and does not require a lot code refactoring.

For this, you will learn how to use 2 Sklearn modules:
1. [ColumnTransformer](#exo11)
2. [Pipeline](#exo12)

### 1. Column Transformer <a id="exo11" />

Before building your pipeline let's use a very useful Sklearn module called [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

This module is very useful when your input data is a pandas dataframe as you can select columns from their names.

#### Exercise

You are given a small dataset containing weights and heights for a few individuals.

In [11]:
import pandas as pd 
data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 180, 'weight': 82},
        {'gender': 'Female', 'height': np.nan, 'weight': 72},
        {'gender': 'Male', 'height': 175, 'weight': 75},
        {'gender': 'Female', 'height': 175, 'weight': 60},
        {'gender': 'Male', 'height': 170, 'weight': 76},
    ])

test_data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 170, 'weight': 72},
        {'gender': 'Female', 'height': np.nan, 'weight': 60}
    ]
)

With `ColumnTransformer`, build a single encoder that apply these transformations:
- encode `gender` with OneHot
- fill missing values for height

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

encoder = ColumnTransformer([
    ('gender', OneHotEncoder(), ['weight']),
    ('fill_missing', SimpleImputer(), ['height'])])
transformed_data = encoder.fit_transform(data)



In [None]:
transformed_data

In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
encoder = ColumnTransformer([
    ('height', SimpleImputer(missing_values=np.nan, strategy='mean'), ['height', 'weight']),
    ('gender', OneHotEncoder(handle_unknown='ignore'), ['gender'])]
)
encoder.fit(data)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('height',
                                 SimpleImputer(add_indicator=False, copy=True,
                                               fill_value=None,
                                               missing_values=nan,
                                               strategy='mean', verbose=0),
                                 ['height', 'weight']),
                                ('gender',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='ignore',
                                               sparse=True),
                                 ['gender'])],
                  verbose=False)

In [23]:
# Version de Raphael

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
encoder = ColumnTransformer(
    (['height', 'weight'], SimpleImputer(missing_values=np.nan, strategy='mean')),
    (['gender'], OneHotEncoder(handle_unknown='ignore'))
)
encoder.fit(data)

TypeError: zip argument #2 must support iteration

In [None]:
data.head()

In [None]:
print(encoder.fit_transform(data))

In [None]:
data.head()

In [None]:
encoder.transform(test_data)

In [10]:
from sklearn.preprocessing import Normalizer

ct = ColumnTransformer(
     [("norm1", Normalizer(norm='l1'), [0, 1]),
      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
               [1., 1., 0., 1.]])

NameError: name 'ColumnTransformer' is not defined

In [None]:
ct.fit_transform(X)

In [None]:
encoder.transform(test_data)

### 2. Pipeline <a id="exo12" />

Now it is time to use a Sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

#### Exercice
With the weight/height dataset, build a pipeline to predict the weight of individuals in the test set.

This pipeline should have:
- a oneHotEncode for `gender`
- fill missing values for height
- a scaler for height
- a simple estimator like a linear regression

**Tip** You can also use [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) which is an alias of `Pipeline` to easily generate a pipeline without giving names to the transformers.

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

encoder = ColumnTransformer([
    ('gender', OneHotEncoder(), ['gender']),
    ('height_scaled', make_pipeline(SimpleImputer(), StandardScaler()), ['height'])
                            ])

pipe  = Pipeline(steps=[ ('features', encoder),
                         ('clf', LassoCV()) ])

pipe.fit(data, data.weight)

# your code

In [None]:
pipe.predict(test_data)

## Refactor Taxi Fare Prediction Problem with a Pipeline

Refactor the model you built yesterday for the Taxi Fare Prediction Problem using:
- Custom encoders you wrote for distance and time features
- OneHot Encoder to encoder hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together


Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

In [None]:
## your code