# ML Pipeline with sklearn
The objective in this session we are going to develop preprocessing pipelines with sklearn. We are going to explore different operations for different data types and learn how to group all these operations in one sklearn element containing all the methods `.fit()` and `.transform()`

In [2]:
import numpy as np
import pandas as pd

from typing import List
from sklearn import set_config
set_config(display='diagram')

In [5]:
file_name = 'taxi-trip-duration.csv'
try:
    data = pd.read_csv(file_name)
    print(f'{file_name} found on disk')
except:
    url = "https://factored-workshops.s3.amazonaws.com/taxi-trip-duration.csv"
    print(f'{file_name} not found on disk, downloading from{url}')
    data = pd.read_csv(url)
    data.to_csv(file_name, index=False)

print(data.head())

taxi-trip-duration.csv found on disk
          id  vendor_id      pickup_datetime     dropoff_datetime  \
0  id2875421          2  2016-03-14 17:24:55  2016-03-14 17:32:30   
1  id2377394          1  2016-06-12 00:43:35  2016-06-12 00:54:38   
2  id3858529          2  2016-01-19 11:35:24  2016-01-19 12:10:48   
3  id3504673          2  2016-04-06 19:32:31  2016-04-06 19:39:40   
4  id2181028          2  2016-03-26 13:30:55  2016-03-26 13:38:10   

   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -73.982155        40.767937         -73.964630   
1                1        -73.980415        40.738564         -73.999481   
2                1        -73.979027        40.763939         -74.005333   
3                1        -74.010040        40.719971         -74.012268   
4                1        -73.973053        40.793209         -73.972923   

   dropoff_latitude store_and_fwd_flag  trip_duration pickup_borough  \
0         40.765602

In [6]:
# Limit data range
time_min = 60 # 1 minuto
time_max = 36000 # 10 horas
data = data[
    (data["trip_duration"] > time_min) &
    (data["trip_duration"] < time_max)
]
data.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_borough,dropoff_borough
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,Manhattan,Manhattan
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,Manhattan,Brooklyn
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,Manhattan,Brooklyn
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,Brooklyn,Brooklyn
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,Manhattan,Manhattan


We are going to separate the dependent variable `trip_duration`.

In [7]:
y = data["trip_duration"]
input_df = data.drop(
    ["id", "trip_duration", "dropoff_datetime", "store_and_fwd_flag"],
    axis="columns"
)

## Data split
We should usually divide the data between train, validate and test.

In this exercise, we are going to divide it in train and validation sets using the `train_test_split` function.

In [8]:
from sklearn.model_selection import train_test_split

train_df, val_df, y_train, y_val = train_test_split(input_df, y, random_state=0)

## Transformers
scikit-learn includes an extense list of transformers that allow us to clean and transform data depending on our objective. Transformers follow the convention of having at least two methods:
- `.fit()`: Computes the parameters necessary to perform the transformation from the input data. This method should only be applied to the training data to make sure the parameters dont contain information from the validation or testing data sets.
- `.transform()`: applies the transform to the data. 

Let's see an example with `StandardScaler`, a transformer that removes the meean and escales the data to have variance = 1. We are going to use it to normalize the pickup coordinates.


In [14]:
from sklearn.preprocessing import StandardScaler

transformer = StandardScaler()
transformer.fit(
    train_df[["pickup_longitude", "pickup_latitude"]]
)
normed_array = transformer.transform(
    val_df[["pickup_longitude", "pickup_latitude"]]
)
print(normed_array)

[[-0.30923835 -0.27478234]
 [-0.13589444  0.75675283]
 [ 0.24599071  0.58169127]
 ...
 [-0.19727957 -0.13002453]
 [-0.03955223 -0.08161096]
 [-0.11150857 -0.79791458]]


## Custom Transformers
Even though scikit-learn offers various operations to transform data, we frequently need to create our own custom transformer specifically for our project. All custom transformers should inherit `BaseEsimator` and `TransformerMixin` to have all the necesary functions to connect with other sklearn objects. By convention, all object used to transform data should have the `.fit()` and `.transform()` methods. Both methods receive X and y to be able to integrate with sklearn pipelines without any problem.
- `.fit()` method should always return self
- `.transform()` method will perform the transform and return the transformed data.

Let's replicate the `StandardScaler` creating a custom transformer and make it return a dataframe isntead of an array.

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class FirstTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = X.mean()
        self.std = X.std()
        return self

    def transform(self, X, y=None):
        return (X - self.mean) / self.std

In [13]:
first_transformer = FirstTransformer()
first_transformer.fit(
    train_df[['pickup_longitude','pickup_latitude']]
    )
val_normed_df = first_transformer.fit(
    val_df[['pickup_longitude','pickup_latitude']]
)

## Transforming Dates

Now that we know how to create objects to transform data, we are going to create a transformer to create variables like extracting the weekday and the hour from the pickup time. This information is going to be relevant for our models as we learned on our previous EDA session.

- `.fit()`: In our transformer we dont need to store any data to perform the transformation so we only return `self`.
- `.transform()`: We are going to extract the dates and return a dataframe with the new columns of interest.

In [18]:
class TransformerDates(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        date_column = pd.to_datetime(X["pickup_datetime"])
        date_df = pd.DataFrame()
        # TODO: Create columns weekday and pickup time
        date_df['weekday'] = date_column.dt.weekday
        date_df['hour'] = date_column.dt.hour
        return date_df

In [19]:
transformer_dates = TransformerDates()
dates_df = transformer_dates.fit_transform(train_df)
dates_df.head()

Unnamed: 0,weekday,hour
518949,3,21
1128931,6,21
574396,1,18
54790,6,17
599130,0,16


## Transforming Distance
We also want a transformer to help us measure the distance between the pickup and dropoff coordinates. 

- `.fit()` we don't need to store anything new in this case either.
- `.transform()` we will compute the distance between two points using the Haversine distance. The code provided already includes a function to get this distance and we will skip the details at this time. As you can see, our transformers can include additional functions that might help us with the transofrmation.

In [26]:
class TransformerDistance(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_init = X[["pickup_latitude", "pickup_longitude"]].to_numpy()
        X_final = X[["dropoff_latitude", "dropoff_longitude"]].to_numpy()

        # Distance de Haversine
        # TODO: Get the distance variable using the distance Haversine function
        distance = self.distance_haversine(X_init, X_final)
        
        distance_df = pd.DataFrame()
        distance_df["distance"] = distance
        return distance_df
    
    def distance_haversine(self, X_init, X_final):
        # Convertir de decimal a radianes
        X_init = np.radians(X_init)
        X_final = np.radians(X_final)

        # Formula Haversine
        dlat = X_final[:, 0] - X_init[:, 0] 
        dlon = X_final[:, 1] - X_init[:, 1]
        a = np.sin(dlat / 2) ** 2 + np.cos(X_init[:, 0]) * np.cos(X_final[:, 0]) * np.sin(dlon / 2) ** 2
        c = 2 * np.arcsin(np.sqrt(a))
        r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
        return c * r

In [27]:
transformer_distance = TransformerDistance()
distance_df = transformer_distance.fit_transform(train_df)
distance_df.head()

Unnamed: 0,distance
0,2.404355
1,0.390267
2,5.629826
3,4.298386
4,7.488963


## Joining Transformers with Pipelines
### Numeric Pipeline

We are now going to use the `Pipeline` and `ColumnTransformer` objects to join the transformers we have created with other transformers available in sklearn to apply them to our data.

In [28]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

`ColumnTransformer` 

ColumnTransformer nos permite elegir las columnas sobre las que queremos aplicar una transformación cuando nos llegan columnas adicionales. En este caso queremos que al TransformerDistancia llegue únicamente pickup_longitude, pickup_latitude, dropoff_longitude y dropoff_latitude .
Leyendo la documentación sabemos que debemos pasar una tupla con el nombre del transformer, la clase que define el transformer y las columnas sobre las que queremos aplicar la transformación. ColumnTransformer también nos permite definir qué se debe hacer con las columnas que no estamos transformando; en este caso elegimos pasarlas sin transformarlas remainder="passthrough".

In [20]:
data.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_borough,dropoff_borough
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,Manhattan,Manhattan
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,Manhattan,Brooklyn
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,Manhattan,Brooklyn
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,Brooklyn,Brooklyn
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,Manhattan,Manhattan
