# Clasificador de Propinas para Viajes en Taxi en NYC (2020)

Inspirado en la charla ["Keeping up with Machine Learning in Production"](https://github.com/shreyashankar/debugging-ml-talk) de [Shreya Shankar](https://twitter.com/sh_reya)

Este notebook muestra la construcción de un modelo de machine learning de juguete, usando datos de viajes de los taxis amarillos de Nueva York para el año 2020, [proporcionados por la NYC Taxi and Limousine Commission (TLC)](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

La idea es encontrar aquellos viajes donde la propina dejada por el pasajero fue alta, es decir, mayor al 20% del costo del viaje.

Para ello ajustaremos un modelo de classificación binaria RandomForest usando los datos de los viajes de enero de 2020. Probaremos el modelo resultante sobre los datos de los viajes de febrero de 2020. Compararemos el desempeño del modelo en ambos casos usando la métrica de [f1-score](https://en.wikipedia.org/wiki/F-score).

**Este notebook está construido para ser ejecutado en [Google Colab](https://colab.research.google.com/), al que podemos acceder de manera gratuita solo teniendo un usuario de Google (Gmail) y un navegador web. No es necesario instalar nada en el computador local.**

## Cargando las librerías necesarias

In [21]:
# Imports y configuración
import pandas as pd
import numpy as np
import os
import sys

# Permite importar desde ../src
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))

from data.dataset import load_dataset, add_target
from features.build_features import preprocess
from modeling.train import train_model

## Leemos los datos de enero 2020 (entrenamiento)

In [23]:
# Cargar datos desde URL
parquet_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-01.parquet"
taxi = load_dataset(parquet_url)

# Crear variable objetivo
taxi = add_target(taxi, target_col="target")

In [24]:
taxi.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,tip_fraction,target
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1,...,3.0,0.5,1.47,0.0,0.3,11.27,2.5,,0.245,1
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1,...,3.0,0.5,1.5,0.0,0.3,12.3,2.5,,0.214286,1
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1,...,3.0,0.5,1.0,0.0,0.3,10.8,2.5,,0.166667,0
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1,...,0.5,0.5,1.36,0.0,0.3,8.16,0.0,,0.247273,1
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2,...,0.5,0.5,0.0,0.0,0.3,4.8,0.0,,0.0,0


In [25]:
taxi.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'tip_fraction',
       'target'],
      dtype='object')

## Descripción del Dataset

El diccionario de los datos puede encontrarse [acá](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf):

| Field Name      | Description |
| ----------- | ----------- |
| VendorID      | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.       |
| tpep_pickup_datetime   | The date and time when the meter was engaged.        |
| tpep_dropoff_datetime   | The date and time when the meter was disengaged.        |
| Passenger_count   | The number of passengers in the vehicle. This is a driver-entered value.      |
| Trip_distance   | The elapsed trip distance in miles reported by the taximeter.      |
| PULocationID   | TLC Taxi Zone in which the taximeter was engaged.      |
| DOLocationID   | TLC Taxi Zone in which the taximeter was disengaged      |
| RateCodeID   | The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride     |
| Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip |
| Payment_type | A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip |
| Fare_amount | The time-and-distance fare calculated by the meter. |
| Extra | Miscellaneous extras and surcharges. Currently, this only includes the \$0.50 and \$1 rush hour and overnight charges. |
| MTA_tax | \$0.50 MTA tax that is automatically triggered based on the metered rate in use. |
| Improvement_surcharge | \$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
| Tolls_amount | Total amount of all tolls paid in trip. |
| Total_amount | The total amount charged to passengers. Does not include cash tips. |

## Definimos las características con las que realizaremos la clasificación.

Las construiremos a continuación en la etapa de pre-procesamiento de los datos.

In [26]:
import sys
import os

# Agrega la carpeta src al path (una vez por notebook)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))

from features.build_features import preprocess

target_col = "high_tip"
taxi_train = preprocess(df=taxi, target_col=target_col)

In [30]:
X = taxi_train.drop(columns=["high_tip", "tpep_dropoff_datetime"])
y = taxi_train["high_tip"]

In [34]:
# Información básica del dataset
print(f'Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).sum() / 1e9:.2f} GB')

Num rows: 6382762, Size: 0.36 GB


In [36]:
taxi_train.head()

Unnamed: 0,tpep_dropoff_datetime,pickup_weekday,pickup_hour,work_hours,pickup_minute,passenger_count,trip_distance,trip_time,trip_speed,PULocationID,DOLocationID,RatecodeID,high_tip
0,2020-01-01 00:33:03,2.0,0.0,0.0,28.0,1.0,1.2,288.0,0.004167,238.0,239.0,1.0,1
1,2020-01-01 00:43:04,2.0,0.0,0.0,35.0,1.0,1.2,445.0,0.002697,239.0,238.0,1.0,1
2,2020-01-01 00:53:52,2.0,0.0,0.0,47.0,1.0,0.6,371.0,0.001617,238.0,238.0,1.0,0
3,2020-01-01 01:00:14,2.0,0.0,0.0,55.0,1.0,0.8,291.0,0.002749,238.0,151.0,1.0,1
4,2020-01-01 00:04:16,2.0,0.0,0.0,1.0,1.0,0.0,138.0,0.0,193.0,193.0,1.0,0


In [40]:
taxi_train.columns

Index(['tpep_dropoff_datetime', 'pickup_weekday', 'pickup_hour', 'work_hours',
       'pickup_minute', 'passenger_count', 'trip_distance', 'trip_time',
       'trip_speed', 'PULocationID', 'DOLocationID', 'RatecodeID', 'high_tip'],
      dtype='object')

## Entrenamos y exportamos el modelo

In [79]:
# Entrenar el modelo y guardar

from time import time
start = time()

model_path = os.path.join(os.getcwd(), "..", "models", "random_forest_model.joblib")
model = train_model(X, y, model_path=model_path)


print(f"Entrenamiento finalizado en {time() - start:.2f} segundos")

Entrenamiento finalizado en 744.61 segundos


## Calculamos la métrica f1-score en el conjunto de entrenamiento

In [88]:
from sklearn.metrics import f1_score

preds = model.predict_proba(X)
preds_labels = [p[1] for p in preds.round()]
f1 = f1_score(y, preds_labels)

In [89]:
print(f'F1-score del conjunto de entrenamiento: {f1:.4f}')

F1-score del conjunto de entrenamiento: 0.7296
