## Descripción del dataset - Enero 2020

El conjunto de datos utilizado corresponde a los viajes en taxi amarillo registrados en la ciudad de Nueva York durante el mes de enero de 2020. Este dataset es proporcionado por el NYC Taxi and Limousine Commission (TLC) y contiene información detallada sobre cada viaje, incluyendo:

- Fecha y hora de inicio y término del viaje
- Distancia recorrida (`trip_distance`)
- Tarifa base (`fare_amount`) y monto total (`total_amount`)
- Propina entregada (`tip_amount`)
- Cantidad de pasajeros (`passenger_count`)
- Ubicación de origen y destino (`PULocationID`, `DOLocationID`)
- Código de tarifa (`RatecodeID`)

Este dataset se utiliza como base para entrenar un modelo de clasificación que predice si un viaje tendrá una propina alta o no, a partir de variables numéricas y categóricas extraídas de esta información.


In [None]:
import pandas as pd

# Cargar dataset de enero
taxi = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-01.parquet')

In [3]:
# Exploración inicial
taxi.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5,
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5,
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5,
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0,
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0,


In [4]:
taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6405008 entries, 0 to 6405007
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[us]
 2   tpep_dropoff_datetime  datetime64[us]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

In [5]:
taxi.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,6405008.0,6405008,6405008,6339567.0,6405008.0,6339567.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6405008.0,6339567.0
mean,1.673002,2020-01-17 03:05:16.413238,2020-01-17 03:21:13.417920,1.515333,2.929644,1.059908,164.7323,162.6627,1.257319,12.69411,1.115456,0.4923182,2.189342,0.3488395,0.297987,18.66315,2.299052
min,1.0,2003-01-01 00:07:17,2003-01-01 14:16:59,0.0,-30.62,1.0,1.0,1.0,0.0,-1238.0,-27.0,-0.5,-91.0,-35.74,-0.3,-1242.3,-2.5
25%,1.0,2020-01-09 17:10:53,2020-01-09 17:27:34.750000,1.0,0.96,1.0,132.0,113.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,11.16,2.5
50%,2.0,2020-01-16 23:16:29,2020-01-16 23:32:24,1.0,1.6,1.0,162.0,162.0,1.0,9.0,0.5,0.5,1.95,0.0,0.3,14.3,2.5
75%,2.0,2020-01-24 18:24:30,2020-01-24 18:39:51,2.0,2.93,1.0,234.0,234.0,2.0,14.0,2.5,0.5,2.86,0.0,0.3,19.8,2.5
max,5.0,2021-01-02 01:12:10,2021-01-02 01:25:01,9.0,210240.1,99.0,265.0,265.0,5.0,4265.0,113.01,30.8,1100.0,910.5,0.3,4268.3,2.75
std,0.4691265,,,1.151594,83.15911,0.8118432,65.54374,69.91261,0.4885669,12.1273,1.260054,0.07374184,2.760028,1.766978,0.03385937,14.75736,0.7017109


## Descripción del Dataset

El diccionario de los datos puede encontrarse [acá](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf):

| Field Name      | Description |
| ----------- | ----------- |
| VendorID      | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.       |
| tpep_pickup_datetime   | The date and time when the meter was engaged.        |
| tpep_dropoff_datetime   | The date and time when the meter was disengaged.        |
| Passenger_count   | The number of passengers in the vehicle. This is a driver-entered value.      |
| Trip_distance   | The elapsed trip distance in miles reported by the taximeter.      |
| PULocationID   | TLC Taxi Zone in which the taximeter was engaged.      |
| DOLocationID   | TLC Taxi Zone in which the taximeter was disengaged      |
| RateCodeID   | The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride     |
| Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip |
| Payment_type | A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip |
| Fare_amount | The time-and-distance fare calculated by the meter. |
| Extra | Miscellaneous extras and surcharges. Currently, this only includes the \$0.50 and \$1 rush hour and overnight charges. |
| MTA_tax | \$0.50 MTA tax that is automatically triggered based on the metered rate in use. |
| Improvement_surcharge | \$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
| Tolls_amount | Total amount of all tolls paid in trip. |
| Total_amount | The total amount charged to passengers. Does not include cash tips. |