# Description
We will work with a Kaggle dataset from the ([New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration)). We will create a model to predict the total ride duration of taxi trips in New York City 🗽. The dataset includes variables like vendor, pick-up and drop-off location, passenger count and pick-up time 🚕💨💨💨.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
from typing import List
import matplotlib


In [4]:
file_name = 'taxi-trip-duration.csv'

try:
    df_train = pd.read_csv(file_name)
    print(f'{file_name} found on disk')
except:
    url = "https://factored-workshops.s3.amazonaws.com/taxi-trip-duration.csv"
    print(f'{file_name} not found on disk, downloading from{url}')
    df_train = pd.read_csv(url)
    df_train.to_csv(file_name, index=False)

print(df_train.head())

taxi-trip-duration.csv not found on disk, downloading fromhttps://factored-workshops.s3.amazonaws.com/taxi-trip-duration.csv
          id  vendor_id      pickup_datetime     dropoff_datetime  \
0  id2875421          2  2016-03-14 17:24:55  2016-03-14 17:32:30   
1  id2377394          1  2016-06-12 00:43:35  2016-06-12 00:54:38   
2  id3858529          2  2016-01-19 11:35:24  2016-01-19 12:10:48   
3  id3504673          2  2016-04-06 19:32:31  2016-04-06 19:39:40   
4  id2181028          2  2016-03-26 13:30:55  2016-03-26 13:38:10   

   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -73.982155        40.767937         -73.964630   
1                1        -73.980415        40.738564         -73.999481   
2                1        -73.979027        40.763939         -74.005333   
3                1        -74.010040        40.719971         -74.012268   
4                1        -73.973053        40.793209         -73.972923   

   

In [17]:
print('columns:', df_train.columns)
print('shape:', df_train.shape)

columns: Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration', 'pickup_borough', 'dropoff_borough'],
      dtype='object')
shape: (1458644, 13)


## Prevent data leakage
We need to remove dropoff_datetime since we want to predict the total ride duration of taxi trips and it would be very easy to do total_ride_duration = pickup_datetime - dropoff_datetime. Let's remove this variable.

In [18]:
df_train = df_train.drop("dropoff_datetime", axis=1)

Let's now check each column's data type

In [19]:
df_train.dtypes

id                     object
vendor_id               int64
pickup_datetime        object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration           int64
pickup_borough         object
dropoff_borough        object
dtype: object

As we can see the `pickup_datetime` column should be of type datetime but is of type `object(string)`. We can use the function `to_datetime()` to convert this column. Using the properties of `datetime` we can extract the year, month, weekday and hour of the ride.

In [20]:
df_train['pickup_datetime']= pd.to_datetime(df_train['pickup_datetime'])
df_train["year"] = df_train["pickup_datetime"].dt.year
df_train["month"] = df_train["pickup_datetime"].dt.month
df_train["weekday"] = df_train["pickup_datetime"].dt.weekday
df_train["hour"] = df_train["pickup_datetime"].dt.hour

df_train[["pickup_datetime","year","month","weekday","hour"]].head()

Unnamed: 0,pickup_datetime,year,month,weekday,hour
0,2016-03-14 17:24:55,2016,3,0,17
1,2016-06-12 00:43:35,2016,6,6,0
2,2016-01-19 11:35:24,2016,1,1,11
3,2016-04-06 19:32:31,2016,4,2,19
4,2016-03-26 13:30:55,2016,3,5,13


We can now describe our variables

In [21]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df_train.describe()

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,month,weekday,hour
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.53495,1.66453,-73.97349,40.75092,-73.97342,40.7518,959.49227,2016.0,3.51682,3.05037,13.60648
std,0.49878,1.31424,0.0709,0.03288,0.07064,0.03589,5237.43172,0.0,1.68104,1.95404,6.39969
min,1.0,0.0,-121.93334,34.3597,-121.9333,32.18114,1.0,2016.0,1.0,0.0,0.0
25%,1.0,1.0,-73.99187,40.73735,-73.99133,40.73588,397.0,2016.0,2.0,1.0,9.0
50%,2.0,1.0,-73.98174,40.7541,-73.97975,40.75452,662.0,2016.0,4.0,3.0,14.0
75%,2.0,2.0,-73.96733,40.76836,-73.96301,40.76981,1075.0,2016.0,5.0,5.0,19.0
max,2.0,9.0,-61.33553,51.88108,-61.33553,43.92103,3526282.0,2016.0,6.0,6.0,23.0


From this description we can observe the following:
- 

De las estadísticas descriptivas podemos concluir los siguientes puntos:
La cantidad de pasajeros transportados va desde 0 hasta 9
La duración puede tomar mínimo 1 segundo y máximo 3'526.282 segundos que aproximadamente son 5.938 horas (¿Estos datos tienen sentido?) Es importante revisar si existen datos atípicos.
Los datos corresponden a los meses entre enero y junio del año 2016

In [22]:
import matplotlib.style 
import matplotlib as mpl 
mpl.style.use('classic')



In [23]:
fig = px.box(df_train , y = "trip_duration" , labels = {"trip_duration": "Trip duration"} , title = "Boxplot: Trip duration") 

fig.update_yaxes(range=[0, 10000]) 

fig.show()

NameError: name 'px' is not defined