# Challenge - Week 2

The dataset ***taxi_trip.csv*** contains the total ride duration of taxi trips in New York City. 

Path: /dbfs/FileStore/CDS2024/taxi_trip.csv

Below you can find the description of each column in the dataset:

* **id** - a unique identifier for each trip
* **vendor_id** - a code indicating the provider associated with the trip record
* **pickup_datetime** - date and time when the meter was engaged
* **dropoff_datetime** - date and time when the meter was disengaged
* **passenger_count** - the number of passengers in the vehicle (driver entered value)
* **pickup_longitude** - the longitude where the meter was engaged
* **pickup_latitude** - the latitude where the meter was engaged
* **dropoff_longitude** - the longitude where the meter was disengaged
* **dropoff_latitude** - the latitude where the meter was disengaged
* **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* **trip_duration** - duration of the trip in seconds

## Goals: Create new features from existing columns in the dataset and plot some analysis

The new feaures you will need to create are:

* Datetimes features
    * The **month** when a trip started
    * The **hour** when a trip started
    * The **week of the year** when a trip started
    * The **day of the year** when a trip started
    * The **day of the week** when a trip started
    * If the trip started in an **USA holiday**


* Coordinates features
    * The **distances** from/to two near airports
    * The **distance** of a trip
    * The **speed** (in meters per seconds) of a trip
    * The **bearing** (in degrees) of a trip - Hint: https://mapscaping.com/how-to-calculate-bearing-between-two-coordinates/

**Tip:** Use geopy library to calculate the distance between two coordinates

### After creating the new features, do the analysis bellow:

* Check the average time taken by two different vendors vs weekday
* Check the distributions between passenger_count and trip_duration for each vendor_id
* Check the average speed by vender per day of the week
* Check the average speed by vendor per hour of the week
* Check the distribution of average speed per hour of a day and day of a week

**Tip:** Use the next two cells to install/import useful libraries for this challenge.

In [0]:
%pip install -U geopy

In [0]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from geopy import distance
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Let's start

Read the dataset with pandas. Path: /dbfs/FileStore/CDS2023/taxi_trip.csv

In [0]:
taxi = pd.read_csv('/dbfs/FileStore/CDS2023/taxi_trip.csv')

In [0]:
taxi.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [0]:
taxi.dtypes

In [0]:
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])
taxi['dropoff_datetime'] = pd.to_datetime(taxi['dropoff_datetime'])

# Create the new features

## Datetime Features

**Feature**: The month when a trip started

In [0]:
taxi['trip_started_month'] = taxi['pickup_datetime'].dt.month
taxi.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_started_month
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3


**Feature**: The hour when a trip started

In [0]:
taxi['trip_started_hour'] = taxi['pickup_datetime'].dt.hour
taxi.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_started_month,trip_started_hour
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3,17
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,0
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,11
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4,19
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3,13


**Feature**: The week of the year when a trip started

In [0]:
from datetime import datetime
taxi['trip_started_week'] = taxi['pickup_datetime'].dt.week
taxi.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_started_month,trip_started_hour,trip_started_week
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3,17,11
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,0,23
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,11,3
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4,19,14
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3,13,12


**Feature**: The day of the year when a trip started

In [0]:
from datetime import datetime
taxi['trip_started_day'] = taxi['pickup_datetime'].dt.day
taxi.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_started_month,trip_started_hour,trip_started_week,trip_started_day
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3,17,11,14
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,0,23,12
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,11,3,19
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4,19,14,6
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3,13,12,26


**Feature**: The day of the week when a trip started

In [0]:
from datetime import datetime
taxi['trip_started_weekday'] = taxi['pickup_datetime'].dt.weekday
taxi.head(10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_started_month,trip_started_hour,trip_started_week,trip_started_day,trip_started_weekday
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3,17,11,14,0
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,0,23,12,6
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,11,3,19,1
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4,19,14,6,2
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3,13,12,26,5
5,id0801584,2,2016-01-30 22:01:40,2016-01-30 22:09:03,6,-73.982857,40.742195,-73.992081,40.749184,N,443,1,22,4,30,5
6,id1813257,1,2016-06-17 22:34:59,2016-06-17 22:40:40,4,-73.969017,40.757839,-73.957405,40.765896,N,341,6,22,24,17,4
7,id1324603,2,2016-05-21 07:54:58,2016-05-21 08:20:49,1,-73.969276,40.797779,-73.92247,40.760559,N,1551,5,7,20,21,5
8,id1301050,1,2016-05-27 23:12:23,2016-05-27 23:16:38,1,-73.999481,40.7384,-73.985786,40.732815,N,255,5,23,21,27,4
9,id0012891,2,2016-03-10 21:45:01,2016-03-10 22:05:26,1,-73.981049,40.744339,-73.973,40.789989,N,1225,3,21,10,10,3


In [0]:
taxi['pickup_datetime'].describe()

**Feature**: If the trip started in an US holiday

In [0]:
%pip install holidays

In [0]:
import holidays

# Select country
us_holidays = holidays.US()

# If it is a holidays then it returns True else False
print('01-01-2016' in us_holidays)
print('06-30-2016' in us_holidays)

# What holidays is it?
print(us_holidays.get('01-01-2016'))
print(us_holidays.get('06-30-2016'))

In [0]:
# Select country
us_holidays = holidays.US()

# Print all the holidays in US in year 2018
for ptr in holidays.US(years = 2016).items():
    print(ptr)

## Coordinates Features

In [0]:
taxi.info()

**Feature**: The distance of a trip

In [0]:
taxi['distance'] = taxi['dropoff_longitude']

**Feature**: The speed (in meters per seconds) of a trip

**Feature**: The bearing (in degrees) of a trip

## Analysis

Check the average time taken by two different vendors vs weekday

Check the distributions between passenger_count and trip_duration for each vendor_id

Check the average speed by vender per day of the week

Check the average speed by vendor per hour of the day

Check the distribution of average speed per hour of a day and day of a week