You have to work on the (NYC Trip Fare)[https://www.kaggle.com/api/v1/datasets/download/diishasiing/revenue-for-cab-drivers/archive.zip] repository. You can skip the store_and_fwd_flag column, but it’s a bonus point if you can manage it correctly.

Notes
1. It is mandatory to use GitHub for developing the project.
2. The project must be a jupyter notebook.
3. There is no restriction on the libraries that can be used, nor on the Python version.
4. All questions on the project must be asked in the Discussion forum on the course website.
5. At most 3 students can be in each group. You must create the groups by yourself. You can use the Discussion forum to create the groups.
6. You do not have to send me the project before the discussion.
7. You do not have to prepare any slides for the discussion.

In [None]:
import pandas as pd
import numpy as np


df = pd.read_csv('data.csv')
# Manage the problem with the store_and_fwd_flag column by replacing NaN values with Unknown so that the column
# doesn't have mixed types.
df["store_and_fwd_flag"].replace({pd.NA: 'Unknown'}, inplace=True)

df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])



  df = pd.read_csv('data.csv')
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["store_and_fwd_flag"].replace({pd.NA: 'Unknown'}, inplace=True)


1. Extract all trips with trip_distance larger than 50

In [70]:
df[df["trip_distance"] > 50].head(4)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
23842,2.0,2020-01-01 01:53:07,2020-01-01 03:54:41,1.0,52.3,5.0,N,262,265,1.0,300.0,0.0,0.0,61.78,6.12,0.3,370.7,2.5
39013,2.0,2020-01-01 02:05:07,2020-01-01 03:03:10,1.0,51.23,5.0,N,264,264,1.0,329.0,0.0,0.5,100.78,6.12,0.3,436.7,0.0
41620,1.0,2020-01-01 03:05:54,2020-01-01 04:16:26,1.0,53.8,5.0,N,132,265,1.0,250.0,0.0,0.0,53.35,16.62,0.3,320.27,0.0
58262,2.0,2020-01-01 05:36:12,2020-01-01 06:40:06,1.0,55.23,5.0,N,132,265,2.0,170.0,0.0,0.5,0.0,18.26,0.3,189.06,0.0


2. Extract all trips where payment_type is missing


In [71]:
df[df["payment_type"].isna()].head(4)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,Unknown,136,232,,51.05,2.75,0.5,0.0,0.0,0.3,54.6,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,Unknown,121,9,,27.06,2.75,0.0,0.0,0.0,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.2,,Unknown,197,216,,24.36,2.75,0.5,0.0,0.0,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,Unknown,262,236,,26.08,2.75,0.5,0.0,0.0,0.3,29.63,0.0


3. For each (PULocationID, DOLocationID) pair, determine the number of trips

In [72]:
trip_counts = df.groupby(['PULocationID', 'DOLocationID']).size()
trip_counts

PULocationID  DOLocationID
1             1                638
              50                 1
              68                 1
              138                2
              140                1
                              ... 
265           259                2
              261                1
              263                4
              264              317
              265             2508
Length: 31277, dtype: int64

4. Save all rows with missing VendorID, passenger_count, store_and_fwd_flag, payment_type in a new dataframe called bad, and remove those rows from the original dataframe.

In [73]:
# Select the rows where ANY of these are NaN
bad = df[df[['VendorID', 'passenger_count', 'store_and_fwd_flag', 'payment_type']].isna().any(axis=1)]
bad


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,Unknown,136,232,,51.05,2.75,0.5,0.0,0.00,0.3,54.60,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,Unknown,121,9,,27.06,2.75,0.0,0.0,0.00,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.20,,Unknown,197,216,,24.36,2.75,0.5,0.0,0.00,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,Unknown,262,236,,26.08,2.75,0.5,0.0,0.00,0.3,29.63,0.0
6339571,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,Unknown,45,142,,25.28,2.75,0.5,0.0,0.00,0.3,28.83,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,Unknown,237,234,,17.59,2.75,0.5,0.0,0.00,0.3,21.14,0.0
6405004,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,Unknown,259,45,,46.67,2.75,0.5,0.0,12.24,0.3,62.46,0.0
6405005,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,Unknown,137,169,,48.85,2.75,0.0,0.0,0.00,0.3,51.90,0.0
6405006,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,Unknown,50,42,,27.17,2.75,0.0,0.0,0.00,0.3,30.22,0.0


In [74]:
df = df.drop(bad.index)
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.20,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.20,1.0,N,239,238,1.0,7.0,3.0,0.5,1.50,0.0,0.3,12.30,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.60,1.0,N,238,238,1.0,6.0,3.0,0.5,1.00,0.0,0.3,10.80,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.80,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.00,1.0,N,193,193,2.0,3.5,0.5,0.5,0.00,0.0,0.3,4.80,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6339562,2.0,2020-01-31 23:38:07,2020-01-31 23:52:21,1.0,2.10,1.0,N,163,246,1.0,11.0,0.5,0.5,2.96,0.0,0.3,17.76,2.5
6339563,2.0,2020-01-31 23:00:18,2020-01-31 23:19:18,1.0,2.13,1.0,N,164,79,1.0,13.0,0.5,0.5,3.36,0.0,0.3,20.16,2.5
6339564,2.0,2020-01-31 23:24:22,2020-01-31 23:40:39,1.0,2.55,1.0,N,79,68,1.0,12.5,0.5,0.5,3.26,0.0,0.3,19.56,2.5
6339565,2.0,2020-01-31 23:44:22,2020-01-31 23:54:00,1.0,1.61,1.0,N,100,142,2.0,8.5,0.5,0.5,0.00,0.0,0.3,12.30,2.5


5. Add a duration column storing how long each trip has taken (use tpep_pickup_datetime, tpep_dropoff_datetime)

In [75]:
df['duration'] = (pd.to_datetime(df['tpep_dropoff_datetime']) - pd.to_datetime(df['tpep_pickup_datetime']))

df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,duration
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5,0 days 00:04:48
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5,0 days 00:07:25
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5,0 days 00:06:11
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0,0 days 00:04:51
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0,0 days 00:02:18


6. For each pickup location, determine how many trips have started there.


In [76]:
df.groupby(df["PULocationID"]).size()

PULocationID
1         753
2           3
3          70
4        9902
5          39
        ...  
261     34229
262     85591
263    123997
264     43779
265      3090
Length: 260, dtype: int64

7. Cluster the pickup time of the day into 30-minute intervals (e.g. from 02:00 to 02:30)


In [86]:
def get_30min_interval(dt):
    hour = dt.hour
    minute = 0 if dt.minute < 30 else 30    
    
    if (minute == 30):
        return f"{hour:02}:{minute:02} - {hour+1:02}:{minute-30:02}"
        
    return f"{hour:02}:{minute:02} - {hour:02}:{minute+30:02}"


df['pickup_interval'] = df['tpep_pickup_datetime'].apply(get_30min_interval)

df[['tpep_pickup_datetime', 'pickup_interval']]

        tpep_pickup_datetime pickup_interval
0        2020-01-01 00:28:15   00:00 - 00:30
1        2020-01-01 00:35:39   00:30 - 01:00
2        2020-01-01 00:47:41   00:30 - 01:00
3        2020-01-01 00:55:23   00:30 - 01:00
4        2020-01-01 00:01:58   00:00 - 00:30
...                      ...             ...
6339562  2020-01-31 23:38:07   23:30 - 24:00
6339563  2020-01-31 23:00:18   23:00 - 23:30
6339564  2020-01-31 23:24:22   23:00 - 23:30
6339565  2020-01-31 23:44:22   23:30 - 24:00
6339566  2020-01-31 23:19:37   23:00 - 23:30

[6339567 rows x 2 columns]


8. For each interval, determine the average number of passengers and the average fare amount.


In [92]:
df.groupby('pickup_interval')[['passenger_count', 'fare_amount']].mean()


Unnamed: 0_level_0,passenger_count,fare_amount
pickup_interval,Unnamed: 1_level_1,Unnamed: 2_level_1
00:00 - 00:30,1.572848,13.526433
00:30 - 01:00,1.584345,13.214132
01:00 - 01:30,1.578933,12.699554
01:30 - 02:00,1.589182,12.265997
02:00 - 02:30,1.587479,12.089669
02:30 - 03:00,1.587687,12.041626
03:00 - 03:30,1.582064,12.500846
03:30 - 04:00,1.585838,13.094785
04:00 - 04:30,1.580261,14.192685
04:30 - 05:00,1.515886,16.409774


9. For each payment type and each interval, determine the average fare amount

In [100]:
df.groupby(['payment_type', 'pickup_interval'])[['fare_amount']].mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,fare_amount
payment_type,pickup_interval,Unnamed: 2_level_1
1.0,00:00 - 00:30,13.869142
1.0,00:30 - 01:00,13.472232
1.0,01:00 - 01:30,12.824603
1.0,01:30 - 02:00,12.357974
1.0,02:00 - 02:30,12.008589
...,...,...
4.0,22:00 - 22:30,1.533326
4.0,22:30 - 23:00,-0.787090
4.0,23:00 - 23:30,-0.351277
4.0,23:30 - 24:00,-2.748432


10. For each payment type, determine the interval when the average fare amount is maximum


11. For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum

12. Find the location with the highest average fare amount

13. Build a new dataframe (called common) where, for each pickup location we keep all trips to the 5 most common destinations (i.e. each pickup location can have different common destinations).

14. On the common dataframe, for each payment type and each interval, determine the average fare amount

15. Compute the difference of the average fare amount computed in the previous point with those computed at point 9.

16. Compute the ratio between the differences computed in the previous point and those computed in point 9. Note: you have to compute a ratio for each pair (payment type, interval).

17. Build chains of trips. Two trips are consecutive in a chain if (a) they have the same VendorID, (b) the pickup location of the second trip is also the dropoff location of the first trip, (c) the pickup time of the second trip is after the dropoff time of the first trip, and (d) the pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip.

Hint: Add a column chain to the dataset. A chain can have more than two trips.