### Project Setup

Before proceeding with the execution of the notebook:  
- The first cell installs all the dependencies listed in `requirements.txt`.  
- The second cell runs a script to download and extract the dataset into the `data` folder.

These steps need to be performed only once.

**Note**: This project is developed using Python version `3.12.1`.


In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
!python scripts/download_data.py

Downloading data from https://www.kaggle.com/api/v1/datasets/download/diishasiing/revenue-for-cab-drivers/...
Download successful.
Files in archive: ['data.csv']
Extracting data.csv...
data.csv extracted to data.
Data is ready at data\data.csv


### Project Exec

I import the `pandas` library and define constants for handling missing values (`FILL_STRING`) and specifying the dataset path (`DATA_PATH`).

In [None]:
import pandas as pd
FILL_STRING = "N/A"
DATA_PATH = 'data/data.csv'

I load the dataset from the specified `DATA_PATH` and display its first few rows. 

In [4]:
df = pd.read_csv(DATA_PATH)
df

  df = pd.read_csv(DATA_PATH)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.20,1.0,N,238,239,1.0,6.00,3.00,0.5,1.47,0.00,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.20,1.0,N,239,238,1.0,7.00,3.00,0.5,1.50,0.00,0.3,12.30,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.60,1.0,N,238,238,1.0,6.00,3.00,0.5,1.00,0.00,0.3,10.80,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.80,1.0,N,238,151,1.0,5.50,0.50,0.5,1.36,0.00,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.00,1.0,N,193,193,2.0,3.50,0.50,0.5,0.00,0.00,0.3,4.80,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.00,0.00,0.3,21.14,0.0
6405004,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.00,12.24,0.3,62.46,0.0
6405005,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.00,0.00,0.3,51.90,0.0
6405006,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.00,0.00,0.3,30.22,0.0


The dataset contains **6,405,008 rows** and **18 columns**. However, a `DtypeWarning` is raised, indicating that column 6 (`store_and_fwd_flag`) has mixed data types. To understand the issue, I investigate the unique values and their counts in this column using the `value_counts` method.







In [5]:
count_values = df["store_and_fwd_flag"].value_counts(dropna=False)

# Stampa i risultati
print(count_values)

store_and_fwd_flag
N      6271447
Y        68120
NaN      65441
Name: count, dtype: int64


The `store_and_fwd_flag` column contains three distinct values:
- `'N'`: 6,271,447 rows
- `'Y'`: 68,120 rows
- `NaN`: 65,441 rows

The presence of `NaN` values is likely the cause of the `DtypeWarning`, as it introduces mixed data types in the column. To address this, I fill the missing values with the constant `FILL_STRING` (`"N/A"`). After filling, I recheck the value counts to confirm the update.

In [6]:
df['store_and_fwd_flag'] = df['store_and_fwd_flag'].fillna(FILL_STRING)
count_values = df["store_and_fwd_flag"].value_counts(dropna=False)

# Stampa i risultati
print(count_values)

store_and_fwd_flag
N      6271447
Y        68120
N/A      65441
Name: count, dtype: int64


I calculate the completeness of each column. This helps identify missing data and assess its impact on the overall quality of the dataset.







In [7]:
# Calcolo della completezza per ciascuna colonna
completeness = (df.notnull().sum() / len(df)) * 100

# Trasformazione in un DataFrame per una visualizzazione più chiara
completeness_df = completeness.reset_index()
completeness_df.columns = ["Column", "Completeness (%)"]

# Stampa i risultati
print(completeness_df)


                   Column  Completeness (%)
0                VendorID         98.978284
1    tpep_pickup_datetime        100.000000
2   tpep_dropoff_datetime        100.000000
3         passenger_count         98.978284
4           trip_distance        100.000000
5              RatecodeID         98.978284
6      store_and_fwd_flag        100.000000
7            PULocationID        100.000000
8            DOLocationID        100.000000
9            payment_type         98.978284
10            fare_amount        100.000000
11                  extra        100.000000
12                mta_tax        100.000000
13             tip_amount        100.000000
14           tolls_amount        100.000000
15  improvement_surcharge        100.000000
16           total_amount        100.000000
17   congestion_surcharge        100.000000


#### 1. Extract all trips with trip_distance larger than 50



I use a boolean mask (`df["trip_distance"] > 50`) to filter rows where the `trip_distance` exceeds 50.


In [8]:
long_distance_trips = df[df["trip_distance"] > 50]
long_distance_trips.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
23842,2.0,2020-01-01 01:53:07,2020-01-01 03:54:41,1.0,52.3,5.0,N,262,265,1.0,300.0,0.0,0.0,61.78,6.12,0.3,370.7,2.5
39013,2.0,2020-01-01 02:05:07,2020-01-01 03:03:10,1.0,51.23,5.0,N,264,264,1.0,329.0,0.0,0.5,100.78,6.12,0.3,436.7,0.0
41620,1.0,2020-01-01 03:05:54,2020-01-01 04:16:26,1.0,53.8,5.0,N,132,265,1.0,250.0,0.0,0.0,53.35,16.62,0.3,320.27,0.0
58262,2.0,2020-01-01 05:36:12,2020-01-01 06:40:06,1.0,55.23,5.0,N,132,265,2.0,170.0,0.0,0.5,0.0,18.26,0.3,189.06,0.0
63024,2.0,2020-01-01 07:40:30,2020-01-01 08:40:01,1.0,54.19,5.0,N,132,265,1.0,230.0,0.0,0.0,0.0,12.24,0.3,242.54,0.0


#### 2. Extract all trips where payment_type is missing



I use the `.isna()` method to create a boolean mask that identifies rows where the `payment_type` column has missing (`NaN`) values. The mask is applied to filter the dataset


In [9]:
missing_payment_trips = df[df["payment_type"].isna()]

missing_payment_trips.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.0,0.3,54.6,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.0,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.2,,,197,216,,24.36,2.75,0.5,0.0,0.0,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.0,0.3,29.63,0.0
6339571,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.0,0.3,28.83,0.0


#### 3. For each (PULocationID, DOLocationID) pair, determine the number of trips







I use the `.groupby()` method to group the dataset by `PULocationID` and `DOLocationID`. The `.size()` method calculates the number of trips for each unique pair, and `.reset_index(name="trip_count")` converts the result into a DataFrame with a new column, `trip_count`, that stores the counts.


In [10]:
trips_per_location_pair = df.groupby(["PULocationID", "DOLocationID"]).size().reset_index(name="trip_count")

trips_per_location_pair.head()

Unnamed: 0,PULocationID,DOLocationID,trip_count
0,1,1,638
1,1,50,1
2,1,68,1
3,1,138,2
4,1,140,1


#### 4. Save all rows with missing VendorID, passenger_count, store_and_fwd_flag, payment_type in a new dataframe called bad, and remove those rows from the original dataframe.



I use `.isna().any(axis=1)` to create a boolean mask that identifies rows where at least one value is missing (`NaN`) in the specified columns: `VendorID`, `passenger_count`, `store_and_fwd_flag`, and `payment_type`. Using this mask, I extract these rows into a new DataFrame called `bad`. 

I then remove these rows from the original DataFrame using `.drop(bad.index)`.

In [11]:
bad = df[df[["VendorID", "passenger_count", "store_and_fwd_flag", "payment_type"]].isna().any(axis=1)]

df = df.drop(bad.index)
bad.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.0,0.3,54.6,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.0,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.2,,,197,216,,24.36,2.75,0.5,0.0,0.0,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.0,0.3,29.63,0.0
6339571,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.0,0.3,28.83,0.0


#### 5. Add a duration column storing how long each trip has taken (use tpep_pickup_datetime, tpep_dropoff_datetime)



I ensure that the `tpep_pickup_datetime` and `tpep_dropoff_datetime` columns are in datetime format using `pd.to_datetime`. This step is necessary for performing datetime operations.

I then calculate the trip duration by subtracting the pickup time (`tpep_pickup_datetime`) from the dropoff time (`tpep_dropoff_datetime`). The resulting timedelta is converted to total seconds using `.dt.total_seconds()` and divided by 60 to express the duration in minutes. 

Finally, I create a new column, `duration`, in the DataFrame to store the calculated values.

In [12]:

df["tpep_pickup_datetime"] = pd.to_datetime(df["tpep_pickup_datetime"])
df["tpep_dropoff_datetime"] = pd.to_datetime(df["tpep_dropoff_datetime"])


df["duration"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60

df[["tpep_pickup_datetime", "tpep_dropoff_datetime", "duration"]].head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,duration
0,2020-01-01 00:28:15,2020-01-01 00:33:03,4.8
1,2020-01-01 00:35:39,2020-01-01 00:43:04,7.416667
2,2020-01-01 00:47:41,2020-01-01 00:53:52,6.183333
3,2020-01-01 00:55:23,2020-01-01 01:00:14,4.85
4,2020-01-01 00:01:58,2020-01-01 00:04:16,2.3


#### 6. For each pickup location, determine how many trips have started there.


I group the dataset by `PULocationID` using `.groupby()` and count the number of trips for each location with `.size()`. The result is converted into a DataFrame with a new column, `trip_count`, using `.reset_index(name="trip_count")`.

In [13]:
trips_per_pickup_location = df.groupby("PULocationID").size().reset_index(name="trip_count")

trips_per_pickup_location.head()


Unnamed: 0,PULocationID,trip_count
0,1,753
1,2,3
2,3,70
3,4,9902
4,5,39


#### 7. Cluster the pickup time of the day into 30-minute intervals (e.g. from 02:00 to 02:30)


I create a new column, `pickup_30min_interval`, to group pickup times into 30-minute intervals.

- I calculate the rounded time once by using `.dt.floor("30min")` on `tpep_pickup_datetime` and store the result in a temporary Series called `pickup_rounded`.
- I format the rounded time (`pickup_rounded`) into `HH:MM` format using `.dt.strftime("%H:%M")`.
- To create the interval, I calculate the end time by adding 30 minutes to `pickup_rounded` using `+ pd.Timedelta(minutes=30)` and format it in the same way.
- The start and end times are concatenated with a `"-"` separator to define the interval (e.g., `02:00-02:30`).

In [14]:
pickup_rounded = df["tpep_pickup_datetime"].dt.floor("30min")

df["pickup_30min_interval"] = (
    pickup_rounded.dt.strftime("%H:%M")
    + "-"
    + (pickup_rounded + pd.Timedelta(minutes=30)).dt.strftime("%H:%M")
)

df[["tpep_pickup_datetime", "pickup_30min_interval"]].head()

Unnamed: 0,tpep_pickup_datetime,pickup_30min_interval
0,2020-01-01 00:28:15,00:00-00:30
1,2020-01-01 00:35:39,00:30-01:00
2,2020-01-01 00:47:41,00:30-01:00
3,2020-01-01 00:55:23,00:30-01:00
4,2020-01-01 00:01:58,00:00-00:30


#### 8. For each interval, determine the average number of passengers and the average fare amount.



I group the dataset by `pickup_30min_interval` using `.groupby()` and calculate the following metrics for each interval:
- `avg_passengers`: computed as the mean of the `passenger_count` column.
- `avg_fare_amount`: computed as the mean of the `fare_amount` column.

These metrics are calculated using the `.agg()` method, which allows applying aggregation functions (in this case, `"mean"`) to multiple columns.

In [15]:
interval_stats = df.groupby("pickup_30min_interval").agg(
    avg_passengers=("passenger_count", "mean"),
    avg_fare_amount=("fare_amount", "mean")
).reset_index()

interval_stats.head()

Unnamed: 0,pickup_30min_interval,avg_passengers,avg_fare_amount
0,00:00-00:30,1.572848,13.526433
1,00:30-01:00,1.584345,13.214132
2,01:00-01:30,1.578933,12.699554
3,01:30-02:00,1.589182,12.265997
4,02:00-02:30,1.587479,12.089669


In [16]:
interval_stats.tail()

Unnamed: 0,pickup_30min_interval,avg_passengers,avg_fare_amount
43,21:30-22:00,1.564563,12.559674
44,22:00-22:30,1.565824,12.80571
45,22:30-23:00,1.571608,13.170497
46,23:00-23:30,1.569022,13.253366
47,23:30-00:00,1.57012,13.206596


#### 9. For each payment type and each interval, determine the average fare amount


I group the dataset by `payment_type` and `pickup_30min_interval` using `.groupby()` to calculate the average fare amount (`avg_fare_amount`) for each combination of payment type and interval. 

The aggregation is performed with the `.agg()` method, applying the `"mean"` function to the `fare_amount` column.

In [17]:
payment_interval_stats = df.groupby(["payment_type", "pickup_30min_interval"]).agg(
    avg_fare_amount=("fare_amount", "mean")
).reset_index()

payment_interval_stats.head()


Unnamed: 0,payment_type,pickup_30min_interval,avg_fare_amount
0,1.0,00:00-00:30,13.869142
1,1.0,00:30-01:00,13.472232
2,1.0,01:00-01:30,12.824603
3,1.0,01:30-02:00,12.357974
4,1.0,02:00-02:30,12.008589


#### 10. For each payment type, determine the interval when the average fare amount is maximum


I first group the dataset by `payment_type` and `pickup_30min_interval` using `.groupby()` and calculate the average fare amount (`avg_fare_amount`) for each combination. The results are stored in the `payment_interval_stats` DataFrame.

To find the interval with the maximum average fare amount for each `payment_type`, I:
1. Use `.groupby("payment_type")["avg_fare_amount"].idxmax()` to identify the row index of the maximum `avg_fare_amount` for each `payment_type`.
2. Use `.loc[]` to select these rows from the `payment_interval_stats` DataFrame.

In [18]:
payment_interval_stats = df.groupby(["payment_type", "pickup_30min_interval"]).agg(
    avg_fare_amount=("fare_amount", "mean")
).reset_index()

max_fare_intervals = payment_interval_stats.loc[
    payment_interval_stats.groupby("payment_type")["avg_fare_amount"].idxmax()
]

max_fare_intervals


Unnamed: 0,payment_type,pickup_30min_interval,avg_fare_amount
10,1.0,05:00-05:30,21.256949
58,2.0,05:00-05:30,14.846814
110,3.0,07:00-07:30,10.950938
154,4.0,05:00-05:30,6.634043
192,5.0,17:30-18:00,0.0


#### 11. For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum


I first group the dataset by `payment_type` and `pickup_30min_interval` using `.groupby()` and calculate:
- `total_tip`: The sum of `tip_amount` for each group.
- `total_fare`: The sum of `fare_amount` for each group.

The results are stored in the `tip_fare_ratio_stats` DataFrame. 

I then compute the tip-to-fare ratio by dividing `total_tip` by `total_fare` and storing the result in a new column, `tip_to_fare_ratio`. To handle cases where `total_fare` is zero, I fill any resulting `NaN` values with `-1` using `.fillna(-1)`.

To find the interval with the maximum tip-to-fare ratio for each `payment_type`, I:
1. Use `.groupby("payment_type")["tip_to_fare_ratio"].idxmax()` to identify the row index of the maximum `tip_to_fare_ratio` for each `payment_type`.
2. Use `.loc[]` to select these rows from the `tip_fare_ratio_stats` DataFrame.


In [None]:
tip_fare_ratio_stats = df.groupby(["payment_type", "pickup_30min_interval"]).agg(
    total_tip=("tip_amount", "sum"),
    total_fare=("fare_amount", "sum")
).reset_index()

tip_fare_ratio_stats["tip_to_fare_ratio"] = tip_fare_ratio_stats["total_tip"] / tip_fare_ratio_stats["total_fare"]

tip_fare_ratio_stats["tip_to_fare_ratio"] = tip_fare_ratio_stats["tip_to_fare_ratio"].fillna(-1)

max_ratio_intervals = tip_fare_ratio_stats.loc[
    tip_fare_ratio_stats.groupby("payment_type")["tip_to_fare_ratio"].idxmax()
]

max_ratio_intervals


Unnamed: 0,payment_type,pickup_30min_interval,total_tip,total_fare,tip_to_fare_ratio
37,1.0,18:30-19:00,485543.67,1998355.2,0.242972
58,2.0,05:00-05:30,15.0,109406.17,0.000137
138,3.0,21:00-21:30,35.62,5646.67,0.006308
170,4.0,13:00-13:30,36.48,170.05,0.214525
192,5.0,17:30-18:00,0.0,0.0,-1.0


In the last row of the result, `total_fare` is `0`, which means the `tip_to_fare_ratio` is set to `-1` as per the handling logic. 






#### 12. Find the location with the highest average fare amount


I first group the dataset by `PULocationID` using `.groupby()` and calculate the average fare amount (`avg_fare_amount`) for each location by applying the `"mean"` aggregation to the `fare_amount` column. The results are stored in the `avg_fare_by_location` DataFrame.

To identify the location with the highest average fare amount:
1. I use `.idxmax()` on the `avg_fare_amount` column to find the row index corresponding to the maximum value.
2. I use `.loc[]` to extract this row from `avg_fare_by_location` and store it in `max_fare_location`.

In [None]:
avg_fare_by_location = df.groupby("PULocationID").agg(
    avg_fare_amount=("fare_amount", "mean")
).reset_index()

max_fare_location = avg_fare_by_location.loc[avg_fare_by_location["avg_fare_amount"].idxmax()]

max_fare_location

PULocationID       204.0
avg_fare_amount    107.0
Name: 198, dtype: float64

#### 13. Build a new dataframe (called common) where, for each pickup location we keep all trips to the 5 most common destinations (i.e. each pickup location can have different common destinations).


I group the dataset by `PULocationID` and `DOLocationID` using `.groupby()` and count the number of trips for each pair with `.size()`. This creates a DataFrame (`top_destinations`) with columns `PULocationID`, `DOLocationID`, and `trip_count`. I then sort the results by `PULocationID` in ascending order and `trip_count` in descending order to identify the most common destinations for each pickup location.

Next, I extract the top 5 destinations for each pickup location by using `.groupby("PULocationID").head(5)`. This step produces a DataFrame (`top_5_destinations`) containing the most frequent destinations for each `PULocationID`.

Finally, I merge the original DataFrame (`df`) with `top_5_destinations` on `PULocationID` and `DOLocationID` using an `inner` join. 


In [None]:
top_destinations = (
    df.groupby(["PULocationID", "DOLocationID"])
    .size()
    .reset_index(name="trip_count")
    .sort_values(["PULocationID", "trip_count"], ascending=[True, False])
)

top_5_destinations = (
    top_destinations.groupby("PULocationID")
    .head(5)
    .reset_index(drop=True)
)

common = df.merge(
    top_5_destinations[["PULocationID", "DOLocationID"]],
    on=["PULocationID", "DOLocationID"],
    how="inner"
)

common.head()


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,duration,pickup_30min_interval,tip_to_fare_ratio
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,...,3.0,0.5,1.47,0.0,0.3,11.27,2.5,4.8,00:00-00:30,0.245
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,...,3.0,0.5,1.5,0.0,0.3,12.3,2.5,7.416667,00:30-01:00,0.214286
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,...,3.0,0.5,1.0,0.0,0.3,10.8,2.5,6.183333,00:30-01:00,0.166667
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,...,0.5,0.5,1.36,0.0,0.3,8.16,0.0,4.85,00:30-01:00,0.247273
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,...,0.5,0.5,0.0,0.0,0.3,4.8,0.0,2.3,00:00-00:30,0.0


#### 14. On the common dataframe, for each payment type and each interval, determine the average fare amount


I group the `common` DataFrame by `payment_type` and `pickup_30min_interval` using `.groupby()` to calculate the average fare amount (`avg_fare_amount`) for each combination of payment type and interval. 

I use the `.agg()` function to apply the `"mean"` aggregation to the `fare_amount` column, resulting in the average fare amount for each group.

In [None]:
avg_fare_common = common.groupby(["payment_type", "pickup_30min_interval"]).agg(
    avg_fare_amount=("fare_amount", "mean")
).reset_index()

avg_fare_common.head()


Unnamed: 0,payment_type,pickup_30min_interval,avg_fare_amount
0,1.0,00:00-00:30,8.588957
1,1.0,00:30-01:00,8.690681
2,1.0,01:00-01:30,8.496143
3,1.0,01:30-02:00,8.026808
4,1.0,02:00-02:30,7.945909


#### 15. Compute the difference of the average fare amount computed in the previous point with those computed at point 9.


I merge the DataFrame `payment_interval_stats` (computed at point 9) with the DataFrame `avg_fare_common` (computed at point 14) using `.merge()`. The merge is performed on the shared columns `payment_type` and `pickup_30min_interval`, ensuring alignment between the two datasets. I use the `suffixes` parameter to differentiate between the average fare amounts in the merged DataFrame:
- Columns from `payment_interval_stats` are suffixed with `_all`.
- Columns from `avg_fare_common` are suffixed with `_common`.

Next, I calculate the difference between the average fare amounts:
- `avg_fare_amount_common` (from the `common` DataFrame) and 
- `avg_fare_amount_all` (from the original DataFrame).
The result is stored in a new column, `fare_amount_difference`.

In [None]:
merged_fares = payment_interval_stats.merge(
    avg_fare_common,
    on=["payment_type", "pickup_30min_interval"],
    suffixes=("_all", "_common")
)

merged_fares["fare_amount_difference"] = (
    merged_fares["avg_fare_amount_common"] - merged_fares["avg_fare_amount_all"]
)

merged_fares[["payment_type", "pickup_30min_interval", "avg_fare_amount_all", "avg_fare_amount_common", "fare_amount_difference"]].head()


Unnamed: 0,payment_type,pickup_30min_interval,avg_fare_amount_all,avg_fare_amount_common,fare_amount_difference
0,1.0,00:00-00:30,13.869142,8.588957,-5.280185
1,1.0,00:30-01:00,13.472232,8.690681,-4.781551
2,1.0,01:00-01:30,12.824603,8.496143,-4.32846
3,1.0,01:30-02:00,12.357974,8.026808,-4.331167
4,1.0,02:00-02:30,12.008589,7.945909,-4.062679


#### 16. Compute the ratio between the differences computed in the previous point and those computed in point 9. Note: you have to compute a ratio for each pair (payment type, interval).


I take the DataFrame from the previous step and calculate the ratio by dividing `fare_amount_difference` by `avg_fare_amount_all`. The result is stored in the `fare_difference_ratio` column.


In [None]:
merged_fares["fare_difference_ratio"] = (
    merged_fares["fare_amount_difference"] / merged_fares["avg_fare_amount_all"]
)

merged_fares[[
    "payment_type",
    "pickup_30min_interval",
    "fare_amount_difference",
    "avg_fare_amount_all",
    "fare_difference_ratio"
]].head()


Unnamed: 0,payment_type,pickup_30min_interval,fare_amount_difference,avg_fare_amount_all,fare_difference_ratio
0,1.0,00:00-00:30,-5.280185,13.869142,-0.380715
1,1.0,00:30-01:00,-4.781551,13.472232,-0.354919
2,1.0,01:00-01:30,-4.32846,12.824603,-0.337512
3,1.0,01:30-02:00,-4.331167,12.357974,-0.350475
4,1.0,02:00-02:30,-4.062679,12.008589,-0.338314


#### 17. Build chains of trips. Two trips are consecutive in a chain if (a) they have the same VendorID, (b) the pickup location of the second trip is also the dropoff location of the first trip, (c) the pickup time of the second trip is after the dropoff time of the first trip, and (d) the pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip. Hint: Add a column chain to the dataset. A chain can have more than two trips.




I first sort the dataset by `VendorID`, `tpep_dropoff_datetime`, and `tpep_pickup_datetime` to ensure that trips are ordered chronologically within each vendor.

To identify consecutive trips:
- I use `.shift()` to compare each trip with the previous one:
  - `same_vendor`: Checks if the `VendorID` of the current trip is equal to the previous trip.
  - `matching_locations`: Checks if the `PULocationID` (pickup location) of the current trip matches the `DOLocationID` (dropoff location) of the previous trip.
  - `time_diff`: Calculates the time difference in seconds between the `tpep_pickup_datetime` of the current trip and the `tpep_dropoff_datetime` of the previous trip.
- I evaluate the time condition `(time_diff > 0) & (time_diff <= 120)` to ensure the pickup time of the current trip is at most 2 minutes later than the dropoff time of the previous trip.

The logical AND (`&`) of these conditions (`same_vendor`, `matching_locations`, and `time_condition`) creates a boolean mask, `is_chain`, indicating whether each trip is part of the same chain as the previous trip.

I then invert `is_chain` with `~is_chain` to detect the start of a new chain and compute a cumulative sum (`.cumsum()`) to assign a unique identifier to each chain. The result is stored in the `chain` column.


In [22]:
df = df.sort_values(by=["VendorID", "tpep_dropoff_datetime", "tpep_pickup_datetime"]).reset_index(drop=True)

same_vendor = df["VendorID"].eq(df["VendorID"].shift())
matching_locations = df["PULocationID"].eq(df["DOLocationID"].shift())
time_diff = (df["tpep_pickup_datetime"] - df["tpep_dropoff_datetime"].shift()).dt.total_seconds()
time_condition = (time_diff > 0) & (time_diff <= 120)

is_chain = same_vendor & matching_locations & time_condition

df["chain"] = (~is_chain).cumsum()

df_sorted_by_chain = df.sort_values(by="chain")
df_sorted_by_chain[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "PULocationID", "DOLocationID", "chain"]].head(20)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,chain
0,1.0,2020-01-01 00:01:40,2020-01-01 00:01:52,79,79,1
1,1.0,2020-01-01 00:00:50,2020-01-01 00:02:32,158,158,2
2,1.0,2020-01-01 00:00:07,2020-01-01 00:03:26,75,75,3
3,1.0,2020-01-01 00:01:55,2020-01-01 00:04:34,141,140,4
4,1.0,2020-01-01 00:01:01,2020-01-01 00:04:46,236,236,5
5,1.0,2020-01-01 00:01:59,2020-01-01 00:05:14,181,181,6
6,1.0,2020-01-01 00:01:21,2020-01-01 00:05:47,231,87,7
7,1.0,2020-01-01 00:00:25,2020-01-01 00:05:59,145,179,8
8,1.0,2020-01-01 00:03:16,2020-01-01 00:06:10,137,170,9
9,1.0,2020-01-01 00:01:45,2020-01-01 00:06:16,236,237,10


In [27]:
chain_counts = df["chain"].value_counts().reset_index()
chain_counts.columns = ["chain", "trip_count"]

multi_trip_chains = chain_counts[chain_counts["trip_count"] > 1]

chains_with_details = df.merge(multi_trip_chains, on="chain")

chains_with_details_sorted = chains_with_details.sort_values(by=["chain", "tpep_pickup_datetime"])
chains_with_details_sorted[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "PULocationID", "DOLocationID", "chain"]].head(20)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,chain
0,1.0,2020-01-03 05:28:12,2020-01-03 05:31:18,100,161,104457
1,1.0,2020-01-03 05:31:26,2020-01-03 05:31:26,161,264,104457
2,1.0,2020-01-03 21:15:40,2020-01-03 21:20:47,142,142,152727
3,1.0,2020-01-03 21:20:49,2020-01-03 21:20:49,142,264,152727
4,1.0,2020-01-05 14:46:55,2020-01-05 14:52:45,249,234,245997
5,1.0,2020-01-05 14:52:47,2020-01-05 14:52:47,234,264,245997
6,1.0,2020-01-06 11:02:24,2020-01-06 11:11:09,48,90,288547
7,1.0,2020-01-06 11:11:12,2020-01-06 11:11:12,90,264,288547
8,1.0,2020-01-06 18:03:09,2020-01-06 18:12:01,236,237,313619
9,1.0,2020-01-06 18:12:03,2020-01-06 18:12:03,237,264,313619
