<a href="https://colab.research.google.com/github/FrancescaNegriUniMiB/focsproject/blob/main/Foundations_CS_2425.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FOUNDATIONS OF COMPUTER SCIENCE - PROJECT FOR A.Y. 2024-2025

### Project text

1. Extract all trips with `trip_distance` larger than 50
2. Extract all trips where `payment_type` is missing
3. For each (`PULocationID`, `DOLocationID`) pair, determine the number of trips
4. Save all rows with missing `VendorID`, `passenger_count`, `store_and_fwd_flag`, `payment_type` in a new DataFrame called `bad`, and remove those rows from the original DataFrame
5. Add a `duration` column storing how long each trip has taken (use `tpep_pickup_datetime`, `tpep_dropoff_datetime`)
6. For each pickup location, determine how many trips have started there
7. Cluster the pickup time of the day into 30-minute intervals (e.g., from 02:00 to 02:30)
8. For each interval, determine the average number of passengers and the average fare amount
9. For each payment type and each interval, determine the average fare amount
10. For each payment type, determine the interval when the average fare amount is maximum
11. For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum
12. Find the location with the highest average fare amount
13. Build a new DataFrame (called `common`) where, for each pickup location, we keep all trips to the 5 most common destinations (i.e., each pickup location can have different common destinations)
14. On the `common` DataFrame, for each payment type and each interval, determine the average fare amount
15. Compute the difference of the average fare amount computed in the previous point with those computed at point 9
16. Compute the ratio between the differences computed in the previous point and those computed in point 9  
    **Note:** You have to compute a ratio for each pair (`payment type`, `interval`)
17. Build chains of trips. Two trips are consecutive in a chain if:
    1. They have the same `VendorID`
    2. The pickup location of the second trip is also the dropoff location of the first trip
    3. The pickup time of the second trip is after the dropoff time of the first trip
    4. The pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip

## Importing


In [None]:
import pandas as pd
import time as t
import numpy as np
from bisect import bisect_left, bisect_right
from collections import defaultdict, deque

if 'executed' not in globals():

  executed = True

  !pip install -q gdown
  !gdown --fuzzy https://drive.google.com/file/d/1IUOdTOYgjco0ggTVsNluQOl-xPbMZ3Z-/view?usp=sharing

  df = pd.read_csv('/content/focs_data.csv',dtype={'store_and_fwd_flag':object})
  display(df.head())

  df_backup = df
else:
  df = df_backup
  print("Reset df: ok.")

**PRE-PROCESSING**
The following rows are being removed because they can't represent a valid trip.
A valid trip is assumed to have these columns always positive:


*   fare_amount: can be >0 or 0 (free trip or payment voided?), but not below 0
*   tip_amount: logically can't be below 0 (but could be 0)
*   trip_distance: can't be <=0 because some distance has to be covered
*   total_amount: since it's a sum of multiple components, can't be <= 0

In [None]:
cols_to_check = ['fare_amount', 'tip_amount', 'trip_distance', 'total_amount']

len_before = len(df)

for col in cols_to_check:
    n_neg = (df[col] < 0).sum()
    print(f"{col}: {n_neg} negative values found")

df = df[(df['fare_amount'] >= 0) & (df['trip_distance'] > 0) & (df['tip_amount'] >= 0) & (df['total_amount'] > 0)]


print(f"\nA total of {len_before - len(df)} rows removed.")

## Task 1

Extract all trips with trip_distance larger than 50

In [None]:
trips_g_50 = df[df['trip_distance'] > 50].copy().reset_index(drop=True)
print(f"\n{len(trips_g_50)} rows found.\n")
print("Details for trip_distance greater than 50:\n")
display(trips_g_50.trip_distance.describe())

## Task 2

Extract all trips where payment_type is missing

In [None]:
trips_nan_pt = df[df['payment_type'].isna()].copy().reset_index(drop=True)
print(f"\n{len(trips_nan_pt)} rows found.\n")
print("First 5 rows of the subdf:\n")
display(trips_nan_pt.head().round(2))

In [None]:
#print(f"Removing {len(df[df['VendorID'].isna()])} rows")
#df = df[df['VendorID'].notna()]

## Task 3

For each (PULocationID, DOLocationID) pair, determine the number of trips

In [None]:
# prompt: For each (PULocationID, DOLocationID) pair, determine the number of trips

trip_counts = df[['PULocationID','DOLocationID']].value_counts().reset_index(name='total_trips_per_route')
print(f"{len(trip_counts)} pairs found.")
print("\nStatistical data about 'total_trips_per_route' distribution:")
display(trip_counts.total_trips_per_route.describe().astype(int))

## Task 4

Save all rows with missing VendorID, passenger_count, store_and_fwd_flag, payment_type in a new DataFrame called bad, and remove those rows from the original DataFrame

In [None]:
nan_cols = ['VendorID', 'passenger_count', 'store_and_fwd_flag', 'payment_type']

null_indices = df[nan_cols].isna().any(axis=1)

print(f"Number of rows where all specified columns are null: {null_indices.sum()}")
print("\nCheck if rows null in one column are null in others:")
for col in nan_cols:
    is_null_in_col = df[col].isna()
    # Check if all rows that are null in 'col' are also null in all other columns
    check = df.loc[is_null_in_col, [c for c in nan_cols if c != col]].isna().all(axis=1)
    print(f"  - Rows where '{col}' is null are also null in all other specified columns: {check.all()}")

#If the value can be missing in just 1 of the 4 columns:
bad = df[null_indices].copy().reset_index(drop=True)
df = df.dropna(subset=nan_cols).reset_index(drop=True)

print(f"\n{len(bad)} rows removed from the original dataset.")
print(f"\nNow the dataset is {len(df)} rows long.")

## Task 5

Add a duration column storing how long each trip has taken (use tpep_pickup_datetime, tpep_dropoff_datetime)

In [None]:
start = t.time()

df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

df['duration_sec'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds().astype(int)

df['duration_str'] = df['duration_sec'].apply(lambda sec: f"{sec // 86400} days " if sec // 86400 > 0 else "") + \
                                df['duration_sec'].apply(lambda sec: f"{sec // 3600 % 24} hours " if sec // 3600 > 0 else "") + \
                                df['duration_sec'].apply(lambda sec: f"{sec // 60 % 60} minutes ") + \
                                df['duration_sec'].apply(lambda sec: f"{sec % 60} seconds")
df['duration_str'] = df['duration_str'].str.strip()

maxdur = df.loc[df['duration_sec'].idxmax()]
mindur = df.loc[df['duration_sec'].idxmin()]
print(f"\nLongest trip: {maxdur.duration_str}")
print(f"Shortest trip: {mindur.duration_str}")

end = t.time()
print(f"\n\n\n(Execution time: {round(end - start,2)} seconds.)")

## Task 6

For each pickup location, determine how many trips have started there

In [None]:
pickup_counts = df['PULocationID'].value_counts().reset_index()
pickup_counts.columns = ['PULocationID', 'number_of_trips']

print("Top 5 most common PULocationID (with the number of trips started there):\n")
display(pickup_counts.head())
print(f"\nTotal {len(pickup_counts)} unique pickup locations.")

print("\nStatistical data about the number of trips per pickup location:")
display(pickup_counts.number_of_trips.describe().astype(int))

## Task 7

Cluster the pickup time of the day into 30-minute intervals (e.g., from 02:00 to 02:30)

In [None]:
start = t.time()

df['hour']   = df['hour2']   = df['tpep_pickup_datetime'].dt.hour
df['minute'] = df['minute2'] = df['tpep_pickup_datetime'].dt.floor('30min').dt.minute

df['hour2'] = np.where(df['minute'] == 30, (df['hour'] + 1) % 24, df['hour']) #[0-23] % 24 -> 0, 24 % 24 -> 1
df['minute2'] = np.where(df['minute'] == 30, 0, 30)

df['tpep_pickup_time_interval'] = (
    df['hour'].map('{:02}'.format) + ':' + df['minute'].map('{:02}'.format) + '-' +
    df['hour2'].map('{:02}'.format) + ':' + df['minute2'].map('{:02}'.format)
)

df.drop(columns=['hour','hour2','minute','minute2'],inplace=True)

print('\nNew column preview:')
display(df[['tpep_pickup_datetime','tpep_pickup_time_interval']].head())

print('\nList of all the possible intervals:')
display(df['tpep_pickup_time_interval'].unique())

end = t.time()
print(f"\n\n\n(Execution time: {round(end - start,2)} seconds.)")

#start = t.time()

#df['pickup_time_30min'] = df['tpep_pickup_datetime'].dt.floor('30min')
#df['time_interval'] = df['pickup_time_30min'].dt.strftime('%H:%M') + '-' + (df['pickup_time_30min'] + pd.Timedelta(minutes=30)).dt.strftime('%H:%M')

#end = t.time()
#print(f"\nExecution time: {end - start} seconds.")

**Method ----->	Mean comp. time requested**

astype(str).str.zfill(2) ----->	1×

.map('{:02}'.format) ----->	~0.6×

.dt.strftime('%H') su datetime ----->	3–4×

## Task 8

For each interval, determine the average number of passengers and the average fare amount

## Task 9

For each payment type and each interval, determine the average fare amount

## Task 10

For each payment type, determine the interval when the average fare amount is maximum

## Task 11

For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum

## Task 12

Find the location with the highest average fare amount

**PLEASE NOTE:**

Since it is not specified whether the location refers to PU or DO, I'll assume the goal is to find where the most expensive trips ***begin***, and therefore I will consider the **PU** locations.

The same calculation can, of course, also be done using the DO locations, perhaps to explore the cost of reaching certain areas.

## Task 13

Build a new DataFrame (called common) where, for each pickup location, we keep all trips to the 5 most common destinations (i.e., each pickup location can have different common destinations)

## Task 14

On the common DataFrame, for each payment type and each interval, determine the average fare amount

## Task 15

Compute the difference of the average fare amount computed in the previous point with those computed at point 9

## Task 16

Compute the ratio between the differences computed in the previous point and those computed in point 9 Note: You have to compute a ratio for each pair (payment type, interval)

## Task 17

Build chains of trips. Two trips are consecutive in a chain if:


*   They have the same VendorID
*   The pickup location of the second trip is also the dropoff location of the first trip
*   The pickup time of the second trip is after the dropoff time of the first trip
*   The pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip

**REASONING**

Task preparation:

- reorder the dataset by VendorID, tpep_pickup_datetime, tpep_dropoff_datetime

- create a column ‘starting_dropoff’ that contains tpep_dropoff_datetime + 1 second

- create a column ‘ending_dropoff’ that contains tpep_dropoff_datetime + 2 minutes

- initialize n = 0

Then define a main function that:

1. creates a column chain initialized to n

2. increments n by 1 and takes the first row “i” with chain = 0 and does the following:

  - chain of i = n

  - crea te a subset of rows whose tpep_pickup_datetime is between the starting_dropoff of i and the ending_dropoff of i and have the same VendorID as i

  - check whether in this subset, in the PULocationID column, the DOLocationID of i appears

  - if it does not appear: go to step 3

  - if it does appear: this row is called “j”, chain of j becomes n, and repeat steps 3A to 3E for j (which thus becomes the new i) but without changing n

3. move to the next row with chain = 0 (the new “i”), increment n by 1 and repeat steps 3A to 3E for the new row