<a href="https://colab.research.google.com/github/FrancescaNegriUniMiB/focsproject/blob/main/Foundations_CS_2425.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FOUNDATIONS OF COMPUTER SCIENCE - PROJECT FOR A.Y. 2024-2025

### Project text

1. Extract all trips with `trip_distance` larger than 50
2. Extract all trips where `payment_type` is missing
3. For each (`PULocationID`, `DOLocationID`) pair, determine the number of trips
4. Save all rows with missing `VendorID`, `passenger_count`, `store_and_fwd_flag`, `payment_type` in a new DataFrame called `bad`, and remove those rows from the original DataFrame
5. Add a `duration` column storing how long each trip has taken (use `tpep_pickup_datetime`, `tpep_dropoff_datetime`)
6. For each pickup location, determine how many trips have started there
7. Cluster the pickup time of the day into 30-minute intervals (e.g., from 02:00 to 02:30)
8. For each interval, determine the average number of passengers and the average fare amount
9. For each payment type and each interval, determine the average fare amount
10. For each payment type, determine the interval when the average fare amount is maximum
11. For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum
12. Find the location with the highest average fare amount
13. Build a new DataFrame (called `common`) where, for each pickup location, we keep all trips to the 5 most common destinations (i.e., each pickup location can have different common destinations)
14. On the `common` DataFrame, for each payment type and each interval, determine the average fare amount
15. Compute the difference of the average fare amount computed in the previous point with those computed at point 9
16. Compute the ratio between the differences computed in the previous point and those computed in point 9  
    **Note:** You have to compute a ratio for each pair (`payment type`, `interval`)
17. Build chains of trips. Two trips are consecutive in a chain if:
    1. They have the same `VendorID`
    2. The pickup location of the second trip is also the dropoff location of the first trip
    3. The pickup time of the second trip is after the dropoff time of the first trip
    4. The pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip

## Importing


In [1]:
import pandas as pd
import time as t
import numpy as np
from bisect import bisect_left, bisect_right
from collections import defaultdict, deque

if 'executed' not in globals():

  executed = True

  !pip install -q gdown
  !gdown --fuzzy https://drive.google.com/file/d/1IUOdTOYgjco0ggTVsNluQOl-xPbMZ3Z-/view?usp=sharing

  df = pd.read_csv('/content/focs_data.csv',dtype={'store_and_fwd_flag':object})
  display(df.head())

  df_backup = df
else:
  df = df_backup
  print("Reset df: ok.")

Downloading...
From (original): https://drive.google.com/uc?id=1IUOdTOYgjco0ggTVsNluQOl-xPbMZ3Z-
From (redirected): https://drive.google.com/uc?id=1IUOdTOYgjco0ggTVsNluQOl-xPbMZ3Z-&confirm=t&uuid=b0de23f6-b5a6-4426-a06c-974a875ae13f
To: /content/focs_data.csv
100% 594M/594M [00:07<00:00, 77.6MB/s]


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


**PRE-PROCESSING**
The following rows are being removed because they can't represent a valid trip.
A valid trip is assumed to have these columns always positive:


*   fare_amount: can be >0 or 0 (free trip or payment voided?), but not below 0
*   tip_amount: logically can't be below 0 (but could be 0)
*   trip_distance: can't be <=0 because some distance has to be covered
*   total_amount: since it's a sum of multiple components, can't be <= 0

In [2]:
cols_to_check = ['fare_amount', 'tip_amount', 'trip_distance', 'total_amount']

len_before = len(df)

for col in cols_to_check:
    n_neg = (df[col] < 0).sum()
    print(f"{col}: {n_neg} negative values found")

df = df[(df['fare_amount'] >= 0) & (df['trip_distance'] > 0) & (df['tip_amount'] >= 0) & (df['total_amount'] > 0)]


print(f"\nA total of {len_before - len(df)} rows removed.")

fare_amount: 19505 negative values found
tip_amount: 170 negative values found
trip_distance: 2338 negative values found
total_amount: 19505 negative values found

A total of 87985 rows removed.


## Task 1

Extract all trips with trip_distance larger than 50

## Task 2

Extract all trips where payment_type is missing

## Task 3

For each (PULocationID, DOLocationID) pair, determine the number of trips

## Task 4

Save all rows with missing VendorID, passenger_count, store_and_fwd_flag, payment_type in a new DataFrame called bad, and remove those rows from the original DataFrame

## Task 5

Add a duration column storing how long each trip has taken (use tpep_pickup_datetime, tpep_dropoff_datetime)

## Task 6

For each pickup location, determine how many trips have started there

## Task 7

Cluster the pickup time of the day into 30-minute intervals (e.g., from 02:00 to 02:30)

**Variante ----->	Tempo medio**

astype(str).str.zfill(2) ----->	1×

.map('{:02}'.format) ----->	~0.6×

.dt.strftime('%H') su datetime ----->	3–4×

## Task 8

For each interval, determine the average number of passengers and the average fare amount

## Task 9

For each payment type and each interval, determine the average fare amount

## Task 10

For each payment type, determine the interval when the average fare amount is maximum

## Task 11

For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum

## Task 12

Find the location with the highest average fare amount

**PLEASE NOTE:**

Since it is not specified whether the location refers to PU or DO, I'll assume the goal is to find where the most expensive trips ***begin***, and therefore I will consider the **PU** locations.

The same calculation can, of course, also be done using the DO locations, perhaps to explore the cost of reaching certain areas.

## Task 13

Build a new DataFrame (called common) where, for each pickup location, we keep all trips to the 5 most common destinations (i.e., each pickup location can have different common destinations)

## Task 14

On the common DataFrame, for each payment type and each interval, determine the average fare amount

## Task 15

Compute the difference of the average fare amount computed in the previous point with those computed at point 9

## Task 16

Compute the ratio between the differences computed in the previous point and those computed in point 9 Note: You have to compute a ratio for each pair (payment type, interval)

## Task 17

Build chains of trips. Two trips are consecutive in a chain if:


*   They have the same VendorID
*   The pickup location of the second trip is also the dropoff location of the first trip
*   The pickup time of the second trip is after the dropoff time of the first trip
*   The pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip