# Calculating cancellation rates of trips and users

Calculating cancellation rates is a common task in analyzing ride-sharing platforms, offering insights into user behavior and service efficiency. In this blog, we'll explore how to compute the cancellation rate for trips with unbanned users, using structured data from a taxi service. We'll filter data, handle exceptions, and perform grouping operations to derive actionable metrics. By the end, you'll understand how to work with real-world datasets to extract meaningful insights and build better analytics pipelines. Let's dive into the details!

### Tables:

#### **Trips Table**
| Column Name | Type     | Description                                               |
|-------------|----------|-----------------------------------------------------------|
| id          | int      | Primary key for this table (unique for each trip).        |
| client_id   | int      | Foreign key referring to `users_id` in the Users table.   |
| driver_id   | int      | Foreign key referring to `users_id` in the Users table.   |
| city_id     | int      | City where the trip occurred.                             |
| status      | str      | str type with values: `('completed', 'cancelled_by_driver', 'cancelled_by_client')`. |
| request_at  | str      | Date of the trip request (in format `YYYY-MM-DD`).        |

#### **Users Table**
| Column Name | Type     | Description                                               |
|-------------|----------|-----------------------------------------------------------|
| users_id    | int      | Primary key for this table (unique for each user).        |
| banned      | str      | str type with values: `('Yes', 'No')`.                   |
| role        | str      | str type with values: `('client', 'driver', 'partner')`. |

### Task:
The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day.

Write a solution to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points.

Return the result table in any order.

### Example Input:

#### **Trips Table**
| id  | client_id | driver_id | city_id | status              | request_at  |
|-----|-----------|-----------|---------|---------------------|-------------|
| 1   | 1         | 10        | 1       | completed           | 2013-10-01  |
| 2   | 2         | 11        | 1       | cancelled_by_driver | 2013-10-01  |
| 3   | 3         | 12        | 6       | completed           | 2013-10-01  |
| 4   | 4         | 13        | 6       | cancelled_by_client | 2013-10-01  |
| 5   | 1         | 10        | 1       | completed           | 2013-10-02  |
| 6   | 2         | 11        | 6       | completed           | 2013-10-02  |
| 7   | 3         | 12        | 6       | completed           | 2013-10-02  |
| 8   | 2         | 12        | 12      | completed           | 2013-10-03  |
| 9   | 3         | 10        | 12      | completed           | 2013-10-03  |
| 10  | 4         | 13        | 12      | cancelled_by_driver | 2013-10-03  |

#### **Users Table**
| users_id | banned | role   |
|----------|--------|--------|
| 1        | No     | client |
| 2        | Yes    | client |
| 3        | No     | client |
| 4        | No     | client |
| 10       | No     | driver |
| 11       | No     | driver |
| 12       | No     | driver |
| 13       | No     | driver |

### Example Output:
| Day         | Cancellation Rate |
|-------------|-------------------|
| 2013-10-01  | 0.33              |
| 2013-10-02  | 0.00              |
| 2013-10-03  | 0.50              |

### Explanation:
On 2013-10-01:
  - There were 4 requests in total, 2 of which were canceled.
  - However, the request with Id=2 was made by a banned client (User_Id=2), so it is ignored in the calculation.
  - Hence there are 3 unbanned requests in total, 1 of which was canceled.
  - The Cancellation Rate is (1 / 3) = 0.33

On 2013-10-02:
  - There were 3 requests in total, 0 of which were canceled.
  - The request with Id=6 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned requests in total, 0 of which were canceled.
  - The Cancellation Rate is (0 / 2) = 0.00

On 2013-10-03:
  - There were 3 requests in total, 1 of which was canceled.
  - The request with Id=8 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned request in total, 1 of which were canceled.
  - The Cancellation Rate is (1 / 2) = 0.50


In [3]:
import pandas as pd
import numpy as np

data = [['1', '1', '10', '1', 'completed', '2013-10-01'], 
        ['2', '2', '11', '1', 'cancelled_by_driver', '2013-10-01'], 
        ['3', '3', '12', '6', 'completed', '2013-10-01'], 
        ['4', '4', '13', '6', 'cancelled_by_client', '2013-10-01'], 
        ['5', '1', '10', '1', 'completed', '2013-10-02'], 
        ['6', '2', '11', '6', 'completed', '2013-10-02'], 
        ['7', '3', '12', '6', 'completed', '2013-10-02'], 
        ['8', '2', '12', '12', 'completed', '2013-10-03'], 
        ['9', '3', '10', '12', 'completed', '2013-10-03'], 
        ['10', '4', '13', '12', 'cancelled_by_driver', '2013-10-03']]
trips = pd.DataFrame(data, 
                     columns=['id', 
                              'client_id', 
                              'driver_id', 
                              'city_id', 
                              'status', 
                              'request_at']).astype({'id':'Int64', 
                                                     'client_id':'Int64', 
                                                     'driver_id':'Int64', 
                                                     'city_id':'Int64', 
                                                     'status':'object', 
                                                     'request_at':'object'})
display(trips)

data = [['1', 'No', 'client'], 
        ['2', 'Yes', 'client'], 
        ['3', 'No', 'client'], 
        ['4', 'No', 'client'], 
        ['10', 'No', 'driver'], 
        ['11', 'No', 'driver'], 
        ['12', 'No', 'driver'], 
        ['13', 'No', 'driver']]
users = pd.DataFrame(data, 
                     columns=['users_id', 
                              'banned', 
                              'role']).astype({'users_id':'Int64', 
                                               'banned':'object', 
                                               'role':'object'})
display(users)

Unnamed: 0,id,client_id,driver_id,city_id,status,request_at
0,1,1,10,1,completed,2013-10-01
1,2,2,11,1,cancelled_by_driver,2013-10-01
2,3,3,12,6,completed,2013-10-01
3,4,4,13,6,cancelled_by_client,2013-10-01
4,5,1,10,1,completed,2013-10-02
5,6,2,11,6,completed,2013-10-02
6,7,3,12,6,completed,2013-10-02
7,8,2,12,12,completed,2013-10-03
8,9,3,10,12,completed,2013-10-03
9,10,4,13,12,cancelled_by_driver,2013-10-03


Unnamed: 0,users_id,banned,role
0,1,No,client
1,2,Yes,client
2,3,No,client
3,4,No,client
4,10,No,driver
5,11,No,driver
6,12,No,driver
7,13,No,driver


**Step 1. Identify banned users:**

- users[users["banned"] == "Yes"]: Filters the users DataFrame to include only rows where the banned column is "Yes".
- ["users_id"]: Selects the users_id column from the filtered DataFrame.
- .tolist(): Converts the resulting column of banned user IDs into a Python list, banned_users.

In [5]:
banned_users = users[users["banned"] == "Yes"]["users_id"].tolist()

display(banned_users)

[2]

**Step 2. Filter out trips involving banned users:**
    
- ~trips["driver_id"].isin(banned_users): Checks if the driver_id of each trip is not in the banned_users list 
- ~ means "not"
- ~trips["client_id"].isin(banned_users): Similarly checks if the client_id is not in the banned_users list.
- &: Combines the two conditions so that only trips where neither the driver nor the client is banned are kept.
- trips[...]: Filters the trips DataFrame based on these conditions.

In [7]:
filtered_trips = trips[~trips["driver_id"].isin(banned_users) & ~trips["client_id"].isin(banned_users)]

display(filtered_trips)

Unnamed: 0,id,client_id,driver_id,city_id,status,request_at
0,1,1,10,1,completed,2013-10-01
2,3,3,12,6,completed,2013-10-01
3,4,4,13,6,cancelled_by_client,2013-10-01
4,5,1,10,1,completed,2013-10-02
6,7,3,12,6,completed,2013-10-02
8,9,3,10,12,completed,2013-10-03
9,10,4,13,12,cancelled_by_driver,2013-10-03


**Step 3. Filter trips for specific dates:**

- filtered_trips["request_at"] == "2013-10-01": Checks if the request_at date is "2013-10-01".
- |: Logical OR operator; includes rows where the condition on either side is true.
- [ ... ]: Filters filtered_trips to include only rows where the request_at date matches one of the three specified dates.

In [9]:
filtered_trips = filtered_trips[(filtered_trips["request_at"] == "2013-10-01") | 
                                (filtered_trips["request_at"] == "2013-10-02") | 
                                (filtered_trips["request_at"] == "2013-10-03")]

display(filtered_trips)

Unnamed: 0,id,client_id,driver_id,city_id,status,request_at
0,1,1,10,1,completed,2013-10-01
2,3,3,12,6,completed,2013-10-01
3,4,4,13,6,cancelled_by_client,2013-10-01
4,5,1,10,1,completed,2013-10-02
6,7,3,12,6,completed,2013-10-02
8,9,3,10,12,completed,2013-10-03
9,10,4,13,12,cancelled_by_driver,2013-10-03


**Step 4. Group trips by date and calculate cancellation rates:**

- filtered_trips.groupby('request_at'): Groups the filtered_trips DataFrame by the request_at date.
- ["status"]: Focuses on the status column within each group.
- x.str.contains("cancelled"): Checks if each value in the status column contains the word "cancelled", returning a Boolean series (True/False).
- .mean(): Calculates the average (mean) of the Boolean series, giving the proportion of cancellations in that group.

In [11]:
grouped = filtered_trips.groupby('request_at')["status"].apply(lambda x: x.str.contains("cancelled").mean())

display(grouped)

request_at
2013-10-01    0.333333
2013-10-02    0.000000
2013-10-03    0.500000
Name: status, dtype: float64

**Step 5. Reset index and rename columns:**

- .reset_index(): Converts the Series into a DataFrame, with request_at as a column instead of the index.
- .rename(columns={"request_at":"Day","status":"Cancellation Rate"}): Renames the columns for clarity:
request_at → Day,
status → Cancellation Rate
- .round(2): Rounds the cancellation rates to 2 decimal places.

In [13]:
grouped = grouped.reset_index().rename(columns={"request_at":"Day","status":"Cancellation Rate"}).round(2)

display(grouped)

Unnamed: 0,Day,Cancellation Rate
0,2013-10-01,0.33
1,2013-10-02,0.0
2,2013-10-03,0.5


References:
[1] https://leetcode.com/problems/trips-and-users/description/