# Hopper Ride Sharing Company Queries (Part 1)

The success of a ridesharing company like Hopper relies on its ability to balance driver availability with ride demand. Understanding trends in driver onboarding and ride acceptance is critical to achieving this. In this blog, we'll dive into a fascinating data analysis challenge: tracking the number of active drivers and accepted rides for each month of the year.
By the end of this blog, we'll understand the methods for answering these questions and uncover trends that could inform strategies for driver recruitment and service optimization. Whether you're a data enthusiast or a professional looking to sharpen your analytical skills, this problem is an excellent exercise in grouping, merging, and transforming datasets. Let's break it down and uncover the story behind Hopper's data.

**Table: Drivers**

| Column Name | Type    |
|-------------|---------|
| driver_id   | int     |
| join_date   | date    |

- `driver_id` is the primary key (unique for each row) for this table.
- Each row contains the driver's ID and the date they joined the Hopper company.

---

**Table: Rides**

| Column Name  | Type    |
|--------------|---------|
| ride_id      | int     |
| user_id      | int     |
| requested_at | date    |

- `ride_id` is the primary key (unique for each row) for this table.
- Each row contains the ride's ID, the user's ID who requested it, and the date of the request.
- Some ride requests in this table may not have been accepted.

---

**Table: AcceptedRides**

| Column Name   | Type    |
|---------------|---------|
| ride_id       | int     |
| driver_id     | int     |
| ride_distance | int     |
| ride_duration | int     |

- `ride_id` is the primary key (unique for each row) for this table.
- Each row contains information about an accepted ride.
- Every accepted ride exists in the `Rides` table.

---

**Task:** Write a solution to generate the following statistics for each month of 2020:

1. **Active Drivers (`active_drivers`)**: The number of drivers who have joined the Hopper company by the end of the month.
2. **Accepted Rides (`accepted_rides`)**: The number of accepted rides in that month.

The result should include:

| month | active_drivers | accepted_rides |
|-------|----------------|----------------|

- **Order**: Results must be ordered by month in ascending order (January = 1, February = 2, etc.).

---

### Example

**Input:**

**Drivers Table:**

| driver_id | join_date  |
|-----------|------------|
| 10        | 2019-12-10 |
| 8         | 2020-01-13 |
| 5         | 2020-02-16 |
| 7         | 2020-03-08 |
| 4         | 2020-05-17 |
| 1         | 2020-10-24 |
| 6         | 2021-01-05 |

**Rides Table:**

| ride_id | user_id | requested_at |
|---------|---------|--------------|
| 6       | 75      | 2019-12-09   |
| 1       | 54      | 2020-02-09   |
| 10      | 63      | 2020-03-04   |
| 19      | 39      | 2020-04-06   |
| 3       | 41      | 2020-06-03   |
| 13      | 52      | 2020-06-22   |
| 7       | 69      | 2020-07-16   |
| 17      | 70      | 2020-08-25   |
| 20      | 81      | 2020-11-02   |
| 5       | 57      | 2020-11-09   |
| 2       | 42      | 2020-12-09   |
| 11      | 68      | 2021-01-11   |
| 15      | 32      | 2021-01-17   |
| 12      | 11      | 2021-01-19   |
| 14      | 18      | 2021-01-27   |

**AcceptedRides Table:**

| ride_id | driver_id | ride_distance | ride_duration |
|---------|-----------|---------------|---------------|
| 10      | 10        | 63            | 38            |
| 13      | 10        | 73            | 96            |
| 7       | 8         | 100           | 28            |
| 17      | 7         | 119           | 68            |
| 20      | 1         | 121           | 92            |
| 5       | 7         | 42            | 101           |
| 2       | 4         | 6             | 38            |
| 11      | 8         | 37            | 43            |
| 15      | 8         | 108           | 82            |
| 12      | 8         | 38            | 34            |
| 14      | 1         | 90            | 74            |

**Output:**

| month | active_drivers | accepted_rides |
|-------|----------------|----------------|
| 1     | 2              | 0              |
| 2     | 3              | 0              |
| 3     | 4              | 1              |
| 4     | 4              | 0              |
| 5     | 5              | 0              |
| 6     | 5              | 1              |
| 7     | 5              | 1              |
| 8     | 5              | 1              |
| 9     | 5              | 0              |
| 10    | 6              | 0              |
| 11    | 6              | 2              |
| 12    | 6              | 1              |

**Explanation:**

- By the end of January, there are **2 active drivers** and **0 accepted rides**.
- By the end of February, there are **3 active drivers** and **0 accepted rides**.
- By the end of March, there are **4 active drivers** and **1 accepted ride** (ride_id: 10).
- By the end of April, there are **4 active drivers** and **0 accepted rides**.
- By the end of May --> five active drivers (10, 8, 5, 7, 4) and no accepted rides.
- By the end of June --> five active drivers (10, 8, 5, 7, 4) and one accepted ride (13).
- By the end of July --> five active drivers (10, 8, 5, 7, 4) and one accepted ride (7).
- By the end of August --> five active drivers (10, 8, 5, 7, 4) and one accepted ride (17).
- By the end of September --> five active drivers (10, 8, 5, 7, 4) and no accepted rides.
- By the end of October --> six active drivers (10, 8, 5, 7, 4, 1) and no accepted rides.
- By the end of November --> six active drivers (10, 8, 5, 7, 4, 1) and two accepted rides (20, 5).
- By the end of December --> six active drivers (10, 8, 5, 7, 4, 1) and one accepted ride (2).


In [63]:
import pandas as pd

data = [[10, '2019-12-10'], 
        [8, '2020-1-13'], 
        [5, '2020-2-16'], 
        [7, '2020-3-8'], 
        [4, '2020-5-17'], 
        [1, '2020-10-24'], 
        [6, '2021-1-5']]
drivers = pd.DataFrame(
           data, 
           columns=['driver_id', 
                    'join_date']).astype({'driver_id':'Int64', 
                    'join_date':'datetime64[ns]'})
data = [[6, 75, '2019-12-9'],
        [1, 54, '2020-2-9'], 
        [10, 63, '2020-3-4'], 
        [19, 39, '2020-4-6'], 
        [3, 41, '2020-6-3'], 
        [13, 52, '2020-6-22'], 
        [7, 69, '2020-7-16'], 
        [17, 70, '2020-8-25'], 
        [20, 81, '2020-11-2'], 
        [5, 57, '2020-11-9'], 
        [2, 42, '2020-12-9'], 
        [11, 68, '2021-1-11'], 
        [15, 32, '2021-1-17'], 
        [12, 11, '2021-1-19'], 
        [14, 18, '2021-1-27']]
rides = pd.DataFrame(
         data, 
         columns=['ride_id', 
                  'user_id', 
                  'requested_at']).astype({'ride_id':'Int64', 
                  'user_id':'Int64', 
                  'requested_at':'datetime64[ns]'})
data = [[10, 10, 63, 38], 
        [13, 10, 73, 96], 
        [7, 8, 100, 28], 
        [17, 7, 119, 68], 
        [20, 1, 121, 92], 
        [5, 7, 42, 101], 
        [2, 4, 6, 38], 
        [11, 8, 37, 43], 
        [15, 8, 108, 82], 
        [12, 8, 38, 34], 
        [14, 1, 90, 74]]
accepted_rides = pd.DataFrame(
          data, 
          columns=['ride_id', 
                   'driver_id', 
                   'ride_distance', 
                   'ride_duration']).astype({'ride_id':'Int64', 
                   'driver_id':'Int64', 
                   'ride_distance':'Int64', 
                   'ride_duration':'Int64'})
display(drivers, rides, accepted_rides)

Unnamed: 0,driver_id,join_date
0,10,2019-12-10
1,8,2020-01-13
2,5,2020-02-16
3,7,2020-03-08
4,4,2020-05-17
5,1,2020-10-24
6,6,2021-01-05


Unnamed: 0,ride_id,user_id,requested_at
0,6,75,2019-12-09
1,1,54,2020-02-09
2,10,63,2020-03-04
3,19,39,2020-04-06
4,3,41,2020-06-03
5,13,52,2020-06-22
6,7,69,2020-07-16
7,17,70,2020-08-25
8,20,81,2020-11-02
9,5,57,2020-11-09


Unnamed: 0,ride_id,driver_id,ride_distance,ride_duration
0,10,10,63,38
1,13,10,73,96
2,7,8,100,28
3,17,7,119,68
4,20,1,121,92
5,5,7,42,101
6,2,4,6,38
7,11,8,37,43
8,15,8,108,82
9,12,8,38,34


**Step 1: Sorting the drivers DataFrame**
- Orders the drivers DataFrame by the join_date column in ascending order.
- Ensures the rows are sorted chronologically by the date each driver joined the company.

In [64]:
drivers = drivers.sort_values(by=["join_date"])
display(drivers.head())

Unnamed: 0,driver_id,join_date
0,10,2019-12-10
1,8,2020-01-13
2,5,2020-02-16
3,7,2020-03-08
4,4,2020-05-17


**Step 2: Extracting Year and Month**
- Creates two new columns in the drivers DataFrame:
- year: The year part of the join_date.
- month: The month part of the join_date.

**Step 3: Filtering Drivers Who Joined by the End of 2020**
- Filters the drivers DataFrame to include only drivers who joined the company on or before 2020.
- Removes any drivers who joined after 2020 from the DataFrame.

In [65]:
drivers["year"] = drivers["join_date"].dt.year
drivers["month"] = drivers["join_date"].dt.month
drivers = drivers[drivers["year"]<=2020]
display(drivers.head())

Unnamed: 0,driver_id,join_date,year,month
0,10,2019-12-10,2019,12
1,8,2020-01-13,2020,1
2,5,2020-02-16,2020,2
3,7,2020-03-08,2020,3
4,4,2020-05-17,2020,5


**Step 4: Grouping and Counting Drivers by Month**
- Grouping: Groups the data by year and month.
- Counting: Counts the number of drivers (driver_id) for each group (i.e., drivers joining in each month).
- Cumulative Sum: Calculates the cumulative sum of drivers up to each month using cumsum, resulting in a new column active_drivers.
- Produces a DataFrame (active_drivers) where each row contains the number of active drivers at the end of each month.

In [66]:
active_drivers = drivers.groupby(["year", "month"])[["driver_id"]].count().reset_index()
active_drivers["active_drivers"] = active_drivers["driver_id"].cumsum()
display(active_drivers.head())

Unnamed: 0,year,month,driver_id,active_drivers
0,2019,12,1,1
1,2020,1,1,2
2,2020,2,1,3
3,2020,3,1,4
4,2020,5,1,5


**Step 5: Filtering for the Year 2020**
- Filters the active_drivers DataFrame to include only rows corresponding to the year 2020.


In [67]:
active_drivers = active_drivers[active_drivers["year"]==2020]
display(active_drivers.head())

Unnamed: 0,year,month,driver_id,active_drivers
1,2020,1,1,2
2,2020,2,1,3
3,2020,3,1,4
4,2020,5,1,5
5,2020,10,1,6


**Step 6: Creating a Month DataFrame**
- Creates a new DataFrame month_df with a single column month containing values from 1 to 12 (representing the months of the year 2020).


In [68]:
month_df = pd.DataFrame()
month_df["month"] = range(1, 13)
display(month_df.head())

Unnamed: 0,month
0,1
1,2
2,3
3,4
4,5


Step 7: Merging Active Drivers Data
- Merge: Combines the month_df DataFrame with active_drivers using the month column.
- Forward Fill: Fills missing values in active_drivers column using the ffill method to propagate the last valid observation forward.
- Ensures every month in 2020 has a value for active_drivers, even if no new drivers joined in certain months.

In [69]:
month_df = month_df.merge(active_drivers[["month", "active_drivers"]], how="left")
month_df["active_drivers"] = month_df["active_drivers"].ffill()
display(month_df.head())

Unnamed: 0,month,active_drivers
0,1,2
1,2,3
2,3,4
3,4,4
4,5,5


**Step 8: Preparing and Filtering the Rides Data**
- Sorting: Orders the rides DataFrame by requested_at in ascending order.
- Extracting Year and Month: Adds year and month columns from the requested_at date.
- Filtering: Retains only rides that were requested in the year 2020.
- Prepares the rides DataFrame for analysis by including only relevant data.

In [70]:
rides = rides.sort_values(by=["requested_at"])
rides["year"] = rides["requested_at"].dt.year
rides["month"] = rides["requested_at"].dt.month
rides = rides[rides["year"]==2020]
display(rides.head())

Unnamed: 0,ride_id,user_id,requested_at,year,month
1,1,54,2020-02-09,2020,2
2,10,63,2020-03-04,2020,3
3,19,39,2020-04-06,2020,4
4,3,41,2020-06-03,2020,6
5,13,52,2020-06-22,2020,6


**Step 9: Filtering Accepted Rides**
- Performs an inner join between rides and accepted_rides on the ride_id column to keep only rides that were accepted.
- Filters out ride requests that were not accepted.

In [71]:
rides = rides.merge(accepted_rides[["ride_id"]], how="inner")
display(rides.head())

Unnamed: 0,ride_id,user_id,requested_at,year,month
0,10,63,2020-03-04,2020,3
1,13,52,2020-06-22,2020,6
2,7,69,2020-07-16,2020,7
3,17,70,2020-08-25,2020,8
4,20,81,2020-11-02,2020,11


**Step 10: Grouping and Counting Accepted Rides**
- Grouping: Groups the rides DataFrame by month.
- Counting: Counts the number of accepted rides (ride_id) for each month.
- Renaming: Renames the column ride_id to accepted_rides for clarity.
- Creates a DataFrame containing the number of accepted rides for each month.

In [72]:
rides = rides.groupby(["month"])[["ride_id"]].count()
rides = rides.reset_index().rename(columns={"ride_id": "accepted_rides"})
display(rides.head())

Unnamed: 0,month,accepted_rides
0,3,1
1,6,1
2,7,1
3,8,1
4,11,2


**Step 11: Merging Accepted Rides with Month Data**
- Merge: Combines month_df with the accepted rides data using the month column.
- Fill Missing Values: Replaces any missing values in the accepted_rides column with 0, indicating no rides were accepted in those months.
- Ensures that each month in 2020 has a corresponding value for accepted_rides.

In [73]:
month_df = month_df.merge(rides[["month", "accepted_rides"]], how="left")
month_df["accepted_rides"] = month_df["accepted_rides"].fillna(0)
display(rides.head())

Unnamed: 0,month,accepted_rides
0,3,1
1,6,1
2,7,1
3,8,1
4,11,2


References: [1] https://leetcode.com/problems/hopper-company-queries-i/?lang=pythondata