Day 15 of Python Summer Party

by Interview Master

Uber

UberPool Driver Earnings Optimization Strategies

You are a Business Analyst on the Uber Pool Product Team working to optimize driver compensation. The team aims to understand how trip characteristics impact driver earnings. Your goal is to develop data-driven recommendations that maximize driver earnings potential.

In [1]:
import pandas as pd
import numpy as np

fct_trips = pd.read_csv('fct_trips.csv')
fct_trips_df = fct_trips.copy()

print(fct_trips.info())
print()
print(fct_trips_df)
print()
print("=" * 150)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   trip_id         15 non-null     int64  
 1   driver_id       15 non-null     int64  
 2   ride_type       15 non-null     object 
 3   trip_date       15 non-null     object 
 4   rider_count     15 non-null     int64  
 5   total_distance  15 non-null     float64
 6   total_earnings  15 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 972.0+ bytes
None

    trip_id  driver_id ride_type   trip_date  rider_count  total_distance  \
0       101          1  UberPool  2024-07-05            3            10.5   
1       102          1  UberPool  2024-07-15            2             8.0   
2       103          2  UberPool  2024-08-10            4            15.0   
3       104          3     UberX  2024-07-20            1             5.0   
4       105          2  UberPool  2

Question 1 of 3

What is the average driver earnings per completed UberPool ride with more than two riders between July 1st and September 30th, 2024? This analysis will help isolate trips that meet specific rider thresholds to understand their impact on driver earnings.

In [2]:
# Fortunately we dont have any missing values in the dataset
# However we will begin by analyzing numerical and non numerical columns
# We will first however transform the date columns to datetime
fct_trips_df['trip_date'] = pd.to_datetime(fct_trips_df['trip_date'], format='%Y-%m-%d', errors='coerce')
print(fct_trips_df.info())
print()
print(fct_trips_df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   trip_id         15 non-null     int64         
 1   driver_id       15 non-null     int64         
 2   ride_type       15 non-null     object        
 3   trip_date       15 non-null     datetime64[ns]
 4   rider_count     15 non-null     int64         
 5   total_distance  15 non-null     float64       
 6   total_earnings  15 non-null     float64       
dtypes: datetime64[ns](1), float64(2), int64(3), object(1)
memory usage: 972.0+ bytes
None

   trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0      101          1  UberPool 2024-07-05            3            10.5   
1      102          1  UberPool 2024-07-15            2             8.0   
2      103          2  UberPool 2024-08-10            4            15.0   
3      104          3     UberX 2024-0

In [3]:
# Making a list of all categorical variables ()'object' or 'category')
cat_cols = fct_trips_df.select_dtypes(include=['object', 'category']).columns

# Iterate through each categorical column and print the count of unique categorical levels, followed by a separator line.
for column in cat_cols:
    print(fct_trips_df[column].value_counts())
    print("-" * 50)


ride_type
UberPool    13
UberX        2
Name: count, dtype: int64
--------------------------------------------------


In [4]:
# Making a list of all numerical variables ('int64', 'float64', 'complex')
num_cols = fct_trips_df.select_dtypes(include=['int64', 'float64', 'complex']).columns

# Iterate through each numerical column and print summary statistics, followed by a separator line.
for column in num_cols:
    print(fct_trips_df[column].describe())
    print("-" * 50)


count     15.000000
mean     108.000000
std        4.472136
min      101.000000
25%      104.500000
50%      108.000000
75%      111.500000
max      115.000000
Name: trip_id, dtype: float64
--------------------------------------------------
count    15.000000
mean      3.466667
std       1.995232
min       1.000000
25%       2.000000
50%       3.000000
75%       5.000000
max       7.000000
Name: driver_id, dtype: float64
--------------------------------------------------
count    15.000000
mean      3.133333
std       1.245946
min       1.000000
25%       2.500000
50%       3.000000
75%       4.000000
max       5.000000
Name: rider_count, dtype: float64
--------------------------------------------------
count    15.000000
mean     12.366667
std       6.432248
min       4.000000
25%       7.500000
50%      11.000000
75%      16.500000
max      25.000000
Name: total_distance, dtype: float64
--------------------------------------------------
count    15.000000
mean     29.366667
std      

In [5]:
# Checking missing values across each column
missing_values = fct_trips_df.isnull().sum()
print('The number of missing values on each column of the data set is:');
print(missing_values)

# Check for complete duplicate records
duplicate_records = fct_trips_df.duplicated().sum()
print('The number of duplicate values on the data set is:', duplicate_records)


The number of missing values on each column of the data set is:
trip_id           0
driver_id         0
ride_type         0
trip_date         0
rider_count       0
total_distance    0
total_earnings    0
dtype: int64
The number of duplicate values on the data set is: 0


In [6]:
# Now we should begin by grouping rides dBetween July 1st 2024 and September 30 2024
julsept_fct_trips_df = fct_trips_df[(fct_trips_df['trip_date'] >= '2024-07-01') & (fct_trips_df['trip_date'] <= '2024-09-30')]
print(julsept_fct_trips_df.head())
print()
print(julsept_fct_trips_df.info())


   trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0      101          1  UberPool 2024-07-05            3            10.5   
1      102          1  UberPool 2024-07-15            2             8.0   
2      103          2  UberPool 2024-08-10            4            15.0   
3      104          3     UberX 2024-07-20            1             5.0   
4      105          2  UberPool 2024-09-01            3            12.0   

   total_earnings  
0            22.5  
1            18.0  
2            35.0  
3            12.0  
4            30.0  

<class 'pandas.core.frame.DataFrame'>
Index: 14 entries, 0 to 14
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   trip_id         14 non-null     int64         
 1   driver_id       14 non-null     int64         
 2   ride_type       14 non-null     object        
 3   trip_date       14 non-null     datetime64[ns]
 4   rider_count   

In [7]:
# Now we group by ride_type (UberPool), rider_count(>2)
grouped_sep_jul_df =  julsept_fct_trips_df.query("(ride_type == 'UberPool') & (rider_count > 2)")
print(grouped_sep_jul_df)

# Now we calculate the average total_earnings
average_earnings_sep_jul = grouped_sep_jul_df['total_earnings'].mean()

# Answer to Question 2
print("\nThe average total earnings for UberPool rides with more than 2 riders between July 1st 2024 and September 30 2024 is: $",average_earnings_sep_jul)


    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
2       103          2  UberPool 2024-08-10            4            15.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
10      111          3  UberPool 2024-08-05            4            13.0   
12      113          6  UberPool 2024-07-30            3            22.0   
13      114          6  UberPool 2024-08-22            4            18.0   
14      115          7  UberPool 2024-09-21            5            25.0   

    total_earnings  
0             22.5  
2             35.0  
4             30.0  
5             50.0  
7             28.0  
8             16.0  
10            32

Question 2

For completed UberPool rides between July 1st and September 30th, 2024, derive a q2 column calculating earnings per mile (total_earnings divided by total_distance) and then compute the average earnings per mile for rides with more than two riders. This calculation will reveal efficiency metrics for driver compensation.

In [8]:
# Display data again
print(julsept_fct_trips_df.head())
print()
print(julsept_fct_trips_df.info())
print()



   trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0      101          1  UberPool 2024-07-05            3            10.5   
1      102          1  UberPool 2024-07-15            2             8.0   
2      103          2  UberPool 2024-08-10            4            15.0   
3      104          3     UberX 2024-07-20            1             5.0   
4      105          2  UberPool 2024-09-01            3            12.0   

   total_earnings  
0            22.5  
1            18.0  
2            35.0  
3            12.0  
4            30.0  

<class 'pandas.core.frame.DataFrame'>
Index: 14 entries, 0 to 14
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   trip_id         14 non-null     int64         
 1   driver_id       14 non-null     int64         
 2   ride_type       14 non-null     object        
 3   trip_date       14 non-null     datetime64[ns]
 4   rider_count   

In [9]:
# Copy the dataframe to avoid overwriting
q2_julsept_fct_df = julsept_fct_trips_df.copy()

# Calculate earnings per mile
q2_julsept_fct_df['earnings_per_mile'] = q2_julsept_fct_df['total_earnings'] / q2_julsept_fct_df['total_distance']
print(q2_julsept_fct_df.head())
print()
print(q2_julsept_fct_df.info())
print()


   trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0      101          1  UberPool 2024-07-05            3            10.5   
1      102          1  UberPool 2024-07-15            2             8.0   
2      103          2  UberPool 2024-08-10            4            15.0   
3      104          3     UberX 2024-07-20            1             5.0   
4      105          2  UberPool 2024-09-01            3            12.0   

   total_earnings  earnings_per_mile  
0            22.5           2.142857  
1            18.0           2.250000  
2            35.0           2.333333  
3            12.0           2.400000  
4            30.0           2.500000  

<class 'pandas.core.frame.DataFrame'>
Index: 14 entries, 0 to 14
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   trip_id            14 non-null     int64         
 1   driver_id          14 non-null     int64       

In [10]:
# Now we group by ride_type (UberPool), rider_count(>2)
q2_grouped_sep_jul_df =  q2_julsept_fct_df.query("(ride_type == 'UberPool') & (rider_count > 2)")
print(q2_grouped_sep_jul_df)

# Now we calculate the average total_earnings
q2_avg_earn_per_mile_sep_jul = q2_grouped_sep_jul_df['earnings_per_mile'].mean().round(2)

# Answer to Question 1
print("\nThe average total earnings for UberPool rides with more than 2 riders between July 1st 2024 and September 30 2024 is: $", q2_avg_earn_per_mile_sep_jul, "per mile")


    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
2       103          2  UberPool 2024-08-10            4            15.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
10      111          3  UberPool 2024-08-05            4            13.0   
12      113          6  UberPool 2024-07-30            3            22.0   
13      114          6  UberPool 2024-08-22            4            18.0   
14      115          7  UberPool 2024-09-21            5            25.0   

    total_earnings  earnings_per_mile  
0             22.5           2.142857  
2             35.0           2.333333  
4             30.0           2.500000  
5  

Question 3

Identify the combination of rider count and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024. This analysis directly recommends optimal trip combination strategies to maximize driver earnings.

In [11]:
# Copy the dataframe to avoid overwriting
q3_julsept_fct_df = q2_julsept_fct_df.copy()
print(q3_julsept_fct_df)



    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
1       102          1  UberPool 2024-07-15            2             8.0   
2       103          2  UberPool 2024-08-10            4            15.0   
3       104          3     UberX 2024-07-20            1             5.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
9       110          2  UberPool 2024-07-07            2             7.0   
10      111          3  UberPool 2024-08-05            4            13.0   
11      112          5     UberX 2024-09-10            1             4.0   
12      113          6  UberPool 2024-07-30            3            22.0   
13      114 

In [12]:
# Now we group by ride_type (UberPool), but not by rider_count
q3_grouped_sep_jul_df =  q3_julsept_fct_df.query("(ride_type == 'UberPool')")
print(q3_grouped_sep_jul_df)


    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
1       102          1  UberPool 2024-07-15            2             8.0   
2       103          2  UberPool 2024-08-10            4            15.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
9       110          2  UberPool 2024-07-07            2             7.0   
10      111          3  UberPool 2024-08-05            4            13.0   
12      113          6  UberPool 2024-07-30            3            22.0   
13      114          6  UberPool 2024-08-22            4            18.0   
14      115          7  UberPool 2024-09-21            5            25.0   

    total_e

In [13]:
# Now I need to narrow down the combinations on that data frame
# I will do that by first creating an index 
q3_rdrcount_earnings_df = (q3_grouped_sep_jul_df.groupby(['rider_count', 'total_distance']).agg(average_driver_earnings = ('total_earnings', 'mean'))).sort_values('average_driver_earnings', ascending=False).reset_index()
print("Table of combination of rider cound and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024");
print(q3_rdrcount_earnings_df)


Table of combination of rider cound and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024
    rider_count  total_distance  average_driver_earnings
0             5            25.0                     60.0
1             5            20.0                     50.0
2             3            22.0                     45.0
3             4            18.0                     42.0
4             4            15.0                     35.0
5             4            13.0                     32.0
6             3            12.0                     30.0
7             4            11.0                     28.0
8             3            10.5                     22.5
9             2             8.0                     18.0
10            3             6.0                     16.0
11            2             7.0                     15.0


In [14]:
# Now that this is sorted, we need to call out the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024
# Answer to Question 3
print("The combination of rider count and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024 is:")
print(q3_rdrcount_earnings_df.nlargest(1, 'average_driver_earnings'))



The combination of rider count and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024 is:
   rider_count  total_distance  average_driver_earnings
0            5            25.0                     60.0
