# Uber Trends & Trips — 2024 Analytics Project

This project explores Uber’s 2024 ride data with a focus on:
- Booking trends across vehicle types
- Customer and driver behavior
- Ratings and satisfaction
- Revenue breakdown by payment method


## Exploratory Data Analysis

### Viewing the first 5 rows of the dataset. Each row represents one ride booking.

The columns include:
- **Date** and **Time** of the booking
- **Booking ID** and **Customer ID** (unique for each ride and customer)
- **Vehicle Type** (like Auto, Bike, Go Sedan, etc.)
- **Pickup and Drop Locations**
- **Avg VTAT** (Average time for driver to reach pickup location (in minutes))
- **Avg CTAT** (Average trip duration from pickup to destination (in minutes))
- **Booking Status** (whether the ride was completed, cancelled, etc.)
- **Booking Value** (total fare)
- **Ratings** (from customer and driver)
- **Payment Method** (UPI, Debit Card, etc.)

Some rows have missing values (like when the ride was incomplete or cancelled), which we'll explore and clean later.

In [3]:
# Loading the dataset

import pandas as pd

df = pd.read_csv('../data/ride_bookings.csv')

# Showing the first 5 rows

df.head()

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI


### How big is the data?

Number of rows and columns in the table.  
Each row is one ride, and each column is a detail about that ride.

In [5]:
df.shape
# we have 150,000 rows and 21 columns

(150000, 21)

### Column names

Checking the names of all the columns in the dataset to know what information is included.

In [6]:
df.columns

Index(['Date', 'Time', 'Booking ID', 'Booking Status', 'Customer ID',
       'Vehicle Type', 'Pickup Location', 'Drop Location', 'Avg VTAT',
       'Avg CTAT', 'Cancelled Rides by Customer',
       'Reason for cancelling by Customer', 'Cancelled Rides by Driver',
       'Driver Cancellation Reason', 'Incomplete Rides',
       'Incomplete Rides Reason', 'Booking Value', 'Ride Distance',
       'Driver Ratings', 'Customer Rating', 'Payment Method'],
      dtype='object')

### Data types and missing values

Type of data is in each column (like text or numbers),  
and checking if anything is missing.

### What the Data Looks Like

- The dataset has **150,000 rows** (ride bookings) and **21 columns**.
- Most columns contain **text** (like dates, locations, IDs) or **numbers** (like ratings and distances).
- Columns like **Avg VTAT**, **Avg CTAT**, **Booking Value**, and **Ratings** are numbers (float).
- Columns like **Booking Status**, **Vehicle Type**, and **Payment Method** are text (object).
- Some columns have missing values:
  - **Avg VTAT** is missing in about 10,500 rides
  - **Avg CTAT**, **Booking Value**, and **Ride Distance** are missing in about 48,000 rides
  - **Ratings** (driver and customer) are missing in 57,000 rides
  - **Cancellation and Incomplete ride columns** have very few values, because they only apply to cancelled/incomplete rides


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           139500 non-null  float64
 9   Avg CTAT                           102000 non-null  float64
 10  Cancelled Rides by Customer        10500 non-null   float64
 11  Reason for cancelling by Customer  1050

### Summary of the numbers and insights from the columns

Quick stats about the number columns, like the average, minimum, and maximum values.

- **Driver arrival time (Avg VTAT)** is around **8.5 minutes** on average, with most drivers arriving between **5.3 and 11.3 minutes**.
- **Trip duration (Avg CTAT)** averages **29 minutes**, but some trips last as short as **10 mins** or as long as **45 mins**.
- **All cancellation and incomplete ride columns only contain 1s**, meaning they are probably just **flags** (1 = yes, 0 = no), not actual counts.
- **Average ride fare (Booking Value)** is about **508**, with some trips costing as low as **50** and others going up to over **4,200** (likely long distance or UberXL).
- **Ride distance** is around **24.6 km** on average, but it ranges from **1 km** to **50 km**.
- **Customer ratings** are slightly higher (average **4.40**) than **driver ratings** (average **4.23**), but both are generally high.

Overall, most Uber rides are completed in under 30 minutes, cost around Ksh 500, and are rated highly by both drivers and customers.



In [8]:
df.describe()

Unnamed: 0,Avg VTAT,Avg CTAT,Cancelled Rides by Customer,Cancelled Rides by Driver,Incomplete Rides,Booking Value,Ride Distance,Driver Ratings,Customer Rating
count,139500.0,102000.0,10500.0,27000.0,9000.0,102000.0,102000.0,93000.0,93000.0
mean,8.456352,29.149636,1.0,1.0,1.0,508.295912,24.637012,4.230992,4.404584
std,3.773564,8.902577,0.0,0.0,0.0,395.805774,14.002138,0.436871,0.437819
min,2.0,10.0,1.0,1.0,1.0,50.0,1.0,3.0,3.0
25%,5.3,21.6,1.0,1.0,1.0,234.0,12.46,4.1,4.2
50%,8.3,28.8,1.0,1.0,1.0,414.0,23.72,4.3,4.5
75%,11.3,36.8,1.0,1.0,1.0,689.0,36.82,4.6,4.8
max,20.0,45.0,1.0,1.0,1.0,4277.0,50.0,5.0,5.0


### Missing values

Checking how many missing entries each column has.  
Columns with a high number of missing values might need to be cleaned, filled, or dropped depending on how important they are.

### What’s missing in the data?

- Most of the missing values come from cancellation and incomplete ride columns. This makes sense because not all rides were cancelled or incomplete.
  - **141,000 rows** don’t have "Incomplete Ride" info (only 9,000 rides were incomplete)
  - **139,500 rows** don’t have customer cancellation info
  - **123,000 rows** don’t have driver cancellation info

- Ratings and trip details are also missing in many rows:
  - **57,000 rides** are missing either a driver or customer rating
  - **48,000 rides** don’t have trip details like distance, fare, or trip time (likely due to cancellation)

**Summary**

- Not all rides are cancelled or incomplete, so those columns are blank most of the time.
- Trip details like fare, distance, and duration are only available for completed rides.
- Some people didn’t leave a rating, so those are missing too.

- It's good that the main ride information (Date, Time, Booking ID, Status, Vehicle Type, Locations) is fully filled in, no missing values at all.

**Next step:** clean or safely ignore the columns with missing data, depending on what column we'll analyze.



In [9]:
df.isnull().sum().sort_values(ascending=False)


Incomplete Rides Reason              141000
Incomplete Rides                     141000
Cancelled Rides by Customer          139500
Reason for cancelling by Customer    139500
Driver Cancellation Reason           123000
Cancelled Rides by Driver            123000
Customer Rating                       57000
Driver Ratings                        57000
Ride Distance                         48000
Booking Value                         48000
Payment Method                        48000
Avg CTAT                              48000
Avg VTAT                              10500
Time                                      0
Drop Location                             0
Pickup Location                           0
Vehicle Type                              0
Customer ID                               0
Booking Status                            0
Booking ID                                0
Date                                      0
dtype: int64