In [2]:
import pandas as pd
import numpy as np

# --- 1. Load the Dataset ---
# We're loading the dataset into a pandas DataFrame. Since it's an Excel file,
# we use pd.read_excel() and need the 'openpyxl' library installed.
# Using a try-except block is good practice to handle potential errors.

try:
    # The file path assumes the notebook is in a 'notebooks' folder
    # and the data is in a 'data' folder, both inside the project root.
    # Make sure your file has a .xlsx or .xls extension.
    df = pd.read_excel(r'C:\Users\jaiku\PycharmProjects\Ola_Ride_Analytics\data\OLA_DataSet.xlsx')
    print("✅ Excel file loaded successfully!")
except FileNotFoundError:
    print("❌ Error: 'ola_data.xlsx' not found. Make sure it's in the 'data' directory.")
    # Exit gracefully if the file isn't found
    exit()
except Exception as e:
    print(f"An error occurred: {e}")
    exit()


# --- 2. Initial Reconnaissance ---
# This step is like a doctor's initial check-up on a patient. We are getting
# a high-level overview of the dataset's health and structure.

print("\n--- First 5 Rows of the Dataset (df.head()) ---")
# df.head() shows us the first few rows. It helps us visually inspect the
# column names, the type of data in each column (text, numbers, dates?),
# and check for any obvious issues like formatting problems.
print(display(df.head()))


print("\n--- Technical Summary of the Dataset (df.info()) ---")
# df.info() gives a technical summary. We pay close attention to:
# - Non-Null Count: This immediately tells us which columns have missing data.
# - Dtype: This is the data type of each column. A column like 'Booking_Value'
#   being an 'object' instead of a number (int64/float64) is a red flag
#   that it contains non-numeric characters and needs cleaning.
print(df.info())


print("\n--- Statistical Summary of the Dataset (df.describe()) ---")
# df.describe(include='all') provides statistical summaries.
# For NUMERIC columns: look at 'min' and 'max' for impossible values
# (e.g., negative Ride_Distance, a rating above 5). Check the 'mean' vs.
# '50%' (median) to get a sense of skewness.
# For CATEGORICAL (object) columns: 'unique' tells us how many distinct
# categories exist. 'top' shows the most frequent category, and 'freq'
# shows its count. This is great for understanding the distribution.
print(display(df.describe(include='all')))

# --- 3. Initial Observations & Next Steps ---
# Based on the output, we can start forming a plan.
# For example, if 'Date' and 'Time' are 'object' types, we know our first
# cleaning task will be to combine and convert them to a proper datetime format.
# If there are many missing values in 'Customer_Rating', we need to investigate why.
print("\n--- Initial Analysis Complete ---")




✅ Excel file loaded successfully!

--- First 5 Rows of the Dataset (df.head()) ---


Unnamed: 0,Date,Time,Booking_ID,Booking_Status,Customer_ID,Vehicle_Type,Pickup_Location,Drop_Location,V_TAT,C_TAT,Canceled_Rides_by_Customer,Canceled_Rides_by_Driver,Incomplete_Rides,Incomplete_Rides_Reason,Booking_Value,Payment_Method,Ride_Distance,Driver_Ratings,Customer_Rating,Vehicle Images
0,2024-07-26 14:00:00,14:00:00,CNR7153255142,Canceled by Driver,CID713523,Prime Sedan,Tumkur Road,RT Nagar,,,,Personal & Car related issue,,,444,,0,,,https://cdn-icons-png.flaticon.com/128/14183/1...
1,2024-07-25 22:20:00,22:20:00,CNR2940424040,Success,CID225428,Bike,Magadi Road,Varthur,203.0,30.0,,,No,,158,Cash,13,4.1,4.0,https://cdn-icons-png.flaticon.com/128/9983/99...
2,2024-07-30 19:59:00,19:59:00,CNR2982357879,Success,CID270156,Prime SUV,Sahakar Nagar,Varthur,238.0,130.0,,,No,,386,UPI,40,4.2,4.8,https://cdn-icons-png.flaticon.com/128/9983/99...
3,2024-07-22 03:15:00,03:15:00,CNR2395710036,Canceled by Customer,CID581320,eBike,HSR Layout,Vijayanagar,,,Driver is not moving towards pickup location,,,,384,,0,,,https://cdn-icons-png.flaticon.com/128/6839/68...
4,2024-07-02 09:02:00,09:02:00,CNR1797421769,Success,CID939555,Mini,Rajajinagar,Chamarajpet,252.0,80.0,,,No,,822,Credit Card,45,4.0,3.0,https://cdn-icons-png.flaticon.com/128/3202/32...


None

--- Technical Summary of the Dataset (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103024 entries, 0 to 103023
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   Date                        103024 non-null  datetime64[ns]
 1   Time                        103024 non-null  object        
 2   Booking_ID                  103024 non-null  object        
 3   Booking_Status              103024 non-null  object        
 4   Customer_ID                 103024 non-null  object        
 5   Vehicle_Type                103024 non-null  object        
 6   Pickup_Location             103024 non-null  object        
 7   Drop_Location               103024 non-null  object        
 8   V_TAT                       63967 non-null   float64       
 9   C_TAT                       63967 non-null   float64       
 10  Canceled_Rides_by_Customer  10499 non-null   

Unnamed: 0,Date,Time,Booking_ID,Booking_Status,Customer_ID,Vehicle_Type,Pickup_Location,Drop_Location,V_TAT,C_TAT,Canceled_Rides_by_Customer,Canceled_Rides_by_Driver,Incomplete_Rides,Incomplete_Rides_Reason,Booking_Value,Payment_Method,Ride_Distance,Driver_Ratings,Customer_Rating,Vehicle Images
count,103024,103024,103024,103024,103024,103024,103024,103024,63967.0,63967.0,10499,18434,63967,3926,103024.0,63967,103024.0,63967.0,63967.0,103024
unique,,1440,103024,4,94544,7,50,50,,,5,4,2,3,,4,,,,7
top,,00:53:00,CNR7153255142,Success,CID954071,Prime Sedan,Banashankari,Peenya,,,Driver is not moving towards pickup location,Personal & Car related issue,No,Customer Demand,,Cash,,,,https://cdn-icons-png.flaticon.com/128/14183/1...
freq,,101,1,63967,5,14877,2201,2159,,,3175,6542,60041,1601,,35022,,,,14877
mean,2024-07-16 11:31:38.879678720,,,,,,,,170.876952,84.873372,,,,,548.751883,,14.189927,3.997457,3.998313,
min,2024-07-01 00:00:00,,,,,,,,35.0,25.0,,,,,100.0,,0.0,3.0,3.0,
25%,2024-07-08 18:41:00,,,,,,,,98.0,55.0,,,,,242.0,,0.0,3.5,3.5,
50%,2024-07-16 11:23:00,,,,,,,,168.0,85.0,,,,,386.0,,8.0,4.0,4.0,
75%,2024-07-24 05:18:00,,,,,,,,238.0,115.0,,,,,621.0,,26.0,4.5,4.5,
max,2024-07-31 23:58:00,,,,,,,,308.0,145.0,,,,,2999.0,,49.0,5.0,5.0,


None

--- Initial Analysis Complete ---


## Analysis of the Initial Output

You've successfully loaded the data, and the initial outputs from `df.info()` and `df.describe()` are revealing several important characteristics of your dataset.

### Key Insights from df.info()

- Missing Data is Structural, Not Random: Notice that several columns (`V_TAT`, `C_TAT`, `Payment_Method`, `Driver_Ratings`, `Customer_Rating`) have the exact same number of non-null entries: 63,967. The `df.describe()` output shows that there are also exactly 63,967 `Success` bookings. This is a crucial insight! It tells us that this data is only recorded for successful rides. This isn't messy data; it's a business rule. We should not try to "fill in" these missing values for cancelled rides.
- Cancellation Reasons are Separate: The columns `Canceled_Rides_by_Customer` and `Canceled_Rides_by_Driver` have many nulls because a ride can only be cancelled by one party. They are mutually exclusive.
- Incorrect Data Types: The `Time` column is an object (text), and more importantly, it's redundant since the `Date` column already includes the full timestamp. `Incomplete_Rides` is also an object when it should logically be a Yes/No or True/False value.
- Useless Column: The `Vehicle Images` column contains image URLs. While interesting, they are not useful for our analytical goals and can be removed.

### Key Insights from df.describe()

- Ride Distance Anomaly: The min value for `Ride_Distance` is 0. A ride with zero distance is physically impossible and needs to be investigated. These are likely records of bookings that were cancelled immediately after being accepted.
- Booking Value Skewness: The mean booking value (548) is significantly higher than the median (50% percentile) value (386). This indicates that the data is right-skewed. In simple terms, there are a few very expensive rides (likely long distances or high surge pricing) that are pulling the average up.
- Clean Ratings Data: The `Driver_Ratings` and `Customer_Rating` columns have a min of 3.0 and max of 5.0. This looks like very clean and well-structured data with no obvious errors.
- Categorical Dominance: For `Booking_Status`, `Success` is the most frequent category. For `Payment_Method`, `Cash` is the most popular. For cancellation reasons, "Driver is not moving" (customer) and "Personal & Car related issue" (driver) are the most common. These are valuable initial business insights.




### Key Columns Summary

| Column | Data Type | Issue/Insight | Action Needed |
|--------|-----------|---------------|---------------|
| `V_TAT`, `C_TAT`, `Payment_Method` | Mixed | Only populated for successful rides (63,967 entries) | Keep as-is, structural missing data |
| `Driver_Ratings`, `Customer_Rating` | Float | Clean data, range 3.0-5.0 | No action needed |
| `Canceled_Rides_by_Customer/Driver` | Object | Mutually exclusive, many nulls | Keep as-is, business logic |
| `Time` | Object | Redundant with `Date` column | Consider removing |
| `Incomplete_Rides` | Object | Should be boolean | Convert to True/False |
| `Vehicle Images` | Object | URLs, not analytical | Remove column |
| `Ride_Distance` | Float | Contains impossible 0 values | Investigate and clean |
| `Booking_Value` | Float | Right-skewed distribution | Consider log transformation |
| `Booking_Status` | Object | Dominated by "Success" | No action needed |

This analysis provides a solid foundation for your data cleaning and preprocessing steps.