In [20]:
'''Converting to ipynb to py in the end using:
jupyter nbconvert --to script rating.ipynb'''

'''https://github.com/kevinchen27/lyft-rides-analysis/blob/master/README.md'''

'''Load and Merge the Dataset'''

import pandas as pd
import yaml

# Load config
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Load data using config paths
df_drivers = pd.read_csv(config['data_paths']['driver_ids'])
df_rides = pd.read_csv(config['data_paths']['ride_ids'])
df_timestamps = pd.read_csv(config['data_paths']['ride_timestamps'])


# Pivot the timestamp events to wide format: one row per ride_id with columns for each event
df_events = df_timestamps.pivot(index='ride_id', columns='event', values='timestamp').reset_index()

# Convert timestamp columns to datetime
for col in ['requested', 'accepted', 'completed']:
    if col in df_events.columns:
        df_events[col] = pd.to_datetime(df_events[col])

# Joining data on ride_id to obtain per-ride data

# Join ride data with event timestamps
df_combined = pd.merge(df_rides, df_events, on='ride_id', how='inner')

# Join driver info (e.g., onboarding date)
df_combined = pd.merge(df_combined, df_drivers, on='driver_id', how='left')

# Compute accept delay (seconds)
df_combined['accept_delay'] = (pd.to_datetime(df_combined['accepted_at']) - pd.to_datetime(df_combined['requested_at'])).dt.total_seconds()

# Compute completion flag: 1 if the ride was dropped off
df_combined['completed_flag'] = df_combined['dropped_off_at'].notnull().astype(int)

# Preview
print(df_combined[['ride_id', 'driver_id', 'ride_duration', 'ride_distance', 'ride_prime_time', 'accept_delay', 'completed_flag']].head())



                            ride_id                         driver_id  \
0  006d61cf7446e682f7bc50b0f8a5bea5  002be0ffdc997bd5c50703158b7c2491   
1  01b522c5c3a756fbdb12e95e87507eda  002be0ffdc997bd5c50703158b7c2491   
2  029227c4c2971ce69ff2274dc798ef43  002be0ffdc997bd5c50703158b7c2491   
3  034e861343a63ac3c18a9ceb1ce0ac69  002be0ffdc997bd5c50703158b7c2491   
4  034f2e614a2f9fc7f1c2f77647d1b981  002be0ffdc997bd5c50703158b7c2491   

   ride_duration  ride_distance  ride_prime_time  accept_delay  completed_flag  
0            327           1811               50          25.0               1  
1            809           3362                0           3.0               1  
2            572           3282                0           8.0               1  
3           3338          65283               25           4.0               1  
4            823           4115              100           2.0               1  


Plan:

## Step 1: Load and Merge the Dataset

- Load the following CSV files:
  - `driver_ids.csv`: Contains driver onboarding information.
  - `ride_ids.csv`: Includes ride duration, distance, prime time multiplier, and driver ID.
  - `ride_timestamps.csv`: Contains ride lifecycle events (e.g., requested, accepted, completed).
- Join all three datasets on `ride_id` to create a unified view of each ride.

## Step 2: Create Per-Ride and Per-Driver Features

- From `ride_timestamps.csv`:
  - Calculate **Accept Delay**: Time between request and accept.
  - Calculate **Ride Completion Rate**: Percentage of requests that reach the "completed" event.
  - Simulate cancellations as rides that were accepted but not completed.

- From `ride_ids.csv`:
  - Calculate **Average Trip Duration** and **Average Distance**.
  - Determine **Prime Time Utilization**: Percentage of rides with a high prime time multiplier (e.g., > 1.5).

## Step 3: Aggregate Features Per Driver

- For each driver, compute aggregated performance metrics:
  - Average trip duration.
  - Average accept delay.
  - Completion rate.
  - Percentage of rides in high prime time.
  - Total number of rides.

## Step 4: Label Drivers

- Assign each driver to a rating category based on performance:
  - **Poor**: Low completion rate or high average accept delay.
  - **Average**: Slightly better metrics but not outstanding.
  - **Good**: High completion rate with acceptable response times.
  - **Excellent**: Very high performance across the board.
- Customize thresholds based on data distribution.

## Step 5: Train a Multi-Class Classifier

- Use the labeled dataset to train a model to predict driver ratings.
- Consider algorithms such as:
  - XGBoost (high performance for tabular data).
  - Random Forest.
  - Logistic Regression (in multi-class mode).
- Evaluate performance using metrics like:
  - Accuracy
  - F1-score
  - Confusion matrix

## Optional Improvements

- Incorporate time-based features like:
  - Ride frequency by weekday.
  - Peak hour patterns.
- Apply feature scaling as needed.
- Use hyperparameter tuning (e.g., GridSearchCV) to improve model performance.


## Why we Still Need a Model
Even though you're assigning labels:
The model learns the underlying patterns in the data.

It can predict new driver ratings without hardcoded thresholds.

It’s more flexible than rules — especially as your dataset grows.

You can run it off-chain and send the result back to the smart contract as the driver's updated score.