### Route metrics
- Total number of stops per route
- Planned total distance per route (sum of DistanceP)
- Actual total distance per route (sum of DistanceA)
- Distance deviation per route (actual total distance minus planned total distance)
- Distance efficiency ratio per route (actual total distance divided by planned total distance)
- Number of sequence deviations per route (count of stops where IndexP is not equal to IndexA)
- Deviation rate per route (sequence deviations divided by total stops)
- Number of SLA violations per route (count of stops where Arrived Time is greater than Latest Time)
- SLA violation rate per route (SLA violations divided by total stops)
- Total route duration per route (maximum Arrived Time minus minimum Arrived Time)


In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv(r'D:\Coding\last-mile-route-deviation-analytics\data\raw\routes_performance.csv')

In [3]:
df.head()

Unnamed: 0,Route ID,Driver ID,Stop ID,Address ID,Week ID,Country,Day of Week,IndexP,IndexA,Arrived Time,Earliest Time,Latest Time,DistanceP,DistanceA,Depot,Delivery
0,0,0,0,0,0,1,Monday,0,0,42.275,0.0,360.0,0.0,0.0,1,0
1,0,0,1,1,0,1,Tuesday,1,4,332.788,240.0,480.0,16.329053,16.329053,0,1
2,0,0,2,2,0,1,Tuesday,2,5,332.956,120.0,540.0,0.37311,0.37311,0,1
3,0,0,3,3,0,1,Monday,3,2,244.994,60.0,540.0,2.491915,0.0,0,1
4,0,0,4,3,0,1,Monday,4,1,244.855,60.0,540.0,0.0,13.944962,0,1


In [10]:
df['sla_violation']=df['Arrived Time']>df['Latest Time']
df["sequence_deviation"] = df["IndexP"] != df["IndexA"]

In [11]:
route_metrics = (
    df
    .groupby('Route ID')
    .agg(
        total_stops=('Stop ID','count'),
        planned_distance=('DistanceP','sum'),
        actual_distance=('DistanceA','sum'),
        deviation_count=('sequence_deviation','sum'),
        sla_violation_count=('sla_violation','sum'),
        route_start_time=('Arrived Time','min'),
        route_end_time=('Arrived Time','max')
    )
    .reset_index()
)

route_metrics['distance_deviation'] = (
    route_metrics['actual_distance'] - route_metrics['planned_distance']
)

route_metrics['distance_efficiency_ratio'] = (
    route_metrics['actual_distance'] / route_metrics['planned_distance']
)

route_metrics['route_duration'] = (
    route_metrics['route_end_time'] - route_metrics['route_start_time']
)

route_metrics['sla_violation_rate'] = (
    route_metrics['sla_violation_count'] / route_metrics['total_stops']
)

route_metrics['deviation_rate'] = (
    route_metrics['deviation_count'] / route_metrics['total_stops']
)


In [8]:
route_metrics.head()

Unnamed: 0,Route ID,total_stops,planned_distance,actual_distance,deviation_count,sla_violation_count,route_start_time,route_end_time,distance_deviation,distance_efficiency_ratio,route_duration,sla_violation_rate,deviation_rate
0,0,7,49.468094,44.965197,5,0,42.275,373.553,-4.502897,0.908974,331.278,0.0,0.714286
1,1,7,33.274342,33.610418,6,0,64.855,371.387,0.336076,1.0101,306.532,0.0,0.857143
2,2,7,12.124804,12.508786,4,0,110.283,473.273,0.383982,1.031669,362.99,0.0,0.571429
3,3,10,19.039848,19.374644,6,0,194.274,448.366,0.334795,1.017584,254.092,0.0,0.6
4,4,8,20.632674,19.528799,2,0,196.588,357.34,-1.103875,0.946499,160.752,0.0,0.25


### Sanity Checks

In [13]:
route_metrics['Route ID'].is_unique

True

In [18]:
# row count check

total_rows = df.shape[0]
total_routes = route_metrics.shape[0]

total_rows, total_routes


(249231, 19647)

In [19]:
# non negative distances

(route_metrics['planned_distance']<0).sum() , (route_metrics['actual_distance']<0).sum()

(np.int64(0), np.int64(0))

In [21]:
# check for NaN or infinity
route_metrics.isna().sum(), \
np.isinf(route_metrics['distance_efficiency_ratio']).sum()

(Route ID                     0
 total_stops                  0
 planned_distance             0
 actual_distance              0
 deviation_count              0
 sla_violation_count          0
 route_start_time             0
 route_end_time               0
 distance_deviation           0
 distance_efficiency_ratio    1
 route_duration               0
 sla_violation_rate           0
 deviation_rate               0
 dtype: int64,
 np.int64(0))

In [None]:
# deviation bound check
(route_metrics["deviation_count"] > route_metrics["total_stops"]).sum()

np.int64(0)

In [23]:
#SLA bound check
(route_metrics["sla_violation_count"] > route_metrics["total_stops"]).sum()

np.int64(0)

In [None]:
# route duration check
(route_metrics['route_duration']<0).sum()

np.int64(0)

In [25]:
route_metrics.describe()


Unnamed: 0,Route ID,total_stops,planned_distance,actual_distance,deviation_count,sla_violation_count,route_start_time,route_end_time,distance_deviation,distance_efficiency_ratio,route_duration,sla_violation_rate,deviation_rate
count,19647.0,19647.0,19647.0,19647.0,19647.0,19647.0,19647.0,19647.0,19647.0,19646.0,19647.0,19647.0,19647.0
mean,9823.0,12.685448,64.160508,63.199997,6.327175,1.291189,1508.296,1960.068,-0.960511,0.998766,451.772815,0.102011,0.440783
std,5671.744705,5.566924,68.117164,71.166229,5.639303,2.264176,118261.1,118289.6,22.444547,0.19092,1351.406029,0.180803,0.31722
min,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,-507.106885,0.141538,0.0,0.0,0.0
25%,4911.5,8.0,29.861861,29.204916,2.0,0.0,143.962,467.036,-1.11526,0.972524,267.278,0.0,0.133333
50%,9823.0,12.0,50.079076,49.029368,5.0,0.0,203.76,562.585,0.0,1.0,354.163,0.0,0.5
75%,14734.5,17.0,80.855897,79.364024,10.0,2.0,266.102,665.7215,0.632074,1.01621,444.045,0.133333,0.722222
max,19646.0,36.0,1948.653027,2409.824486,33.0,28.0,12809990.0,12814320.0,1898.701726,6.227439,57552.851,1.0,1.0


### Dataset Overview
- Total Rows: 249,231 stops.

- Total Routes: 19,647 unique routes.

### Data Quality
- Distances: All distance values are positive or zero. There are no negative distances.

- Efficiency Ratio: Only one missing value (NaN) exists. There are no infinity errors.

- Route Duration: All durations are valid. There are no values less than zero.

### Data Sanity Checks
#### Logic Verification

- Deviation Counts: The number of deviations is never higher than the total stops in a route. This confirms the deviation logic is correct.

- SLA Violations: The number of SLA breaches never exceeds the total stops per route. This validates the SLA logic.

#### Edge Cases

- Distance Discrepancies: Some routes have zero planned distance but show positive actual distance. These are flagged as edge cases.

- Missing Ratios (NaN): One NaN exists because the planned distance was zero. This will be addressed during the visualization step.