# Insights from Failed Orders
![taxi](taxi.jpg)

This data project has been used as a take-home assignment in the recruitment process for the data science positions at Gett.

Gett, previously known as GetTaxi, is an Israeli-developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not complete successfully, i.e., the customer didn't end up getting a car.

## Assignment
Please complete the following tasks:

1. **Build up distribution of orders according to reasons for failure**: cancellations before and after driver assignment, and reasons for order rejection. Analyze the resulting plot. Which category has the highest number of orders?

2. **Plot the distribution of failed orders by hours**: Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained?

3. **Plot the average time to cancellation with and without driver, by the hour**: If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot?

4. **Plot the distribution of average ETA by hours**: How can this plot be explained?


## Data Description
We have two data sets:

- **data_orders**: Contains the following columns:
  - `order_datetime`: Time of the order
  - `origin_longitude`: Longitude of the order
  - `origin_latitude`: Latitude of the order
  - `m_order_eta`: Time before order arrival
  - `order_gk`: Order number
  - `order_status_key`: Status, an enumeration consisting of the following mapping:
    - 4 - Cancelled by client
    - 9 - Cancelled by system (i.e., a reject)
  - `is_driver_assigned_key`: Whether a driver has been assigned
  - `cancellation_time_in_seconds`: How many seconds passed before cancellation

- **data_offers**: Contains the following columns:
  - `order_gk`: Order number, associated with the same column from the orders data set
  - `offer_id`: ID of an offer


In [43]:
%load_ext sql

%sql postgresql://postgres:your-passward@localhost:5432/failed_orders_db


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [44]:
import psycopg2
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

host = "localhost"
database = "failed_orders_db"
user = "postgres"
password = "your-passward"

# Connect to the PostgreSQL database
conn = psycopg2.connect(
    host=host,
    database=database,
    user=user,
    password=password
)

In [45]:
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go


In [46]:
%%sql 
-- Create view
CREATE VIEW table_columns AS 
SELECT
    table_name,
    STRING_AGG(column_name,', ') AS columns
FROM information_schema.columns
WHERE table_schema = 'public'
GROUP BY table_name

 * postgresql://postgres:***@localhost:5432/failed_orders_db
(psycopg2.errors.DuplicateTable) relation "table_columns" already exists

[SQL: -- Create view
CREATE VIEW table_columns AS 
SELECT
    table_name,
    STRING_AGG(column_name,', ') AS columns
FROM information_schema.columns
WHERE table_schema = 'public'
GROUP BY table_name]
(Background on this error at: https://sqlalche.me/e/20/f405)


In [47]:
%%sql
-- USING view
SELECT * FROM table_columns

 * postgresql://postgres:***@localhost:5432/failed_orders_db
3 rows affected.


table_name,columns
data_orders,"origin_longitude, origin_latitude, m_order_eta, order_gk, order_status_key, is_driver_assigned_key, cancellations_time_in_seconds, order_datetime"
table_columns,"table_name, columns"
data_offers,"offer_id, order_gk"


In [48]:
data_orders = pd.read_sql_query("""SELECT * FROM data_orders""",conn)
data_orders.head()

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
0,18:08:07,-0.978916,51.456173,60.0,3000583041974,4,1,198.0
1,20:57:32,-0.950385,51.456843,,3000583116437,4,0,128.0
2,12:07:50,-0.96952,51.455544,477.0,3000582891479,4,1,46.0
3,13:50:20,-1.054671,51.460544,658.0,3000582941169,4,1,62.0
4,21:24:45,-0.967605,51.458236,,3000583140877,9,0,


In [49]:
data_offers = pd.read_sql_query("""SELECT * FROM data_offers""",conn)
data_offers.head()

Unnamed: 0,order_gk,offer_id
0,3000579625629,300050936206
1,3000627306450,300052064651
2,3000632920686,300052408812
3,3000632771725,300052393030
4,3000583467642,300051001196


In [50]:
print('shape of data_orders',data_orders.shape)
print('shape of data_offers',data_offers.shape)

shape of data_orders (10716, 8)
shape of data_offers (334363, 2)


In [51]:
# checking for null values 
data_orders.isnull().sum()

order_datetime                      0
origin_longitude                    0
origin_latitude                     0
m_order_eta                      7902
order_gk                            0
order_status_key                    0
is_driver_assigned_key              0
cancellations_time_in_seconds    3409
dtype: int64

In [52]:
data_offers.isnull().sum()

order_gk    0
offer_id    0
dtype: int64

In [53]:
data_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10716 entries, 0 to 10715
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_datetime                 10716 non-null  object 
 1   origin_longitude               10716 non-null  float64
 2   origin_latitude                10716 non-null  float64
 3   m_order_eta                    2814 non-null   float64
 4   order_gk                       10716 non-null  int64  
 5   order_status_key               10716 non-null  int64  
 6   is_driver_assigned_key         10716 non-null  int64  
 7   cancellations_time_in_seconds  7307 non-null   float64
dtypes: float64(4), int64(3), object(1)
memory usage: 669.9+ KB


In [54]:
data_offers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334363 entries, 0 to 334362
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   order_gk  334363 non-null  int64
 1   offer_id  334363 non-null  int64
dtypes: int64(2)
memory usage: 5.1 MB


In [55]:
# check for duplicates
print(data_orders.order_gk.nunique())
print(data_orders.shape[0])
print(data_orders.duplicated().sum())

10716
10716
0


### DATA PREPARATION

In [56]:
# data type of order_datetime , cancellations_time_in_seconds is objet -> datatime 
# the values in order_status_key are 4 and 9 where 4 means rejected by client and 9 means rejected by service -> replce 4 - client, 9 - service 

In [57]:
# Convert Data Types:
data_orders['order_datetime'] = pd.to_datetime(data_orders['order_datetime'], format='%H:%M:%S')
data_orders['cancellations_time_in_seconds'] = pd.to_numeric(data_orders['cancellations_time_in_seconds'], errors='coerce')

In [58]:
data_orders.dtypes

order_datetime                   datetime64[ns]
origin_longitude                        float64
origin_latitude                         float64
m_order_eta                             float64
order_gk                                  int64
order_status_key                          int64
is_driver_assigned_key                    int64
cancellations_time_in_seconds           float64
dtype: object

In [59]:
# Correcting order_status_key more intuitive
data_orders['order_status_key'] = data_orders['order_status_key'].replace({4:'client',9:'service'})
data_orders.head()

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
0,1900-01-01 18:08:07,-0.978916,51.456173,60.0,3000583041974,client,1,198.0
1,1900-01-01 20:57:32,-0.950385,51.456843,,3000583116437,client,0,128.0
2,1900-01-01 12:07:50,-0.96952,51.455544,477.0,3000582891479,client,1,46.0
3,1900-01-01 13:50:20,-1.054671,51.460544,658.0,3000582941169,client,1,62.0
4,1900-01-01 21:24:45,-0.967605,51.458236,,3000583140877,service,0,


In [60]:
data_orders.describe()['cancellations_time_in_seconds']

count    7307.000000
mean      157.892021
std       213.366963
min         3.000000
25%        45.000000
50%        98.000000
75%       187.500000
max      4303.000000
Name: cancellations_time_in_seconds, dtype: float64

Max timing of cacellation is way far than 75% let's see 

In [61]:
data_orders[data_orders['cancellations_time_in_seconds'] > data_orders['cancellations_time_in_seconds'].quantile(0.99)]

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
379,1900-01-01 08:33:20,-0.958621,51.459510,897.0,3000592145793,client,1,1070.0
587,1900-01-01 08:45:43,-0.978230,51.454575,60.0,3000584765892,client,1,1710.0
730,1900-01-01 14:34:36,-0.924755,51.463005,597.0,3000585541881,client,1,1116.0
1246,1900-01-01 01:10:48,-0.980845,51.456257,358.0,3000630807209,client,1,1091.0
1318,1900-01-01 06:31:33,-0.963151,51.452836,239.0,3000550355598,client,1,1442.0
...,...,...,...,...,...,...,...,...
10228,1900-01-01 11:25:37,-0.964947,51.471804,239.0,3000590480744,client,1,1536.0
10230,1900-01-01 03:28:53,-0.979874,51.455549,60.0,3000590256037,client,1,1536.0
10240,1900-01-01 23:07:40,-0.971851,51.459834,,3000590756619,client,0,999.0
10310,1900-01-01 12:35:39,-0.966739,51.452743,478.0,3000555496022,client,1,1774.0


### DATA AGGREGATION AND ANALYSIS
##### 1. Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. Analyse the resulting plot. Which category has the highest number of orders?

#### Distribution of Order Status:

In [62]:
failure_counts = data_orders['order_status_key'].value_counts()
failure_counts

client     7307
service    3409
Name: order_status_key, dtype: int64

In [63]:

fig = px.bar(
    x=failure_counts.index,
    y=failure_counts.values,
    labels={'x': 'Failure Reason', 'y': 'Number of Orders'},
    title='Distribution of Orders by Failure Reason',
    color=failure_counts.index,
    text=failure_counts.values
)
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(xaxis_title='Failure Reason', yaxis_title='Number of Orders')
fig.show()


#### Count of Cancellations by Clients vs. System:

In [64]:
client_cancellations = data_orders[data_orders['order_status_key'] == 'client'].shape[0]
system_cancellations = data_orders[data_orders['order_status_key'] == 'service'].shape[0]
print(f"Client Cancellations: {client_cancellations}")
print(f"System Cancellations: {system_cancellations}")

Client Cancellations: 7307
System Cancellations: 3409


#### Impact of Driver Assignment:

In [65]:

fig = px.histogram(
    data_orders,
    x='order_status_key',
    color='is_driver_assigned_key',
    color_discrete_map={0: 'blue', 1: 'red'},
    labels={'order_status_key': 'Order Rejection', 'is_driver_assigned_key': 'Driver Assigned'},
    title='Order Cancellation Count before and after Driver Assignment'
)
fig.update_layout(
    xaxis_title='Order Rejection',
    yaxis_title='Count of Order Rejections',
    legend_title='Driver Assigned'
)
fig.show()


#### Breakdown of Cancellations with and without Driver Assignment:

In [66]:
client_driver_assigned = data_orders[(data_orders['order_status_key'] == 'client') & (data_orders['is_driver_assigned_key'] == 1)].shape[0]
client_driver_not_assigned = data_orders[(data_orders['order_status_key'] == 'client') & (data_orders['is_driver_assigned_key'] == 0)].shape[0]
system_driver_assigned = data_orders[(data_orders['order_status_key'] == 'service') & (data_orders['is_driver_assigned_key'] == 1)].shape[0]
system_driver_not_assigned = data_orders[(data_orders['order_status_key'] == 'service') & (data_orders['is_driver_assigned_key'] == 0)].shape[0]

print(f"Client Cancellations with Driver Assigned: {client_driver_assigned}")
print(f"Client Cancellations without Driver Assigned: {client_driver_not_assigned}")
print(f"System Cancellations with Driver Assigned: {system_driver_assigned}")
print(f"System Cancellations without Driver Assigned: {system_driver_not_assigned}")


Client Cancellations with Driver Assigned: 2811
Client Cancellations without Driver Assigned: 4496
System Cancellations with Driver Assigned: 3
System Cancellations without Driver Assigned: 3406


From above observations:

- **High Client Cancellations**: With 7307 client cancellations vs. 3409 system cancellations, client cancellations are significantly higher.
- **Driver Assignment**: Both clients and the system tend to cancel more when no driver is assigned. Very few orders are canceled by the system when a driver is assigned.


In [67]:
fig = px.box(
    data_orders,
    x='order_status_key',
    y='cancellations_time_in_seconds',
    color='is_driver_assigned_key',
    color_discrete_map={0: 'blue', 1: 'red'},
    labels={'order_status_key': 'Order Status Key', 'cancellation_time_in_seconds': 'Cancellation Time (seconds)'},
    title='Time to Cancellation by Order Status and Driver Assignment'
)
fig.update_layout(
    xaxis_title='Order Status Key',
    yaxis_title='Cancellation Time (seconds)',
    legend_title='Driver Assigned'
)
fig.show()


#### Remove outliers

In [68]:
Q1 = data_orders['cancellations_time_in_seconds'].quantile(0.25)
Q3 = data_orders['cancellations_time_in_seconds'].quantile(0.75)
IQR = Q3 - Q1
data_clean = data_orders[~((data_orders['cancellations_time_in_seconds'] < (Q1 - 1.5 * IQR)) | (data_orders['cancellations_time_in_seconds'] > (Q3 + 1.5 * IQR)))]

In [69]:
fig = px.box(
    data_clean,
    x='order_status_key',
    y='cancellations_time_in_seconds',
    color='is_driver_assigned_key',
    color_discrete_map={0: 'blue', 1: 'red'},
    labels={'order_status_key': 'Order Status Key', 'cancellation_time_in_seconds': 'Cancellation Time (seconds)'},
    title='Time to Cancellation by Order Status and Driver Assignment (Combined)'
)
fig.update_layout(
    xaxis_title='Order Status Key',
    yaxis_title='Cancellation Time (seconds)',
    legend_title='Driver Assigned'
)
fig.show()


In [70]:
data_orders[data_orders['order_status_key'] == 'service']['cancellations_time_in_seconds'].unique()

array([nan])

**Time to Cancellation Analysis:**

- **Client Cancellations:**
  - **With Driver Assigned:**
    - **Median Time:** 127.0 seconds
    - **Interquartile Range (IQR):** 242.0 seconds
    - **Outliers:** 219
  - **Without Driver Assigned:**
    - **Median Time:** 88.0 seconds
    - **Interquartile Range (IQR):** 119.0 seconds
    - **Outliers:** 101
- **Service Cancellations:**
  - The service cancellations do not provide a valid cancellation time, likely due to missing or null values.

**Overall Conclusion:**

- **Driver Assignment:** The lack of driver assignment is a major factor in cancellations by both clients and the system.
- **Client Behavior:** Clients tend to cancel more often regardless of driver assignment, but especially when no driver is assigned.
- **Service Behavior:** The system rarely cancels orders when a driver is assigned, indicating a robust process in place for driver-assigned orders.


#### Plot the distribution of failed orders by hours: Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained?

In [71]:
data_orders['order_hour'] = data_orders['order_datetime'].dt.hour
data_orders.head(2)

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds,order_hour
0,1900-01-01 18:08:07,-0.978916,51.456173,60.0,3000583041974,client,1,198.0,18
1,1900-01-01 20:57:32,-0.950385,51.456843,,3000583116437,client,0,128.0,20


In [72]:
# Group by hour and order status key
hourly_failures = data_orders.groupby(['order_hour','order_status_key','is_driver_assigned_key']).size().reset_index(name='count')


# Pivot the data for better visualization 
pivot_failures = hourly_failures.pivot_table(index='order_hour',columns=['order_status_key','is_driver_assigned_key'],values='count',fill_value=0)
pivot_failures.head()

order_status_key,client,client,service,service
is_driver_assigned_key,0,1,0,1
order_hour,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,298,120,263,2
1,219,88,164,0
2,237,78,240,0
3,224,64,225,0
4,50,41,61,0


In [73]:
pivot_failures = pivot_failures.reset_index()
melted_failures = pivot_failures.melt(id_vars=['order_hour'], var_name=['order_status_key', 'is_driver_assigned_key'], value_name='count')

# Fix column names
melted_failures.columns = ['order_hour', 'order_status_key', 'is_driver_assigned_key', 'count']
melted_failures['is_driver_assigned_key'] = melted_failures['is_driver_assigned_key'].astype(str)
melted_failures.head()


Unnamed: 0,order_hour,order_status_key,is_driver_assigned_key,count
0,0,client,0,298
1,1,client,0,219
2,2,client,0,237
3,3,client,0,224
4,4,client,0,50


In [74]:
# Create the bar plot
fig = px.bar(melted_failures, 
             x='order_hour', 
             y='count', 
             color='order_status_key', 
             barmode='group', 
             facet_col='is_driver_assigned_key',
             labels={
                 'order_hour': 'Hour of the Day', 
                 'count': 'Number of Failures', 
                 'order_status_key': 'Failure Reason'
             },
             title='Distribution of Failed Orders by Hour and Failure Reason')

# Customize layout
fig.update_layout(
    xaxis_title="Hour of the Day",
    yaxis_title="Number of Failures",
    legend_title="Failure Reason"
)

fig.show()


### Analysis of Failed Orders by Hour

#### Before Driver Assignment (is_driver_assigned_key=0):
- **Significant Failure Times:** Peaks at 8 AM and 9 PM.
- **Client vs. Service Failures:** Client cancellations are more frequent than service-related failures.
- **Service-Related Peaks:** Notable around 8 AM and 9-11 PM.

#### After Driver Assignment (is_driver_assigned_key=1):
- **Overall Failures:** Significantly lower.
- **Client vs. Service Failures:** Mainly due to client reasons.
- **Failure Times:** Relatively stable with minor peaks at 8 AM and 5 PM.

### Conclusion

From the analysis, we can conclude:

1. **Impact of Driver Assignment:**
   - The presence of a driver significantly reduces the number of failed orders. This highlights the importance of timely driver assignment in reducing cancellations.

2. **Peak Failure Times:**
   - Before driver assignment, failures peak at 8 AM and 9 PM, suggesting these times might have higher demand or other operational challenges.
   - After driver assignment, failures are relatively stable, with minor peaks at 8 AM and 5 PM, indicating some ongoing issues even when a driver is assigned.

3. **Client vs. Service Failures:**
   - Client cancellations are more common than service-related failures both before and after driver assignment. This suggests that addressing client concerns and improving the customer experience could reduce cancellations.
   - Service-related failures are notably high before 8 AM and between 9-11 PM, indicating possible operational inefficiencies or resource constraints during these times.

Overall, these insights suggest that improving driver assignment processes and focusing on high-demand times could significantly reduce order failures.


### Plot the average time to cancellation with and without driver, by the hour: If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot?

In [75]:
# Remove outliers
Q1 = data_orders['cancellations_time_in_seconds'].quantile(0.25)
Q3 = data_orders['cancellations_time_in_seconds'].quantile(0.75)
IQR = Q3 - Q1
filter = (data_orders['cancellations_time_in_seconds'] >= (Q1 - 1.5 * IQR)) & (data_orders['cancellations_time_in_seconds'] <= (Q3 + 1.5 * IQR))
data_clean = data_orders.loc[filter]

# Group by hour and driver assignment status
avg_cancel_time = data_clean.groupby(['order_hour', 'is_driver_assigned_key'])['cancellations_time_in_seconds'].mean().reset_index()

In [76]:
fig = px.line(avg_cancel_time, x='order_hour', y='cancellations_time_in_seconds', color='is_driver_assigned_key',
              labels={
                  "order_hour": "Hour of the Day",
                  "cancellations_time_in_seconds": "Average Cancellation Time (seconds)",
                  "is_driver_assigned_key": "Driver Assigned"
              },
              title="Average Time to Cancellation by Hour and Driver Assignment")
fig.show()


### Conclusion

- **Overall Trend**: The cancellation time is generally higher when a driver is assigned (`driver_assigned = 1`), indicating clients are more likely to wait longer before canceling if a driver is already assigned.

- **Peak Cancellation Times**:
  - **Without Driver Assignment**: 
    - The highest peak is at 5 AM with an average cancellation time of 121.8 seconds.
    - Significant peaks are also observed around 0 AM (103.45 seconds) and 10 PM (108.27 seconds).
  - **With Driver Assignment**: 
    - The highest peak is at 0 AM with an average cancellation time of 146.13 seconds.
    - Other notable peaks occur at 5 AM (140.53 seconds) and 11 PM (137.25 seconds).

- **Hourly Patterns**:
  - **Without Driver Assignment**: The cancellation times are relatively consistent throughout the day with minor variations.
  - **With Driver Assignment**: The cancellation times fluctuate more noticeably, with significant peaks at the start and end of the day, reflecting a tendency for longer waits before cancelation during these hours.


### Plot the distribution of average ETA by hours: How can this plot be explained?

In [77]:
# Calculate the average ETA(Estimated time of arrival) by hour
average_eta_by_hour = data_orders.groupby('order_hour')['m_order_eta'].mean().reset_index()

average_eta_by_hour

Unnamed: 0,order_hour,m_order_eta
0,0,357.959016
1,1,324.75
2,2,391.012821
3,3,388.09375
4,4,299.658537
5,5,411.12
6,6,427.148936
7,7,583.358974
8,8,636.910828
9,9,504.891026


In [78]:

fig = px.line(average_eta_by_hour, x='order_hour', y='m_order_eta', 
              title='Average ETA by Hour of the Day',
              labels={'order_hour': 'Hour of the Day', 'm_order_eta': 'Average ETA (seconds)'})
fig.update_traces(mode='lines+markers')
fig.show()


### Conclusion

- **Early Morning High ETAs**: The average ETA is significantly higher during the early morning hours, particularly at 2 AM (391 seconds) and 3 AM (388 seconds). This is likely due to fewer drivers being available during these hours, resulting in longer wait times.
- **Highest ETAs Around 8 AM**: The peak average ETA occurs at 8 AM (637 seconds), suggesting a potential surge in demand during this time, possibly due to morning commute hours.
- **Afternoon and Evening Variations**: There are noticeable fluctuations in ETAs during the afternoon and evening hours. For instance, ETAs are relatively high at 5 PM (520 seconds) and decrease towards 8 PM (300 seconds).
- **Midday Consistency**: The average ETA remains relatively consistent during the midday hours (11 AM - 4 PM), with slight variations, indicating a steady driver availability during this period.

Overall, the plot shows that driver availability and demand fluctuations throughout the day significantly impact the average ETA for orders.


In [41]:
import pandas as pd
import folium



# Filter data to include only rejected orders
rejected_orders = data_orders[data_orders['order_status_key'] == 'client']  # Assuming 'client' indicates rejection

# Drop rows with missing latitude or longitude
rejected_orders = rejected_orders.dropna(subset=['origin_latitude', 'origin_longitude'])

# Create a base map
map_center = [rejected_orders['origin_latitude'].mean(), rejected_orders['origin_longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=10)

# Add rejected order locations to the map
for _, row in rejected_orders.iterrows():
    folium.CircleMarker(
        location=[row['origin_latitude'], row['origin_longitude']],
        radius=5,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.6,
        popup=f"Order ID: {row['order_gk']}, Time: {row['order_datetime']}"
    ).add_to(m)

# Save the map to an HTML file
m.save('rejection_map.html')

print("Map has been saved as rejection_map.html")


Map has been saved as rejection_map.html


In [79]:


# Drop rows with missing latitude or longitude
data_orders = data_orders.dropna(subset=['origin_latitude', 'origin_longitude'])

# Create a column to differentiate client and service rejections
data_orders['rejection_type'] = data_orders['order_status_key'].apply(lambda x: 'Client' if x == 'client' else 'Service')

# Create the map
fig = px.scatter_mapbox(
    data_orders, 
    lat="origin_latitude", 
    lon="origin_longitude", 
    color="rejection_type",
    hover_name="order_gk",
    hover_data=["order_datetime"],
    color_discrete_map={'Client': 'red', 'Service': 'blue'},
    title="Order Rejections Map",
    height=600,
    zoom=10
)

fig.update_layout(mapbox_style="open-street-map")

# Show the map
fig.show()


# Conclusion: Insights from Failed Orders

### Geographic Distribution of Rejections
Our analysis revealed distinct patterns in the geographic distribution of order rejections. We visualized these rejections on a map using different colors to distinguish between client and service rejections:
- **Red Markers**: Represent client rejections, indicating orders canceled by the clients.
- **Blue Markers**: Represent service rejections, indicating orders canceled by the system.

The map highlights specific regions with a higher frequency of rejections, allowing us to identify potential problem areas in the service network.

### Temporal Patterns in Rejections
We analyzed the distribution of failed orders by the hour of the day and observed significant variations:
- Certain hours, particularly during peak times, showed higher rejection rates.
- The busiest hours with the highest number of rejections were typically during late evenings and early mornings, suggesting that these times might correlate with high demand or operational challenges.

### Impact of Driver Assignment on Rejections
Our data analysis indicated a strong correlation between driver assignment and order rejections:
- Orders without assigned drivers had a significantly higher rejection rate, underscoring the importance of improving driver assignment efficiency.
- This finding suggests that enhancing driver availability and assignment processes could reduce the number of failed orders.

### Average Time to Cancellation
We examined the average time to cancellation, both with and without driver assignment:
- The data showed noticeable differences in cancellation times, with orders without drivers being canceled more quickly.
- By plotting these times, we identified outliers and trends that could inform improvements in the customer experience and operational efficiency.

### Distribution of Average ETA by Hours
The analysis of the average estimated time of arrival (ETA) by hours provided additional insights:
- There were fluctuations in the average ETA across different hours, which might be influenced by traffic conditions, driver availability, or demand patterns.
- Understanding these variations can help optimize service delivery and manage customer expectations.

### Overall Insights
Our comprehensive analysis of failed orders has provided valuable insights into the reasons and patterns behind order rejections. By focusing on geographic hotspots, peak times, and the impact of driver assignment, we have identified key areas for improvement:
- Enhancing driver assignment processes to reduce rejection rates.
- Addressing geographic areas with high rejection frequencies.
- Optimizing operations during peak demand hours to improve service reliability.

These findings will help inform strategic decisions and operational improvements, ultimately leading to a better customer experience and more efficient service delivery.
