In [14]:
import pandas as pd
import numpy as np
from db_utils import get_connection, validate_connection, get_data
pd.options.display.float_format = '{:,.2f}'.format

# Motivating Question

How can we manipulate data in pandas to to answer our questions?

We're going to be working with some ```Milestone Updates``` data which is messy, buggy, and at the same time is the source for a lot of questions we'll need to solve.
The column update_lead_hours here is a calculation of hours between when the update was made, and when the milestone occured. Negative means the milestone was created after it happened, postive means the milestone was created ahead of the actual event.

In [36]:
# We'll get started by establishing a database connection and pulling a predetermined dataset of milestone updates
# We're selecting for only July 2020, only Air/Ocean shipments, and only arrival/departure milestones.
conn, cur = get_connection()
df = get_data(
    "select \
    mu.id,\
    mu.shipment_id,\
    s.mode as shipment_mode,\
    s.load_type as shipment_load_type,\
    legs.leg_transportation_mode_name as leg_mode_type,\
    address_type,\
    update_event_type,\
    update_date_type,\
    source,\
    created_by,\
    update_lead_hours\
    from entities.shipment_milestone_updates as mu\
    left join entities.legs as legs \
        on legs.leg_id = mu.leg_id \
    left join entities.shipment_attributes as s\
        on s.shipment_id = mu.shipment_id\
    where update_created_at >= '2020-07-01'\
    and update_created_at <= '2020-07-31'\
    and update_event_type in ('arrival', 'departure')\
    order by shipment_id desc", 'text', conn)


Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...


# Exploration

Let's take a look through our data to have an idea of what we're working with.


In [38]:
df.head()
# Looks like there's some outliers in our lead_hours data which we'll need to take care of

Unnamed: 0,id,shipment_id,shipment_mode,shipment_load_type,leg_mode_type,address_type,update_event_type,update_date_type,source,created_by,update_lead_hours
0,vu3219541886076666330647937003561493TLLU1613600,860766,Ocean,LCL,Truck - Domestic,departure_port,arrival,scheduled,human,Cassie Yang,155.0
1,vu3219541986076666330647937003905235TLLU1613600,860766,Ocean,LCL,Truck - Domestic,consolidation,departure,scheduled,human,Cassie Yang,155.0
2,vu32295137860525665492378826761106066ct,860525,Trucking - Domestic,FTL,Truck - Domestic,origin,departure,scheduled,human,Nate Lund,100.0
3,vu32295136860525665492378826761106069ct,860525,Trucking - Domestic,FTL,Truck - Domestic,destination,arrival,scheduled,human,Nate Lund,109.0
4,vu32295096860523665491278826651106069ct,860523,Trucking - Domestic,FTL,Truck - Domestic,destination,arrival,scheduled,human,Nate Lund,109.0


# Questions:
    1. What percentage of actual updates are human vs automated
    2. What (if any) is the improvement in update speed for automation
    3. What milestones are frequently missing per given mode?
    4. What are the fastest updates per mode
    
    Keeping in mind for the above, we'll need to handle outliers and extraneous/missing data.

# 1. Percentage of  updates human vs automated

I'm solving this by stringing together some logic to only solve for air and ocean milestones (that's all we want right now) and choosing only actual dates, not scheduled. From there we get a total count of rows, and final determine the count of each value in a series before dividing by the toal.

After examining this solution work through your own solution utilizing another way to cut data. This could be solved using iloc, loc, or <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html">query which I suggest</a>. You can also use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">group_by function</a> to circumvent value_counts. In total this question can be answered in a single line.

In [58]:
# One (computationally expensive) way to filter data is to create a series that holds conditional expressions
# for example
ocean_filter = df.shipment_mode == 'Ocean'
air_filter = df.shipment_mode == 'Air'
milestones_filter = df.update_date_type == 'actual'
final_filter =  (ocean_filter | air_filter )& milestones_filter

total_rows = df[final_filter]['id'].count()
df[final_filter]['source'].value_counts()/total_rows


human           0.67
no user found   0.29
cargosmart      0.02
inttra          0.02
crux            0.00
Name: source, dtype: float64

In [None]:
# SOLUTION CELL
df.query("update_date_type == 'actual' & shipment_mode in ('Ocean', 'Air')").groupby(by='source')['id'].count()/df.query("update_date_type == 'actual'")['id'].count()

# 2. Improvement in Update Speed for Automation

Well we learned in the top portion that the majority of our updates are coming from humans, with 