### “We manage what we measure, but frequently we measure what is easy.” - Datanuts meeting

link to dataset
https://www.kaggle.com/lewisgmorris/warehouse-picking-times

Content

PH_DOC - unique order reference

PH_SORDER - unique order reference

PH_DELIVER - unique delivery reference

PH_PICKEDB - name of picker

PH_PICKSTA - datetime of picking started

PH_PICKEND - datetime of picking ended

PH_TOTALLI - total lines picked

PH_TOTALBO - total boxes used

Inspiration

Can you find the patterns in the numbers?

Suggestions:

What days are busiest?   
Are things getting better or worse?   
Who is the best?   
Whos not pulling their weight?   

#### As seen above these are typical questions asked from UPH (units per hour) information. But is this the whole story?     
Assuming that a worker is not "pulling their weight" because of low UPH assumes that people generally don't want to perform well and puts this burden on the individual to fix the problem. This removes the responsibility from management to find bottle necks and other sources of delay that prevent workers from succeeding and also allows senior managers to erroniously compare disperate environments.

#### What if we assumed that every worker wants to succeed and limitations in the environment prevents them from doing so?    
This mindset would challenge managers to find and fix the areas and processes that are getting in the way of success and/or to determine if the cost of fixing the issue is more or less expensive than hiring additional labor and would provide a more accurate basis of performance assessment.

#### How do we find the bottlenecks and issues that prevent success?    
Ask the people doing the job!

#### Some frequent limiting factors that are not typically considered when establishing or comparing UPH metrics from site to site:    
- Bulk/Lumber distribution: weather, % of warehouse indoor/outdoor, ground level/ramp, state truck maximum weight capacity, product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, max safe forklift operating speed, average # years operator experience, average # units/truck, % live unload vs staged trucks, lighting levels, equipment down time, in stock %, varibility in product count/package size, average temperature

- Indoor operations/non-perishable: product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, max safe forklift operating speed, average # years operator experience, average # units/truck, % live unload vs staged trucks, floor load/pallet, % full pallet items/individual pick, conveyer lines/pallet jack/forklift pick, products with team lift requirements, sqft of pick zone, average # picks per order, equipment down time, in stock %, varibility in product count/package size, average temperature

- Store pulling: store volume, store sqft, product diversity of size, product diversity of weight, order storage area(walk-in/transfer to shelf), sqft of pick zone, average # picks per order, length of shift, task distribution among operators, operator experience(and/or time for response on questions), in stock %, varibility in product count/package size, % use best judgement/do not substitue orders

#### Some frequent limiting factors that are not typically considered when establishing or comparing UPH metrics within a site:
- Bulk/Lumber distribution: average # years operator experience, average # units/truck, % live unload vs staged trucks, task distribution among operators (inbound, outbound, "non-productive"(safety walks, meetings, training), average # units/truck, % live unload vs staged trucks, equipment down time, in stock %, varibility in product count/package size, max forklift capcity, average drive distance from stacks to truck

- Indoor operations/non-perishable: product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, average # years operator experience, average # units/truck, % live unload vs staged trucks, floor load/pallet, % full pallet items/individual pick, conveyer lines/pallet jack/forklift pick, % products with team lift requirements, sqft of pick zone, average # picks per order, equipment down time, in stock %, varibility in product count/package size

- Store pulling: sproduct diversity of size, product diversity of weight, sqft of pick zone, average # picks per order, length of shift, task distribution among operators, operator experience(and/or time for response on questions), in stock %, % use best judgement/do not substitue orders

Ideally I would find data for all of the above information then explore to see which information showed the greatest impact for predicting units per hour and could build a ML model to predict UPH for a specific site or within a site.

Data source ideas:
Bulk/Lumber distribution: find my old files, maybe reach out to previous connections to see if any info could be exported

Create a fictional warehouse based off of this picking dataset. Use public retail sales dataset for a product mix and generate random info within a range for product weight, size. Use apriori algoithm to generate product frequencies and create fictional orders?



In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('pick data.csv', low_memory=False)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isna().sum()

In [None]:
df.info()

NOTE: drop all Unnamed columns, these contain only null values

In [None]:
# create df of useable columns only, PH_Doc and PH_sorder are showing 'value#' for most entries, 
# PH_deliver contains large # of nulls

pickdf = df[['PH_PICKEDB', 'PH_PICKSTA', 'PH_PICKEND', 'PH_TOTALLI', 'PH_TOTALBO']]

In [None]:
pickdf.head()

In [None]:
# write this df to csv for future use
#pickdf.to_csv('pickdf.csv')

In [None]:
# Wrangling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Exploring
import scipy.stats as stats

# Visualizing
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.model_selection import learning_curve

pd.options.display.float_format = '{:20,.2f}'.format

In [None]:
pickdf = pd.read_csv('pickdf.csv', index_col=0)

In [None]:
pickdf.head()

In [None]:
pickdf.info()

In [None]:
pickdf.isna().sum()

In [None]:
# this is a very small # of nulls, will drop all
pickdf.dropna(inplace=True)

In [None]:
pickdf.isna().sum()

In [None]:
pickdf = pickdf.rename(columns={'PH_PICKEDB': 'operator', 'PH_PICKSTA': 'start_time', 'PH_PICKEND': 'end_time', 'PH_TOTALLI': 'total_lines', 'PH_TOTALBO': 'total_boxes'})

In [None]:
pickdf.head()

In [None]:
# convert start_time and end_time to date time
pickdf['start']= pd.to_datetime(pickdf['start_time'])
pickdf['end']= pd.to_datetime(pickdf['end_time'])

In [None]:
pickdf.head()

In [None]:
pickdf.info()

In [None]:
pickdf = pickdf.drop(columns=['start_time', 'end_time'])

# START HERE

### For 1st iteration looking to create baseline and model to predict boxes/hr that beats baseline

#### for this iteration total lines will be defined as the number of line items on the order

In [1]:
# Wrangling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import acquire
import prepare
import wrangle_pick

# Exploring
import scipy.stats as stats

# Visualizing
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.model_selection import learning_curve

pd.options.display.float_format = '{:20,.2f}'.format

### more data cleaning needed   

- sort operator column into actual operators
    - for now remove all operator names that occur less than 10 times
    - a list of those has been made as variable one_off if needed later
    - one_off operators will need to be removed from full dataframe before modeling
- add pick time in seconds column
- removed all from 2015 and 2020
   - vis shows 2015 and 2020 significantly lower volume, for continuity with work with years 2016-2019 where volume is in consistent range
- add lines per box
- add sec per box
- add sec per line
- drop single observation with negative pick time

### completed and added to prepare function

In [2]:
train, test, validate = wrangle_pick.wrangle_pick_data()
train.shape, test.shape, validate.shape

Acquire: downloading raw data files...
Acquire: Completed!
Prepare: Cleaning acquired data...
Prepare: Completed!


((96644, 18), (20065, 18), (17055, 18))

In [3]:
train.to_csv('train_v1.csv')

In [4]:
train.head()

Unnamed: 0,operator,total_lines,total_boxes,start,end,pick_time,pick_seconds,int_day,day_name,start_year,start_month,start_Y_M,end_year,end_month,end_Y_M,sec_per_box,lines_per_box,sec_per_line
50058,IT,3,1,2016-11-10 09:58:58,2016-11-10 10:00:57,00:01:59,119.0,3,Thursday,2016,11,2016-11,2016,11,2016-11,119.0,3.0,39.67
78958,IT,71,1,2017-08-22 08:40:42,2017-08-22 09:08:29,00:27:47,1667.0,1,Tuesday,2017,8,2017-08,2017,8,2017-08,1667.0,71.0,23.48
72775,JS,3,1,2017-06-19 16:54:16,2017-06-19 16:58:10,00:03:54,234.0,0,Monday,2017,6,2017-06,2017,6,2017-06,234.0,3.0,78.0
68159,IT,4,1,2017-04-27 16:51:23,2017-04-27 16:57:45,00:06:22,382.0,3,Thursday,2017,4,2017-04,2017,4,2017-04,382.0,4.0,95.5
96678,HB,2,1,2018-03-15 12:17:11,2018-03-15 12:17:35,00:00:24,24.0,3,Thursday,2018,3,2018-03,2018,3,2018-03,24.0,2.0,12.0


In [None]:
train.groupby('operator')[['pick_seconds']].mean()

In [None]:
train.groupby('day_name')[['pick_seconds']].mean()

In [None]:
train.groupby(['day_name', 'operator']).pick_seconds.agg(['mean', 'median', 'min', 'max', 'count'])

In [None]:
train.groupby('operator').pick_seconds.agg(['mean', 'median', 'min', 'max', 'count'])

# return to this - 

In [None]:
# additional info needed  operator tenure (in this dataset)
op_max_start = train.groupby('operator')['start'].min()
op_max_start
#train.groupby('operator')['end'].max()

In [None]:
EDYTA = train[train.operator == 'EDYTA']
EDYTA.sort_values('start')

In [None]:
train.head()

questions for explore
1. total min/max pick time? by operator? by day?
2. does longer tenure = faster pick_time
3. does pick_time average vary significantly by day of week
4. days with most orders? fewest? is there a change in # of operators on fewest vs highest order days?
    - initial visual shows significant drop in box count on Sat and Sun, likely there are fewer shifts on those days

additional variations
- assign some operators are PT (4hr vs 8hr shift)
- define each line as 1 unique item in 1 unique location, based on that infer size of items