### “We manage what we measure, but frequently we measure what is easy.” - Datanuts meeting

link to dataset
https://www.kaggle.com/lewisgmorris/warehouse-picking-times

Content

PH_DOC - unique order reference

PH_SORDER - unique order reference

PH_DELIVER - unique delivery reference

PH_PICKEDB - name of picker

PH_PICKSTA - datetime of picking started

PH_PICKEND - datetime of picking ended

PH_TOTALLI - total lines picked

PH_TOTALBO - total boxes used

Inspiration

Can you find the patterns in the numbers?

Suggestions:

What days are busiest?   
Are things getting better or worse?   
Who is the best?   
Whos not pulling their weight?   

#### As seen above these are typical questions asked from UPH (units per hour) information. But is this the whole story?     
Assuming that a worker is not "pulling their weight" because of low UPH assumes that people generally don't want to perform well and puts this burden on the individual to fix the problem. This removes the responsibility from management to find bottle necks and other sources of delay that prevent workers from succeeding and also allows senior managers to erroniously compare disperate environments.

#### What if we assumed that every worker wants to succeed and limitations in the environment prevents them from doing so?    
This mindset would challenge managers to find and fix the areas and processes that are getting in the way of success and/or to determine if the cost of fixing the issue is more or less expensive than hiring additional labor and would provide a more accurate basis of performance assessment.

#### How do we find the bottlenecks and issues that prevent success?    
Ask the people doing the job!

#### Some frequent limiting factors that are not typically considered when establishing or comparing UPH metrics from site to site:    
- Bulk/Lumber distribution: weather, % of warehouse indoor/outdoor, ground level/ramp, state truck maximum weight capacity, product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, max safe forklift operating speed, average # years operator experience, average # units/truck, % live unload vs staged trucks, lighting levels, equipment down time, in stock %, varibility in product count/package size, average temperature

- Indoor operations/non-perishable: product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, max safe forklift operating speed, average # years operator experience, average # units/truck, % live unload vs staged trucks, floor load/pallet, % full pallet items/individual pick, conveyer lines/pallet jack/forklift pick, products with team lift requirements, sqft of pick zone, average # picks per order, equipment down time, in stock %, varibility in product count/package size, average temperature

- Store pulling: store volume, store sqft, product diversity of size, product diversity of weight, order storage area(walk-in/transfer to shelf), sqft of pick zone, average # picks per order, length of shift, task distribution among operators, operator experience(and/or time for response on questions), in stock %, varibility in product count/package size, % use best judgement/do not substitue orders

#### Some frequent limiting factors that are not typically considered when establishing or comparing UPH metrics within a site:
- Bulk/Lumber distribution: average # years operator experience, average # units/truck, % live unload vs staged trucks, task distribution among operators (inbound, outbound, "non-productive"(safety walks, meetings, training), average # units/truck, % live unload vs staged trucks, equipment down time, in stock %, varibility in product count/package size, max forklift capcity, average drive distance from stacks to truck

- Indoor operations/non-perishable: product diversity of size, product diversity of weight, max forklift capcity, average drive distance from stacks to truck, asile width of main travel paths, average # years operator experience, average # units/truck, % live unload vs staged trucks, floor load/pallet, % full pallet items/individual pick, conveyer lines/pallet jack/forklift pick, % products with team lift requirements, sqft of pick zone, average # picks per order, equipment down time, in stock %, varibility in product count/package size

- Store pulling: sproduct diversity of size, product diversity of weight, sqft of pick zone, average # picks per order, length of shift, task distribution among operators, operator experience(and/or time for response on questions), in stock %, % use best judgement/do not substitue orders

Ideally I would find data for all of the above information then explore to see which information showed the greatest impact for predicting units per hour and could build a ML model to predict UPH for a specific site or within a site.

Data source ideas:
Bulk/Lumber distribution: find my old files, maybe reach out to previous connections to see if any info could be exported

Create a fictional warehouse based off of this picking dataset. Use public retail sales dataset for a product mix and generate random info within a range for product weight, size. Use apriori algoithm to generate product frequencies and create fictional orders?



In [None]:
import pandas as pd

In [10]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : 0.10.0
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
n

In [None]:
df = pd.read_csv('pick data.csv', low_memory=False)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isna().sum()

In [None]:
df.info()

NOTE: drop all Unnamed columns, these contain only null values

In [None]:
# create df of useable columns only, PH_Doc and PH_sorder are showing 'value#' for most entries, 
# PH_deliver contains large # of nulls

pickdf = df[['PH_PICKEDB', 'PH_PICKSTA', 'PH_PICKEND', 'PH_TOTALLI', 'PH_TOTALBO']]

In [None]:
pickdf.head()

In [None]:
# write this df to csv for future use
#pickdf.to_csv('pickdf.csv')

In [None]:
# Wrangling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Exploring
import scipy.stats as stats

# Visualizing
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.model_selection import learning_curve

pd.options.display.float_format = '{:20,.2f}'.format

In [None]:
pickdf = pd.read_csv('pickdf.csv', index_col=0)

In [None]:
pickdf.head()

In [None]:
pickdf.info()

In [None]:
pickdf.isna().sum()

In [None]:
# this is a very small # of nulls, will drop all
pickdf.dropna(inplace=True)

In [None]:
pickdf.isna().sum()

In [None]:
pickdf = pickdf.rename(columns={'PH_PICKEDB': 'operator', 'PH_PICKSTA': 'start_time', 'PH_PICKEND': 'end_time', 'PH_TOTALLI': 'total_lines', 'PH_TOTALBO': 'total_boxes'})

In [None]:
pickdf.head()

In [None]:
# convert start_time and end_time to date time
pickdf['start']= pd.to_datetime(pickdf['start_time'])
pickdf['end']= pd.to_datetime(pickdf['end_time'])

In [None]:
pickdf.head()

In [None]:
pickdf.info()

In [None]:
pickdf = pickdf.drop(columns=['start_time', 'end_time'])

# START HERE

### For 1st iteration looking to create baseline and model to predict boxes/hr that beats baseline

#### for this iteration total lines will be defined as the number of line items on the order

In [1]:
# Wrangling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import acquire
import prepare
import wrangle_pick

# Exploring
import scipy.stats as stats

# Visualizing
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.model_selection import learning_curve

pd.options.display.float_format = '{:20,.2f}'.format

In [2]:
train, test, validate = wrangle_pick.wrangle_pick_data()
train.shape, test.shape, validate.shape

Acquire: downloading raw data files...
Acquire: Completed!
Prepare: Cleaning acquired data...
Prepare: Completed!


((115575, 5), (23995, 5), (20396, 5))

In [3]:
train.head()

Unnamed: 0,operator,total_lines,total_boxes,start,end
128944,DP,2,1,2019-04-03 14:13:12,2019-04-03 14:13:47
46790,LEWIS,21,1,2016-08-09 14:49:17,2016-08-09 14:57:04
54283,LM,3,1,2016-11-21 12:21:50,2016-11-21 12:24:07
96986,IT,7,1,2018-03-20 09:31:37,2018-03-20 09:33:37
94283,IT,6,5,2018-12-02 13:42:16,2018-12-02 13:45:00


In [11]:
# additional info needed time from start to end, day of week, operator tenure (in this dataset)
train['pick_time'] = train.end - train.start
train['int_day'] = train.start.dt.dayofweek
train['day_name'] = train.start.dt.day_name()
train.head()

Unnamed: 0,operator,total_lines,total_boxes,start,end,pick_time,int_day,day_name
128944,DP,2,1,2019-04-03 14:13:12,2019-04-03 14:13:47,00:00:35,2,Wednesday
46790,LEWIS,21,1,2016-08-09 14:49:17,2016-08-09 14:57:04,00:07:47,1,Tuesday
54283,LM,3,1,2016-11-21 12:21:50,2016-11-21 12:24:07,00:02:17,0,Monday
96986,IT,7,1,2018-03-20 09:31:37,2018-03-20 09:33:37,00:02:00,1,Tuesday
94283,IT,6,5,2018-12-02 13:42:16,2018-12-02 13:45:00,00:02:44,6,Sunday


In [32]:
# additional info needed  operator tenure (in this dataset)
op_max_start = train.groupby('operator')['start'].min()
op_max_start
#train.groupby('operator')['end'].max()

operator
 EDYTA       2015-07-28 14:17:43
 IVETA       2016-09-06 13:47:30
 PETER       2015-02-10 09:36:43
10090630     2018-02-16 09:34:08
10173433     2016-11-16 09:25:52
                     ...        
WC12/22      2017-10-17 09:26:10
WH109/18     2016-07-07 09:32:04
Z2383700     2016-08-26 09:38:31
ZWSTRAPPRI   2019-11-15 15:02:31
ZWTIMPSTRA   2016-09-06 10:31:48
Name: start, Length: 165, dtype: datetime64[ns]

In [29]:
EDYTA = train[train.operator == 'EDYTA']
EDYTA.sort_values('start')

Unnamed: 0,operator,total_lines,total_boxes,start,end,pick_time,int_day,day_name
2225,EDYTA,2,1,2015-01-05 15:05:22,2015-01-05 15:05:28,00:00:06,0,Monday
1174,EDYTA,3,1,2015-04-15 12:45:58,2015-04-15 12:46:07,00:00:09,2,Wednesday
1175,EDYTA,2,1,2015-04-15 12:54:34,2015-04-15 12:57:56,00:03:22,2,Wednesday
1176,EDYTA,2,1,2015-04-15 12:58:39,2015-04-15 12:59:45,00:01:06,2,Wednesday
1177,EDYTA,2,1,2015-04-15 13:10:05,2015-04-15 13:11:44,00:01:39,2,Wednesday
...,...,...,...,...,...,...,...,...
22991,EDYTA,2,1,2016-06-01 14:20:34,2016-06-01 14:21:09,00:00:35,2,Wednesday
22995,EDYTA,10,2,2016-06-01 14:33:43,2016-06-01 14:39:54,00:06:11,2,Wednesday
22998,EDYTA,14,1,2016-06-01 14:41:20,2016-06-01 14:45:08,00:03:48,2,Wednesday
23003,EDYTA,8,1,2016-06-01 14:47:22,2016-06-01 14:49:51,00:02:29,2,Wednesday


questions for explore
1. total min/max pick time? by operator? by day?
2. does longer tenure = faster pick_time
3. does pick_time average vary significantly by day of week
4. days with most orders? fewest? is there a change in # of operators on fewest vs highest order days?

additional variations
- assign some operators are PT (4hr vs 8hr shift)
- define each line as 1 unique item in 1 unique location, based on that infer size of items