### Part 1: Analysis 
#### Analyze and extract any interesting insights you can derive from the data set attached (each row represents the assignment of a job in our research queue, including some data about the analyst who received the assignment and the current state of the research queue). What can you infer? What do you think this means for us from a business perspective? 

In [1]:
# import libraries

import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 30)

import sqlite3

In [2]:
# read in file and create dataframe
df = pd.read_csv('Assignment Log.csv')

### Exploratory Data Analysis

In [3]:
df.head(5)

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
0,06/22/2017 19:59:06,9fcbc63ff4c8bea5cea4efad782c87cf,5.0,5.0,Accepted Job,594bec5c95e2ce005840c23a,06/22/2017 12:12:12,review,1,review,0,13,14,4,6,2,0,1,1
1,06/22/2017 19:59:02,9fcbc63ff4c8bea5cea4efad782c87cf,5.0,5.0,Assigned Job,594bec5c95e2ce005840c23a,06/22/2017 12:12:12,review,1,review,1,13,15,5,6,2,0,1,1
2,06/22/2017 19:51:30,85c7b78e76b5232cd38014ea4cdc8f56,4.35,4.35,Declined Job,594bec83fd2cf400280aa965,06/22/2017 12:12:51,writing,9,"sourcing, writing",1,12,12,5,5,1,0,0,1
3,06/22/2017 19:51:01,0e9802516f8a79dd0d45211dd4ee74af,4.5,4.5,Accepted Job,594c1f5cd7e68f0028c9062c,06/22/2017 15:49:48,sourcing,1,"sourcing, writing",1,11,12,5,5,1,0,0,1
4,06/22/2017 19:50:58,85c7b78e76b5232cd38014ea4cdc8f56,4.35,4.35,Assigned Job,594bec83fd2cf400280aa965,06/22/2017 12:12:51,writing,8,"sourcing, writing",2,11,14,5,5,2,0,1,1


In [4]:
df.shape

(791, 19)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 791 entries, 0 to 790
Data columns (total 19 columns):
Event occurred at           791 non-null object
Analyst                     791 non-null object
Quality score (sourcing)    791 non-null float64
Quality score (writing)     791 non-null float64
Action                      791 non-null object
Request                     791 non-null object
Request created at          791 non-null object
Job                         791 non-null object
Wait time (min)             791 non-null int64
Waiting for                 785 non-null object
Analysts available          791 non-null int64
Analysts occupied           791 non-null int64
Total jobs available        791 non-null int64
Review jobs available       791 non-null int64
Vetting jobs available      791 non-null int64
Planning jobs available     791 non-null int64
Editing jobs available      791 non-null int64
Sourcing jobs available     791 non-null int64
Writing jobs available      791 non-nu

In [6]:
# convert Event occurred at and Request created at to datetime
df['Event occurred at'] = pd.to_datetime(df['Event occurred at'])
df['Request created at'] = pd.to_datetime(df['Request created at'])

In [7]:
# Min and Max times for creating requests

print('Min Date: ' + str(min(df['Request created at'])))
print('Max Date: ' + str(max(df['Request created at'])))

Min Date: 2017-06-19 13:43:51
Max Date: 2017-06-22 18:16:11


In [8]:
# Min and Max times for events

print('Min Date: ' + str(min(df['Event occurred at'])))
print('Max Date: ' + str(max(df['Event occurred at'])))

Min Date: 2017-06-21 20:15:42
Max Date: 2017-06-22 19:59:06


#### Handling Nulls

In [9]:
# check for any null values
df.isnull().sum()

Event occurred at           0
Analyst                     0
Quality score (sourcing)    0
Quality score (writing)     0
Action                      0
Request                     0
Request created at          0
Job                         0
Wait time (min)             0
Waiting for                 6
Analysts available          0
Analysts occupied           0
Total jobs available        0
Review jobs available       0
Vetting jobs available      0
Planning jobs available     0
Editing jobs available      0
Sourcing jobs available     0
Writing jobs available      0
dtype: int64

In [10]:
# find nulls
def find_nulls(df):
    null_columns=df.columns[df.isnull().any()]
    return df[df['Waiting for'].isnull()][null_columns]

In [11]:
find_nulls(df)

Unnamed: 0,Waiting for
195,
340,
392,
454,
595,
723,


#### There are just 6 null values in the dataset and all are in the 'Waiting for' column. I am going to replace the null value with the job value.

In [12]:
len(df[df['Job'] == df['Waiting for']])

392

In [13]:
len(df[df['Job'] != df['Waiting for']])

399

In [14]:
df.iloc[194:196]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
194,2017-06-22 14:31:08,94959cf2d0b592d0fa1e5e9cf760a1c7,5.0,5.0,Assigned Job,594bec81fd2cf400280aa95c,2017-06-22 12:12:49,planning,1,planning,5,16,8,0,3,5,0,0,0
195,2017-06-22 14:30:59,9fa24ddce8fd9d1526d9d7451304fc74,4.89,4.89,Accepted Job,594bec7ffd2cf400280aa953,2017-06-22 12:12:47,sourcing,26,,5,15,8,0,3,5,0,0,0


In [15]:
df.loc[[195], 'Waiting for'] = 'sourcing'

In [16]:
df.loc[339:341]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
339,2017-06-22 08:03:23,98de7e62209a07eee6ec8dc984911042,0.0,0.0,Assigned Job,594a9ce6c864e200288e766b,2017-06-21 12:20:54,sourcing,1,sourcing,4,10,4,0,0,1,0,1,2
340,2017-06-22 08:01:25,a29fe6d26c1a49ff4a3c876eaee0a1af,0.0,0.0,Accepted Job,594a9ce6c864e200288e766b,2017-06-21 12:20:54,sourcing,1,,3,10,3,0,0,1,0,0,2
341,2017-06-22 08:01:22,a29fe6d26c1a49ff4a3c876eaee0a1af,0.0,0.0,Accepted Job,594a9ce6c864e200288e766b,2017-06-21 12:20:54,sourcing,1,sourcing,3,10,3,0,0,1,0,0,2


In [17]:
df.loc[[340], 'Waiting for'] = 'sourcing'

In [18]:
df.loc[391:393]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
391,2017-06-22 06:17:54,7e22ad15724c44543d1d4bcafd10c812,5.0,5.0,Assigned Job,594ae6fb4f06e20035d92345,2017-06-21 17:36:59,review,1,review,1,10,6,3,1,0,0,0,2
392,2017-06-22 06:13:40,766d3435eda76c4de9f034b8f97a0602,3.5,3.5,Accepted Job,594a8fd422b3c50028e71976,2017-06-21 11:25:08,writing,1,,0,9,6,3,1,0,0,0,2
393,2017-06-22 06:13:34,766d3435eda76c4de9f034b8f97a0602,3.5,3.5,Assigned Job,594a8fd422b3c50028e71976,2017-06-21 11:25:08,writing,1,"sourcing, writing",1,9,7,3,1,0,0,0,3


In [19]:
df.loc[[392], 'Waiting for'] = 'sourcing, writing'

In [20]:
df.loc[453:455]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
453,2017-06-22 04:45:27,7e22ad15724c44543d1d4bcafd10c812,5.0,5.0,Assigned Job,594b00e23f82d10028f03be2,2017-06-21 19:27:30,review,1,review,1,14,7,3,1,0,0,0,3
454,2017-06-22 04:40:07,7e22ad15724c44543d1d4bcafd10c812,5.0,5.0,Accepted Job,594af3c6f3b9f600287eabe0,2017-06-21 18:31:34,review,1,,0,15,7,3,1,0,0,0,3
455,2017-06-22 04:40:03,7e22ad15724c44543d1d4bcafd10c812,5.0,5.0,Assigned Job,594af3c6f3b9f600287eabe0,2017-06-21 18:31:34,review,1,review,1,15,8,4,1,0,0,0,3


In [21]:
df.loc[[454], 'Waiting for'] = 'review'

In [22]:
df.loc[594:596]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
594,2017-06-21 23:26:17,9fcbc63ff4c8bea5cea4efad782c87cf,5.0,5.0,Assigned Job,594b20ed3f82d10028f03c6c,2017-06-21 21:44:13,sourcing,1,sourcing,1,24,7,1,3,1,0,1,1
595,2017-06-21 23:23:02,f7f7591403c6c431053920223069550a,5.0,5.0,Accepted Job,594b20ee3f82d10028f03c74,2017-06-21 21:44:14,planning,1,,0,23,7,1,3,1,0,1,1
596,2017-06-21 23:22:37,36ee9fc3bade4a4f71c2a6e5c2bd8862,0.0,0.0,Accepted Job,594b20eb3f82d10028f03c5c,2017-06-21 21:44:11,sourcing,1,sourcing,1,22,8,1,3,2,0,1,1


In [23]:
df.loc[[595], 'Waiting for'] = 'planning'

In [24]:
df.loc[722:724]

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
722,2017-06-21 21:25:06,9fcbc63ff4c8bea5cea4efad782c87cf,5.0,5.0,Assigned Job,5949462dd9ae5200633f9640,2017-06-20 11:58:37,review,1,review,2,21,8,6,1,0,0,0,1
723,2017-06-21 21:24:44,206de922289a1f9f5ee250fc71308628,3.17,3.17,Accepted Job,594ae6fb4f06e20035d92345,2017-06-21 17:36:59,writing,1,,1,20,8,6,1,0,0,0,1
724,2017-06-21 21:24:38,206de922289a1f9f5ee250fc71308628,3.17,3.17,Accepted Job,594ae6fb4f06e20035d92345,2017-06-21 17:36:59,writing,1,writing,1,20,8,6,1,0,0,0,1


In [25]:
df.loc[[723], 'Waiting for'] = 'writing'

In [26]:
# check for any null values
df.isnull().sum()

Event occurred at           0
Analyst                     0
Quality score (sourcing)    0
Quality score (writing)     0
Action                      0
Request                     0
Request created at          0
Job                         0
Wait time (min)             0
Waiting for                 0
Analysts available          0
Analysts occupied           0
Total jobs available        0
Review jobs available       0
Vetting jobs available      0
Planning jobs available     0
Editing jobs available      0
Sourcing jobs available     0
Writing jobs available      0
dtype: int64

#### Handling Duplicates

In [27]:
# find duplicates
# select duplicate rows except first occurrence based on all columns
def duplicates(df):
    duplicate_rows_df = df[df.duplicated()]
    return duplicate_rows_df

In [28]:
duplicates(df)

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available
9,2017-06-22 19:47:03,85c7b78e76b5232cd38014ea4cdc8f56,4.35,4.35,Accepted Job,594c1e983b593b00281250ba,2017-06-22 15:46:32,sourcing,4,"sourcing, writing",1,11,11,5,5,1,0,0,0
55,2017-06-22 18:26:42,e2333b2dc03032f12c8526e45243f0c1,0.0,0.0,Accepted Job,594c0b883b593b002812506e,2017-06-22 14:25:12,sourcing,22,sourcing,2,14,9,2,7,0,0,0,0


In [29]:
df.drop_duplicates(inplace = True)

In [30]:
# confirm duplicates removed
duplicates(df)

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available


In [31]:
# inspecting the unique values found in columns
df.nunique()

Event occurred at           669
Analyst                      71
Quality score (sourcing)     37
Quality score (writing)      37
Action                        3
Request                      74
Request created at           72
Job                           7
Wait time (min)              39
Waiting for                  27
Analysts available            9
Analysts occupied            19
Total jobs available         17
Review jobs available        12
Vetting jobs available        8
Planning jobs available      13
Editing jobs available        3
Sourcing jobs available       5
Writing jobs available        7
dtype: int64

In [32]:
# create function to find value counts for some columns
def values(column):
    print(df[column].value_counts())

In [33]:
# run values function
print(values('Action'))
print(values('Job'))
print(values('Waiting for'))

Assigned Job    393
Accepted Job    301
Declined Job     95
Name: Action, dtype: int64
None
sourcing         209
writing          160
review           143
vetting           94
planning          94
source review     58
editing           31
Name: Job, dtype: int64
None
sourcing                                         159
sourcing, writing                                 94
review                                            92
writing                                           69
review, planning, editing, sourcing, writing      66
planning                                          49
vetting, planning, editing, sourcing, writing     46
review, vetting, editing, sourcing, writing       32
vetting                                           27
review, planning                                  24
editing, sourcing, writing                        23
editing                                           22
review, editing, sourcing, writing                15
planning, editing, sourcing, writing       

In [34]:
# create function to find the average wait time for each request based
# on the job type (sourcing, writing, review, planning, vetting, source
# review and editing)

def averages(df, value):    
    return 'Average ' + value.title() + ' Wait Time in Minutes: '+ \
    str(round(df[df['Job'] == value]['Wait time (min)'].mean(), 2))

In [35]:
# run averages function 
print(averages(df, 'sourcing')) 
print(averages(df, 'writing')) 
print(averages(df, 'review')) 
print(averages(df, 'planning')) 
print(averages(df, 'vetting')) 
print(averages(df, 'source review')) 
print(averages(df, 'editing')) 

Average Sourcing Wait Time in Minutes: 10.46
Average Writing Wait Time in Minutes: 3.74
Average Review Wait Time in Minutes: 3.36
Average Planning Wait Time in Minutes: 2.18
Average Vetting Wait Time in Minutes: 1.2
Average Source Review Wait Time in Minutes: 2.48
Average Editing Wait Time in Minutes: 9.03


In [36]:
print('Analysts Available: ' + str(max(df['Analysts available'])))
print('Total jobs available: ' + str(max(df['Total jobs available'])))

Analysts Available: 8
Total jobs available: 16


In [37]:
# number of jobs grouped by analyst, action and job 

df.groupby(['Analyst', 'Action', 'Job'])['Action'].count()

Analyst                           Action        Job          
00360b3f177375b01b795a4be7b4686c  Assigned Job  sourcing          4
                                  Declined Job  sourcing          4
008fa95f5c94985e2d44047aeac31655  Accepted Job  review            4
                                  Assigned Job  review            3
052066814364850e1a17a90d576dd904  Accepted Job  review            2
                                  Assigned Job  review            1
0c2680433387fb4cf51a3546296f8422  Accepted Job  sourcing          3
                                                writing           2
                                  Assigned Job  sourcing          5
                                                writing           4
                                  Declined Job  sourcing          2
                                                writing           2
0e9802516f8a79dd0d45211dd4ee74af  Accepted Job  sourcing          5
                                                writin

In [38]:
# number of jobs accepted/assigned or declined based on job

df.groupby(['Action', 'Job'])['Action'].count()

Action        Job          
Accepted Job  editing           15
              planning          33
              review            65
              source review     27
              sourcing          72
              vetting           35
              writing           54
Assigned Job  editing           14
              planning          49
              review            70
              source review     29
              sourcing         101
              vetting           47
              writing           83
Declined Job  editing            2
              planning          12
              review             8
              source review      2
              sourcing          36
              vetting           12
              writing           23
Name: Action, dtype: int64

In [39]:
df.groupby(['Request', 'Analyst'])['Wait time (min)'].sum()

Request                   Analyst                         
59480d57e759070028da6467  a09c8906073b4c0b75e3100b857b982a      2
                          d8e25a290ea51352bf9100a99c475f6d      7
59485c262e71030033104e3c  642782c690c8d963c487300a4751e220      2
                          7e22ad15724c44543d1d4bcafd10c812      2
                          a09c8906073b4c0b75e3100b857b982a      2
                          b599bfb42906772db81ac90137fc1916      2
                          e817dd4305458b293cbeb3015da99565      2
59486e7af5874900429ce273  9fcbc63ff4c8bea5cea4efad782c87cf      2
5949462dd9ae5200633f9640  39012c98c8fb80752d2bbcc3dc285230      2
                          9fa24ddce8fd9d1526d9d7451304fc74     10
                          9fcbc63ff4c8bea5cea4efad782c87cf      2
59494c13d9ae5200633f9695  a09c8906073b4c0b75e3100b857b982a      2
5949a81c4d1319005556396c  39012c98c8fb80752d2bbcc3dc285230      4
                          632a6492e9ff20cc4a442245836424e5      2
                 

In [40]:
# dummy Action category

action_dummies = pd.get_dummies(df['Action'])
action_dummies.head()

df = pd.concat([df, action_dummies],axis=1)
#df.drop('Action',axis=1, inplace=True)

In [41]:
# sort values in table by time the request created
df.sort_values('Request created at', inplace = True)

In [42]:
# create a primary key per record
df['record'] = range(1, len(df) + 1)

In [43]:
df.head()

Unnamed: 0,Event occurred at,Analyst,Quality score (sourcing),Quality score (writing),Action,Request,Request created at,Job,Wait time (min),Waiting for,Analysts available,Analysts occupied,Total jobs available,Review jobs available,Vetting jobs available,Planning jobs available,Editing jobs available,Sourcing jobs available,Writing jobs available,Accepted Job,Assigned Job,Declined Job,record
697,2017-06-21 21:50:31,a09c8906073b4c0b75e3100b857b982a,5.0,5.0,Accepted Job,59480d57e759070028da6467,2017-06-19 13:43:51,review,1,"review, planning, editing, sourcing, writing",1,21,13,6,1,6,0,0,0,1,0,0,1
718,2017-06-21 21:26:17,d8e25a290ea51352bf9100a99c475f6d,5.0,5.0,Assigned Job,59480d57e759070028da6467,2017-06-19 13:43:51,review,1,review,4,21,8,5,1,0,0,1,1,0,1,0,2
713,2017-06-21 21:27:19,d8e25a290ea51352bf9100a99c475f6d,5.0,5.0,Accepted Job,59480d57e759070028da6467,2017-06-19 13:43:51,review,2,review,1,22,6,5,1,0,0,0,0,1,0,0,3
714,2017-06-21 21:26:52,d8e25a290ea51352bf9100a99c475f6d,5.0,5.0,Assigned Job,59480d57e759070028da6467,2017-06-19 13:43:51,review,2,review,2,22,7,6,1,0,0,0,0,0,1,0,4
715,2017-06-21 21:26:52,d8e25a290ea51352bf9100a99c475f6d,5.0,5.0,Declined Job,59480d57e759070028da6467,2017-06-19 13:43:51,review,2,review,2,22,7,6,1,0,0,0,0,0,0,1,5


In [44]:
# save cleaned csv to a file
df.to_csv (r'clean_assignment_log.csv', index = False, header=True) 