# Introduction




**Using Process Mining with Pre- and Post-Intervention Analysis to Improve Digital Service Delivery: A Governmental Case Study**

Authors: Trottier, J., Van Woensel, W., Wang, X., Mallur, K., El-Gharib, N., Amyot, D.


This notebook illustrates the use of the [logprep4pm](https://github.com/ProcessMining-uOttawa/logprep4pm) for pre-processing event logs as a companion to the above-titled case study. The [logprep4pm](https://github.com/ProcessMining-uOttawa/logprep4pm) library provides a set of convencience functions for preparing and filtering an event log for process discovery.

The notebook also contains some code samples that do not use the library but nevertheless implement common pre-processing tasks.

---



**Notice:** This notebook applies only to Phase I of our case study. Due to confidentiality, the actual dataset has not been shared. Redactions have been made to conceal sensitive information on the process that was examined.



---




**The following is a high-level breakdown of the notebook / sections:**

* Ingestion
    * Sanity Check
* Exploratory Data Analysis
    * Inspection
    * Outlier Review
    * Duplicate Event Review & Removal
* Enhancement & Refinement
    * Replace resource names with role types
    * Anonymize case IDs for privacy protection
    * Rename event classes
    * Timezone conversion
    * Filter date range
* Export event log





## Import dependencies

In [None]:
# Needed in Google Colab
!pip install skimpy pandas

# Import essential modules for preprocessing
import pandas as pd
import re
import csv

# The logprep4pm module can be downloaded here: https://github.com/ProcessMining-uOttawa/logprep4pm
import DataPreprocessing as prep4pm

# Ingestion
Open CSV of event log, specify the column names for id, event and timestamp, along with timestamp format.

In [7]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/readCSV

# (update column names to suit CSV dump)
event_log = prep4pm.readCSV('event_log_sample.csv','case_id','activity','timestamp',"%Y-%m-%d %H:%M:%S.%f")

  ## Sanity Check
  Ensure import was done correctly.

In [8]:
event_log.shape

(421, 5)

In [9]:
# Lets visually inspect the first ten rows to make sure the data is formatted properly.
# Ensure there are no obvious import issues, especially with CSVs where structure can break easily.

event_log.head(10)

Unnamed: 0,case_id,event,timestamp,resource,new_time
0,78,mgrRequestScreening,2023-03-03 15:15:56.010,michael@smith.com,1677856556
1,78,appEmailInvitation,2023-03-03 15:16:18.410,maria@garcia.com,1677856578
2,78,appApplicationSubmit,2023-03-03 15:32:05.300,maria@garcia.com,1677857525
3,78,sspApplAmended,2023-03-04 15:11:12.160,mary@smith.com,1677942672
4,78,appApplicationSubmit,2023-03-04 15:40:14.130,maria@garcia.com,1677944414
5,78,sspApplicationReview,2023-03-07 14:09:40.930,mary@smith.com,1678198180
6,78,mgrApplicantIDCheck,2023-03-07 14:10:20.300,maria@hernandez.com,1678198220
7,78,mgrVerifiedApplID,2023-03-08 18:47:08.660,maria@martinez.com,1678301228
8,78,sspCRCSubmit,2023-03-09 18:17:47.930,james@johnson.com,1678385867
9,78,sspReceiveCRC,2023-03-10 14:22:33.360,james@johnson.com,1678458153


# Exploratory Data Analysis

## Inspection
Understand general properties of event log and get get high-level event log statistics. E.g. how many cases, events, event classes, end/start events; look at event classes and their frequency.


In [10]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/getEventLogStats
prep4pm.getEventLogStats(event_log)

Unnamed: 0,Cases,Events,Event Classes,Start events,End events
0,30,421,15,1,3


## Outlier Review
Examine long-running cases, review outliers, incomplete drifted/traces that are anomalies - decide whether to retain or discard.

In [11]:
# Look at the event classes and their frequency in the dataset
# We might spot a few low-count event classes that are anomalies or exceptional cases
event_log['event'].value_counts()

Unnamed: 0_level_0,count
event,Unnamed: 1_level_1
appApplicationSubmit,52
appEmailInvitation,38
mgrApplicantIDCheck,32
sspRequestSecBrief,32
mgrRequestScreening,30
sspApplicationReview,30
mgrVerifiedApplID,30
sspCRCSubmit,30
sspReceiveCRC,30
sspPerformCC,30


In [12]:
# Examine the longest running cases and investigate these traces to determine how to handle these outliers
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/getTraceDurations
prep4pm.getTraceDurations(event_log)[0:20]

Unnamed: 0,case_id,start,end,time_delta
16,94,2023-03-14 13:38:07.740,2023-07-07 13:12:12.330,114 days 23:34:04.590000
18,96,2023-03-16 17:34:55.270,2023-07-06 16:52:05.740,111 days 23:17:10.470000
20,98,2023-03-18 11:23:26.870,2023-07-06 16:47:07.290,110 days 05:23:40.420000
22,100,2023-03-21 18:04:16.980,2023-06-27 19:42:05.900,98 days 01:37:48.920000
19,97,2023-03-16 17:39:48.010,2023-05-30 13:47:05.710,74 days 20:07:17.700000
9,87,2023-03-11 13:07:39.960,2023-05-17 16:22:10.640,67 days 03:14:30.680000
6,84,2023-03-09 17:27:28.690,2023-04-22 16:17:07.600,43 days 22:49:38.910000
26,104,2023-03-25 11:44:35.600,2023-05-05 12:57:05.260,41 days 01:12:29.660000
2,80,2023-03-03 15:53:33.420,2023-04-08 16:12:07.490,36 days 00:18:34.070000
1,79,2023-03-03 15:29:27.140,2023-04-08 14:12:07.490,35 days 22:42:40.350000


In [13]:
# Remove the traces that contain top secret and secret clearances
# (these would have events like 'checkCSIS', 'checkPolygraph')
delete_event_traces = ['checkCSIS', 'checkPolygraph']

# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/deleteAllEvents
event_log2 = prep4pm.deleteAllEvents(event_log,delete_event_traces)

# Check log stats before & after removing the traces:
print("before:")
print(prep4pm.getEventLogStats(event_log))
print("after:")
print(prep4pm.getEventLogStats(event_log2))

before:
   Cases  Events  Event Classes  Start events  End events
0     30     421             15             1           3
after:
   Cases  Events  Event Classes  Start events  End events
0     27     380             13             1           1


In [14]:
# Let's have another look at the event classes and their frequency
event_log2['event'].value_counts()

Unnamed: 0_level_0,count
event,Unnamed: 1_level_1
appApplicationSubmit,49
appEmailInvitation,35
mgrApplicantIDCheck,29
sspRequestSecBrief,29
mgrRequestScreening,27
sspApplicationReview,27
mgrVerifiedApplID,27
sspCRCSubmit,27
sspReceiveCRC,27
sspPerformCC,27


In [15]:
# Delete low-frequency event classes that are exceptional cases or remnants of a drifted process
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/removeEventsLowFrequency
event_log3 = prep4pm.removeEventsLowFrequency(event_log2, 25)

# Check log stats before & after removing the traces:
print("before:")
print(prep4pm.getEventLogStats(event_log2))
print("after:")
print(prep4pm.getEventLogStats(event_log3))

before:
   Cases  Events  Event Classes  Start events  End events
0     27     380             13             1           1
after:
   Cases  Events  Event Classes  Start events  End events
0     27     358             12             1           1


## Duplicate Event Review & Removal
Lets examine duplicate events to see how they impact the event log.

In [16]:
# List duplicate events (i.e., events with same activity occurring right after one another)
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/eventIsRepeated
event_log_repeated = prep4pm.eventIsRepeated(event_log3)

# Let's check the repeated ones:
event_log_duplicates = event_log_repeated.loc[event_log_repeated['isRepeated']]
event_log_duplicates

Unnamed: 0,case_id,event,timestamp,resource,new_time,isRepeated
2,78,appApplicationSubmit,2023-03-03 15:32:05.300,maria@garcia.com,1677857525,True
4,78,appApplicationSubmit,2023-03-04 15:40:14.130,maria@garcia.com,1677944414,True
15,79,appEmailInvitation,2023-03-03 15:29:48.880,maria@rodriguez.com,1677857388,True
16,79,appEmailInvitation,2023-03-08 10:06:40.910,maria@rodriguez.com,1678270000,True
17,79,appEmailInvitation,2023-03-08 14:13:59.950,maria@rodriguez.com,1678284839,True
...,...,...,...,...,...,...
380,105,appApplicationSubmit,2023-03-29 15:07:20.540,david@smith.com,1680102440,True
382,105,mgrApplicantIDCheck,2023-03-29 18:16:13.060,maria@hernandez.com,1680113773,True
383,105,mgrApplicantIDCheck,2023-04-04 16:11:53.940,maria@hernandez.com,1680624713,True
413,107,sspRequestSecBrief,2023-04-01 12:40:52.570,maria@martinez.com,1680352852,True


In [18]:
# Lets count how many duplicate events there are in order to determine the impact on the dataset
event_log_duplicates.shape[0]

62

In [19]:
# How many duplicates per event class?
event_log_duplicates['event'].value_counts()

Unnamed: 0_level_0,count
event,Unnamed: 1_level_1
appApplicationSubmit,40
appEmailInvitation,14
mgrApplicantIDCheck,4
sspRequestSecBrief,4


In [20]:
# Delete duplicate events, i.e., remove duplicates but only within a given time delta (3 minutes)
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/deleteDuplicateEventRowsDelta
event_log_dedup = prep4pm.deleteDuplicateEventRowsDelta(event_log3, 3 * 60 * 1000)

print("before:", event_log3.shape[0])
print(prep4pm.getEventLogStats(event_log3))
print("after:", event_log_dedup.shape[0])
print(prep4pm.getEventLogStats(event_log_dedup))

before: 358
   Cases  Events  Event Classes  Start events  End events
0     27     358             12             1           1
after: 346
   Cases  Events  Event Classes  Start events  End events
0     27     346             12             1           1


In [21]:
# Lets look at how many duplicate events remain
# Perhaps we can optimize this a little bit further
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/eventIsRepeated
event_log_repeated2 = prep4pm.eventIsRepeated(event_log_dedup)

# Filter out which events are repeated and count them
event_log_duplicates2 = event_log_repeated2.loc[event_log_repeated2['isRepeated']]
event_log_duplicates2.shape[0]

43

In [22]:
# How many duplicates per event class?
event_log_duplicates2['event'].value_counts()

Unnamed: 0_level_0,count
event,Unnamed: 1_level_1
appApplicationSubmit,25
appEmailInvitation,10
mgrApplicantIDCheck,4
sspRequestSecBrief,4


In [23]:
# Lets eyeball them to see if we should adjust our delta cutoff further
event_log_duplicates2.head(20)

Unnamed: 0,case_id,event,timestamp,resource,new_time,isRepeated
13,79,appEmailInvitation,2023-03-03 15:29:48.880,maria@rodriguez.com,1677857388,True
14,79,appEmailInvitation,2023-03-08 10:06:40.910,maria@rodriguez.com,1678270000,True
15,79,appApplicationSubmit,2023-03-10 18:37:52.480,maria@garcia.com,1678473472,True
16,79,appApplicationSubmit,2023-03-17 15:10:57.220,maria@garcia.com,1679065857,True
64,84,appApplicationSubmit,2023-03-10 13:35:03.220,maria@rodriguez.com,1678455303,True
65,84,appApplicationSubmit,2023-03-14 15:32:48.690,maria@rodriguez.com,1678807968,True
67,84,mgrApplicantIDCheck,2023-03-18 15:03:58.830,maria@martinez.com,1679151838,True
68,84,mgrApplicantIDCheck,2023-03-24 16:30:08.030,maria@martinez.com,1679675408,True
74,84,sspRequestSecBrief,2023-03-28 16:02:26.470,maria@hernandez.com,1680019346,True
75,84,sspRequestSecBrief,2023-04-22 13:55:25.650,maria@hernandez.com,1682171725,True


In [24]:
# Compare between old and new counts:
compare_dupes = pd.concat([event_log_duplicates2['event'].value_counts(),event_log_duplicates['event'].value_counts()],axis=1,keys=["new","old"])
compare_dupes

Unnamed: 0_level_0,new,old
event,Unnamed: 1_level_1,Unnamed: 2_level_1
appApplicationSubmit,25,40
appEmailInvitation,10,14
mgrApplicantIDCheck,4,4
sspRequestSecBrief,4,4


In [25]:
# Let's keep working with the de-duplicated event log
event_log = event_log_dedup

# Enhancement & Refinement

## Replace resource names with role types

In [26]:
# Change resources names to their rank / role / level
event_log.loc[event_log['resource'] == 'john@smith.com','resource'] = "[Role in Organization]"

# Checkout our dummy result:
event_log.loc[event_log['resource']=="[Role in Organization]",]

Unnamed: 0,case_id,event,timestamp,resource,new_time
8,78,sspPerformCC,2023-03-10 14:24:11.950,[Role in Organization],1678458251
22,79,sspPerformCC,2023-03-18 11:36:29.340,[Role in Organization],1679139389
34,81,sspPerformCC,2023-03-10 14:55:33.290,[Role in Organization],1678460133
46,82,sspPerformCC,2023-03-09 10:51:51.100,[Role in Organization],1678359111
58,83,sspPerformCC,2023-03-08 16:29:37.420,[Role in Organization],1678292977
72,84,sspPerformCC,2023-03-28 16:00:01.890,[Role in Organization],1680019201
86,85,sspPerformCC,2023-03-17 18:32:13.220,[Role in Organization],1679077933
98,86,sspPerformCC,2023-03-16 11:15:02.900,[Role in Organization],1678965302
111,87,sspPerformCC,2023-05-05 17:47:26.670,[Role in Organization],1683308846
123,88,sspPerformCC,2023-03-16 11:25:48.530,[Role in Organization],1678965948


## Anonymize case IDs for Privacy Protection

In [27]:
# Anonymize case IDs for privacy protection
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/anonymizeCaseIDs
event_log_export = prep4pm.anonymizeCaseIDs(event_log)
event_log_export.head(20)

Unnamed: 0,case_id,event,timestamp,resource,new_time
0,1,mgrRequestScreening,2023-03-03 15:15:56.010,michael@smith.com,1677856556
1,1,appEmailInvitation,2023-03-03 15:16:18.410,maria@garcia.com,1677856578
2,1,appApplicationSubmit,2023-03-03 15:32:05.300,maria@garcia.com,1677857525
3,1,sspApplicationReview,2023-03-07 14:09:40.930,mary@smith.com,1678198180
4,1,mgrApplicantIDCheck,2023-03-07 14:10:20.300,maria@hernandez.com,1678198220
5,1,mgrVerifiedApplID,2023-03-08 18:47:08.660,maria@martinez.com,1678301228
6,1,sspCRCSubmit,2023-03-09 18:17:47.930,james@johnson.com,1678385867
7,1,sspReceiveCRC,2023-03-10 14:22:33.360,james@johnson.com,1678458153
8,1,sspPerformCC,2023-03-10 14:24:11.950,[Role in Organization],1678458251
9,1,sspApproved,2023-03-10 14:28:45.550,michael@smith.com,1678458525


## Rename event classes

In [28]:
# Obfuscate event names to conceal the nature of the process

# This is a map of original to new event names with a JSON object:
# { Original Event Name in Database : New User Friendly Name }

replace_names = {
        'mgrRequestScreening'  : '[User Friendly Name]',
}

# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/renameEventNames
event_log_export2 = prep4pm.renameEventNames(event_log_export,replace_names)

# Checkout our dummy result:
event_log_export2.loc[event_log_export2['event']=='[User Friendly Name]']

Unnamed: 0,case_id,event,timestamp,resource,new_time
0,1,[User Friendly Name],2023-03-03 15:15:56.010,michael@smith.com,1677856556
12,2,[User Friendly Name],2023-03-03 15:29:27.140,james@smith.com,1677857367
26,3,[User Friendly Name],2023-03-03 23:50:44.810,james@smith.com,1677887444
38,4,[User Friendly Name],2023-03-04 18:29:26.520,james@smith.com,1677954566
50,5,[User Friendly Name],2023-03-04 18:50:09.100,michael@smith.com,1677955809
62,6,[User Friendly Name],2023-03-09 17:27:28.690,michael@smith.com,1678382848
77,7,[User Friendly Name],2023-03-10 16:59:59.380,michael@smith.com,1678467599
90,8,[User Friendly Name],2023-03-10 17:56:18.520,james@smith.com,1678470978
102,9,[User Friendly Name],2023-03-11 13:07:39.960,michael@smith.com,1678540059
115,10,[User Friendly Name],2023-03-11 14:58:57.770,james@smith.com,1678546737


## Timezone conversion

In [29]:
event_log_export3 = event_log_export2.copy()

# Convert timestamps from UTC to EST
event_log_export3['timestamp'] = pd.DatetimeIndex(event_log_export2['timestamp']).tz_localize("UTC").tz_convert("America/New_York")

event_log_export3.head(20)

Unnamed: 0,case_id,event,timestamp,resource,new_time
0,1,[User Friendly Name],2023-03-03 10:15:56.010000-05:00,michael@smith.com,1677856556
1,1,appEmailInvitation,2023-03-03 10:16:18.410000-05:00,maria@garcia.com,1677856578
2,1,appApplicationSubmit,2023-03-03 10:32:05.300000-05:00,maria@garcia.com,1677857525
3,1,sspApplicationReview,2023-03-07 09:09:40.930000-05:00,mary@smith.com,1678198180
4,1,mgrApplicantIDCheck,2023-03-07 09:10:20.300000-05:00,maria@hernandez.com,1678198220
5,1,mgrVerifiedApplID,2023-03-08 13:47:08.660000-05:00,maria@martinez.com,1678301228
6,1,sspCRCSubmit,2023-03-09 13:17:47.930000-05:00,james@johnson.com,1678385867
7,1,sspReceiveCRC,2023-03-10 09:22:33.360000-05:00,james@johnson.com,1678458153
8,1,sspPerformCC,2023-03-10 09:24:11.950000-05:00,[Role in Organization],1678458251
9,1,sspApproved,2023-03-10 09:28:45.550000-05:00,michael@smith.com,1678458525


In [30]:
# Remove localization for date range filter function
event_log_export3['timestamp'] = pd.DatetimeIndex(event_log_export3['timestamp']).tz_localize(None)
event_log_export3.head(20)

Unnamed: 0,case_id,event,timestamp,resource,new_time
0,1,[User Friendly Name],2023-03-03 10:15:56.010,michael@smith.com,1677856556
1,1,appEmailInvitation,2023-03-03 10:16:18.410,maria@garcia.com,1677856578
2,1,appApplicationSubmit,2023-03-03 10:32:05.300,maria@garcia.com,1677857525
3,1,sspApplicationReview,2023-03-07 09:09:40.930,mary@smith.com,1678198180
4,1,mgrApplicantIDCheck,2023-03-07 09:10:20.300,maria@hernandez.com,1678198220
5,1,mgrVerifiedApplID,2023-03-08 13:47:08.660,maria@martinez.com,1678301228
6,1,sspCRCSubmit,2023-03-09 13:17:47.930,james@johnson.com,1678385867
7,1,sspReceiveCRC,2023-03-10 09:22:33.360,james@johnson.com,1678458153
8,1,sspPerformCC,2023-03-10 09:24:11.950,[Role in Organization],1678458251
9,1,sspApproved,2023-03-10 09:28:45.550,michael@smith.com,1678458525


## Filter Date Range
We could not include data prior to a specific date (using "March 7th" as an example) as the system had just launched and the process model was unstable.

In [31]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/filterTracesWithinDateRange
event_log_export4 = prep4pm.filterTracesWithinDateRange(event_log_export3,'2023-03-07 00:00:00.0','2023-10-26 23:59:59.0','%Y-%m-%d %H:%M:%S.%f')

print("before:")
print(prep4pm.getEventLogStats(event_log_export3))
print("after:")
print(prep4pm.getEventLogStats(event_log_export4))

before:
   Cases  Events  Event Classes  Start events  End events
0     27     346             12             1           1
after:
   Cases  Events  Event Classes  Start events  End events
0     22     284             12             1           1


# Export Event Log

In [32]:
# Output event log to CSV
# To-do: create export functions to export to XLS and XES
event_log_export4.to_csv('event_log_output.csv')