# Introduction




**Using Process Mining with Pre- and Post-Intervention Analysis to Improve Digital Service Delivery: A Governmental Case Study**

Authors: Trottier, J., Van Woensel, W., Wang, X., Mallur, K., El-Gharib, N., Amyot, D.


This notebook illustrates the use of the [logprep4pm](https://github.com/ProcessMining-uOttawa/logprep4pm) for pre-processing event logs as a companion to the above-titled case study. The [logprep4pm](https://github.com/ProcessMining-uOttawa/logprep4pm) library provides a set of convencience functions for preparing and filtering an event log for process discovery.

The notebook also contains some code samples that do not use the library but nevertheless implement common pre-processing tasks.

---



**Notice:** This notebook applies only to Phase I of our case study. Due to confidentiality, the actual dataset has not been shared. Redactions have been made to conceal sensitive information on the process that was examined.



---




**The following is a high-level breakdown of the notebook / sections:**

* Ingestion
    * Sanity Check
* Exploratory Data Analysis
    * Inspection
    * Outlier Review
    * Duplicate Event Review & Removal
* Enhancement & Refinement
    * Replace resource names with role types
    * Anonymize case IDs for privacy protection
    * Rename event classes
    * Timezone conversion
    * Filter date range
* Export event log





## Import dependencies

In [None]:
# NOTE uncomment for Google Colab
# !pip install skimpy pandas

# Import essential modules for preprocessing
import pandas as pd
import re
import csv

# The logprep4pm module can be downloaded here: https://github.com/ProcessMining-uOttawa/logprep4pm
import DataPreprocessing as prep4pm

# Ingestion
Open CSV of event log, specify the column names for id, event and timestamp, along with timestamp format.

In [None]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/readCSV

# (update column names to suit CSV dump)
event_log = prep4pm.readCSV('event_log_sample.csv','case_id','activity','timestamp',"%Y-%m-%d %H:%M:%S.%f")

  ## Sanity Check
  Ensure import was done correctly.

In [None]:
event_log.shape

In [None]:
# Lets visually inspect the first ten rows to make sure the data is formatted properly.
# Ensure there are no obvious import issues, especially with CSVs where structure can break easily.

event_log.head(10)

# Exploratory Data Analysis

## Inspection
Understand general properties of event log and get get high-level event log statistics. E.g. how many cases, events, event classes, end/start events; look at event classes and their frequency.


In [None]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/getEventLogStats
prep4pm.getEventLogStats(event_log)

## Outlier Review
Examine long-running cases, review outliers, incomplete drifted/traces that are anomalies - decide whether to retain or discard.

In [None]:
# Look at the event classes and their frequency in the dataset
# We might spot a few low-count event classes that are anomalies or exceptional cases
event_log['event'].value_counts()

In [None]:
# Examine the longest running cases and investigate these traces to determine how to handle these outliers
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/getTraceDurations
prep4pm.getTraceDurations(event_log)[0:20]

In [None]:
# Remove the traces that contain top secret and secret clearances
# (these would have events like 'checkCSIS', 'checkPolygraph')
delete_event_traces = ['checkCSIS', 'checkPolygraph']

# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/deleteAllEvents
event_log2 = prep4pm.deleteAllEvents(event_log,delete_event_traces)

# Check log stats before & after removing the traces:
print("before:")
print(prep4pm.getEventLogStats(event_log))
print("after:")
print(prep4pm.getEventLogStats(event_log2))

In [None]:
# Let's have another look at the event classes and their frequency
event_log2['event'].value_counts()

In [None]:
# Delete low-frequency event classes that are exceptional cases or remnants of a drifted process
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/removeEventsLowFrequency
event_log3 = prep4pm.removeEventsLowFrequency(event_log2, 25)

# Check log stats before & after removing the traces:
print("before:")
print(prep4pm.getEventLogStats(event_log2))
print("after:")
print(prep4pm.getEventLogStats(event_log3))

## Duplicate Event Review & Removal
Lets examine duplicate events to see how they impact the event log.

In [None]:
# List duplicate events (i.e., events with same activity occurring right after one another)
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/eventIsRepeated
event_log_repeated = prep4pm.eventIsRepeated(event_log3)

# Let's check the repeated ones:
event_log_duplicates = event_log_repeated.loc[event_log_repeated['isRepeated']]
event_log_duplicates

In [None]:
# Lets count how many duplicate events there are in order to determine the impact on the dataset
event_log_duplicates.shape[0]

In [None]:
# How many duplicates per event class?
event_log_duplicates['event'].value_counts()

In [None]:
# Delete duplicate events, i.e., remove duplicates but only within a given time delta (3 minutes)
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/deleteDuplicateEventRowsDelta
event_log_dedup = prep4pm.deleteDuplicateEventRowsDelta(event_log3, 3 * 60 * 1000)

print("before:", event_log3.shape[0])
print(prep4pm.getEventLogStats(event_log3))
print("after:", event_log_dedup.shape[0])
print(prep4pm.getEventLogStats(event_log_dedup))

In [None]:
# Lets look at how many duplicate events remain
# Perhaps we can optimize this a little bit further
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/eventIsRepeated
event_log_repeated2 = prep4pm.eventIsRepeated(event_log_dedup)

# Filter out which events are repeated and count them
event_log_duplicates2 = event_log_repeated2.loc[event_log_repeated2['isRepeated']]
event_log_duplicates2.shape[0]

In [None]:
# How many duplicates per event class?
event_log_duplicates2['event'].value_counts()

In [None]:
# Lets eyeball them to see if we should adjust our delta cutoff further
event_log_duplicates2.head(20)

In [None]:
# Compare between old and new counts:
compare_dupes = pd.concat([event_log_duplicates2['event'].value_counts(),event_log_duplicates['event'].value_counts()],axis=1,keys=["new","old"])
compare_dupes

In [None]:
# Let's keep working with the de-duplicated event log
event_log = event_log_dedup

# Enhancement & Refinement

## Replace resource names with role types

In [None]:
# Change resources names to their rank / role / level
event_log.loc[event_log['resource'] == 'john@smith.com','resource'] = "[Role in Organization]"

# Checkout our dummy result:
event_log.loc[event_log['resource']=="[Role in Organization]",]

## Anonymize case IDs for Privacy Protection

In [None]:
# Anonymize case IDs for privacy protection
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/anonymizeCaseIDs
event_log_export = prep4pm.anonymizeCaseIDs(event_log)
event_log_export.head(20)

## Rename event classes

In [None]:
# Obfuscate event names to conceal the nature of the process

# This is a map of original to new event names with a JSON object:
# { Original Event Name in Database : New User Friendly Name }

replace_names = {
        'mgrRequestScreening'  : '[User Friendly Name]',
}

# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/renameEventNames
event_log_export2 = prep4pm.renameEventNames(event_log_export,replace_names)

# Checkout our dummy result:
event_log_export2.loc[event_log_export2['event']=='[User Friendly Name]']

## Timezone conversion

In [None]:
event_log_export3 = event_log_export2.copy()

# Convert timestamps from UTC to EST
event_log_export3['timestamp'] = pd.DatetimeIndex(event_log_export2['timestamp']).tz_localize("UTC").tz_convert("America/New_York")

event_log_export3.head(20)

In [None]:
# Remove localization for date range filter function
event_log_export3['timestamp'] = pd.DatetimeIndex(event_log_export3['timestamp']).tz_localize(None)
event_log_export3.head(20)

## Filter Date Range
We could not include data prior to a specific date (using "March 7th" as an example) as the system had just launched and the process model was unstable.

In [None]:
# Link to function description: https://processmining-uottawa.github.io/logprep4pm/#/./APIs/filterTracesWithinDateRange
event_log_export4 = prep4pm.filterTracesWithinDateRange(event_log_export3,'2023-03-07 00:00:00.0','2023-10-26 23:59:59.0','%Y-%m-%d %H:%M:%S.%f')

print("before:")
print(prep4pm.getEventLogStats(event_log_export3))
print("after:")
print(prep4pm.getEventLogStats(event_log_export4))

# Export Event Log

In [None]:
# Output event log to CSV
# To-do: create export functions to export to XLS and XES
event_log_export4.to_csv('event_log_output.csv')