## ***Preprocessing of Real Eventlogs***
#### *In this code, we preprocess the event log. We check whether the log contains multitasking executions, and if it does, then we eliminate them so the log would be treated as Ground-truth. To this end, the following lines of code are executed. For some event logs, additional conditions must be checked, which are explicitly stated in the comments.*
*The event log XES file must contain the following attributes:*

`case:concept:name` | `concept:name` | `start:timestamp` | `time:timestamp` | `org:resource`


In [None]:
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import time
import pm4py
from pm4py.objects.log.importer.xes import importer as xes_importer
import pytz
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt
from pm4py.objects.log.util import sorting
from pm4py.objects.log.exporter.xes import exporter as xes_exporter

In [None]:
# Upload the eventLog to check the multitasking execution in it
log = xes_importer.apply('Production.xes')
log

parsing log, completed traces ::   0%|          | 0/225 [00:00<?, ?it/s]

[{'attributes': {'concept:name': 'Case1'}, 'events': [{'concept:name': 'Turning & Milling', 'start:timestamp': datetime.datetime(2012, 1, 30, 6, 24, tzinfo=datetime.timezone.utc), 'time:timestamp': datetime.datetime(2012, 1, 30, 12, 43, tzinfo=datetime.timezone.utc), 'org:resource': 'ID4932'}, '..', {'concept:name': 'Packing', 'start:timestamp': datetime.datetime(2012, 2, 17, 7, 0, tzinfo=datetime.timezone.utc), 'time:timestamp': datetime.datetime(2012, 2, 17, 8, 0, tzinfo=datetime.timezone.utc), 'org:resource': 'ID4820'}]}, '....', {'attributes': {'concept:name': 'Case99'}, 'events': [{'concept:name': 'Turning & Milling Q.C.', 'start:timestamp': datetime.datetime(2012, 3, 22, 18, 59, tzinfo=datetime.timezone.utc), 'time:timestamp': datetime.datetime(2012, 3, 22, 23, 3, tzinfo=datetime.timezone.utc), 'org:resource': 'ID4618'}, '..', {'concept:name': 'Packing', 'start:timestamp': datetime.datetime(2012, 3, 30, 8, 0, tzinfo=datetime.timezone.utc), 'time:timestamp': datetime.datetime(2012

In [None]:
# Converting eventlog xes into dataframe
log_df = pm4py.convert_to_dataframe(log)
# log_df = log_df.drop('triage color', axis=1)  # Run for ED_Log
log_df.head(60)

Unnamed: 0,concept:name,start:timestamp,time:timestamp,org:resource,case:concept:name
0,Turning & Milling,2012-01-30 06:24:00+00:00,2012-01-30 12:43:00+00:00,ID4932,Case1
1,Turning & Milling,2012-01-30 12:44:00+00:00,2012-01-30 13:42:00+00:00,ID4932,Case1
2,Turning & Milling,2012-01-30 13:59:00+00:00,2012-01-30 14:21:00+00:00,ID4167,Case1
3,Turning & Milling,2012-01-30 14:21:00+00:00,2012-01-30 17:58:00+00:00,ID4167,Case1
4,Turning & Milling Q.C.,2012-01-31 20:20:00+00:00,2012-01-31 21:50:00+00:00,ID4163,Case1
5,Laser Marking,2012-02-01 15:18:00+00:00,2012-02-01 15:27:00+00:00,ID0998,Case1
6,Lapping,2012-02-14 07:00:00+00:00,2012-02-14 08:15:00+00:00,ID4882,Case1
7,Lapping,2012-02-14 07:00:00+00:00,2012-02-14 08:15:00+00:00,ID4882,Case1
8,Lapping,2012-02-14 16:05:00+00:00,2012-02-14 16:38:00+00:00,ID4882,Case1
9,Lapping,2012-02-14 16:05:00+00:00,2012-02-14 17:20:00+00:00,ID4882,Case1


In [None]:
## "WE CHECK THIS FOR ED_LOG"

# # Negative duration check for ED LOG:
# log_df['start:timestamp'] = pd.to_datetime(log_df['start:timestamp'], format='ISO8601', utc=True)
# log_df['time:timestamp'] = pd.to_datetime(log_df['time:timestamp'], format='ISO8601', utc=True)
# log_df['duration_minutes'] = (log_df['time:timestamp'] - log_df['start:timestamp']).dt.total_seconds()/60
# log_df[log_df['duration_minutes']<0]

# # Removing rows which has duration in negative: (FOR ED_LOG):
# log_df = log_df[log_df['duration_minutes'] >= 0]
# log_df

##### ***Checking whether the real_log have any overlapping execution:***
*In this step, we identify whether workitem overlaps with another workitem or executes individually. This condition is recorded in the column `overlap`, and the value is represented as a binary indicator (True or False). Furthermore, each group of overlapping workitems (that are sharing the common overlapping interval) is assigned a unique identifier, where '0' represents workitems that do not overlap. We call this column as 'overlap_section'.*

In [None]:
def mark_overlaps(
    log_df: pd.DataFrame,
    start_col: str = 'start:timestamp',
    end_col: str = 'time:timestamp',
    resource_col: str = 'org:resource',
    duration_sec_col: str = 'duration_seconds'  
) -> pd.DataFrame:
  
    df = log_df.copy()
    df[start_col] = pd.to_datetime(df[start_col])
    
    #initial values for columns
    df['overlap'] = False
    df['overlap_section'] = 0
    next_section_id = 1  #unique id for overlap sections across all resources

    #loop per resource
    grouped = df.groupby(resource_col, sort=False)
    for resource, group in grouped:
        # sort by start time 
        g = group.sort_values(start_col)
        indices = g.index.to_list()
        starts = g[start_col].values  
        ends = g[end_col].values
        n = len(indices)
        if n <= 1:
            continue  # no possible overlap for a single item

        # union-find (disjoint set) structure
        parent = list(range(n))
        def find(i):
            # pathcompression
            while parent[i] != i:
                parent[i] = parent[parent[i]]
                i = parent[i]
            return i
        def union(i, j):
            ri, rj = find(i), find(j)
            if ri != rj:
                parent[rj] = ri

        # compare intervals pairwise but break early using sorted starts:
        # for each i, only j with starts[j] < ends[i] can overlap i, so:
        for i in range(n):
            # j starts with i+1
            for j in range(i+1, n):
                # if start_j >= end_i then j and later cannot overlap i (sorted starts), so break
                if starts[j] >= ends[i]:
                    break
                # otherwise check overlap condition (strict)
                if (starts[i] < ends[j]) and (starts[j] < ends[i]):
                    union(i, j)

        # collect components
        components = {}
        for k in range(n):
            root = find(k)
            components.setdefault(root, []).append(k)

        # assign overlap flags/section ids for components with size > 1
        for comp in components.values():
            if len(comp) > 1:
                # assign a new global section id
                sid = next_section_id
                next_section_id += 1
                for k in comp:
                    df.at[indices[k], 'overlap'] = True
                    df.at[indices[k], 'overlap_section'] = sid
        # singletons remain overlap=False and overlap_section=0

    # return marked_df
    marked_df = df  # it contains all original columns plus 'overlap' and 'overlap_section'
    return marked_df

marked_df = mark_overlaps(log_df)
marked_df.head()

Unnamed: 0,concept:name,start:timestamp,time:timestamp,org:resource,case:concept:name,overlap,overlap_section
0,Turning & Milling,2012-01-30 06:24:00+00:00,2012-01-30 12:43:00+00:00,ID4932,Case1,True,16
1,Turning & Milling,2012-01-30 12:44:00+00:00,2012-01-30 13:42:00+00:00,ID4932,Case1,True,16
2,Turning & Milling,2012-01-30 13:59:00+00:00,2012-01-30 14:21:00+00:00,ID4167,Case1,False,0
3,Turning & Milling,2012-01-30 14:21:00+00:00,2012-01-30 17:58:00+00:00,ID4167,Case1,True,50
4,Turning & Milling Q.C.,2012-01-31 20:20:00+00:00,2012-01-31 21:50:00+00:00,ID4163,Case1,False,0


In [None]:
marked_df_sorted = marked_df.sort_values(['org:resource', 'start:timestamp']).reset_index(drop=True)

#check is there any overlap=true, if found then it need to be proprocessed
print("No. of workitems that are taking part in multitasking: ", len(marked_df_sorted[marked_df_sorted['overlap']==True]))
print("No. of workitems that are NOT taking part in multitasking: ", len(marked_df_sorted[marked_df_sorted['overlap']==False]))
print("No. of unique overlapping group of workitems belongs to the same overlap interval: ", len(marked_df_sorted['overlap_section'].unique()))

No. of workitems that are taking part in multitasking:  2014
No. of workitems that are NOT taking part in multitasking:  2489
No. of unique overlapping group of workitems belongs to the same overlap interval:  710


#### ***Working to delete rows from real_log to make it free from multitasking executing (No_Multitasking: NMT)***

In [None]:
#Average Duration of Activities (before making any amendments)

log_df['start:timestamp'] = pd.to_datetime(log_df['start:timestamp'], format='ISO8601', utc=True)
log_df['time:timestamp'] = pd.to_datetime(log_df['time:timestamp'], format='ISO8601', utc=True)

log_df['duration_seconds'] = (log_df['time:timestamp'] - log_df['start:timestamp']).dt.total_seconds()
log_df['duration_minutes'] = (log_df['time:timestamp'] - log_df['start:timestamp']).dt.total_seconds()/60

avg_durations = log_df.groupby('concept:name')['duration_seconds'].mean().reset_index()
avg_durations['duration_minutes'] = avg_durations['duration_seconds'] / 60

avg_durations = avg_durations.sort_values(by='concept:name', ascending=True)
print(avg_durations)

                   concept:name  duration_seconds  duration_minutes
0                Change Version      29505.000000        491.750000
1              Final Inspection       4500.000000         75.000000
2         Final Inspection Q.C.       6888.327273        114.805455
3                           Fix       8140.000000        135.666667
4                 Flat Grinding       5702.631579         95.043860
5               Grinding Rework       9490.515464        158.175258
6                       Lapping       6392.432432        106.540541
7                 Laser Marking       3465.714286         57.761905
8                       Milling      19718.000000        328.633333
9                  Milling Q.C.       4500.000000         75.000000
10               Nitration Q.C.       1575.000000         26.250000
11                      Packing       3600.000000         60.000000
12               Rework Milling       6300.000000        105.000000
13                  Round  Q.C.       4050.00000

In [None]:
# checking the summary of log (showing 'total_count': how many times it is executed, 'overlap_count': how many times it is executed as a multitasked activity)

#Total occurrences of each activity
total_counts = (marked_df_sorted.groupby('concept:name').size()
      .reset_index(name='total_count'))

# overlap_counts of each activity
overlap_counts = (marked_df_sorted[marked_df_sorted['overlap'] == True].groupby('concept:name')
      .size()
      .reset_index(name='overlap_count'))

# Merging for summary
overlap_summary = (pd.merge(total_counts, overlap_counts, on='concept:name', how='left').fillna(0))

overlap_summary['overlap_percentage'] = (overlap_summary['overlap_count'] / overlap_summary['total_count'] * 100).round(2)

overlap_summary = overlap_summary.sort_values('overlap_percentage', ascending=False)

print(overlap_summary)

affected_cases = marked_df_sorted.loc[marked_df_sorted['overlap'], 'case:concept:name'].nunique()
print("Number of affected cases:", affected_cases)

                   concept:name  total_count  overlap_count  \
1              Final Inspection            1            1.0   
9                  Milling Q.C.            1            1.0   
15  SETUP     Turning & Milling            3            3.0   
11                      Packing          277          258.0   
18   Turn & Mill. & Screw Assem           35           21.0   
10               Nitration Q.C.            4            2.0   
6                       Lapping          370          183.0   
20            Turning & Milling         1269          623.0   
21       Turning & Milling Q.C.          522          240.0   
7                 Laser Marking          252          110.0   
22                 Turning Q.C.           55           24.0   
2         Final Inspection Q.C.          550          235.0   
4                 Flat Grinding          114           44.0   
14               Round Grinding          774          229.0   
19                      Turning          127           

In [None]:
# >>>> Removing multitasked events to make the log non_multitasking(NMT): 

df_cleaned = marked_df_sorted.copy()

#A: Separating non-overlapping
non_overlapping = df_cleaned[df_cleaned['overlap_section'] == 0]

#B: From overlapping sections, keep only the first event in each section
first_in_section = (df_cleaned[df_cleaned['overlap_section'] > 0]
    .sort_values(['overlap_section', 'start:timestamp'])  
    .groupby('overlap_section')
    .head(1))

#C: Combining both subsets back and sorting
df_filtered = pd.concat([non_overlapping, first_in_section], ignore_index=True)

df_filtered = df_filtered.sort_values(['case:concept:name', 'start:timestamp']).reset_index(drop=True)

print(f"Original rows: {len(marked_df_sorted)}")
print(f"Filtered rows: {len(df_filtered)}")
print(f"Removed rows: {len(marked_df_sorted) - len(df_filtered)}")

Original rows: 4503
Filtered rows: 3198
Removed rows: 1305


In [None]:
df_filtered = df_filtered.drop(['overlap', 'overlap_section'], axis=1)
df_filtered = df_filtered.sort_values(['org:resource', 'start:timestamp']).reset_index(drop=True)
df_filtered

Unnamed: 0,concept:name,start:timestamp,time:timestamp,org:resource,case:concept:name
0,Round Grinding,2012-01-23 14:01:00+00:00,2012-01-24 00:26:00+00:00,ID0420,Case238
1,Round Grinding,2012-01-24 13:50:00+00:00,2012-01-24 15:40:00+00:00,ID0420,Case21
2,Round Grinding,2012-01-24 21:53:00+00:00,2012-01-25 00:47:00+00:00,ID0420,Case207
3,Round Grinding,2012-01-25 14:00:00+00:00,2012-01-25 16:30:00+00:00,ID0420,Case20
4,Round Grinding,2012-01-26 15:21:00+00:00,2012-01-26 22:56:00+00:00,ID0420,Case20
...,...,...,...,...,...
3193,Turning & Milling,2012-03-30 01:49:00+00:00,2012-03-30 03:49:00+00:00,ID4932,Case134
3194,Turning & Milling,2012-03-30 03:50:00+00:00,2012-03-30 08:00:00+00:00,ID4932,Case134
3195,Turning & Milling,2012-03-30 08:01:00+00:00,2012-03-30 13:45:00+00:00,ID4932,Case135
3196,Turning & Milling,2012-03-31 07:43:00+00:00,2012-03-31 13:45:00+00:00,ID4932,Case134


In [None]:
# Saving the dataframe as an eventlog that has no multitaskinng executions (NMT):

from pm4py.objects.log.exporter.xes import exporter as xes_exporter
output_path = 'Production_NMT.xes'

xes_exporter.apply(df_filtered, output_path)

print(f"NMT log saved as {output_path}")

exporting log, completed traces ::   0%|          | 0/222 [00:00<?, ?it/s]

NMT log saved as Production_NMT.xes


In [None]:
# >>>>Checking activity duration after amendments

df_new = df_filtered.copy()

df_new['start:timestamp'] = pd.to_datetime(df_new['start:timestamp'], format='ISO8601', utc=True)
df_new['time:timestamp'] = pd.to_datetime(df_new['time:timestamp'], format='ISO8601', utc=True)

df_new['duration_seconds'] = (df_new['time:timestamp'] - df_new['start:timestamp']).dt.total_seconds()
df_new['duration_minutes'] = (df_new['time:timestamp'] - df_new['start:timestamp']).dt.total_seconds()/60

avg_durations = df_new.groupby('concept:name')['duration_seconds'].mean().reset_index()
avg_durations['duration_minutes'] = avg_durations['duration_seconds'] / 60

avg_durations = avg_durations.sort_values(by='concept:name', ascending=True)
print(avg_durations)

                  concept:name  duration_seconds  duration_minutes
0               Change Version      29505.000000        491.750000
1        Final Inspection Q.C.       7136.818182        118.946970
2                          Fix       8140.000000        135.666667
3                Flat Grinding       5928.791209         98.813187
4              Grinding Rework       9725.161290        162.086022
5                      Lapping       7103.112840        118.385214
6                Laser Marking       3678.034682         61.300578
7                      Milling      20635.714286        343.928571
8               Nitration Q.C.       3090.000000         51.500000
9                      Packing       3600.000000         60.000000
10              Rework Milling       6300.000000        105.000000
11                 Round  Q.C.       4050.000000         67.500000
12              Round Grinding      13165.273011        219.421217
13                       Setup       5670.000000         94.50