# Preprocessing
This notebook filters the data and saves training and test data in the data folder

In [1]:
# import basic libraries
import pandas as pd

# import machine learing library
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest

# import pm4py library to work with XES logs and process mining
import pm4py
from pm4py.algo.transformation.log_to_features.util import locally_linear_embedding
from pm4py.visualization.graphs import visualizer

In [2]:
log = pm4py.read_xes("data/BPI_Challenge_2017.xes.gz")
log_df = pm4py.convert_to_dataframe(log)
log_df.head()

parsing log, completed traces ::   0%|          | 0/31509 [00:00<?, ?it/s]

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,


## Ordinal encoding
This encodes all string inputs as integers, which is needed to run models on it. This might not be the best encoding method, as categories do not imply any kind of order, while intergers do.

For future implementations we also want to experiment with:
- One-hot encoding (using pm4py log_to_features) followed by PCA to reduce dimensionality
- Bi-Grams (also using pm4py log_to_features)
- Multisets

In [3]:
# encode string values using ordinal encoding
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
encoded_log = encoder.fit_transform(log_df)
encoded_df = pd.DataFrame(encoded_log)
encoded_df.fillna(value=-1, inplace=True)
encoded_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.0,4.0,0.0,233979.0,1.0,0.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,4.0,0.0,8.0,0.0,62695.0,1.0,1.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,0.0,0.0,22.0,2.0,552510.0,3.0,2.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,1.0,0.0,22.0,2.0,702398.0,6.0,3.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,0.0,0.0,21.0,2.0,631062.0,3.0,4.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


## Anomaly detection
We apply a method called IsolationForest to the dataframe. This permits to add a column scores that is lower or equal than 0 when the case needs to be considered anomalous, and is greater than 0 when the case needs not to be considered anomalous.

*Note: based on the results, we think it's better to not remove traces with high anomaly scores. After visual inspection of these traces, they don't seem to have anything weird going on*

In [4]:
scores_df = log_df.copy()

model=IsolationForest()
model.fit(encoded_df)
scores_df["scores"] = model.decision_function(encoded_df)
scores_df.head()

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.002252
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.032915
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.05289
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.01981
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.056859


To see which cases are more anomalous, we can sort the dataframe inserting an index. Then, the print will show which cases are more anomalous

In [5]:
# show highest scores
scores_df.sort_values("scores")

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
1185439,Created,User_96,O_Create Offer,Offer,Offer_1534515155,complete,2016-12-27 11:52:54.741000+00:00,Car,New credit,Application_1596066079,62500.0,39960.0,120.0,True,663.88,True,963.0,65200.0,,-0.272542
354267,Created,User_13,O_Create Offer,Offer,Offer_1992128451,complete,2016-04-30 13:33:55.876000+00:00,Car,Limit raise,Application_1125253534,58000.0,33000.0,121.0,True,600.00,True,897.0,58000.0,,-0.271849
816403,Created,User_12,O_Create Offer,Offer,Offer_103582037,complete,2016-09-09 18:57:32.651000+00:00,Boat,New credit,Application_2005060264,75000.0,75000.0,100.0,True,887.42,True,985.0,75000.0,,-0.269413
604512,Created,User_53,O_Create Offer,Offer,Offer_2101069545,complete,2016-07-18 08:49:18.540000+00:00,Boat,New credit,Application_1310092467,75000.0,60000.0,120.0,True,610.93,True,971.0,60000.0,,-0.266891
490244,Created,User_49,O_Create Offer,Offer,Offer_1684795486,complete,2016-06-15 17:20:17.520000+00:00,Boat,New credit,Application_461381832,60000.0,60000.0,120.0,True,610.93,True,891.0,60000.0,,-0.266814
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
484937,Obtained,User_3,W_Call incomplete files,Workflow,Workitem_2040598482,resume,2016-07-08 08:45:04.326000+00:00,Existing loan takeover,New credit,Application_396471385,10000.0,,,,,,,,,0.121299
530933,Obtained,User_3,W_Call incomplete files,Workflow,Workitem_1552303102,resume,2016-07-08 10:56:18.438000+00:00,Home improvement,New credit,Application_212279407,10000.0,,,,,,,,,0.121335
486665,Obtained,User_3,W_Call incomplete files,Workflow,Workitem_1437480239,resume,2016-06-30 07:48:07.342000+00:00,Existing loan takeover,New credit,Application_479534679,14000.0,,,,,,,,,0.121404
484935,Obtained,User_33,W_Call incomplete files,Workflow,Workitem_1830672503,resume,2016-07-04 14:12:33.335000+00:00,Existing loan takeover,New credit,Application_396471385,10000.0,,,,,,,,,0.121568


In [6]:
# show highest average scores per trace
scores_df[["case:concept:name", "scores"]].groupby(["case:concept:name"]).mean().sort_values("scores")

Unnamed: 0_level_0,scores
case:concept:name,Unnamed: 1_level_1
Application_896441766,-0.082105
Application_918459127,-0.072499
Application_742871702,-0.070639
Application_923895936,-0.070573
Application_83337214,-0.069637
...,...
Application_342142984,0.082741
Application_1806060525,0.083940
Application_2089806999,0.083981
Application_1845792027,0.084568


## Feature evolution
We may be interested to evaluate the evolution of the features over time, to identify the positions of the event log with a behavior that is different from the mainstream behavior.

*Note: my laptop doesn't have enough memory to run this, so I don't know what the results are*

In [None]:
x, y = locally_linear_embedding.apply(log)
gviz = visualizer.apply(x, y, variant=visualizer.Variants.DATES,
                        parameters={"title": "Locally Linear Embedding", "format": "svg", "y_axis": "Intensity"})
visualizer.view(gviz)

## Remaining time
Here we calculate the remaining time per trace

In [7]:
# the shortest trace has 10 activities
log_df[["case:concept:name", "Action"]].groupby(["case:concept:name"]).count().min()

Action    10
dtype: int64

In [8]:
# Remaining time

# add column "event_index_in_trace"
# which indicates the 1st, 2nd ... event in the trace
log_df["event_index_in_trace"] = log_df.groupby("case:concept:name").cumcount()

# add column "remain_time"
# which indicates time from that event until the last event in the trace
log_df["time:timestamp"] = pd.to_datetime(log_df["time:timestamp"], utc=True)
log_df["remaining_time"] = log_df.groupby("case:concept:name")["time:timestamp"].transform(lambda x: x.max() - x).dt.total_seconds() / (24 * 60 * 60)  # convert to float days

log_df.head()

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,...,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,event_index_in_trace,remaining_time
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,...,,,,,,,,,0,13.248566
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,...,,,,,,,,,1,13.248566
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,...,,,,,,,,,2,13.248561
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,...,,,,,,,,,3,13.247628
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,...,,,,,,,,,4,13.247628


## Date formatting for training

In [9]:
# Convert timestamp to a pandas datetime object
log_df['timestamp'] = pd.to_datetime(log_df['time:timestamp'],format='ISO8601')

# Extract relevant features
log_df['year'] = log_df['timestamp'].dt.year
log_df['month'] = log_df['timestamp'].dt.month
log_df['day'] = log_df['timestamp'].dt.day
log_df['hour'] = log_df['timestamp'].dt.hour
log_df['minute'] = log_df['timestamp'].dt.minute
log_df['second'] = log_df['timestamp'].dt.second
log_df['microsecond'] = log_df['timestamp'].dt.microsecond 

# Drop the original timestamp column
log_df = log_df.drop(['timestamp'], axis=1)

## Split train and test
Using the pm4py.split_train_test resulted in traces in train that ended after the start of traces in test unfortunately. This is not a good split, so we implement it manually by sorting traces on timestamp

In [10]:
trace_start_df = log_df[["case:concept:name", "time:timestamp"]].groupby(["case:concept:name"]).min()
trace_end_df = log_df[["case:concept:name", "time:timestamp"]].groupby(["case:concept:name"]).max()

In [11]:
# take the last 10% of the traces as test set
test_size = round(len(trace_start_df)*0.1)
test_cases = trace_start_df.sort_values("time:timestamp").tail(test_size)

In [12]:
# train cases must end before test cases start
train_cases = trace_end_df[trace_end_df["time:timestamp"] < test_cases["time:timestamp"].min()]

In [13]:
train_df = log_df[log_df["case:concept:name"].isin(train_cases.index)]
test_df = log_df[log_df["case:concept:name"].isin(test_cases.index)]

In [14]:
# double check that the timestamps don't overlap
# all traces in train must end before the start of traces in test
print(train_df["time:timestamp"].max())
print(test_df["time:timestamp"].min())

2016-11-22 09:21:30.939000+00:00
2016-11-22 09:22:17.274000+00:00


## Feature encoding
For now we use the basic feature encoding from pm4py, but we want to experiment with using complex index encoding, where we encode the previous 10 activities (or add padding). Furthermore, we add the index of the activity in the log


In [15]:
# select the features we are going to encode
columns_to_encode = ['Action', 'concept:name', 'time:timestamp', 'case:LoanGoal', 'case:RequestedAmount', 'event_index_in_trace', 'remaining_time']

# one-hot encode the data
train_df_encode = pd.get_dummies(train_df[columns_to_encode], dtype=int)
test_df_encode = pd.get_dummies(test_df[columns_to_encode], dtype=int)
test_df_encode.head()

Unnamed: 0,time:timestamp,case:RequestedAmount,event_index_in_trace,remaining_time,Action_Created,Action_Deleted,Action_Obtained,Action_Released,Action_statechange,concept:name_A_Accepted,...,case:LoanGoal_Debt restructuring,case:LoanGoal_Existing loan takeover,case:LoanGoal_Extra spending limit,case:LoanGoal_Home improvement,case:LoanGoal_Motorcycle,case:LoanGoal_Not speficied,"case:LoanGoal_Other, see explanation",case:LoanGoal_Remaining debt home,case:LoanGoal_Tax payments,case:LoanGoal_Unknown
1080782,2016-11-22 09:22:17.274000+00:00,0.0,0,30.901643,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080783,2016-11-22 09:22:17.285000+00:00,0.0,1,30.901643,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080784,2016-11-22 09:22:17.288000+00:00,0.0,2,30.901643,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080785,2016-11-22 09:22:17.291000+00:00,0.0,3,30.901643,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1080786,2016-11-22 09:24:43.370000+00:00,0.0,4,30.899953,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1


In [16]:
# Concatenate the DataFrames based on the index
full_train_df = pd.concat([train_df[["case:concept:name", "org:resource", "lifecycle:transition"]], train_df_encode], axis=1)
full_test_df = pd.concat([test_df[["case:concept:name", "org:resource", "lifecycle:transition"]], test_df_encode], axis=1)

full_train_df.head()

Unnamed: 0,case:concept:name,org:resource,lifecycle:transition,time:timestamp,case:RequestedAmount,event_index_in_trace,remaining_time,Action_Created,Action_Deleted,Action_Obtained,...,case:LoanGoal_Debt restructuring,case:LoanGoal_Existing loan takeover,case:LoanGoal_Extra spending limit,case:LoanGoal_Home improvement,case:LoanGoal_Motorcycle,case:LoanGoal_Not speficied,"case:LoanGoal_Other, see explanation",case:LoanGoal_Remaining debt home,case:LoanGoal_Tax payments,case:LoanGoal_Unknown
0,Application_652823628,User_1,complete,2016-01-01 09:51:15.304000+00:00,20000.0,0,13.248566,1,0,0,...,0,1,0,0,0,0,0,0,0,0
1,Application_652823628,User_1,complete,2016-01-01 09:51:15.352000+00:00,20000.0,1,13.248566,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,Application_652823628,User_1,schedule,2016-01-01 09:51:15.774000+00:00,20000.0,2,13.248561,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,Application_652823628,User_1,withdraw,2016-01-01 09:52:36.392000+00:00,20000.0,3,13.247628,0,1,0,...,0,1,0,0,0,0,0,0,0,0
4,Application_652823628,User_1,schedule,2016-01-01 09:52:36.403000+00:00,20000.0,4,13.247628,1,0,0,...,0,1,0,0,0,0,0,0,0,0


In [None]:
# Create frequency encoding for test and train df
# Select columns that start with "concept:" or "Action_"
relevant_columns = [c for c in train_df.columns if c.startswith("concept:") or c.startswith("Action_")]

for trace_id, trace_df in train_df.groupby("case:concept:name"):
    trace_df_sorted = trace_df.sort_values(by='event_index_in_trace')
    # Update only the selected columns with the cumulative sum
    train_df.loc[trace_df_sorted.index, relevant_columns] = trace_df_sorted[relevant_columns].cumsum()


# Select columns that start with "concept:" or "Action_"
relevant_columns = [c for c in test_df.columns if c.startswith("concept:") or c.startswith("Action_")]

for trace_id, trace_df in test_df.groupby("case:concept:name"):
    trace_df_sorted = trace_df.sort_values(by='event_index_in_trace')
    # Update only the selected columns with the cumulative sum
    test_df.loc[trace_df_sorted.index, relevant_columns] = trace_df_sorted[relevant_columns].cumsum()

## Split and save features X and targets y

In [None]:
#Save the frequency encoded dataframes
X_train = train_df.drop(columns=["remaining_time"])
X_train.to_csv("data/generated/frequency/X_train.csv")

X_test = test_df.drop(columns=["remaining_time"])
X_test.to_csv("data/generated/frequency/X_test.csv")

y_train = train_df["remaining_time"]
y_train.to_csv("data/generated/frequency/y_train.csv")

y_test = test_df["remaining_time"]
y_test.to_csv("data/generated/frequency/y_test.csv")

In [None]:
#Save the one-hot encoded dataframes
X_train = full_train_df.drop(columns=["remaining_time"])
X_train.to_csv("data/generated/onehot/X_train.csv")

X_test = full_test_df.drop(columns=["remaining_time"])
X_test.to_csv("data/generated/onehot/X_test.csv")

y_train = full_train_df["remaining_time"]
y_train.to_csv("data/generated/onehot/y_train.csv")

y_test = full_test_df["remaining_time"]
y_test.to_csv("data/generated/onehot/y_test.csv")