# Preprocessing
This notebook filters the data and saves training and test data in the data folder

In [1]:
# import basic libraries
import pandas as pd

# import machine learing library
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest

# import pm4py library to work with XES logs and process mining
import pm4py
from pm4py.algo.transformation.log_to_features.util import locally_linear_embedding
from pm4py.visualization.graphs import visualizer

In [2]:
log = pm4py.read_xes("data/BPI_Challenge_2017.xes.gz")
log_df = pm4py.convert_to_dataframe(log)
log_df.head()



parsing log, completed traces ::   0%|          | 0/31509 [00:00<?, ?it/s]

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,


## Ordinal encoding
This encodes all string inputs as integers, which is needed to run models on it. This might not be the best encoding method, as categories do not imply any kind of order, while intergers do.

For future implementations we also want to experiment with:
- One-hot encoding (using pm4py log_to_features) followed by PCA to reduce dimensionality
- Bi-Grams (also using pm4py log_to_features)
- Multisets

In [3]:
# encode string values using ordinal encoding
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
encoded_log = encoder.fit_transform(log_df)
encoded_df = pd.DataFrame(encoded_log)
encoded_df.fillna(value=-1, inplace=True)
encoded_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.0,4.0,0.0,233979.0,1.0,0.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,4.0,0.0,8.0,0.0,62695.0,1.0,1.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,0.0,0.0,22.0,2.0,552510.0,3.0,2.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,1.0,0.0,22.0,2.0,702398.0,6.0,3.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,0.0,0.0,21.0,2.0,631062.0,3.0,4.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


## Anomaly detection
We apply a method called IsolationForest to the dataframe. This permits to add a column scores that is lower or equal than 0 when the case needs to be considered anomalous, and is greater than 0 when the case needs not to be considered anomalous.

*Note: based on the results, we think it's better to not remove traces with high anomaly scores. After visual inspection of these traces, they don't seem to have anything weird going on*

In [4]:
scores_df = log_df.copy()

model=IsolationForest()
model.fit(encoded_df)
scores_df["scores"] = model.decision_function(encoded_df)
scores_df.head()

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,-0.0063
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.014594
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.033172
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.019175
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.037771


To see which cases are more anomalous, we can sort the dataframe inserting an index. Then, the print will show which cases are more anomalous

In [5]:
# show highest scores
scores_df.sort_values("scores")

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
1185439,Created,User_96,O_Create Offer,Offer,Offer_1534515155,complete,2016-12-27 11:52:54.741000+00:00,Car,New credit,Application_1596066079,62500.0,39960.0,120.0,True,663.88,True,963.0,65200.0,,-0.261864
354267,Created,User_13,O_Create Offer,Offer,Offer_1992128451,complete,2016-04-30 13:33:55.876000+00:00,Car,Limit raise,Application_1125253534,58000.0,33000.0,121.0,True,600.00,True,897.0,58000.0,,-0.259584
653354,Created,User_41,O_Create Offer,Offer,Offer_1511444856,complete,2016-07-28 09:46:08.475000+00:00,Existing loan takeover,Limit raise,Application_1220630254,75000.0,0.0,126.0,True,750.00,True,886.0,75000.0,,-0.257827
1005841,Created,User_66,O_Create Offer,Offer,Offer_466589272,complete,2016-10-31 13:06:19.111000+00:00,Car,New credit,Application_1209476276,65600.0,18105.8,120.0,True,667.96,True,941.0,65600.0,,-0.257751
1201746,Created,User_72,O_Create Offer,Offer,Offer_1632721348,complete,2017-01-03 10:40:30.764000+00:00,Existing loan takeover,New credit,Application_1616238013,54000.0,13500.0,120.0,True,585.48,True,953.0,57500.0,,-0.257239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
442016,Obtained,User_41,W_Call incomplete files,Workflow,Workitem_1933806778,resume,2016-06-08 17:19:14.406000+00:00,Home improvement,New credit,Application_1924053629,10000.0,,,,,,,,,0.113725
631598,Obtained,User_39,W_Call incomplete files,Workflow,Workitem_1526416952,resume,2016-08-04 14:44:03.004000+00:00,Home improvement,New credit,Application_2019183781,13000.0,,,,,,,,,0.113762
469200,Obtained,User_4,W_Call after offers,Workflow,Workitem_1533235476,resume,2016-06-13 15:32:28.117000+00:00,Home improvement,New credit,Application_1585186665,10000.0,,,,,,,,,0.113869
662598,Obtained,User_39,W_Call incomplete files,Workflow,Workitem_1905067702,resume,2016-08-04 13:13:12.504000+00:00,Home improvement,New credit,Application_237168632,10000.0,,,,,,,,,0.114320


In [6]:
# show highest average scores per trace
scores_df[["case:concept:name", "scores"]].groupby(["case:concept:name"]).mean().sort_values("scores")

Unnamed: 0_level_0,scores
case:concept:name,Unnamed: 1_level_1
Application_896441766,-0.090179
Application_1113899604,-0.082390
Application_1200551534,-0.081788
Application_946413213,-0.081064
Application_922011706,-0.078052
...,...
Application_22510455,0.080227
Application_1373016712,0.081355
Application_150888226,0.081631
Application_2123037823,0.082707


## Feature evolution
We may be interested to evaluate the evolution of the features over time, to identify the positions of the event log with a behavior that is different from the mainstream behavior.

*Note: my laptop doesn't have enough memory to run this, so I don't know what the results are*

In [None]:
x, y = locally_linear_embedding.apply(log)
gviz = visualizer.apply(x, y, variant=visualizer.Variants.DATES,
                        parameters={"title": "Locally Linear Embedding", "format": "svg", "y_axis": "Intensity"})
visualizer.view(gviz)

## Split train and test
Using the pm4py.split_train_test resulted in traces in train that ended after the start of traces in test unfortunately. This is not a good split, so we implement it manually by sorting traces on timestamp

In [8]:
trace_start_df = log_df[["case:concept:name", "time:timestamp"]].groupby(["case:concept:name"]).min()
trace_end_df = log_df[["case:concept:name", "time:timestamp"]].groupby(["case:concept:name"]).max()

In [9]:
# take the last 10% of the traces as test set
test_size = round(len(trace_start_df)*0.1)
test_cases = trace_start_df.sort_values("time:timestamp").tail(test_size)

In [10]:
# train cases must end before test cases start
train_cases = trace_end_df[trace_end_df["time:timestamp"] < test_cases["time:timestamp"].min()]

In [33]:
train_df = log_df[log_df["case:concept:name"].isin(train_cases.index)]
test_df = log_df[log_df["case:concept:name"].isin(test_cases.index)]

In [34]:
# double check that the timestamps don't overlap
# all traces in train must end before the start of traces in test
print(train_df["time:timestamp"].max())
print(test_df["time:timestamp"].min())

2016-11-22 09:21:30.939000+00:00
2016-11-22 09:22:17.274000+00:00


## Feature encoding
For now we use the basic feature encoding from pm4py, but we want to experiment with using complex index encoding, where we encode the previous 10 activities (or add padding). Furthermore, we add the index of the activity in the log


In [35]:
# the shortest trace has 10 activities
log_df[["case:concept:name", "Action"]].groupby(["case:concept:name"]).count().min()

Action    10
dtype: int64

In [36]:
# add column "event_index_in_trace"
# which indicates the 1st, 2nd ... event in the trace
train_df["event_index_in_trace"] = train_df.groupby("case:concept:name").cumcount()
test_df["event_index_in_trace"] = test_df.groupby("case:concept:name").cumcount()

# add column "remain_time" 
# which indicates time from that event until the last event in the trace
train_df["time:timestamp"] = pd.to_datetime(train_df["time:timestamp"], utc=True)
train_df["remaining_time"] = train_df.groupby("case:concept:name")["time:timestamp"].apply(lambda x: x.max() - x).values
train_df["remaining_time"] = train_df["remaining_time"].dt.total_seconds() / (24 * 60 * 60)  # convert to float days
test_df["time:timestamp"] = pd.to_datetime(test_df["time:timestamp"], utc=True)
test_df["remaining_time"] = test_df.groupby("case:concept:name")["time:timestamp"].apply(lambda x: x.max() - x).values
test_df["remaining_time"] = test_df["remaining_time"].dt.total_seconds() / (24 * 60 * 60)  # convert to float days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["event_index_in_trace"] = train_df.groupby("case:concept:name").cumcount()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df["event_index_in_trace"] = test_df.groupby("case:concept:name").cumcount()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["time:timestamp"] = pd.to_date

In [37]:
columns_to_keep = ['Action', 'concept:name', 'time:timestamp', 'case:LoanGoal', 'case:RequestedAmount', 'event_index_in_trace', 'remaining_time']

train_df = train_df[columns_to_keep]
test_df = test_df[columns_to_keep]

In [38]:
# all the features we are going to encode
test_df.head()

Unnamed: 0,Action,concept:name,time:timestamp,case:LoanGoal,case:RequestedAmount,event_index_in_trace,remaining_time
1080782,Created,A_Create Application,2016-11-22 09:22:17.274000+00:00,Unknown,0.0,0,30.901643
1080783,Created,W_Complete application,2016-11-22 09:22:17.285000+00:00,Unknown,0.0,1,30.901643
1080784,Obtained,W_Complete application,2016-11-22 09:22:17.288000+00:00,Unknown,0.0,2,30.901643
1080785,statechange,A_Concept,2016-11-22 09:22:17.291000+00:00,Unknown,0.0,3,30.901643
1080786,statechange,A_Accepted,2016-11-22 09:24:43.370000+00:00,Unknown,0.0,4,30.899953


In [39]:
# one-hot encode the data
train_df = pd.get_dummies(train_df, dtype=int)
test_df = pd.get_dummies(test_df, dtype=int)
test_df.head()

Unnamed: 0,time:timestamp,case:RequestedAmount,event_index_in_trace,remaining_time,Action_Created,Action_Deleted,Action_Obtained,Action_Released,Action_statechange,concept:name_A_Accepted,...,case:LoanGoal_Debt restructuring,case:LoanGoal_Existing loan takeover,case:LoanGoal_Extra spending limit,case:LoanGoal_Home improvement,case:LoanGoal_Motorcycle,case:LoanGoal_Not speficied,"case:LoanGoal_Other, see explanation",case:LoanGoal_Remaining debt home,case:LoanGoal_Tax payments,case:LoanGoal_Unknown
1080782,2016-11-22 09:22:17.274000+00:00,0.0,0,30.901643,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080783,2016-11-22 09:22:17.285000+00:00,0.0,1,30.901643,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080784,2016-11-22 09:22:17.288000+00:00,0.0,2,30.901643,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1080785,2016-11-22 09:22:17.291000+00:00,0.0,3,30.901643,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1080786,2016-11-22 09:24:43.370000+00:00,0.0,4,30.899953,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1


## Split and save features X and targets y

In [40]:
X_train = train_df.drop(columns=["remaining_time"])
X_train.to_csv("data/generated/X_train.csv")

X_test = test_df.drop(columns=["remaining_time"])
X_test.to_csv("data/generated/X_test.csv")

y_train = train_df["remaining_time"]
y_train.to_csv("data/generated/y_train.csv")

y_test = test_df["remaining_time"]
y_test.to_csv("data/generated/y_test.csv")