# Preprocessing
This notebook filters the data and saves training and test data in the data folder

In [1]:
# import basic libraries
import pandas as pd

# import machine learing library
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest

# import pm4py library to work with XES logs and process mining
import pm4py
from pm4py.algo.transformation.log_to_features.util import locally_linear_embedding
from pm4py.visualization.graphs import visualizer

In [2]:
log = pm4py.read_xes("data/BPI_Challenge_2017.xes.gz")
log_df = pm4py.convert_to_dataframe(log)
log_df.head()



parsing log, completed traces ::   0%|          | 0/31509 [00:00<?, ?it/s]

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,


## Ordinal encoding
This encodes all string inputs as integers, which is needed to run models on it. This might not be the best encoding method, as categories do not imply any kind of order, while intergers do.

For future implementations we also want to experiment with:
- One-hot encoding (using pm4py log_to_features) followed by PCA to reduce dimensionality
- Bi-Grams (also using pm4py log_to_features)
- Multisets

In [3]:
# encode string values using ordinal encoding
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
encoded_log = encoder.fit_transform(log_df)
encoded_df = pd.DataFrame(encoded_log)
encoded_df.fillna(value=-1, inplace=True)
encoded_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.0,4.0,0.0,233979.0,1.0,0.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,4.0,0.0,8.0,0.0,62695.0,1.0,1.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,0.0,0.0,22.0,2.0,552510.0,3.0,2.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,1.0,0.0,22.0,2.0,702398.0,6.0,3.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,0.0,0.0,21.0,2.0,631062.0,3.0,4.0,5.0,1.0,25893.0,301.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


## Anomaly detection
We apply a method called IsolationForest to the dataframe. This permits to add a column scores that is lower or equal than 0 when the case needs to be considered anomalous, and is greater than 0 when the case needs not to be considered anomalous.

*Note: based on the results, we think it's better to not remove traces with high anomaly scores. After visual inspection of these traces, they don't seem to have anything weird going on*

In [4]:
scores_df = log_df.copy()

model=IsolationForest()
model.fit(encoded_df)
scores_df["scores"] = model.decision_function(encoded_df)
scores_df.head()

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
0,Created,User_1,A_Create Application,Application,Application_652823628,complete,2016-01-01 09:51:15.304000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,-0.01545
1,statechange,User_1,A_Submitted,Application,ApplState_1582051990,complete,2016-01-01 09:51:15.352000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.027764
2,Created,User_1,W_Handle leads,Workflow,Workitem_1298499574,schedule,2016-01-01 09:51:15.774000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.044494
3,Deleted,User_1,W_Handle leads,Workflow,Workitem_1673366067,withdraw,2016-01-01 09:52:36.392000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.013311
4,Created,User_1,W_Complete application,Workflow,Workitem_1493664571,schedule,2016-01-01 09:52:36.403000+00:00,Existing loan takeover,New credit,Application_652823628,20000.0,,,,,,,,,0.05508


To see which cases are more anomalous, we can sort the dataframe inserting an index. Then, the print will show which cases are more anomalous

In [5]:
# show highest scores
scores_df.sort_values("scores")

Unnamed: 0,Action,org:resource,concept:name,EventOrigin,EventID,lifecycle:transition,time:timestamp,case:LoanGoal,case:ApplicationType,case:concept:name,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,Accepted,MonthlyCost,Selected,CreditScore,OfferedAmount,OfferID,scores
886461,Created,User_51,O_Create Offer,Offer,Offer_195849826,complete,2016-09-28 14:43:30.444000+00:00,Boat,New credit,Application_1045189910,50000.0,50000.0,120.0,True,509.11,True,965.0,50000.0,,-0.256726
1191015,Created,User_77,O_Create Offer,Offer,Offer_961425477,complete,2016-12-28 15:50:00.230000+00:00,"Other, see explanation",New credit,Application_926354715,75000.0,75000.0,120.0,True,763.67,True,989.0,75000.0,,-0.256714
12163,Created,User_92,O_Create Offer,Offer,Offer_659630981,complete,2016-01-05 18:12:49.532000+00:00,"Other, see explanation",New credit,Application_1231177181,50000.0,21382.0,113.0,True,548.39,True,1009.0,50000.0,,-0.256215
970466,Created,User_67,O_Create Offer,Offer,Offer_238375766,complete,2016-10-21 09:17:15.175000+00:00,Existing loan takeover,New credit,Application_949156991,60000.0,48277.0,55.0,True,1199.05,True,977.0,60000.0,,-0.256215
923784,Created,User_61,O_Create Offer,Offer,Offer_1729753611,complete,2016-10-31 17:58:52.360000+00:00,"Other, see explanation",New credit,Application_1141437569,50000.0,50000.0,120.0,True,509.11,True,979.0,50000.0,,-0.256215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530933,Obtained,User_3,W_Call incomplete files,Workflow,Workitem_1552303102,resume,2016-07-08 10:56:18.438000+00:00,Home improvement,New credit,Application_212279407,10000.0,,,,,,,,,0.107134
530140,Obtained,User_24,W_Call incomplete files,Workflow,Workitem_136935623,resume,2016-07-11 13:32:09.266000+00:00,Home improvement,New credit,Application_286583043,12000.0,,,,,,,,,0.107260
662600,Obtained,User_39,W_Call incomplete files,Workflow,Workitem_1966625285,resume,2016-08-04 13:37:37.032000+00:00,Home improvement,New credit,Application_237168632,10000.0,,,,,,,,,0.107404
662598,Obtained,User_39,W_Call incomplete files,Workflow,Workitem_1905067702,resume,2016-08-04 13:13:12.504000+00:00,Home improvement,New credit,Application_237168632,10000.0,,,,,,,,,0.107598


In [6]:
# show highest average scores per trace
scores_df[["case:concept:name", "scores"]].groupby(["case:concept:name"]).mean().sort_values("scores")

Unnamed: 0_level_0,scores
case:concept:name,Unnamed: 1_level_1
Application_896441766,-0.103550
Application_946413213,-0.096499
Application_1113899604,-0.086795
Application_1562291654,-0.086046
Application_918459127,-0.086018
...,...
Application_1797625521,0.072118
Application_1817442788,0.072330
Application_1845792027,0.073978
Application_1806060525,0.075903


## Feature evolution
We may be interested to evaluate the evolution of the features over time, to identify the positions of the event log with a behavior that is different from the mainstream behavior.

*Note: my laptop doesn't have enough memory to run this, so I don't know what the results are*

In [7]:
x, y = locally_linear_embedding.apply(log)
gviz = visualizer.apply(x, y, variant=visualizer.Variants.DATES,
                        parameters={"title": "Locally Linear Embedding", "format": "svg", "y_axis": "Intensity"})
visualizer.view(gviz)

MemoryError: Unable to allocate 7.40 GiB for an array with shape (31509, 31509) and data type float64

## Feature selection
For now, select the basic features such as activity, case_id and timestamp. Later on, it would be interesting to test PCA and correlation coefficients

In [None]:
features_df = pm4py.extract_features_dataframe(
    log_df,
    activity_key='concept:name',
    case_id_key='case:concept:name',
    timestamp_key='time:timestamp')

features_df.head()