# <center>Assignment 4 : Predictive Process Monitoring</center>

## <center>Gamal Elkoumy</center>
### <center>University of Tartu, 2021</center>

In this document, we report the our solution to the predictive process monitoring presented here https://courses.cs.ut.ee/LTAT.05.025/2021_spring/uploads/Main/2021Homework4.pdf

## Solution GitHub Repository
Our solution is available using the following URL.
https://github.com/Elkoumy/predictive-monitoring-benchmark

## Task 1
(1 point) 

As part of the log preprocessing, it is necessary to categorize the process traces as
deviant or regular. This log contains a column called SLA. it is a "case attribute," which indicates
how many minutes each case must complete. You must create a new column in the log that
contains a case attribute called label, which contains a value of 1 for deviant cases or 0 for
regular ones. This column's value is 0 if the duration of the case (in minutes) is less than or equal
to the SLA; otherwise, this column's value must be 1 (the SLA has not been met). NB! If there are
cases that do not have SLA, categorize them as 0. 

In [4]:
import numpy as np
import pandas as pd

In [5]:
df=pd.read_csv(r'C:\Gamal Elkoumy\PhD\OneDrive - Tartu Ülikool\Courses\Process Mining\Assignment4\predictive-monitoring-benchmark\data\turnaround_anon_sla.csv')

#converting datatypes , timestamps
df.start_timestamp= pd.to_datetime(df.start_timestamp,utc=True)
df.end_timestamp= pd.to_datetime(df.end_timestamp,utc=True)

df=df.sort_values(['start_timestamp']).reset_index()
df=df.drop('index',axis=1)


In [6]:
"""Q1"""

#calculating the start time and end time of every case
df['case_end_time']=df.groupby(['caseid']).end_timestamp.transform('max')
df['case_start_time']=df.groupby(['caseid']).start_timestamp.transform('min')

#calculating case duration in minutes ( the same time unit as the SLA)
df['duration']=(df.case_end_time-df.case_start_time).astype('timedelta64[m]')


#creating the label column
df['label']=1
df.loc[df.duration<=df['SLA MIN'],'label']=0
df.loc[df['SLA MIN'].isna(),'label']=0

## Task 2
(2 points) 

Add a column to the event log that captures the WIP of the process at the moment
where the last eventin the prefix occurs. Remember that the WIP refers to the number of active
cases, meaning the number of cases that have started but not yet completed. 

#### First, we define a funtion that performs the estimation of wip for each activity.

In [8]:
def count_wip(row, case_times):
    wip=0
    #started before start and ended after end
    #started after start and ended before end
    #started before start and ended before end
    #started before end and ended after end
    wip=case_times.loc[(case_times.case_start_time<= row.start_timestamp) & (case_times.case_end_time>=row.end_timestamp) |
                       (case_times.case_start_time >= row.start_timestamp) & (case_times.case_end_time <= row.end_timestamp)|
                       (case_times.case_start_time <= row.start_timestamp) & (case_times.case_end_time >= row.start_timestamp)|
                       (case_times.case_start_time <= row.end_timestamp) & (case_times.case_end_time >= row.end_timestamp)
                       ].shape[0]

    return wip

#### We then use the pandas apply function to execute the count_wip function as follows.

In [9]:
"""Q2"""
case_times= pd.DataFrame()
case_times['case_end_time']=df.groupby(['caseid']).end_timestamp.max()
case_times['case_start_time']=df.groupby(['caseid']).start_timestamp.min()
case_times=case_times.reset_index()

df['WIP']=df.apply(count_wip,case_times=case_times ,axis=1)

#### We export the result in order to use it separately to optimize the model parameters as we will mention later.

In [10]:
df=df.rename(columns={'caseid': 'Case ID','activity':'Activity', 'start_timestamp':'time:timestamp'})
df.to_csv(r'C:\Gamal Elkoumy\PhD\OneDrive - Tartu Ülikool\Courses\Process Mining\Assignment4\predictive-monitoring-benchmark\experiments\experiment_log\turnaround_anon_sla_renamed.csv',index=False, sep=';')


#### As a preprocessing for the next step, we prepare the data for the train/test split.


In [12]:
# split into training and test
def split_data_strict(data, train_ratio, split="temporal"):
    # split into train and test using temporal split and discard events that overlap the periods
    data = data.sort_values(['time:timestamp', 'Activity'], ascending=True, kind='mergesort')
    grouped = data.groupby('Case ID')
    start_timestamps = grouped['time:timestamp'].min().reset_index()
    start_timestamps = start_timestamps.sort_values('time:timestamp', ascending=True, kind='mergesort')
    train_ids = list(start_timestamps['Case ID'])[:int(train_ratio*len(start_timestamps))]
    train = data[data['Case ID'].isin(train_ids)].sort_values(['time:timestamp', 'Activity'], ascending=True, kind='mergesort')
    test = data[~data['Case ID'].isin(train_ids)].sort_values(['time:timestamp', 'Activity'], ascending=True, kind='mergesort')
    split_ts = test['time:timestamp'].min()
    train = train[train['time:timestamp'] < split_ts]
    return (train, test)

In [13]:
"""Split into train and test"""
train_ratio = 0.8
n_splits = 2
random_state = 22

train, test = split_data_strict(df, train_ratio, split="temporal")


## Task 3

(4 points) 

Currently, the work proposed by Taineema et al. performs the extraction of the prefixes
of the traces registered in the log to train the classification models. For large logs, this approach
leads to an increase in the dimensionality of the models' input (too many features) without
necessarily improving its precision, especially in cases in which the event traces are very long.
You must modify this technique to extract subsequences of size n (n-grams), where n is a userdefined parameter, instead of encoding entire prefixes. An n-gram is a contiguous sequence of n
items from a given trace.

#### First, we define the function that calculates the n-grams. The following function calculates the prefixes using the n-grams for every case separately.

In [20]:
def create_ngrams(data, ngram_size):
    result=pd.DataFrame()


    for idx in range(0,data.shape[0]- ngram_size +1):

        prefix=data.iloc[idx:idx+ngram_size].copy()
        prefix=prefix.reset_index()

        prefix['Case ID']=prefix['Case ID']+'_'+str(idx)

        result=pd.concat([result,prefix])

    return result

#### As a helper function, we adapted the following method to the new label values.

In [21]:
def get_class_ratio(data):
    class_freqs = data['label'].value_counts()
    return class_freqs[1] / class_freqs.sum()


#### We then follow the same CV method as the practice session 10.

In [22]:
from sklearn.model_selection import StratifiedKFold
def get_stratified_split_generator(data, n_splits=5, shuffle=True, random_state=22):
    grouped_firsts = data.groupby('Case ID', as_index=False).first()
    skf = StratifiedKFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

    for train_index, test_index in skf.split(grouped_firsts, grouped_firsts['label']):
        current_train_names = grouped_firsts['Case ID'][train_index]
        train_chunk = data[data['Case ID'].isin(current_train_names)].sort_values('time:timestamp', ascending=True, kind='mergesort')
        test_chunk = data[~data['Case ID'].isin(current_train_names)].sort_values('time:timestamp', ascending=True, kind='mergesort')
        yield (train_chunk, test_chunk)

In [23]:

# prepare chunks for CV
dt_prefixes = []
class_ratios = []
min_prefix_length = 1
ngram_size=5

for train_chunk, test_chunk in get_stratified_split_generator(train, n_splits=n_splits):
    class_ratios.append(get_class_ratio(train_chunk))
    # generate data where each prefix is a separate instance
    dt_prefixes.append(generate_prefix_data(test_chunk, ngram_size))
del train

## Task 4

(3 points) 

Test the results of your modifications with the Turnaround process event log using
cluster bucketing, index encoding, and the XGboost model.

### Model Parameter Optimization

Taineema et al provide a method for optimizing the model parameters for predictive process monitoring. The file <a href="https://github.com/Elkoumy/predictive-monitoring-benchmark/blob/master/experiments/optimize_params.py">optimize_params.py</a> performs the parameter optimization. We adopted the file by adding the required parameters for the input event log "turnaround_anon_sla.csv".

We needed also to perform adaptations in the file <a href="https://github.com/Elkoumy/predictive-monitoring-benchmark/blob/master/experiments/dataset_confs.py">dataset_confs.py</a> in order to enable the parameter tuning for the dataset "turnaround_anon_sla.csv". 

We used the following command to execute the optimizer:
``` python optimize_params.py turnaround_anon_sla_renamed optimizer_log 10  cluster index xgboost ```

The output of the optimizer could be found in the folder "optimizer_log". Also, the optimial parameters are in the pickle file <a href="https://github.com/Elkoumy/predictive-monitoring-benchmark/blob/master/experiments/dataset_confs.py">dataset_confs.py</a>