# Predictive Process Mining with Vanilla Methods

This notebook demonstrates the process of predictive process mining using vanilla methods. We will cover data preprocessing, feature engineering, and training a linear regression model to predict outcomes based on event log data.

## 1. Import Packages

In [8]:
import pandas as pd
import pm4py
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from src import SRC_DIR

<br></br>
## 2. Data Preprocessing

In this section, we will load the event log data, preprocess it by converting timestamp formats, and encode categorical variables.

In [33]:
# Import Eventlog
event_log = pd.read_csv(SRC_DIR / 'Datasets' / 'Example' / 'Running_Example' / 'running-example.csv', sep=";")
event_log.head()

Unnamed: 0,case_id,activity,timestamp,costs,resource
0,3,register request,2010-12-30 14:32:00+01:00,50,Pete
1,3,examine casually,2010-12-30 15:06:00+01:00,400,Mike
2,3,check ticket,2010-12-30 16:34:00+01:00,100,Ellen
3,3,decide,2011-01-06 09:18:00+01:00,200,Sara
4,3,reinitiate request,2011-01-06 12:18:00+01:00,200,Sara


In [34]:
# Change "timestamp" format
event_log['timestamp'] = pd.to_datetime(event_log['timestamp'])
event_log.head()

Unnamed: 0,case_id,activity,timestamp,costs,resource
0,3,register request,2010-12-30 14:32:00+01:00,50,Pete
1,3,examine casually,2010-12-30 15:06:00+01:00,400,Mike
2,3,check ticket,2010-12-30 16:34:00+01:00,100,Ellen
3,3,decide,2011-01-06 09:18:00+01:00,200,Sara
4,3,reinitiate request,2011-01-06 12:18:00+01:00,200,Sara


In [35]:
# Replace Activities with keys
activities_code = {activity: str(idx) for idx, activity in enumerate(event_log.activity.unique())}
event_log.replace({"activity": activities_code}, inplace=True)
event_log.head()

Unnamed: 0,case_id,activity,timestamp,costs,resource
0,3,0,2010-12-30 14:32:00+01:00,50,Pete
1,3,1,2010-12-30 15:06:00+01:00,400,Mike
2,3,2,2010-12-30 16:34:00+01:00,100,Ellen
3,3,3,2011-01-06 09:18:00+01:00,200,Sara
4,3,4,2011-01-06 12:18:00+01:00,200,Sara


<br></br>
## 3. Prediction with Vanilla Method

### 3.1. Making Dummy Variables for Resources

We will convert the resource column into dummy variables to include them as features in our model.


In [36]:
event_log = pd.get_dummies(event_log, columns=['resource'], prefix="Resource")
event_log.head()

Unnamed: 0,case_id,activity,timestamp,costs,Resource_Ellen,Resource_Mike,Resource_Pete,Resource_Sara,Resource_Sean,Resource_Sue
0,3,0,2010-12-30 14:32:00+01:00,50,0,0,1,0,0,0
1,3,1,2010-12-30 15:06:00+01:00,400,0,1,0,0,0,0
2,3,2,2010-12-30 16:34:00+01:00,100,1,0,0,0,0,0
3,3,3,2011-01-06 09:18:00+01:00,200,0,0,0,1,0,0
4,3,4,2011-01-06 12:18:00+01:00,200,0,0,0,1,0,0


<br></br>
### 3.2. Reshape Eventlog (Generate State Dataset)

Reshape the event log to generate a state dataset where each row represents a case with its attributes.

In [55]:
reshaped_event_log_lst = []

for group_name, group in event_log.groupby('case_id'):
    group.sort_values("timestamp", inplace=True)
    group.reset_index(drop=True, inplace=True)

    # Input measures
    prefix = [tuple(group['activity'].values[:i]) for i in range(len(group) + 1)]

    # Total Elapsed time
    elapsed_time = []
    for i in range(len(group) + 1):
        start_time = min(group['timestamp'].values)
        end_time = max(group['timestamp'].values[:i]) if i != 0 else start_time
        elapsed_time.append(end_time - start_time)

    # Total Paid Costs
    paid_costs = [sum(group['costs'].values[:i]) for i in range(len(group) + 1)]

    # Total number of done activities
    number_of_done_activities = [len(group['activity'].values[:i]) for i in range(len(group) + 1)]

    # Resourse
    resourses = {}
    for col in group.columns:
        if "Resource_" in col:
            resourses[col] = [sum(group[col].values[:i]) for i in range(len(group) + 1)]

    # Output measures
    total_time = [(max(group['timestamp']) - min(group['timestamp'])) for _ in range(len(group) + 1)]
    total_cost = [sum(group['costs'].values) for _ in range(len(group) + 1)]


    # Create DataFrame
    reshaped_group = pd.DataFrame(
        {'Prefix': prefix,
         'Elapsed_Time': elapsed_time,
         'Paid_Costs': paid_costs,
         '#Activities': number_of_done_activities,
         'Total_Time': total_time,
         'Total_Cost': total_cost,
         **resourses         # Merge Two Dictionaries
        })

    reshaped_group['Case_Name'] = group_name
    reshaped_group.Elapsed_Time = reshaped_group.Elapsed_Time.dt.total_seconds()
    reshaped_group.Total_Time = reshaped_group.Total_Time.dt.total_seconds()

    # Append 'reshaped group' to 'reshaped eventlog'
    reshaped_event_log_lst.append(reshaped_group)

reshaped_event_log = pd.concat(reshaped_event_log_lst, axis=0)

In [57]:
reshaped_event_log.head()

Unnamed: 0,Prefix,Elapsed_Time,Paid_Costs,#Activities,Total_Time,Total_Cost,Resource_Ellen,Resource_Mike,Resource_Pete,Resource_Sara,Resource_Sean,Resource_Sue,Case_Name
0,(),0.0,0,0,703320.0,950,0,0,0,0,0,0,1
1,"(0,)",0.0,50,1,703320.0,950,0,0,1,0,0,0,1
2,"(0, 5)",83040.0,450,2,703320.0,950,0,0,1,0,0,1,1
3,"(0, 5, 2)",533400.0,550,3,703320.0,950,0,1,1,0,0,1,1
4,"(0, 5, 2, 3)",605760.0,750,4,703320.0,950,0,1,1,1,0,1,1


<br></br>
### 3.3. Making Dummy Variables for Prefix

Convert the prefix column into dummy variables to include them as features in our model.

In [58]:
reshaped_event_log = pd.get_dummies(reshaped_event_log, columns=['Prefix'], prefix="Prefix")
reshaped_event_log.head()

Unnamed: 0,Elapsed_Time,Paid_Costs,#Activities,Total_Time,Total_Cost,Resource_Ellen,Resource_Mike,Resource_Pete,Resource_Sara,Resource_Sean,...,"Prefix_('0', '2', '1')","Prefix_('0', '2', '1', '3')","Prefix_('0', '2', '1', '3', '6')","Prefix_('0', '2', '5')","Prefix_('0', '2', '5', '3')","Prefix_('0', '2', '5', '3', '7')","Prefix_('0', '5')","Prefix_('0', '5', '2')","Prefix_('0', '5', '2', '3')","Prefix_('0', '5', '2', '3', '7')"
0,0.0,0,0,703320.0,950,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.0,50,1,703320.0,950,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,83040.0,450,2,703320.0,950,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,533400.0,550,3,703320.0,950,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,605760.0,750,4,703320.0,950,0,1,1,1,0,...,0,0,0,0,0,0,0,0,1,0


<br></br>
### 3.4. Model Training and Evaluation

We will train multiple models to predict the outcome based on the features extracted from the event log.

#### 3.4.1. Linear Regression

In [74]:
x_columns_name = ["Elapsed_Time", "Paid_Costs", "#Activities"] + \
                 [col for col in reshaped_event_log.columns if "Prefix_" in col or "Resource_" in col]

y_column_name = ["Total_Cost"]

In [80]:
x = reshaped_event_log[x_columns_name]
y = reshaped_event_log[y_column_name]
y = y.to_numpy().ravel()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [82]:
model = LinearRegression()
model.fit(x_train, y_train)

y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

print(f"Train R^2: {r2_score(y_train, y_pred_train)}")
print(f"Test R^2: {r2_score(y_test, y_pred_test)}")

Train R^2: 0.6978343749705542
Test R^2: 0.4486057453477055


In [83]:
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

intercept: 1525.5973329426927
slope: [ 4.42187828e-04 -2.70292667e-01  5.41221304e+01  1.00585502e+03
 -6.87589693e+02  2.89046877e+02 -1.03896262e+02 -2.57024218e+02
 -1.92269596e+02 -2.15597333e+02 -3.13286022e+02  2.44204689e+02
 -1.22990343e+01 -2.06052023e+01  1.27813974e+02 -5.36319598e+02
  1.26812067e+02  1.23246152e+02  7.64242003e-11  4.00435421e+02
  8.03690527e+01  4.28426183e-11  7.51753412e+02 -4.30839887e+01
 -3.78802822e-11 -3.64938677e+02 -1.60141403e+03  1.94503681e+02
  7.30820441e+02  1.03854972e+03  9.17768654e+02 -2.03905895e+02
  0.00000000e+00  0.00000000e+00  0.00000000e+00 -6.95706452e+02
 -2.34353333e+02  0.00000000e+00 -4.94767698e+02]


<br></br>
#### 3.4.2. Random Forest Regressor

In [84]:
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(x_train, y_train)

y_pred_train_rf = rf_model.predict(x_train)
y_pred_test_rf = rf_model.predict(x_test)

print(f"Train R^2 (RF): {r2_score(y_train, y_pred_train_rf)}")
print(f"Test R^2 (RF): {r2_score(y_test, y_pred_test_rf)}")

Train R^2 (RF): 0.7863601707470897
Test R^2 (RF): 0.7596022825432256


<br></br>
### Conclusion

This notebook demonstrated the process of predictive process mining using a simple linear regression model. We covered data preprocessing, feature engineering, and model training. Further improvements could include trying more sophisticated models and conducting a thorough evaluation.
