<h1>Encodings for Next Step Activity Prediction</h1>
<br/>
<h5>Lorenzo Manuel Cirac Monteagudo</h5>
<h5>Supervisor: Ana Luisa Oliveira da Nobrega Costa</h5>
<h5>Chair: Information Systems</h5>
<h5>TUM School of Computation, Information and Technology</h5>
<br/>

<h3>Overview</h3>
<p>Next-step activity prediction is a supervised machine learning task where you predict what activity will happen next in a business process, given the history of activities that have already occurred in a case.</p>

<h3>Dataset Information</h3>
<p>Helpdesk dataset: <a href="https://github.com/ERamaM/PredictiveMonitoringDatasets/tree/master/raw_datasets">https://github.com/ERamaM/PredictiveMonitoringDatasets/tree/master/raw_datasets</a></p>
<p>This event log contains data of a ticketing management process
form an Italian software company</p>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from pm4py.objects.conversion.log import converter as xes_converter
from pm4py.objects.log.importer.xes import importer as xes_importer

log = xes_importer.apply("data/helpdesk")
df = xes_converter.apply(log, variant=xes_converter.Variants.TO_DATA_FRAME)

parsing log, completed traces ::   0%|          | 0/4580 [00:00<?, ?it/s]

<h3>Dataset Exploration</h3>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21348 entries, 0 to 21347
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   concept:name          21348 non-null  object             
 1   lifecycle:transition  21348 non-null  object             
 2   org:resource          21348 non-null  object             
 3   time:timestamp        21348 non-null  datetime64[ns, UTC]
 4   Activity              21348 non-null  object             
 5   Resource              21348 non-null  object             
 6   case:concept:name     21348 non-null  object             
 7   case:variant          21348 non-null  object             
 8   case:variant-index    21348 non-null  int64              
 9   case:creator          21348 non-null  object             
dtypes: datetime64[ns, UTC](1), int64(1), object(8)
memory usage: 1.6+ MB


In [4]:
df.head()

Unnamed: 0,concept:name,lifecycle:transition,org:resource,time:timestamp,Activity,Resource,case:concept:name,case:variant,case:variant-index,case:creator
0,Assign seriousness,complete,Value 2,2010-01-13 08:40:25+00:00,Assign seriousness,Value 2,Case3608,Variant 33,33,Fluxicon Disco
1,Take in charge ticket,complete,Value 2,2010-01-29 08:52:27+00:00,Take in charge ticket,Value 2,Case3608,Variant 33,33,Fluxicon Disco
2,Resolve ticket,complete,Value 2,2010-01-29 08:52:34+00:00,Resolve ticket,Value 2,Case3608,Variant 33,33,Fluxicon Disco
3,Closed,complete,Value 5,2010-02-13 08:52:48+00:00,Closed,Value 5,Case3608,Variant 33,33,Fluxicon Disco
4,Closed,complete,Value 5,2010-02-13 08:52:48+00:00,Closed,Value 5,Case3608,Variant 33,33,Fluxicon Disco


<h3>Data Cleaning & Prefix Generation</h3>

In [5]:
# Clean Data
df = df.rename(columns = {
    "case:concept:name": "case_id",
    "concept:name": "activity",
    "org:resource": "resource",
    "time:timestamp": "timestamp"
})

df = df[["case_id", "activity", "resource", "timestamp"]]
df["next_activity"] = df.groupby("case_id")["activity"].shift(-1)
df = df[df["next_activity"].notna()]
df = df.sort_values(by = ["case_id", "timestamp"])

df.head(10)

Unnamed: 0,case_id,activity,resource,timestamp,next_activity
17612,Case1,Assign seriousness,Value 1,2012-10-09 14:50:17+00:00,Take in charge ticket
17613,Case1,Take in charge ticket,Value 1,2012-10-09 14:51:01+00:00,Take in charge ticket
17614,Case1,Take in charge ticket,Value 2,2012-10-12 15:02:56+00:00,Resolve ticket
17615,Case1,Resolve ticket,Value 1,2012-10-25 11:54:26+00:00,Closed
212,Case10,Assign seriousness,Value 2,2010-02-10 08:50:20+00:00,Take in charge ticket
213,Case10,Take in charge ticket,Value 2,2010-03-19 08:47:06+00:00,Resolve ticket
214,Case10,Resolve ticket,Value 2,2010-03-19 08:47:13+00:00,Closed
19330,Case100,Assign seriousness,Value 1,2013-04-12 10:25:17+00:00,Take in charge ticket
19331,Case100,Take in charge ticket,Value 9,2013-04-24 10:24:01+00:00,Require upgrade
19332,Case100,Require upgrade,Value 9,2013-04-24 15:51:11+00:00,Resolve ticket


In [6]:
# Add Prefix
prefixes = []
for case_id, group in df.groupby("case_id"):
    activities = group["activity"].astype(str).tolist()
    resources = group["resource"].astype(str).tolist()
    timestamps = group["timestamp"].tolist()
    
    for i in range(1, len(activities)):   
        prefixes.append({
            "case_id": case_id,
            "step": i,
            "activities": activities[:i],
            "resources": resources[:i],
            "next_activity": activities[i],
        })

df = pd.DataFrame(prefixes)

df.head(10)

Unnamed: 0,case_id,step,activities,resources,next_activity
0,Case1,1,[Assign seriousness],[Value 1],Take in charge ticket
1,Case1,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 1]",Take in charge ticket
2,Case1,3,"[Assign seriousness, Take in charge ticket, Ta...","[Value 1, Value 1, Value 2]",Resolve ticket
3,Case10,1,[Assign seriousness],[Value 2],Take in charge ticket
4,Case10,2,"[Assign seriousness, Take in charge ticket]","[Value 2, Value 2]",Resolve ticket
5,Case100,1,[Assign seriousness],[Value 1],Take in charge ticket
6,Case100,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 9]",Require upgrade
7,Case100,3,"[Assign seriousness, Take in charge ticket, Re...","[Value 1, Value 9, Value 9]",Resolve ticket
8,Case1000,1,[Assign seriousness],[Value 2],Assign seriousness
9,Case1000,2,"[Assign seriousness, Assign seriousness]","[Value 2, Value 2]",Take in charge ticket


In [7]:
from sklearn.preprocessing import LabelEncoder

X = df.drop(columns = ["case_id", "next_activity"])
y = df["next_activity"]

y_encoder = LabelEncoder()
y = y_encoder.fit_transform(y)

In [8]:
X

Unnamed: 0,step,activities,resources
0,1,[Assign seriousness],[Value 1]
1,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 1]"
2,3,"[Assign seriousness, Take in charge ticket, Ta...","[Value 1, Value 1, Value 2]"
3,1,[Assign seriousness],[Value 2]
4,2,"[Assign seriousness, Take in charge ticket]","[Value 2, Value 2]"
...,...,...,...
12183,1,[Assign seriousness],[Value 9]
12184,2,"[Assign seriousness, Take in charge ticket]","[Value 9, Value 2]"
12185,3,"[Assign seriousness, Take in charge ticket, Wait]","[Value 9, Value 2, Value 9]"
12186,1,[Assign seriousness],[Value 1]


<h3>Encodings</h3>

<h5>One Hot Encoding</h5>

In [9]:
from sklearn.preprocessing import MultiLabelBinarizer

def apply_one_hot_encoding(X):
    
    mlb_activities = MultiLabelBinarizer()
    mlb_resources = MultiLabelBinarizer()
    
    activities_encoded = pd.DataFrame(
        mlb_activities.fit_transform(df["activities"]),
        columns=[f"act_{a}" for a in mlb_activities.classes_],
        index=df.index  
    )
    resources_encoded = pd.DataFrame(
        mlb_resources.fit_transform(df["resources"]),
        columns=[f"res_{r}" for r in mlb_resources.classes_],
        index=df.index  
    )
    
    result = pd.concat([X.drop(columns = ["activities", "resources"]), activities_encoded, resources_encoded], axis=1)
    
    return result

<h3>Random Forest Model</h3>

In [10]:
X = apply_one_hot_encoding(X)

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 2025)

In [13]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state = 2025)
model.fit(X_train, y_train)

In [14]:
from sklearn.metrics import accuracy_score, f1_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score: {f1:.4f}")

Accuracy: 0.7443
F1-Score: 0.7110
