<h1>Encodings for Next Step Activity Prediction</h1>
<br/>
<h5>Lorenzo Manuel Cirac Monteagudo</h5>
<h5>Supervisor: Ana Luisa Oliveira da Nobrega Costa</h5>
<h5>Chair: Information Systems</h5>
<h5>TUM School of Computation, Information and Technology</h5>
<br/>

<h3>Overview</h3>
<p>Next-step activity prediction is a supervised machine learning task where you predict what activity will happen next in a business process, given the history of activities that have already occurred in a case.</p>

<h3>Dataset Information</h3>
<p>Helpdesk dataset: <a href="https://github.com/ERamaM/PredictiveMonitoringDatasets/tree/master/raw_datasets">https://github.com/ERamaM/PredictiveMonitoringDatasets/tree/master/raw_datasets</a></p>
<p>This event log contains data of a ticketing management process
form an Italian software company</p>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [2]:
from pm4py.objects.conversion.log import converter as xes_converter
from pm4py.objects.log.importer.xes import importer as xes_importer

log = xes_importer.apply("data/helpdesk")
df = xes_converter.apply(log, variant=xes_converter.Variants.TO_DATA_FRAME)

parsing log, completed traces ::   0%|          | 0/4580 [00:00<?, ?it/s]

<h3>Dataset Exploration</h3>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21348 entries, 0 to 21347
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   concept:name          21348 non-null  object             
 1   lifecycle:transition  21348 non-null  object             
 2   org:resource          21348 non-null  object             
 3   time:timestamp        21348 non-null  datetime64[ns, UTC]
 4   Activity              21348 non-null  object             
 5   Resource              21348 non-null  object             
 6   case:concept:name     21348 non-null  object             
 7   case:variant          21348 non-null  object             
 8   case:variant-index    21348 non-null  int64              
 9   case:creator          21348 non-null  object             
dtypes: datetime64[ns, UTC](1), int64(1), object(8)
memory usage: 1.6+ MB


In [4]:
df.head()

Unnamed: 0,concept:name,lifecycle:transition,org:resource,time:timestamp,Activity,Resource,case:concept:name,case:variant,case:variant-index,case:creator
0,Assign seriousness,complete,Value 2,2010-01-13 08:40:25+00:00,Assign seriousness,Value 2,Case3608,Variant 33,33,Fluxicon Disco
1,Take in charge ticket,complete,Value 2,2010-01-29 08:52:27+00:00,Take in charge ticket,Value 2,Case3608,Variant 33,33,Fluxicon Disco
2,Resolve ticket,complete,Value 2,2010-01-29 08:52:34+00:00,Resolve ticket,Value 2,Case3608,Variant 33,33,Fluxicon Disco
3,Closed,complete,Value 5,2010-02-13 08:52:48+00:00,Closed,Value 5,Case3608,Variant 33,33,Fluxicon Disco
4,Closed,complete,Value 5,2010-02-13 08:52:48+00:00,Closed,Value 5,Case3608,Variant 33,33,Fluxicon Disco


<h3>Data Cleaning & Prefix Generation</h3>

In [5]:
# Clean Data
df = df.rename(columns = {
    "case:concept:name": "case_id",
    "concept:name": "activity",
    "org:resource": "resource",
    "time:timestamp": "timestamp"
})

df = df[["case_id", "activity", "resource", "timestamp"]]
df["next_activity"] = df.groupby("case_id")["activity"].shift(-1)
df = df[df["next_activity"].notna()]
df = df.sort_values(by = ["case_id", "timestamp"])

df.head(10)

Unnamed: 0,case_id,activity,resource,timestamp,next_activity
17612,Case1,Assign seriousness,Value 1,2012-10-09 14:50:17+00:00,Take in charge ticket
17613,Case1,Take in charge ticket,Value 1,2012-10-09 14:51:01+00:00,Take in charge ticket
17614,Case1,Take in charge ticket,Value 2,2012-10-12 15:02:56+00:00,Resolve ticket
17615,Case1,Resolve ticket,Value 1,2012-10-25 11:54:26+00:00,Closed
212,Case10,Assign seriousness,Value 2,2010-02-10 08:50:20+00:00,Take in charge ticket
213,Case10,Take in charge ticket,Value 2,2010-03-19 08:47:06+00:00,Resolve ticket
214,Case10,Resolve ticket,Value 2,2010-03-19 08:47:13+00:00,Closed
19330,Case100,Assign seriousness,Value 1,2013-04-12 10:25:17+00:00,Take in charge ticket
19331,Case100,Take in charge ticket,Value 9,2013-04-24 10:24:01+00:00,Require upgrade
19332,Case100,Require upgrade,Value 9,2013-04-24 15:51:11+00:00,Resolve ticket


In [6]:
# Add Prefix
prefixes = []
for case_id, group in df.groupby("case_id"):
    activities = group["activity"].astype(str).tolist()
    resources = group["resource"].astype(str).tolist()
    timestamps = group["timestamp"].tolist()
    
    for i in range(1, len(activities)):   
        prefixes.append({
            "case_id": case_id,
            "step": i,
            "activities": activities[:i],
            "resources": resources[:i],
            "next_activity": activities[i],
        })

df = pd.DataFrame(prefixes)

df.head(10)

Unnamed: 0,case_id,step,activities,resources,next_activity
0,Case1,1,[Assign seriousness],[Value 1],Take in charge ticket
1,Case1,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 1]",Take in charge ticket
2,Case1,3,"[Assign seriousness, Take in charge ticket, Ta...","[Value 1, Value 1, Value 2]",Resolve ticket
3,Case10,1,[Assign seriousness],[Value 2],Take in charge ticket
4,Case10,2,"[Assign seriousness, Take in charge ticket]","[Value 2, Value 2]",Resolve ticket
5,Case100,1,[Assign seriousness],[Value 1],Take in charge ticket
6,Case100,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 9]",Require upgrade
7,Case100,3,"[Assign seriousness, Take in charge ticket, Re...","[Value 1, Value 9, Value 9]",Resolve ticket
8,Case1000,1,[Assign seriousness],[Value 2],Assign seriousness
9,Case1000,2,"[Assign seriousness, Assign seriousness]","[Value 2, Value 2]",Take in charge ticket


In [7]:
from sklearn.preprocessing import LabelEncoder

X = df.drop(columns = ["case_id", "next_activity"])
y = df["next_activity"]

le = LabelEncoder()
y = le.fit_transform(df["next_activity"])

In [8]:
X

Unnamed: 0,step,activities,resources
0,1,[Assign seriousness],[Value 1]
1,2,"[Assign seriousness, Take in charge ticket]","[Value 1, Value 1]"
2,3,"[Assign seriousness, Take in charge ticket, Ta...","[Value 1, Value 1, Value 2]"
3,1,[Assign seriousness],[Value 2]
4,2,"[Assign seriousness, Take in charge ticket]","[Value 2, Value 2]"
...,...,...,...
12183,1,[Assign seriousness],[Value 9]
12184,2,"[Assign seriousness, Take in charge ticket]","[Value 9, Value 2]"
12185,3,"[Assign seriousness, Take in charge ticket, Wait]","[Value 9, Value 2, Value 9]"
12186,1,[Assign seriousness],[Value 1]


<h3>Encodings</h3>

<h5>One Hot Encoding</h5>

In [9]:
from sklearn.preprocessing import MultiLabelBinarizer

def apply_one_hot_encoding(X_train, X_test):
    
    mlb_activities = MultiLabelBinarizer()
    mlb_resources = MultiLabelBinarizer()
    
    X_train_activities = mlb_activities.fit_transform(X_train["activities"])
    X_test_activities = mlb_activities.transform(X_test["activities"])
    
    X_train_resources = mlb_resources.fit_transform(X_train["resources"])
    X_test_resources = mlb_resources.transform(X_test["resources"])
    
    X_train_step = X_train[["step"]].values
    X_test_step = X_test[["step"]].values
    
    X_train_final = np.hstack([X_train_step, X_train_activities, X_train_resources])
    X_test_final = np.hstack([X_test_step, X_test_activities, X_test_resources])
    
    return X_train_final, X_test_final, "one-hot encoding"

<h5>Last State Encoding</h5>

In [10]:
from sklearn.preprocessing import OneHotEncoder

def apply_last_state_encoding(X_train, X_test, n=3):
    
    def extract_last_state(X):
        
        last_activities = X["activities"].apply(
            lambda x: (["None"] * max(0, n - len(x)) + x)[-n:]
        )
        last_resources = X["resources"].apply(
            lambda x: (["None"] * max(0, n - len(x)) + x)[-n:]
        )
        
        features = {"step": X["step"]}
        for i in range(n):
            features[f"last_activity_{i+1}"] = last_activities.apply(lambda x: x[i])
            features[f"last_resource_{i+1}"] = last_resources.apply(lambda x: x[i])
        
        return pd.DataFrame(features, index=X.index)
    
    X_train_new = extract_last_state(X_train)
    X_test_new = extract_last_state(X_test)
    
    # Onehot encode categorical columns
    categorical_cols = [col for col in X_train_new.columns if col != "step"]
    
    encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    X_train_encoded = encoder.fit_transform(X_train_new[categorical_cols])
    X_test_encoded = encoder.transform(X_test_new[categorical_cols])
    
    X_train_final = np.column_stack([X_train_new[["step"]].values, X_train_encoded])
    X_test_final = np.column_stack([X_test_new[["step"]].values, X_test_encoded])
    
    return X_train_final, X_test_final, "last_state_encoding"

<h5>Index Encoding</h5>

In [11]:
from sklearn.preprocessing import OneHotEncoder

def apply_index_encoding(X_train, X_test, max_length=5):
    
    def extract_index_features(X):
        
        activities = X["activities"].apply(
            lambda x: (x + ["None"] * max_length)[:max_length]
        )
        resources = X["resources"].apply(
            lambda x: (x + ["None"] * max_length)[:max_length]
        )
        
        features = {"step": X["step"]}
        for i in range(max_length):
            features[f"activity_{i+1}"] = activities.apply(lambda x: x[i])
            features[f"resource_{i+1}"] = resources.apply(lambda x: x[i])
        
        return pd.DataFrame(features, index=X.index)
    
    X_train_new = extract_index_features(X_train)
    X_test_new = extract_index_features(X_test)
    
    # Onehot encode categorical columns
    categorical_cols = [col for col in X_train_new.columns if col != "step"]
    
    encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    X_train_encoded = encoder.fit_transform(X_train_new[categorical_cols])
    X_test_encoded = encoder.transform(X_test_new[categorical_cols])
    
    X_train_final = np.column_stack([X_train_new[["step"]].values, X_train_encoded])
    X_test_final = np.column_stack([X_test_new[["step"]].values, X_test_encoded])
    
    return X_train_final, X_test_final, "index_encoding"

<h5>Inter Case Encoding</h5>

In [12]:
def apply_inter_case_encoding(X_train, X_test):
    
    step_counts = X_train["step"].value_counts().to_dict()
    
    mlb_activities = MultiLabelBinarizer()
    mlb_resources = MultiLabelBinarizer()
    
    X_train_activities = mlb_activities.fit_transform(X_train["activities"])
    X_test_activities = mlb_activities.transform(X_test["activities"])
    
    X_train_resources = mlb_resources.fit_transform(X_train["resources"])
    X_test_resources = mlb_resources.transform(X_test["resources"])
    
    # Add inter-case features
    X_train_step_load = X_train["step"].map(step_counts).values.reshape(-1, 1)
    X_test_step_load = X_test["step"].map(step_counts).fillna(0).values.reshape(-1, 1)
    
    X_train_final = np.hstack([
        X_train[["step"]].values,
        X_train_step_load,
        X_train_activities, 
        X_train_resources
    ])
    X_test_final = np.hstack([
        X_test[["step"]].values,
        X_test_step_load,
        X_test_activities, 
        X_test_resources
    ])
    
    return X_train_final, X_test_final, "inter_case_encoding"

<h5>Aggregation Encoding</h5>

In [13]:
from collections import Counter

def apply_aggregation_encoding(X_train, X_test):
    
    def compute_aggregation_features(X):
        features = []
        for _, row in X.iterrows():
            activities = row["activities"]
            resources = row["resources"]
            
            features.append({
                "step": row["step"],
                "prefix_length": len(activities),
                "unique_activities": len(set(activities)),
                "unique_resources": len(set(resources)),
                "most_frequent_activity_count": max(Counter(activities).values()) if activities else 0,
                "most_frequent_resource_count": max(Counter(resources).values()) if resources else 0,
                "activity_resource_ratio": len(set(activities)) / len(set(resources)) if len(set(resources)) > 0 else 0
            })
        
        return pd.DataFrame(features)
    
    X_train_new = compute_aggregation_features(X_train)
    X_test_new = compute_aggregation_features(X_test)
    
    return X_train_new.values, X_test_new.values, "aggregation_encoding"

<h5>Embeddings</h5>

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

def apply_embedding_encoding(X_train, X_test):
    
    def lists_to_string(series):
        return series.apply(lambda x: " ".join(x))
    
    X_train_activities_str = lists_to_string(X_train["activities"])
    X_test_activities_str = lists_to_string(X_test["activities"])
    X_train_resources_str = lists_to_string(X_train["resources"])
    X_test_resources_str = lists_to_string(X_test["resources"])
    
    # Fit TF-IDF vectorizers on training data
    activity_vectorizer = TfidfVectorizer(max_features = 50)  
    resource_vectorizer = TfidfVectorizer(max_features = 50)
    
    X_train_activities_tfidf = activity_vectorizer.fit_transform(X_train_activities_str).toarray()
    X_test_activities_tfidf = activity_vectorizer.transform(X_test_activities_str).toarray()
    
    X_train_resources_tfidf = resource_vectorizer.fit_transform(X_train_resources_str).toarray()
    X_test_resources_tfidf = resource_vectorizer.transform(X_test_resources_str).toarray()
    
    # Combine all features
    X_train_final = np.column_stack([
        X_train[["step"]].values,
        X_train_activities_tfidf,
        X_train_resources_tfidf
    ])
    
    X_test_final = np.column_stack([
        X_test[["step"]].values,
        X_test_activities_tfidf,
        X_test_resources_tfidf
    ])
    
    return X_train_final, X_test_final, "embedding_encoding"

<h3>Random Forest Model</h3>

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

results = []

# Parameter grid for Random Forest (you can expand this)
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10]
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2025)

encodings = [
    apply_one_hot_encoding, 
    apply_aggregation_encoding, 
    apply_last_state_encoding,
    apply_index_encoding,
    apply_inter_case_encoding,
    apply_embedding_encoding
]

for encoding in encodings:
    
    X_train_encoded, X_test_encoded, method = encoding(X_train, X_test)
    
    rf = RandomForestClassifier(random_state = 2025)
    grid_search = GridSearchCV(rf, param_grid, cv = 3, scoring = 'f1_weighted', n_jobs = -1)
    grid_search.fit(X_train_encoded, y_train)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_encoded)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    print(method)
    print(f"Best Params: {grid_search.best_params_}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print()
    
    results.append({
        'encoding': method,
        'best_params': grid_search.best_params_,
        'model': best_model,
        'accuracy': accuracy,
        'f1_score': f1
    })

one-hot encoding
Best Params: {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 200}
Accuracy: 0.7518
F1-Score: 0.7078

aggregation_encoding
Best Params: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.7190
F1-Score: 0.6512

last_state_encoding
Best Params: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 100}
Accuracy: 0.7654
F1-Score: 0.7238

index_encoding
Best Params: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
Accuracy: 0.7564
F1-Score: 0.7117

inter_case_encoding
Best Params: {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 200}
Accuracy: 0.7576
F1-Score: 0.7133

embedding_encoding
Best Params: {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 200}
Accuracy: 0.7473
F1-Score: 0.7015

