<h1>Encodings for Next Step Activity Prediction</h1>
<br/>
<h4>Lorenzo Manuel Cirac Monteagudo</h4>
<h4>Supervisor: Ana Luisa Oliveira da Nobrega Costa</h4>
<h4>TUM School of Computation, Information and Technology</h4>
<h4>Information Systems Chair</h4>
<br/>

<h3>Information about the Dataset</h3>
<p>MIP dataset: <a href="https://github.com/Sergey-Zeltyn/MIP-dataset/tree/main">https://github.com/Sergey-Zeltyn/MIP-dataset/tree/main</a></p>
<p>The MIP dataset originates from the paper "Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration" by Sergey Zeltyn, Segev Shlomov, Avi Yaeli, and Alon Oved. Presented at the IJCAI 2022 International Workshop on Process Management in the AI Era (PMAI), the dataset captures event logs related to intelligent process automation workflows involving chatbot orchestration. It serves as a valuable resource for studying and developing AI-driven prescriptive monitoring techniques within business process management.</p>

<h3>Data Preprocessing</h3>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/MIP/mip.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49604 entries, 0 to 49603
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             49604 non-null  int64  
 1   session_id          49604 non-null  object 
 2   num_session         49604 non-null  int64  
 3   role                49604 non-null  object 
 4   user_id             49604 non-null  object 
 5   timestamp           49604 non-null  object 
 6   turn                49604 non-null  int64  
 7   activity            49604 non-null  object 
 8   user_utterance      44166 non-null  object 
 9   chatbot_response    49604 non-null  object 
 10  intent              32226 non-null  object 
 11  intent_confidence   49604 non-null  float64
 12  entity              3838 non-null   object 
 13  entity_confidence   3838 non-null   float64
 14  score               49604 non-null  float64
 15  expecting_response  49604 non-null  bool   
dtypes: b

In [3]:
df.head()

Unnamed: 0,case_id,session_id,num_session,role,user_id,timestamp,turn,activity,user_utterance,chatbot_response,intent,intent_confidence,entity,entity_confidence,score,expecting_response
0,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:26:43.584,1,welcome,,Welcome message,,0.0,,,0.0,False
1,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:28:05.194,2,report_yearly_assessments,show yearly assessments,Yearly assessment report,report_yearly_assessments,0.783387,,,0.747725,False
2,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:28:37.723,3,disambiguation2,view project table,Do you wish to view Project assessments report...,,0.989976,,,0.922247,False
3,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:29:24.237,4,report_project_assessments,project assessments report,Project assessments report,report_project_assessments,0.725236,,,0.68899,False
4,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:30:14.552,5,report_learning_activities,view learning activities summary,Learning activities report,report_learning_activities,0.837968,,,0.879382,False


<p>The dataset doesn't include a label column for prediction. Since we're doing Next-Step Activity Prediction, we create the label (next_activity) by grouping events by case_id and assigning the next activity using a shift. We then drop rows where no next activity exists (i.e., the last event in each case).</p>

In [4]:
df = df.sort_values(by=["case_id", "timestamp"])

df["next_activity"] = df.groupby("case_id")["activity"].shift(-1)
df = df.dropna(subset=["next_activity"])

In [5]:
df.head()

Unnamed: 0,case_id,session_id,num_session,role,user_id,timestamp,turn,activity,user_utterance,chatbot_response,intent,intent_confidence,entity,entity_confidence,score,expecting_response,next_activity
0,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:26:43.584,1,welcome,,Welcome message,,0.0,,,0.0,False,report_yearly_assessments
1,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:28:05.194,2,report_yearly_assessments,show yearly assessments,Yearly assessment report,report_yearly_assessments,0.783387,,,0.747725,False,disambiguation2
2,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:28:37.723,3,disambiguation2,view project table,Do you wish to view Project assessments report...,,0.989976,,,0.922247,False,report_project_assessments
3,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:29:24.237,4,report_project_assessments,project assessments report,Project assessments report,report_project_assessments,0.725236,,,0.68899,False,report_learning_activities
4,1,M7vkTk2f537I,1,team leader,Robert North,2022-03-07T16:30:14.552,5,report_learning_activities,view learning activities summary,Learning activities report,report_learning_activities,0.837968,,,0.879382,False,fallback
