# Simple Log Analysis with Declare4Py

This tutorial explains how to perform simple analysis of logs with Declare4Py

After importing the Declare4Py package and specified the path of the log, a `Declare4Py` object has to be instantiated.

In [1]:
import sys
sys.path.append("..")
import os
from src.api.declare4py import Declare4Py


log_path = os.path.join("..", "test", "Sepsis Cases.xes.gz")

d4py = Declare4Py()

The next step is the parsing of the log with the `parse_xes_log` function. Logs can be passed both in the `.xes` or `xes.gz` formats. 

In [2]:
d4py.parse_xes_log(log_path)

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

Declare4Py offers several facilities for simple log indexing and analysis

In [3]:
# Return the number of cases in the log
print(f"Number of cases: {d4py.get_log_length()}")
print("--------------------------------------")

# Return the ids of the cases in the log
print(f"Cases ids:\n{d4py.get_trace_keys()}")
print("--------------------------------------")

# Return the names of the activities in the log
print(f"Activity alphabet:\n{d4py.get_log_alphabet_activities()}")
print("--------------------------------------")

# Return the names of the resources in the log
print(f"Resource alphabet:\n{d4py.get_log_alphabet_payload()}")
print("--------------------------------------")

Number of cases: 1050
--------------------------------------
Cases ids:
[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H'), (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L'), (12, 'M'), (13, 'N'), (14, 'O'), (15, 'P'), (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T'), (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X'), (24, 'Y'), (25, 'Z'), (26, 'AA'), (27, 'BA'), (28, 'CA'), (29, 'DA'), (30, 'EA'), (31, 'FA'), (32, 'GA'), (33, 'HA'), (34, 'IA'), (35, 'JA'), (36, 'KA'), (37, 'LA'), (38, 'MA'), (39, 'NA'), (40, 'OA'), (41, 'PA'), (42, 'QA'), (43, 'RA'), (44, 'SA'), (45, 'TA'), (46, 'UA'), (47, 'VA'), (48, 'WA'), (49, 'XA'), (50, 'YA'), (51, 'ZA'), (52, 'AB'), (53, 'BB'), (54, 'CB'), (55, 'DB'), (56, 'EB'), (57, 'FB'), (58, 'GB'), (59, 'HB'), (60, 'IB'), (61, 'JB'), (62, 'KB'), (63, 'LB'), (64, 'MB'), (65, 'NB'), (66, 'OB'), (67, 'PB'), (68, 'QB'), (69, 'RB'), (70, 'SB'), (71, 'TB'), (72, 'UB'), (73, 'VB'), (74, 'WB'), (75, 'XB'), (76, 'YB'), (77, 'ZB'), (78, 'AC'), (79, 'BC'), (80

A log is a complex data structure that can be explored along several dimensions. The functions `activities_log_projection` and `resources_log_projection` project the cases in the log according to the activities and resources dimensions, respectively. Each projection is a list (the log) of lists (the single cases) containing the name of the activity/resource.

In [4]:
# Activity projection
for idx, trace in enumerate(d4py.activities_log_projection()):
    print(f"{idx}- {trace}")
print("--------------------------------------")

# Resource projection
for idx, trace in enumerate(d4py.resources_log_projection()):
    print(f"{idx}- {trace}")
print("--------------------------------------")

0- ['ER Registration', 'Leucocytes', 'CRP', 'LacticAcid', 'ER Triage', 'ER Sepsis Triage', 'IV Liquid', 'IV Antibiotics', 'Admission NC', 'CRP', 'Leucocytes', 'Leucocytes', 'CRP', 'Leucocytes', 'CRP', 'CRP', 'Leucocytes', 'Leucocytes', 'CRP', 'CRP', 'Leucocytes', 'Release A']
1- ['ER Registration', 'ER Triage', 'CRP', 'LacticAcid', 'Leucocytes', 'ER Sepsis Triage', 'IV Liquid', 'IV Antibiotics', 'Admission NC', 'CRP', 'CRP', 'Release A']
2- ['ER Registration', 'ER Triage', 'ER Sepsis Triage', 'Leucocytes', 'CRP', 'IV Liquid', 'IV Antibiotics', 'Admission NC', 'Admission NC', 'Leucocytes', 'CRP', 'Leucocytes', 'CRP', 'Release A']
3- ['ER Registration', 'ER Triage', 'ER Sepsis Triage', 'CRP', 'LacticAcid', 'Leucocytes', 'IV Liquid', 'IV Antibiotics', 'Admission NC', 'Leucocytes', 'CRP', 'Release A', 'Return ER']
4- ['ER Registration', 'ER Triage', 'ER Sepsis Triage', 'IV Liquid', 'CRP', 'Leucocytes', 'LacticAcid', 'IV Antibiotics']
5- ['ER Registration', 'ER Triage', 'ER Sepsis Triage', 

A useful utily for logs is the one hot encoding according to the `act` or `payload` dimensions. These encodings can be useful for statistical analysis or Machine Learning tasks. The returned data type is a Pandas Dataframe.

In [5]:
# One hot encoding for activities
d4py.log_encoding(dimension='act')

Unnamed: 0,Admission IC,Admission NC,CRP,ER Registration,ER Sepsis Triage,ER Triage,IV Antibiotics,IV Liquid,LacticAcid,Leucocytes,Release A,Release B,Release C,Release D,Release E,Return ER
0,False,True,True,True,True,True,True,True,True,True,True,False,False,False,False,False
1,False,True,True,True,True,True,True,True,True,True,True,False,False,False,False,False
2,False,True,True,True,True,True,True,True,False,True,True,False,False,False,False,False
3,False,True,True,True,True,True,True,True,True,True,True,False,False,False,False,True
4,False,False,True,True,True,True,True,True,True,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1045,True,True,True,True,True,True,True,False,True,True,True,False,False,False,False,False
1046,False,False,False,True,True,True,False,False,False,False,False,False,False,False,False,False
1047,False,False,False,True,True,True,False,False,False,False,False,False,False,False,False,False
1048,False,True,True,True,True,True,True,True,True,True,True,False,False,False,False,False


In [6]:
# One hot encoding for payload
d4py.log_encoding(dimension='payload')

Unnamed: 0,?,A,B,C,D,E,F,G,H,I,...,P,Q,R,S,T,U,V,W,X,Y
0,False,True,True,True,True,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,True,True,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,True,True,False,True,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
3,True,True,True,True,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,True,True,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1045,False,True,True,True,False,True,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1046,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1047,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1048,False,True,True,True,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False


## Frequent Itemsets

Declare4Py offers support for computing the frequent itemsets of activities/resources in the log. The function `compute_frequent_itemsets` takes as input the `min_support` of the itemsets, the `algorithm` to perform the computation (available `fpgrowth` and `apriori`) and `len_itemset` indicating the maximum length of the itemsets, the default is `None`.

In [7]:
d4py.compute_frequent_itemsets(min_support=0.8, algorithm='fpgrowth', len_itemset=3)
d4py.frequent_item_sets

Unnamed: 0,support,itemsets,length
0,1.0,(ER Triage),1
1,1.0,(ER Registration),1
2,0.999048,(ER Sepsis Triage),1
3,0.96381,(Leucocytes),1
4,0.959048,(CRP),1
5,0.819048,(LacticAcid),1
6,1.0,"(ER Registration, ER Triage)",2
7,0.999048,"(ER Sepsis Triage, ER Registration)",2
8,0.999048,"(ER Sepsis Triage, ER Triage)",2
9,0.999048,"(ER Sepsis Triage, ER Registration, ER Triage)",3
