# Natural Language Processing of Logs



#### 1.- Reading and writing logs

#### 2.- NLP Techniques applied to logs

    2.1.- Sentence detection

    2.2.- Part-Of-Speech Taging

    2.3.- Named Entities Recognition

    2.4.- Acronym detection

    2.5.- Dependency Parser

    2.6.- Lemmatization

#### 3.- Clustering

    3.1.- DBSCAN algorithm
    
    3.2.- Computation of clusters
    
    3.3.- Evaluation of clusters

#### 4.- Text Quality metrics
#### 5.- Creation of the new log and re-labelling of events
#### 6.- Data Log Quality Metrics Calculation
#### 7.- Process Quality Metrics Calculation

## 1.- Reading and writting CSV logs

In [1]:
import csv

LOG_FILE_PATH="./data/original_log.csv"

def read_csv_log(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        row = next(reader)
        if len(row) == 3:
            return [(inc_code, inc_type, description) for inc_code, inc_type, description in reader]
        else:
            return [(inc_code, description) for inc_code, description in reader]

    
def write_csv_log(log, file_path):
    with open(file_path, "w", encoding="utf-8") as f:
        writer = csv.writer(f)
        for line in log:
            writer.writerow(line)


In [2]:
incidences = read_csv_log(LOG_FILE_PATH)

print("Number of incidences: ", len(incidences))

Number of incidences:  4416


In [3]:
from collections import Counter

type_counter = Counter([inc_type for _, inc_type, _ in incidences])

print("Number of incidences by type:")
print("\tTYPE \t INCIDENCES")
for itype, count in type_counter.items():
    print("\t %s \t   %d" %(itype, count))



Number of incidences by type:
	TYPE 	 INCIDENCES
	 TT 	   1775
	 NHA 	   1287
	 HEQ 	   185
	 PIS 	   40
	 NHM 	   212
	 HMC 	   33
	 PIO 	   383
	 HEL 	   196
	 NHP 	   33
	 NHT 	   272


## 2.- NLP Techniques applied to logs

Loading NLP resources

In [4]:
import spacy
from spacy.lang.es.stop_words import STOP_WORDS

nlp = spacy.load('es_core_news_md')  # Language model for Spanish

stopwords = ["warning", "warning:"]  # words that occur in the logs and do not provide any information

docs = {code:nlp(text) for code, _, text in incidences}
tokens = {code: [t for t in doc] for code, doc in docs.items()}

# Pre-process incidences
raw_texts = [str(token) for _, token in tokens.items()]  # without NLP filters
codes = [code for code, _ in tokens.items()]

### 2.1.- Sentence Detection

The trained language models provided by SpaCy include all the requirements for our approach, including a simple, yet useful, sentence detection.

In [5]:
sentences = {code:list(doc.sents) for code, doc in docs.items()}

print(list(sentences.keys())[:5])
print(list(sentences.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


### 2.2.- POS Tagging

The logs are processed in order to keep just those words with a specific morphosyntactic category in the text.

In [6]:
TAGS = {"NOUN", "VERB", "ADJ", "PROPN"}    # we want to keep those words classified with these tags

postags = {code:[str(token) for token in tks if token.pos_ in TAGS] for code, tks in tokens.items()}

print(list(postags.keys())[:5])
print(list(postags.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


### 2.3.- Named Entities Recognition (NER)

Named entities are words or groups of words that refers to an organization, a specific person or location, etc.

In [7]:
entities = {code:[str(token) for token in tokens if token in set(docs[code].ents)] for code, tks in tokens.items()}

print(list(entities.keys())[:5])
print(list(entities.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[[], [], [], [], []]


### 2.4.- Acronym Detection

A simple rule-based approach: a word is considered an **acronym** if it's uppercased and it does not appear in the vocabulary of the target language (in lowercases)

In [8]:
def detect_acronyms(inc_tokens):
    return [str(token) for token in set(inc_tokens) if str(token).isupper() and str(token).lower() not in nlp.vocab and len(str(token)) > 1]

In [9]:
acronyms = {code:detect_acronyms(tks) for code, tks in tokens.items()}

print(list(acronyms.keys())[:5])
print(list(acronyms.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[[], [], [], ['SWLP', 'LMCP', 'FWS', 'LMWS'], []]


### 2.5.- Dependency Parser

The words in the log are filtered out, keeping just those words with a specific function in the text (_subject, direct object, root,_ etc.)

In [10]:
DEPENDENCIES = {"nsubj", "obj"}     # Set of dependencies that we want to keep. 
                                    # We could be (a lot) more restrictive with: DEPENDENCIES = {"ROOT"}

def dependencies(inc_tokens, dependencies=DEPENDENCIES):
    valids = set()
    for token in inc_tokens:
        if token.dep_ in dependencies:
            valids.add(str(token))              # Token with the required dependency
            valids.add(str(token.head))         # Token that is the origin of this dependency
    
    return [str(token) for token in inc_tokens if str(token) in valids]

In [11]:
deps = {code:dependencies(tks) for code, tks in tokens.items()}

print(list(deps.keys())[:5])
print(list(deps.values())[:5])

deps_root = {code:dependencies(tks, {"ROOT"}) for code, tks in tokens.items()}

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[['Se', 'aborta'], ['tenemos', 'masa'], ['SE', 'aborta'], ['se', 'comprueba', 'funcionalidad', 'asi', 'indicación'], ['se', 'aborta', 'configurar', 'avion']]


### 2.6.- Lemmatization

Logs where the words are changed by their lemmas.

In [12]:
lemmas = {code:[token.lemma_ for token in tks] for code, tks in tokens.items()}

print(list(lemmas.keys())[:5])
print(list(lemmas.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


## 3.- Clustering

### 3.1.- DBSCAN algorithm 

In [13]:
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer

def cluster_incidences(texts, inc_codes=codes):
    # Clustering model
    model = DBSCAN(n_jobs=2)

    # Vectorization of the texts
    vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS), max_df=0.95, min_df=1, lowercase=True)
    vec_model = vectorizer.fit_transform(texts)

    # compute clusters
    topics = model.fit_predict(vec_model)

    return {code: topic for code, topic in zip(inc_codes, topics)}, vec_model

### 3.2.- Computation of clusters for the different NLP pre-processed texts

In [14]:
# Raw texts
raw_text_clusters, raw_texts_vecmodel = cluster_incidences(raw_texts)

# POS-Tagging
pos_texts = [" ".join(t) for _,t in postags.items()]
pos_clustering, pos_vecmodel = cluster_incidences(pos_texts)

# Dependencies: nsubj and obj
deps_texts = [" ".join(t) for _,t in deps.items()]
deps_clustering, deps_vecmodel = cluster_incidences(deps_texts)

# Dependencies: ROOT
root_texts = [" ".join(t) for _,t in deps_root.items()]
root_clustering, root_vecmodel = cluster_incidences(root_texts)

### 3.3.- Evaluation of clusters with silhouette metric

In [15]:
from sklearn.metrics import silhouette_score

def silhouette_computation(vec_model, topics):
    return silhouette_score(vec_model, topics)

In [16]:
# Raw texts
silhouette_raw = silhouette_score(raw_texts_vecmodel, list(raw_text_clusters.values()))
print("* Raw texts => Silhouette metric: %f" % silhouette_raw)

* Raw texts => Silhouette metric: -0.038807


In [17]:
# POS-Tagging: NOUN, VERB, NPROP, ADJ
silhouette_pos = silhouette_score(pos_vecmodel, list(pos_clustering.values()))
print("* POS tagged texts => Silhouette metric: %f" % silhouette_pos)

* POS tagged texts => Silhouette metric: -0.019892


In [18]:
# Dependency parser: nsubvj, obj
silhouette_deps = silhouette_score(deps_vecmodel, list(deps_clustering.values()))
print("* Dependency parsed texts => Silhouette metric: %f" % silhouette_deps)

* Dependency parsed texts => Silhouette metric: 0.238413


In [19]:
# Dependency parser: ROOT
silhouette_root = silhouette_score(root_vecmodel, list(root_clustering.values()))
print("* Dependency parsed texts (ROOT) => Silhouette metric: %f" % silhouette_root)

* Dependency parsed texts (ROOT) => Silhouette metric: 0.685790


## 4.- Text Quality Metrics

Set of metrics computed in order to assess the quality of the data logs.

In [20]:
# Reading data log files
log_dep_lemmas_root = read_csv_log('./data/log_sentences_dep_lemmas_root.csv')
text_dep_lemmas_root = ["".join(t) for _,t in log_dep_lemmas_root]

log_dep_lemmas = read_csv_log('./data/log_sentences_dependencies_lemmas.csv')
text_dep_lemmas = ["".join(t) for _,t in log_dep_lemmas]

log_pos_ner_acronyms_lemmas = read_csv_log('./data/log_sentences_pos_ner_acronyms_lemma.csv')
text_pos_ner_acronyms_lemmas = ["".join(t) for _,t in log_pos_ner_acronyms_lemmas]

log_pos_ner_acronyms = read_csv_log('./data/log_sentences_pos_ner_acronyms.csv')
text_pos_ner_acronyms = ["".join(t) for _,t in log_pos_ner_acronyms]

log_pos_ner_acronyms_v2 = read_csv_log('./data/log_sentences_pos_ner_acronyms_v2.csv')
text_pos_ner_acronyms_v2 = ["".join(t) for _,t in log_pos_ner_acronyms_v2]

log_pos_VERB_ner_acronyms_lemmas = read_csv_log('./data/log_sentences_pos_VERB_ner_acronyms_lemma.csv')
text_pos_VERB_ner_acronyms_lemmas = ["".join(t) for _,t in log_pos_VERB_ner_acronyms_lemmas]

In [21]:
# Computing clusters with DBSCAN for each data log:

inc_codes = [code for code,_ in log_dep_lemmas_root]
cluster_dep_lemmas_root, vecmodel_dep_lemmas_root = cluster_incidences(text_dep_lemmas_root, inc_codes = inc_codes)

inc_codes = [code for code,_ in log_dep_lemmas]
cluster_dep_lemmas, vecmodel_dep_lemmas = cluster_incidences(text_dep_lemmas, inc_codes = inc_codes)

inc_codes = [code for code,_ in log_pos_ner_acronyms_lemmas]
cluster_pos_ner_acronyms_lemmas, vecmodel_pos_ner_acronyms_lemmas = cluster_incidences(text_pos_ner_acronyms_lemmas, inc_codes = inc_codes)

inc_codes = [code for code,_ in log_pos_ner_acronyms]
cluster_pos_ner_acronyms, vecmodel_pos_ner_acronyms = cluster_incidences(text_pos_ner_acronyms, inc_codes = inc_codes)

inc_codes = [code for code,_ in log_pos_ner_acronyms_v2]
cluster_pos_ner_acronyms_v2, vecmodel_pos_ner_acronyms_v2 = cluster_incidences(text_pos_ner_acronyms_v2, inc_codes = inc_codes)

inc_codes = [code for code,_ in log_pos_VERB_ner_acronyms_lemmas]
cluster_pos_VERB_ner_acronyms_lemmas, vecmodel_pos_VERB_ner_acronyms_lemmas = cluster_incidences(text_pos_VERB_ner_acronyms_lemmas, inc_codes = inc_codes)

In [25]:
# Silhouette metric for each cluster:

silhouette_dep_lemmas_root = silhouette_score(vecmodel_dep_lemmas_root, list(cluster_dep_lemmas_root.values()))
print("* Dependency parsed texts (ROOT)=> Silhouette metric: %f" % silhouette_dep_lemmas_root)

silhouette_dep_lemmas = silhouette_score(vecmodel_dep_lemmas, list(cluster_dep_lemmas.values()))
print("* Dependency parsed texts => Silhouette metric: %f" % silhouette_dep_lemmas)

silhouette_pos_ner_acronyms_lemmas = silhouette_score(vecmodel_pos_ner_acronyms_lemmas, list(cluster_pos_ner_acronyms_lemmas.values()))
print("* POS (NOUN, VERB, ADJ) + NER + Acronyms + lemmas => Silhouette metric: %f" % silhouette_pos_ner_acronyms_lemmas)

silhouette_pos_ner_acronyms = silhouette_score(vecmodel_pos_ner_acronyms, list(cluster_pos_ner_acronyms.values()))
print("* POS (NOUN, VERB, ADJ) + NER + Acronyms => Silhouette metric: %f" % silhouette_pos_ner_acronyms)

silhouette_pos_ner_acronyms_v2 = silhouette_score(vecmodel_pos_ner_acronyms_v2, list(cluster_pos_ner_acronyms_v2.values()))
print("* POS (NOUN, VERB, ADJ, NPROP)+ NER + Acronyms => Silhouette metric: %f" % silhouette_pos_ner_acronyms_v2)

silhouette_pos_VERB_ner_acronyms_lemmas = silhouette_score(vecmodel_pos_VERB_ner_acronyms_lemmas, list(cluster_pos_VERB_ner_acronyms_lemmas.values()))
print("* POS (VERB) + NER + Acronyms + lemmas => Silhouette metric: %f" % silhouette_pos_VERB_ner_acronyms_lemmas)    

# Escribir en un archivo el clustering de dep_lemmas_root
with open("./data/clustering_log5_dep_lemmas_root.csv", "w", encoding="utf-8") as f:
    f.write("INCIDENCECODE,CLUSTER")
    for code, cluster in cluster_dep_lemmas_root.items():
        f.write("%s,%s\n" % (code, cluster))

* Dependency parsed texts (ROOT)=> Silhouette metric: 0.765413
* Dependency parsed texts => Silhouette metric: 0.082992
* POS (NOUN, VERB, ADJ) + NER + Acronyms + lemmas => Silhouette metric: 0.001605
* POS (NOUN, VERB, ADJ) + NER + Acronyms => Silhouette metric: 0.027595
* POS (NOUN, VERB, ADJ, NPROP)+ NER + Acronyms => Silhouette metric: -0.009782
* POS (VERB) + NER + Acronyms + lemmas => Silhouette metric: 0.131110


## 5.- Creation of the new log and re-labelling of events.

The ***log_sentences_dep_lemmas_root*** contains the identifier of the incidents and their processed descriptions. The log ***clustering_log5_dep_lemmas_root*** is the result of performing a clustering task on the above-mentioned log, and contains the identifier of the incidents and the cluster number to which it has been assigned.

It is necessary to join both logs in order to have the three relevant data in the same memory structure: *incident identifier, processed descriptions and assigned cluster.*



In [None]:
df_sentences_dep_lemmas_root = pd.read_csv('./data/log_sentences_dep_lemmas_root.csv')
df_clustering_dbscan_sentences_dep_lemmas = pd.read_csv('./data/clustering_log5_dep_lemmas_root.csv')
df_clustering_dbscan_sentences_dep_lemmas.columns = ['INCIDENCECODE', 'CLUSTER']
df_merged_sent_dep_lemmas = pd.merge(df_clustering_dbscan_sentences_dep_lemmas, df_sentences_dep_lemmas_root, how='left', left_on=['INCIDENCECODE'], right_on=['INCIDENCECODE'])

In [None]:
#Extracted from: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XoRkFXLtaUk

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        if feature_vals[idx] not in results.keys():
            if score_vals[idx] == max(score_vals):
                results[feature_vals[idx]] = {'max_punctuation':score_vals[idx]}
            else:
                results[feature_vals[idx]]=score_vals[idx]
    
    return results

The next step is to calculate the new labels that will be assigned to each cluster. To do this, the most representative cluster words are obtained using TFIDF.

In [None]:
cv = CountVectorizer(max_df=0.8, ngram_range=(1,3))
X = cv.fit_transform(df_merged_sent_dep_lemmas['DESCRIPTION'])
#If errors of format
#X = cv.fit_transform(df_merged_sent_dep_lemmas['DESCRIPTION'].values.astype('U'))
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(X)
feature_names = cv.get_feature_names()

different_clusters_indexes = list(df_merged_sent_dep_lemmas['CLUSTER'].unique())
keyword_cluster_relabelling = dict()

for dci in different_clusters_indexes:
    df_filtered = df_merged_sent_dep_lemmas[df_merged_sent_dep_lemmas['CLUSTER'] == dci]
    docs = list(df_filtered['DESCRIPTION'])
    #If errors of format:
    #docs = list(df_filtered['DESCRIPTION'].values.astype('U'))
    tfidf_vector_cluster = tfidf_transformer.transform(cv.transform(docs))
    sorted_items_cluster = sort_coo(tfidf_vector_cluster.tocoo())
    keywords_cluster = extract_topn_from_vector(feature_names, sorted_items_cluster, 10)
    most_relevant_to_return = [k for k,v in keywords_cluster.items() if type(v)== dict]
    keyword_cluster_relabelling[dci] = most_relevant_to_return[0]
    
keyword_cluster_relabelling

The function that will serve to create the new column formed by the new labels is created and the new cluster is generated.
To obtain relevant information to be included in the XES, it is necessary to add information regarding the *case* (GTI code) and the *timestamp* of each incident.

In [None]:
def new_label_value(row, keyword_cluster_relabelling):
    return keyword_cluster_relabelling[row['CLUSTER']]

In [None]:
df_merged_sent_dep_lemmas['LABEL'] = df_merged_sent_dep_lemmas.apply(lambda x: new_label_value(x, keyword_cluster_relabelling), axis=1)

In [None]:
df_to_xes_merge = df_merged_sent_dep_lemmas.filter(items=['INCIDENCECODE', 'LABEL'])

new_incidence_code_no_sentences = list(df_to_xes_merge['INCIDENCECODE'])
new_incidence_code_no_sentences = [ic.split('_')[0] if '_' in ic else ic for ic in new_incidence_code_no_sentences]

df_to_xes_merge_no_ = df_to_xes_merge.copy()
df_to_xes_merge_no_['INCIDENCECODE'] = new_incidence_code_no_sentences

In [None]:
df_ora_complete = pd.read_csv('./data/df_ora_complete.csv')
df_ora_complete = pd.read_sql(query, con=connection)
df_ora_columns_needed = df_ora_complete.filter(items=['INCIDENCECODE','INCIDENCEDATE', 'GTICODE'])
df_merged_to_ora =  pd.merge(df_to_xes_merge_no_, df_ora_columns_needed, how='left', left_on=['INCIDENCECODE'], right_on=['INCIDENCECODE'])
df_merged_to_ora['INCIDENCECODE'] = df_merged_sent_dep_lemmas['INCIDENCECODE']
df_merged_to_ora.to_csv('./data/log_sentences_dep_lemmas_root_relabelled.csv',index=False)

## 6.- Data Log Quality Metrics Calculation.

In [None]:
import pandas as pd
import numpy as np
import cx_Oracle
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pprint
import seaborn as sns
import xmltodict
from json import dumps,loads
import xml.etree.ElementTree as xml
from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement
from xmlr import xmlparse

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)
pp = pprint.PrettyPrinter(indent=4)

In [None]:
def get_metrics(path):
    xml_string_xes = open(path,mode='r', encoding='utf-8').read()
    log_xes = xmltodict.parse(xml_string_xes)
    log_xes = loads(dumps(log_xes))
    traces = log_xes['log']['trace']
    new_traces = list(map(lambda x: {'id_trace': x['string']['@value'], 'events': x['event']}, traces))
    
    new_traces_2 = []
    events_set=set()
    different_per_trace= dict()
      
    for i in new_traces:
        incidencecodes = []
        new_events = []
        for e in i['events']:
            #Use 'int' if INCIDENCECODE are int or 'string' if they are string in the XES log.
#             incidencecode = [d['@value'] for d in e['int'] if d['@key']=='INCIDENCECODE'][0]
            incidencecode = [d['@value'] for d in e['string'] if d['@key']=='INCIDENCECODE'][0]
            if incidencecode not in incidencecodes:
                incidencecodes.append(incidencecode)
                new_events.append([d['@value'] for d in e['string'] if d['@key']== 'concept:name'][0])
    
        new_traces_2.append({'id_trace':i['id_trace'], 'events':new_events})
        events_set.update(new_events)
        different_per_trace[i['id_trace']] = len(set(new_events))
    
    
    traces_count_events = list(map(lambda x: {x['id_trace']: len(x['events'])}, new_traces_2))
    
    traces_count_events_dict = dict()
    for d in traces_count_events:
        for k,v in d.items():
            traces_count_events_dict[k] = v
    
    total_events = sum([list(d.values())[0] for d in traces_count_events])
    
    events_dict = dict()
    for i in new_traces_2:
        for e in i['events']:
            if e in events_dict.keys():
                events_dict[e] = events_dict[e] + 1
            else:
                events_dict[e] = 1

    lonely_events = [k for k, v in events_dict.items() if v == 1]
    
    lonely_events_per_trace = dict()

    for d in new_traces_2:
        lonely_events_per_trace[d['id_trace']] = len([e for e in d['events'] if e in lonely_events])
    
    
    return traces_count_events_dict, lonely_events_per_trace, total_events, len(list(events_set)), len(lonely_events), len(traces_count_events_dict), total_events/len(traces_count_events_dict), different_per_trace

In [None]:
def get_average_lonely_events_events(lonely_events_per_trace, events_per_trace):
    average_lonely_events_dict = dict()
    for k, v in lonely_events_per_trace.items():
        average_lonely_events_dict[k] = (v/events_per_trace[k])
    
    return average_lonely_events_dict

In [None]:
def get_noise_roughest_trace(average_lonely_events_per_trace):
    df = pd.DataFrame()
    df['id_trace'] = list(average_lonely_events_per_trace.keys())
    df['average'] = list(average_lonely_events_per_trace.values())
    average_mean = df['average'].mean()
    df['sd'] = df.apply(lambda x: np.std([x['average'], average_mean]), axis=1)
    return max(list(df['sd']), default=-1)

In [None]:
def get_noise_in_log(total_events, total_lonely_events):
    return total_lonely_events/total_events

In [None]:
def get_diversity_in_log(total_events, total_different_events):
    return total_different_events/total_events

In [None]:
def get_diversity_average(different_events_per_trace, events_per_trace):
    diversity_events_dict = dict()
    for k, v in different_events_per_trace.items():
        diversity_events_dict[k] = (v/events_per_trace[k])
    
    return diversity_events_dict

In [None]:
def get_diversity_disparate_trace(average_diversity_per_trace):
    df = pd.DataFrame()
    df['id_trace'] = list(average_diversity_per_trace.keys())
    df['average'] = list(average_diversity_per_trace.values())
    average_mean = df['average'].mean()
    df['sd'] = df.apply(lambda x: np.std([x['average'], average_mean]), axis=1)
    return max(list(df['sd']), default=-1)

In [None]:
def get_composed_metrics(total_events, total_lonely_events, total_different_events, events_per_trace, lonely_per_trace, different_per_trace):
    noise_in_log = get_noise_in_log(total_events, total_lonely_events)
    average_lonely_per_trace = get_average_lonely_events_events(lonely_per_trace, events_per_trace)
    max_noise = get_noise_roughest_trace(average_lonely_per_trace)
    diversity_in_log = get_diversity_in_log(total_events, total_different_events)
    average_diversity_per_trace = get_diversity_average(different_per_trace, events_per_trace)
    max_diversity = get_diversity_disparate_trace(average_diversity_per_trace)
    
    return noise_in_log, max_noise, diversity_in_log, max_diversity

In [None]:
def print_metrics(path):
    res = get_metrics(path)
    res_composed = get_composed_metrics(res[2], res[4], res[3], res[0], res[1], res[7])
    print('EVENTS: ', res[2])
    print('DIFFERENT EVENTS: ', res[3])
    print('LONELY EVENTS: ', res[4])
    print('TRACES: ', res[5])
    print('COMPLEXITY: ', res[6])
    print('NOISE IN LOG: ', res_composed[0])
    print('NOISE MAX DEV: ', res_composed[1])
    print('DIVERSITY IN LOG: ', res_composed[2])
    print('DIVERSITY MAX DEV: ', res_composed[3])

In [None]:
#Example of metrics computation
print_metrics('./data/log_sentences_dep_lemmas_root.xes') 

## 7.- Process Quality Metrics Calculation.

In [None]:
def get_quality(df_param):
    df_param = df_param.iloc[:,1:]
    df_param = df_param.loc[:, (df_param != 0).any()]
    total_configurations = df_param.shape[0]
    df_param = df_param.apply(pd.to_numeric)
    sums = df_param.iloc[:,1:df_param.shape[1]].select_dtypes(pd.np.number).sum().rename('total')
    total_sum = sums.sum()
    lasagna_quality = total_sum/total_configurations
    lasagna_quality = (df_param.shape[1]-1) - lasagna_quality
    return lasagna_quality

In [None]:
def load_dataframe_from_xes_file(filepath):
    path = filepath
    xml_string = open(path, mode='r', encoding='utf-8').read()
    log_is = xmltodict.parse(xml_string)
    log_is = loads(dumps(log_is))
    events_set = set()
    traces = log_is['log']['trace']
    new_traces = list(map(lambda x: {'id_trace': x['string']['@value'], 'events': x['event']}, traces))

    for i in new_traces:
        new_events = list(map(lambda x: [dictionary['@value'] for dictionary in x['string'] if dictionary['@key'] == 'concept:name'][0], i['events']))
        i['events'] = new_events

    for i in new_traces:
        events_set.update(i['events'])
        columns = np.array(list(events_set), dtype=object)
    
    traces_rows = []
    for trace in new_traces:
        count = 0
        zeros = np.zeros(len(columns))
        zeros = np.append(zeros, [trace['id_trace']])
        for i in columns:
            if (i in trace['events']):
                zeros[count] = float(zeros[count])+1
            count = count + 1
        traces_rows.append(zeros)

    traces_rows2 = list(map(lambda x: pd.Series(x), traces_rows))

    df = pd.DataFrame()
    df = pd.concat(traces_rows2, axis=1)
    df = df.T
    
    columns = np.append(columns, ['id_instance'])
    df.columns = columns

    cols = df.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    df = df[cols]
    df.iloc[:,1:] = df.iloc[:,1:].apply(pd.to_numeric)
    df['id_instance'] = df['id_instance'].astype(str)

    return df

In [None]:
df = load_dataframe_from_xes_file('./data/log_sentences_dep_lemmas_root_relabelled.xes')

In [None]:
get_quality(df)

The calculation of process metrics has been done using the following Java code script.

In [None]:
import java.io.File;
import java.util.HashMap;
import java.util.Map;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

/***	
 * Class to determine process metrics
 * 
 * @author Ángel Jesús Varela Vaca (ajvarela@us.es)
 * @author IDEA Research Group (http://www.idea.us.es)
 * @version 1.0   
 *
 */

public class testBPMN {

	/**
	 * Main program which recevies the bpmn an print the process metrics. 
	 * @param args receive as input a bpmn file (UTF-8)
	 * 
	 */
	public static void main(String[] args) {

		try {
			File inputFile = new File("src/main/resources/" + args[0] + ".bpmn");

			DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
			DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
			Document doc = dBuilder.parse(inputFile);
			doc.getDocumentElement().normalize();

			System.out.println("Root element :" + doc.getDocumentURI());

			NodeList lSequence = doc.getElementsByTagName("sequenceFlow");

			NodeList lTask = doc.getElementsByTagName("task");

			NodeList lgateway = doc.getElementsByTagName("exclusiveGateway");

			NodeList pgateway = doc.getElementsByTagName("parallelGateway");

			NodeList loutgoings = doc.getElementsByTagName("outgoing");

			Map<String, String> mtask = new HashMap<>();

			System.out.println("Number of tasks:" + lTask.getLength());

			System.out.println("Number of sequences:" + lSequence.getLength());

			int gateways = pgateway.getLength() + lgateway.getLength();
			
			System.out.println("Number of and:" + pgateway.getLength());

			System.out.println("Number of exclusives:" + lgateway.getLength());

			System.out.println("Number of gateways:" + gateways);

			System.out.println("Number of outgoings (CFC):" + loutgoings.getLength());

			for (int temp = 0; temp < lTask.getLength(); temp++) {
				Node nNode = lTask.item(temp);
				Element eElement = (Element) nNode;

				mtask.put(eElement.getAttribute("name"), eElement.getAttribute("id"));
			}

			int contador = 0;
			for (int i = 0; i < lSequence.getLength(); i++) {
				Node n = lSequence.item(i);
				Element eElement = (Element) n;

				boolean nsource = mtask.values().contains(eElement.getAttribute("sourceRef"));
				boolean ntarget = mtask.values().contains(eElement.getAttribute("targetRef"));

				if (nsource && ntarget) {
					contador += 1;
				}
			}

			System.out.println("Arcs betwee none-connector nodes:" + contador);
			float seq = (float) contador/lSequence.getLength();
			System.out.println("Sequentiality:" + String.format("%,.10f", seq));

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}
