# Natural Language Processing of Logs



#### 1.- Reading and writing logs

#### 2.- NLP Techniques applied to logs

    2.1.- Sentence detection

    2.2.- Part-Of-Speech Taging

    2.3.- Named Entities Recognition

    2.4.- Acronym detection

    2.5.- Dependency Parser

    2.6.- Lemmatization

#### 3.- Creation of the new log and re-labelling of events
#### 4.- Data Log Quality Metrics Calculation


## 1.- Reading and writting CSV logs

In [1]:
import csv

LOG_FILE_PATH="./data/original_log.csv"

def read_csv_log(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        row = next(reader)
        if len(row) == 3:
            return [(inc_code, inc_type, description) for inc_code, inc_type, description in reader]
        else:
            return [(inc_code, description) for inc_code, description in reader]

    
def write_csv_log(log, file_path):
    with open(file_path, "w", encoding="utf-8") as f:
        writer = csv.writer(f)
        for line in log:
            writer.writerow(line)


In [2]:
incidences = read_csv_log(LOG_FILE_PATH)

print("Number of incidences: ", len(incidences))

Number of incidences:  4416


In [3]:
from collections import Counter

type_counter = Counter([inc_type for _, inc_type, _ in incidences])

print("Number of incidences by type:")
print("\tTYPE \t INCIDENCES")
for itype, count in type_counter.items():
    print("\t %s \t   %d" %(itype, count))



Number of incidences by type:
	TYPE 	 INCIDENCES
	 TT 	   1775
	 NHA 	   1287
	 HEQ 	   185
	 PIS 	   40
	 NHM 	   212
	 HMC 	   33
	 PIO 	   383
	 HEL 	   196
	 NHP 	   33
	 NHT 	   272


## 2.- NLP Techniques applied to logs

Loading NLP resources

In [4]:
import spacy
from spacy.lang.es.stop_words import STOP_WORDS

nlp = spacy.load('es_core_news_md')  # Language model for Spanish

stopwords = ["warning", "warning:"]  # words that occur in the logs and do not provide any information

docs = {code:nlp(text) for code, _, text in incidences}
tokens = {code: [t for t in doc] for code, doc in docs.items()}

# Pre-process incidences
raw_texts = [str(token) for _, token in tokens.items()]  # without NLP filters
codes = [code for code, _ in tokens.items()]

### 2.1.- Sentence Detection

The trained language models provided by SpaCy include all the requirements for our approach, including a simple, yet useful, sentence detection.

In [5]:
sentences = {code:list(doc.sents) for code, doc in docs.items()}

print(list(sentences.keys())[:5])
print(list(sentences.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


### 2.2.- POS Tagging

The logs are processed in order to keep just those words with a specific morphosyntactic category in the text.

In [6]:
TAGS = {"NOUN", "VERB", "ADJ", "PROPN"}    # we want to keep those words classified with these tags

postags = {code:[str(token) for token in tks if token.pos_ in TAGS] for code, tks in tokens.items()}

print(list(postags.keys())[:5])
print(list(postags.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


### 2.3.- Named Entities Recognition (NER)

Named entities are words or groups of words that refers to an organization, a specific person or location, etc.

In [7]:
entities = {code:[str(token) for token in tokens if token in set(docs[code].ents)] for code, tks in tokens.items()}

print(list(entities.keys())[:5])
print(list(entities.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[[], [], [], [], []]


### 2.4.- Acronym Detection

A simple rule-based approach: a word is considered an **acronym** if it's uppercased and it does not appear in the vocabulary of the target language (in lowercases)

In [8]:
def detect_acronyms(inc_tokens):
    return [str(token) for token in set(inc_tokens) if str(token).isupper() and str(token).lower() not in nlp.vocab and len(str(token)) > 1]

In [9]:
acronyms = {code:detect_acronyms(tks) for code, tks in tokens.items()}

print(list(acronyms.keys())[:5])
print(list(acronyms.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[[], [], [], ['SWLP', 'LMCP', 'FWS', 'LMWS'], []]


### 2.5.- Dependency Parser

The words in the log are filtered out, keeping just those words with a specific function in the text (_subject, direct object, root,_ etc.)

In [10]:
DEPENDENCIES = {"nsubj", "obj"}     # Set of dependencies that we want to keep. 
                                    # We could be (a lot) more restrictive with: DEPENDENCIES = {"ROOT"}

def dependencies(inc_tokens, dependencies=DEPENDENCIES):
    valids = set()
    for token in inc_tokens:
        if token.dep_ in dependencies:
            valids.add(str(token))              # Token with the required dependency
            valids.add(str(token.head))         # Token that is the origin of this dependency
    
    return [str(token) for token in inc_tokens if str(token) in valids]

In [11]:
deps = {code:dependencies(tks) for code, tks in tokens.items()}

print(list(deps.keys())[:5])
print(list(deps.values())[:5])

deps_root = {code:dependencies(tks, {"ROOT"}) for code, tks in tokens.items()}

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']
[['Se', 'aborta'], ['tenemos', 'masa'], ['SE', 'aborta'], ['se', 'comprueba', 'funcionalidad', 'asi', 'indicación'], ['se', 'aborta', 'configurar', 'avion']]


### 2.6.- Lemmatization

Logs where the words are changed by their lemmas.

In [12]:
lemmas = {code:[token.lemma_ for token in tks] for code, tks in tokens.items()}

print(list(lemmas.keys())[:5])
print(list(lemmas.values())[:5])

['1007314995715', '1010259490543', '1014902821262', '1015282934226', '1017842527711']


## 3.- Creation of the new log and re-labelling of events.

In [None]:
connection = cx_Oracle.connect('user/pass@localhost')
query = """select * FROM INCIDENTS"""
df_ora_complete = pd.read_sql(query, con=connection)
df_ora_columns_needed = df_ora_complete.filter(items=['INCIDENCECODE','INCIDENCEDATE', 'GTICODE'])

root_path = './data'
files_list = os.listdir(root_path)
files_list_nosent = [f for f in files_list if not 'sent' in f]
files_list_sent = [f for f in files_list if 'sent' in f]
save_path = './xes_files'

for file_name in files_files_list_nosent:
    complete_path_aux = os.path.join(root_path,file_name)
    df_log_aux = pd.read_csv(complete_path_aux)
    df_log_aux['INCIDENCECODE'] = df_log_aux['INCIDENCECODE'].apply(str)
    df_log_aux_merged =  pd.merge(df_log_aux, df_ora_columns_needed, how='left', left_on=['INCIDENCECODE'], right_on=['INCIDENCECODE'])
    df_log_aux_merged.to_csv(os.path.join(save_path, file_name), index=False)

    
for file_name in files_list_sent:
    complete_path_aux = os.path.join(root_path,file_name)
    df_log_aux = pd.read_csv(complete_path_aux)
    df_log_aux['INCIDENCECODE'] = df_log_aux['INCIDENCECODE'].apply(str)
    df_log_aux['INCIDENCECODE_nosent'] = df_log_aux['INCIDENCECODE']
    df_log_aux['INCIDENCECODE_nosent'] =df_log_aux['INCIDENCECODE_nosent'].apply(lambda ic: ic.split('_')[0] if '_' in ic else ic)
    df_log_aux_merged =  pd.merge(df_log_aux, df_ora_columns_needed, how='left', left_on=['INCIDENCECODE_nosent'], right_on=['INCIDENCECODE'])
    df_log_aux_merged_to_save = df_log_aux_merged.filter(items=['INCIDENCECODE_x','INCIDENCEDATE', 'GTICODE', 'DESCRIPTION'])
    df_log_aux_merged_to_save.to_csv(os.path.join(save_path, file_name), index=False)


## 4.- Data Log Quality Metrics Calculation.

In [None]:
import pandas as pd
import numpy as np
import cx_Oracle
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pprint
import seaborn as sns
import xmltodict
from json import dumps,loads
import xml.etree.ElementTree as xml
from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement
from xmlr import xmlparse

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)
pp = pprint.PrettyPrinter(indent=4)

In [None]:
def compute_consistency(new_traces_2, avg_length_events_string, total_events):
    consistency_result = 0
    for elem in new_traces_2:
        aux_events_list = elem['events']
        for e_i in aux_events_list:
            e_i_value = abs(len(e_i)-avg_length_events_string)/total_events
            consistency_result += e_i_value
            
    return consistency_result

In [None]:
def get_metrics(path):
    xml_string_xes = open(path,mode='r', encoding='utf-8').read()
    log_xes = xmltodict.parse(xml_string_xes)
    log_xes = loads(dumps(log_xes))
    traces = log_xes['log']['trace']
    new_traces = list(map(lambda x: {'id_trace': x['string']['@value'], 'events': x['event']}, traces))
    
    new_traces_2 = []
    events_set=set()
    different_per_trace= dict()
      
    for i in new_traces:
        incidencecodes = []
        new_events = []
        for e in i['events']:
            #Use 'int' if INCIDENCECODE are int or 'string' if they are string in the XES log.
#             incidencecode = [d['@value'] for d in e['int'] if d['@key']=='INCIDENCECODE'][0]
            incidencecode = [d['@value'] for d in e['string'] if d['@key']=='INCIDENCECODE'][0]
            if incidencecode not in incidencecodes:
                incidencecodes.append(incidencecode)
                new_events.append([d['@value'] for d in e['string'] if d['@key']== 'concept:name'][0])
    
        new_traces_2.append({'id_trace':i['id_trace'], 'events':new_events})
        events_set.update(new_events)
        different_per_trace[i['id_trace']] = len(set(new_events))
    
    avg_length_events_string = sum( map(len, list(events_set)) ) / len(list(events_set))
    traces_count_events = list(map(lambda x: {x['id_trace']: len(x['events'])}, new_traces_2))
    
    traces_count_events_dict = dict()
    for d in traces_count_events:
        for k,v in d.items():
            traces_count_events_dict[k] = v
    
    total_events = sum([list(d.values())[0] for d in traces_count_events])
    
    events_dict = dict()
    for i in new_traces_2:
        for e in i['events']:
            if e in events_dict.keys():
                events_dict[e] = events_dict[e] + 1
            else:
                events_dict[e] = 1

    lonely_events = [k for k, v in events_dict.items() if v == 1]
    
    lonely_events_per_trace = dict()

    for d in new_traces_2:
        lonely_events_per_trace[d['id_trace']] = len([e for e in d['events'] if e in lonely_events])
        
    consistency = compute_consistency(new_traces_2, avg_length_events_string, total_events)
    
    
    return traces_count_events_dict, lonely_events_per_trace, total_events, len(list(events_set)), len(lonely_events), len(traces_count_events_dict), total_events/len(traces_count_events_dict), different_per_trace

In [None]:
def get_average_lonely_events_events(lonely_events_per_trace, events_per_trace):
    average_lonely_events_dict = dict()
    for k, v in lonely_events_per_trace.items():
        average_lonely_events_dict[k] = (v/events_per_trace[k])
    
    return average_lonely_events_dict

In [None]:
def get_noise_roughest_trace(average_lonely_events_per_trace):
    df = pd.DataFrame()
    df['id_trace'] = list(average_lonely_events_per_trace.keys())
    df['average'] = list(average_lonely_events_per_trace.values())
    average_mean = df['average'].mean()
    df['sd'] = df.apply(lambda x: np.std([x['average'], average_mean]), axis=1)
    return max(list(df['sd']), default=-1)

In [None]:
def get_noise_in_log(total_events, total_lonely_events):
    return total_lonely_events/total_events

In [None]:
def get_diversity_in_log(total_events, total_different_events):
    return total_different_events/total_events

In [None]:
def get_diversity_average(different_events_per_trace, events_per_trace):
    diversity_events_dict = dict()
    for k, v in different_events_per_trace.items():
        diversity_events_dict[k] = (v/events_per_trace[k])
    
    return diversity_events_dict

In [None]:
def get_diversity_disparate_trace(average_diversity_per_trace):
    df = pd.DataFrame()
    df['id_trace'] = list(average_diversity_per_trace.keys())
    df['average'] = list(average_diversity_per_trace.values())
    average_mean = df['average'].mean()
    df['sd'] = df.apply(lambda x: np.std([x['average'], average_mean]), axis=1)
    return max(list(df['sd']), default=-1)

In [None]:
def get_composed_metrics(total_events, total_lonely_events, total_different_events, events_per_trace, lonely_per_trace, different_per_trace):
    noise_in_log = get_noise_in_log(total_events, total_lonely_events)
    average_lonely_per_trace = get_average_lonely_events_events(lonely_per_trace, events_per_trace)
    max_noise = get_noise_roughest_trace(average_lonely_per_trace)
    diversity_in_log = get_diversity_in_log(total_events, total_different_events)
    average_diversity_per_trace = get_diversity_average(different_per_trace, events_per_trace)
    max_diversity = get_diversity_disparate_trace(average_diversity_per_trace)
    
    return noise_in_log, max_noise, diversity_in_log, max_diversity

In [None]:
def print_metrics(path):
    res = get_metrics(path)
    res_composed = get_composed_metrics(res[2], res[4], res[3], res[0], res[1], res[7])
    print('EVENTS: ', res[2])
    print('DIFFERENT EVENTS: ', res[3])
    print('LONELY EVENTS: ', res[4])
    print('TRACES: ', res[5])
    print('COMPLEXITY: ', res[6])
    print('UNIQUENESS: ', res_composed[0])
    print('RELEVANCY: ', res_composed[2])
    print('CONSISTENCY: ', res[8])

In [None]:
#Example of metrics computation
print_metrics('./data/log_sentences_dep_lemmas_root.xes') 