# Pre-analysis of the BlueGene/L dataset 
This notebook uses data mining techniques to help better understand the data

### About the dataset

BGL is an open dataset of logs collected from a BlueGene/L supercomputer system at Lawrence Livermore National Labs (LLNL) in Livermore, California, with 131,072 processors and 32,768GB memory. The log contains alert and non-alert messages identified by alert category tags. In the first column of the log, "-" indicates non-alert messages while others are alert messages. The label information is amenable to alert detection and prediction research. It has been used in several studies on log parsing, anomaly detection, and failure prediction.

### Structure of log

Logs are structure following the format "LABEL TIMESTAMP DATE NODE DATE-FULL NODE(again) TYPE COMPONENT LEVEL CONTENT". For example:

"- 1117838570 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 RAS KERNEL INFO instruction cache parity error corrected"

In [1]:
# dependencies
from utils.parser import parse_logs
from utils.paths import project_root

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

In [2]:
truncate_proportion = 0.1 # proportion of the log to truncate for testing

# loads parsed logs if needed
parse_logs(f'{project_root()}/data', 'BGL.log', f'{project_root()}/data/parsed', truncate_proportion)
#TODO: for now lines with no content are skipped. Later define if stays this way

Processing BGL.log...
Processing only: 474796
Cleaning log file: /home/paulofr/repos/failure-prediction-probabilistic-ml/data/BGL.log
Truncating input to first 474796 lines.
Cleaned log saved to: /home/paulofr/repos/failure-prediction-probabilistic-ml/data/cleaned_BGL
Parsing file: /home/paulofr/repos/failure-prediction-probabilistic-ml/data/cleaned_BGL
Total lines:  474796
Processed 0.2% of log lines.
Processed 0.4% of log lines.
Processed 0.6% of log lines.
Processed 0.8% of log lines.
Processed 1.1% of log lines.
Processed 1.3% of log lines.
Processed 1.5% of log lines.
Processed 1.7% of log lines.
Processed 1.9% of log lines.
Processed 2.1% of log lines.
Processed 2.3% of log lines.
Processed 2.5% of log lines.
Processed 2.7% of log lines.
Processed 2.9% of log lines.
Processed 3.2% of log lines.
Processed 3.4% of log lines.
Processed 3.6% of log lines.
Processed 3.8% of log lines.
Processed 4.0% of log lines.
Processed 4.2% of log lines.
Processed 4.4% of log lines.
Processed 4.6%

In [3]:
df = pd.read_csv(f'{project_root()}/data/parsed/cleaned_BGL_structured.csv')
df.head()

Unnamed: 0,LineId,Label,Timestamp,Date,Node,Time,NodeRepeat,Type,Component,Level,Content,EventId,EventTemplate,ParameterList
0,1,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.363779,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,3aa50e45,instruction cache parity error corrected,[]
1,2,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.527847,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,3aa50e45,instruction cache parity error corrected,[]
2,3,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.675872,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,3aa50e45,instruction cache parity error corrected,[]
3,4,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.823719,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,3aa50e45,instruction cache parity error corrected,[]
4,5,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.982731,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,3aa50e45,instruction cache parity error corrected,[]


## Understanding Features
Labels tell what the type of the event that geterated the log. Most of the logs have a "-" in the label, meaning there was not problem identified in that log. According to the article "What Supercomputers Say: A Study of Five System Logs
" (https://ieeexplore.ieee.org/document/4273008), the types can be split in the following categories:

![labels](../../images/labels.png) 


In [4]:
# anomalies distribution
df['Label'].value_counts()

Label
-           281949
KERNDTLB    152659
KERNSTOR     36864
APPREAD       3181
KERNRTSP       133
KERNMC          10
Name: count, dtype: int64

Similar to labels, we have level which can tell us the gravity of the report of the log. It can go from just an INFO log to a failure or kill

In [5]:
df['Level'].value_counts()

Level
FATAL      241534
INFO       233138
SEVERE         34
ERROR          34
Name: count, dtype: int64

In [6]:
df['Content'].value_counts().head(20)

Content
data TLB error interrupt                                                                                                               152659
data storage interrupt                                                                                                                  36864
instruction address: 0x00004ed8                                                                                                         36864
instruction cache parity error corrected                                                                                                22624
data address: 0x00000002                                                                                                                 4096
ciod: Message code 0 is not 51 or 4294967295                                                                                             2268
ciod: LOGIN chdir(/p/gb2/draeger/benchmark/dat32k_060205) failed: No such file or directory                                              154

### Parsing

The logparser library was used to parse the log events, removing variable data from the content like ip addresses or variable names and keeping the core of the message. This can be used to create the bag-of-logs embedding

In [7]:
df['EventId'].nunique()

125

In [8]:
df['EventTemplate'].nunique()

125

In [9]:
df['EventTemplate'].value_counts().head(20)

EventTemplate
generating <*>                                                                        189334
data TLB error interrupt                                                              152659
instruction address: <*>                                                               36963
data storage interrupt                                                                 36864
instruction cache parity error corrected                                               22624
CE sym <*> at <*> mask <*>                                                              8262
<*> double-hummer alignment exceptions                                                  4737
data address: <*>                                                                       4186
ciod: failed to read message prefix on control stream (CioStream socket to <*>:<*>      3181
ciod: cpu <*> at treeaddr <*> sent unrecognized message <*>                             2569
ciod: Message code <*> is not <*> or <*><*>             