# Pre-analysis of the BlueGene/L dataset 
This notebook uses data mining techniques to help better understand the data

### About the dataset

BGL is an open dataset of logs collected from a BlueGene/L supercomputer system at Lawrence Livermore National Labs (LLNL) in Livermore, California, with 131,072 processors and 32,768GB memory. The log contains alert and non-alert messages identified by alert category tags. In the first column of the log, "-" indicates non-alert messages while others are alert messages. The label information is amenable to alert detection and prediction research. It has been used in several studies on log parsing, anomaly detection, and failure prediction.

### Structure of log

Logs are structure following the format "LABEL TIMESTAMP DATE NODE DATE-FULL NODE(again) TYPE COMPONENT LEVEL CONTENT". For example:

"- 1117838570 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 RAS KERNEL INFO instruction cache parity error corrected"

In [None]:
# dependencies
from utils.parser import parse_logs
from utils.paths import project_root

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

In [None]:
truncate_proportion = 1.0 # proportion of the log to truncate for testing

# loads parsed logs if needed
parse_logs(f'{project_root()}/data', 'BGL.log', f'{project_root()}/data/parsed', truncate_proportion)
#TODO: for now lines with no content are skipped. Later define if stays this way

In [None]:
df = pd.read_csv(f'{project_root()}/data/parsed/cleaned_BGL_structured.csv')
df.head()

In [None]:
df.shape

In [None]:
df['Label'].unique()

In [None]:
df[df['Label']!='-']

## Understanding Features
Labels tell what the type of the event that geterated the log. Most of the logs have a "-" in the label, meaning there was not problem identified in that log. According to the article "What Supercomputers Say: A Study of Five System Logs
" (https://ieeexplore.ieee.org/document/4273008), the types can be split in the following categories:

![labels](../../images/labels.png) 


In [None]:
# anomalies distribution
df['Label'].value_counts()

Similar to labels, we have level which can tell us the gravity of the report of the log. It can go from just an INFO log to a failure or kill

In [None]:
df['Level'].value_counts()

In [None]:
df['Content'].value_counts().head(20)

### Parsing

The logparser library was used to parse the log events, removing variable data from the content like ip addresses or variable names and keeping the core of the message. This can be used to create the bag-of-logs embedding

In [None]:
df['EventId'].nunique()

In [None]:
df['EventTemplate'].nunique()

In [None]:
df['EventTemplate'].value_counts().head(20)