# Parsing and Extracting Log Keys
This section will make use of src/parse.py to parse log files and extract the log keys from them. First we import the parse file. Make sure to have it installed, to install it execute `pip install -e .` in the root directory of the project. 

Then we import it. To do that we first import `sys` and `os` and add the src directory to path: 

In [1]:
import sys
import os
sys.path.append(os.path.abspath("../src"))

from preprocessor import Preprocessor
from message_encoder import *
from util import get_sorted_log_numbers_by_size

# Preprocessing Parameters
Here we can set several parameters for preprocessing. The effects will be explained here: 
- `log_files` - What log files to load for pre-processing. It will load all these Logfiles for message encoding, not all of the data will be used as data
- `logs_per_class` - How many logs should be attempted to be labeled per class. The resulting Dataset will be of size `num_classes*logs_per_class` if the end of the log files wasn't reached before that
- `window_size` - The size of the sliding window to use for the data set. `window_size` logs will be treated as one data entry
- `encoding_output_size` - size to be passed to the message_encoder, note that this is not neccessairily the shape of the output (for `BERTEncodingEmbedding` the shape of the output is multiplied by the length of its hidden state, typically 768)
- `message_encoder` - The type of message encoding to use, can be one of `TextVectorizationEncoder` (uses keras.layers.TextVectorizer), `BERTEncoder` (only uses the BERT tokenizer) or `BERTEmbeddingEncoder` (also uses the BERT model)
- `extended_datetime_features` - Whether to use simple datetime features (days since epoch and seconds since midnight) or extended datetime features which includes a multitude of normalized features extracted from the datetime

In [None]:
# preprocessing parameters
log_files = get_sorted_log_numbers_by_size(r"C:\Users\Askion\Documents\agmge\log-classification\data\CCI")[:20]
logs_per_class = 100
window_size = 20
encoding_output_size = 16
message_encoder = BERTEncoder(max_length=encoding_output_size)
extended_datetime_features = True

The next codeblock will create a Preprocessor instance and preprocess the data. Upon instanciating the Preprocessor, all Logs will be loaded from the specified log_files, the message_encoder will be initialized with the found log messages and the function_encoder will be initialized. Since all logs will be loaded, this process takes a while. Using a Keyboard interrupt the loading of new logs can be interrupted and the Preprocessor will work with the data loaded until that point. 

Then the preprocessing starts which lets a sliding window slide over the logs. Each window will be labeled. If the class of the label is already full (count reached `logs_per_class`) the window will be skipped and discarded, otherwise the window and its label will be added to the dataset. Once all classes are full or there is no more data the preprocessing will stop. 

In [9]:
# preprocessing
pp = Preprocessor(log_files, message_encoder, logs_per_class, window_size, extended_datetime_features, True)
pp.preprocess()

parsing log file CCLog-backup.749.log:   0%|          | 0/86349 [00:00<?, ?it/s]

parsing log file CCLog-backup.751.log:   0%|          | 0/103334 [00:00<?, ?it/s]

parsing log file CCLog-backup.752.log:   0%|          | 0/102980 [00:00<?, ?it/s]

parsing log file CCLog-backup.750.log:   0%|          | 0/104128 [00:00<?, ?it/s]

parsing log file CCLog-backup.610.log:   0%|          | 0/189786 [00:00<?, ?it/s]

parsing log file CCLog-backup.611.log:   0%|          | 0/197591 [00:00<?, ?it/s]

parsing log file CCLog-backup.595.log:   0%|          | 0/273676 [00:00<?, ?it/s]

parsing log file CCLog-backup.596.log:   0%|          | 0/286621 [00:00<?, ?it/s]

parsing log file CCLog-backup.599.log:   0%|          | 0/257536 [00:00<?, ?it/s]

parsing log file CCLog-backup.585.log:   0%|          | 0/291334 [00:00<?, ?it/s]

parsing log file CCLog-backup.598.log:   0%|          | 0/294818 [00:00<?, ?it/s]

parsing log file CCLog-backup.602.log:   0%|          | 0/410334 [00:00<?, ?it/s]

parsing log file CCLog-backup.603.log:   0%|          | 0/415123 [00:00<?, ?it/s]

parsing log file CCLog-backup.597.log:   0%|          | 0/338397 [00:00<?, ?it/s]

parsing log file CCLog-backup.589.log:   0%|          | 0/443758 [00:00<?, ?it/s]

parsing log file CCLog-backup.588.log:   0%|          | 0/450698 [00:00<?, ?it/s]

parsing log file CCLog-backup.768.log:   0%|          | 0/330362 [00:00<?, ?it/s]

parsing log file CCLog-backup.590.log:   0%|          | 0/588384 [00:00<?, ?it/s]

parsing log file CCLog-backup.600.log:   0%|          | 0/706309 [00:00<?, ?it/s]

parsing log file CCLog-backup.591.log:   0%|          | 0/799535 [00:00<?, ?it/s]

annotating events:   0%|          | 0/5753233 [00:00<?, ?it/s]

State counts:
  - 0 : 100
  - 3 : 100
  - 2 : 100
  - 1 : 100
All states have the desired log count


In the last step the preprocessor can be saved. Preprocessors are saved as a zip file containing all serialized data about the preprocessor as well as the complete data including preprocessed and unpreprocessed. The zip file will therefore will be relatively large. The naming scheme for the preprocessor zip files is `preprocessor_{len(loaded_files)}files_{logs_per_class}lpc_{window_size}ws_{message_encoder}x{encoding_output_size}{_extdt if extended_datetime_features}`.  

In [10]:
pp.save()

'C:/Users/Askion/Documents/agmge/log-classification/data/preprocessors/preprocessor_20files_100lpc_20ws_BERTembx12288.zip'