# Log line embedding using a BERT-based encoder

This notebook showcases how to use a pretrained DistilBert based model to embed log lines from text into a vector space, using Huggingface Transformers and Datasets libraries.

Note: This notebook assumes [Cookiecutter datascience](https://drivendata.github.io/cookiecutter-data-science/) directory structure of the project, and expects to me in /notebooks/ folder

In [1]:
from datasets import load_dataset
import numpy as np
from dataclasses import dataclass
from typing import List, Union, Dict, Optional
import torch
from transformers import DistilBertTokenizerFast, DistilBertPreTrainedModel, DistilBertModel
from transformers.file_utils import ModelOutput
from pathlib import Path

Setup general used objects and constants.

In [2]:
project_base_dir = Path.cwd().parent
base_pretrained_model_name = "distilbert-base-cased"

## Dataset preparation
First we load HDFS1 dataset and select first 1000 lines from it as a demonstrative subset

In [3]:
hdfs1_path = project_base_dir / 'data' / 'raw' / 'HDFS1' / 'HDFS.log'
hdfs1_dataset = load_dataset('text', data_files=str(hdfs1_path), split='train')
small_raw_dataset = hdfs1_dataset.select(range(1000))

Using custom data configuration default-f7d20bad4b8d075b
Reusing dataset text (/home/cernypro/.cache/huggingface/datasets/text/default-f7d20bad4b8d075b/0.0.0/44d63bd03e7e554f16131765a251f2d8333a5fe8a73f6ea3de012dbc49443691)


Now we perform a rudimentary log-line preprocessing, removing the timestamp from each line (note, the model used in this notebook was pretrained with such preprocessing done)

In [4]:
def remove_timestamp(example):
    # need to find third occurence of a space and slice the string after it
    # using a very non robust silly solution
    s = example['text']
    example['text'] = s[s.find(' ', s.find(' ', s.find(' ')+1)+1)+1:]
    return example

small_cleaned_dataset = small_raw_dataset.map(remove_timestamp)

Loading cached processed dataset at /home/cernypro/.cache/huggingface/datasets/text/default-f7d20bad4b8d075b/0.0.0/44d63bd03e7e554f16131765a251f2d8333a5fe8a73f6ea3de012dbc49443691/cache-9078a3a0732e2ad5.arrow


## Transformer model preparation
Here we'll prepare the Transformer model classes

In [5]:
@dataclass
class EmbeddingOutput(ModelOutput):
    """
    ModelOutput class inspired per Huggingface Transformers library conventions, may be replaced by a suitable alternative class from the library if any exists.
    """
    embedding: torch.FloatTensor = None
        
class DistilBertForClsEmbedding(DistilBertPreTrainedModel):
    """
    DistilBertModel with a linear layer applied to [CLS] token.
    Initialize using .from_pretrained(path_or_model_name) method
    """
    def __init__(self, config):
        super().__init__(config)
        if config.task_specific_params is None:
            config.task_specific_params = dict()

        self.distilbert = DistilBertModel(config)
        self.cls_projector = torch.nn.Linear(config.dim, config.task_specific_params.setdefault('cls_embedding_dimension', 512))

        self.init_weights()
    
    def forward(self, input_ids, attention_mask):
        bert_output = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        cls_token_embedding = bert_output.last_hidden_state[:, 0]
        cls_encoding = self.cls_projector(cls_token_embedding)
        return EmbeddingOutput(embedding=cls_encoding)

Now load the model from checkpoint and prepare it's tokenizer

In [6]:
embedding_model_directory = project_base_dir / 'models' / 'Line_Encoder_ICT_8M_loglines_2_epochs'

encoder_model = DistilBertForClsEmbedding.from_pretrained(embedding_model_directory).to('cuda')
tokenizer = DistilBertTokenizerFast.from_pretrained(base_pretrained_model_name)  # The tokenizer must match the one used for the saved model, this model uses distilbert-base-cased tokenizer

Here we'll prepare the encode function which we will map over our dataset, which will add an embedding column to our data containing the vector embeddings for each log-line.

We will then apply this function in batches (for faster processing) as both our tokenizer and model can handle data in batched form. The batch size was chosen arbitrarily.

Our encode function takes two additional arguments which have to be passed as a dict fn_kwargs to the map function. (We could also use closures, but I find this cleaner and easier to copy into a script from a notebook environment)

See [Datasets .map documentation](https://huggingface.co/docs/datasets/processing.html#processing-data-with-map) for more info

In [7]:
def encode(examples, tokenizer, encoder):
    return {'embedding': encoder(**tokenizer(examples['text'],
                                             return_tensors='pt',
                                             truncation=True,
                                             padding=True).to('cuda')
                                 ).embedding.cpu().detach().numpy().tolist()}

small_embedded_dataset = small_cleaned_dataset.map(encode,
                                                   fn_kwargs={'tokenizer': tokenizer,
                                                              'encoder': encoder_model},
                                                   batched=True,
                                                   batch_size=128)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




Here we can see first three embeddings alongside their original lines using the slicing notation, which Datasets supports. 
The returned object is a dict with column names as keys and lists of the column contents as values.

In [8]:
small_embedded_dataset[:3]

{'embedding': [[-0.07249397784471512,
   0.08337511122226715,
   0.44695860147476196,
   -0.0203128382563591,
   -0.0008805245161056519,
   0.033679693937301636,
   -0.2584163248538971,
   -0.01713814213871956,
   -0.13464701175689697,
   0.2772810161113739,
   -0.10751543939113617,
   -0.02893763780593872,
   -0.1402347981929779,
   -0.1562652587890625,
   -0.2721223533153534,
   0.07620158791542053,
   -0.14240868389606476,
   0.3091081380844116,
   -0.00977490097284317,
   -0.027673259377479553,
   0.22512134909629822,
   0.02421267330646515,
   -0.1515868902206421,
   0.02557799220085144,
   -0.059474557638168335,
   0.13023941218852997,
   -0.11966827511787415,
   -0.2382989227771759,
   -0.04261612892150879,
   0.11512461304664612,
   -0.2678832411766052,
   0.27379319071769714,
   0.47852030396461487,
   0.007744520902633667,
   0.2810533046722412,
   -0.08203402161598206,
   0.027505144476890564,
   -0.1304769665002823,
   -0.06472301483154297,
   0.19899003207683563,
   -0.081

The dataset can be saved into a folder as follows, and can later be loaded using the load_dataset(path) function from datasets library

In [9]:
small_dataset_path = project_base_dir / 'data' / 'processed' / 'demo-small-embedding-dataset'
small_embedded_dataset.save_to_disk(small_dataset_path)