Skip to content

Pre-training input format? #1

@bolu61

Description

@bolu61

Hi, I was testing the pre-trained model, but I can't seem to make the model to behave correctly on one of the pre-training tasks:

from transformers import AutoTokenizer, BartForConditionalGeneration

model_path = "/path/to/pretrained/PreLog"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BartForConditionalGeneration.from_pretrained(model_path)

x = tokenizer("authentication <mask>; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4", return_tensors="pt")
y = model.generate(**x, max_length=50)

print(tokenizer.decode(y[0]))

I get: </s><s>authentication; logname=0 uid=0 euid=0 tty=NODEVssh ruser=0 rhost=218.188.2.4</s>. Which is unexpected. I expected it to fill in the <mask> token with failure. This line of logs is directly taken from Linux from LogHub, which should be in the pre-training dataset. I found the pre-trained weights here https://figshare.com/s/5a08ef8b02b94f6726c2.

Is the input format I used wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions