# SciBERT for Single-Label Classification

[![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/center-for-threat-informed-defense/tram/blob/main/user_notebooks/fine_tune_single_label.ipynb)

This notebook allows one to continue fine-tuning our provided SciBERT-for-singlelabel-sequence-classification on custom data.

In [None]:
!mkdir scibert_single_label_model
!wget https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/config.json -O scibert_single_label_model/config.json
!wget https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/pytorch_model.bin -O scibert_single_label_model/pytorch_model.bin
!pip install torch transformers pandas

This cell instantiates the label encoder. Do not modify this cell, as the classes (ie, ATT&CK techniques) and their order must match those the model expects.

In [9]:
from sklearn.preprocessing import OneHotEncoder as OHE

CLASSES = [
   'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
   'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
   'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
   'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
   'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
   'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
   'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
   'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
   'T1569.002', 'T1570', 'T1573.001', 'T1574.002'
]

encoder = OHE(sparse_output=False)
encoder.fit([[c] for c in CLASSES])

encoder.categories_

[array(['T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
        'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
        'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
        'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
        'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
        'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
        'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
        'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
        'T1569.002', 'T1570', 'T1573.001', 'T1574.002'], dtype=object)]

This cell is for loading the training data. You will need to modify this cell to load your data. Ensure that by the end of this cell, a DataFrame has been assigned to the variable `data` that has a `text` column containing the segments, and a `label` column containing individual strings, where those strings are an ATT&CK IDs that this model can classify. It does not matter how the DataFrame is indexed or what other columns with other names, if any, it has.

For demonstration purposes, we will use the same single-label data that was produced during this TRAM effort, even though the model was trained on this data already. This cell is only present to show the expected format of the `data` DataFrame, and is not intended to be run as shown.

In [13]:
import pandas as pd
data = pd.read_json('../single_label.json').drop(columns='doc_title').head(500)
data

Unnamed: 0,text,label
0,This file extracts credentials from LSASS simi...,T1003.001
1,It calls OpenProcess on lsass.exe with access ...,T1003.001
2,It spreads to Microsoft Windows machines using...,T1210
3,SMB exploitation via EternalBlue,T1210
4,SMBv1 Exploitation via EternalBlue,T1210
...,...,...
495,The unpacked sample is approximately 540 KB,T1027
496,decompress data blobs,T1140
497,decompress them,T1140
498,The decompression function,T1140


In [16]:
import transformers
import torch

cuda = torch.device('cuda')

tokenizer = transformers.BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased", max_length=512)
bert = transformers.BertForSequenceClassification.from_pretrained('scibert_single_label_model').to(cuda).train()

In [19]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, shuffle=True)

def _load_data(x, y, batch_size=10):
    x_len, y_len = x.shape[0], y.shape[0]
    assert x_len == y_len
    for i in range(0, x_len, batch_size):
        slc = slice(i, i + batch_size)
        yield x[slc].to(cuda), y[slc].to(cuda)

def _tokenize(instances: list[str]):
    return tokenizer(instances, return_tensors='pt', padding='max_length', truncation=True, max_length=512).input_ids

def _encode_labels(labels):
    """:labels: should be the `labels` column (a Series) of the DataFrame"""
    return torch.Tensor(encoder.transform(labels))

In [21]:
x_train = _tokenize(train['text'].tolist())
x_train

tensor([[  102,  4546,   217,  ...,     0,     0,     0],
        [  102,   106,  2289,  ...,     0,     0,     0],
        [  102,   111,  1384,  ...,     0,     0,     0],
        ...,
        [  102,  9683,   972,  ...,     0,     0,     0],
        [  102,   111, 24870,  ...,     0,     0,     0],
        [  102, 14397,   111,  ...,     0,     0,     0]])

In [23]:
y_train = _encode_labels(train[['label']])
y_train



tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

This array may appear to be empty, but taking the sum shows that there is one `1` per row.

In [24]:
y_train.sum()

tensor(400.)

This cell contains the training loop. You may change the `NUM_EPOCHS` value to any integer you would like.

In [25]:
NUM_EPOCHS = 3

from statistics import mean

from tqdm import tqdm
from torch.optim import AdamW

optim = AdamW(bert.parameters(), lr=2e-5, eps=1e-8)

for epoch in range(NUM_EPOCHS):
    epoch_losses = []
    for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)):
        bert.zero_grad()
        out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y)
        epoch_losses.append(out.loss.item())
        out.loss.backward()
        optim.step()
    print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

40it [00:23,  1.68it/s]


epoch 1 loss: 0.009985628997674212


40it [00:20,  1.92it/s]


epoch 2 loss: 0.005113609199179336


40it [00:20,  1.91it/s]

epoch 3 loss: 0.0038467945356387644





If the loss from the last iteration was not to your liking, do not re-run the previous cell. Uncomment the following cell and run it for however many additional epochs you would like.

In [None]:
# NUM_EXTRA_EPOCHS = 1
# for epoch in range(NUM_EXTRA_EPOCHS):
#     epoch_losses = []
#     for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)):
#         bert.zero_grad()
#         out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y)
#         epoch_losses.append(out.loss.item())
#         out.loss.backward()
#         optim.step()
#     print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

The next cells evaluate the performance after the additional fine-tuning. The performance scores on the example data will be high, as the model has already been trained on most of these instances.

In [30]:
bert.eval()

x_test = _tokenize(test['text'].tolist())
y_test = test['label']

batch_size = 20
preds = []

with torch.no_grad():
    for i in range(0, x_test.shape[0], batch_size):
        x = x_test[i : i + batch_size].to(cuda)
        out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int))
        preds.extend(out.logits.to('cpu'))

import torch.nn.functional as F
from sklearn.metrics import precision_recall_fscore_support as calculate_score
        
predicted_labels = (
    encoder.inverse_transform(
        F.one_hot(
            torch.vstack(preds).softmax(-1).argmax(-1),
            num_classes=len(encoder.categories_[0])
        )
        .numpy()
    )
    .reshape(-1)
)

predicted = list(predicted_labels)
actual = y_test.tolist()

labels = sorted(set(actual) | set(predicted))

scores = calculate_score(actual, predicted, labels=labels)

scores_df = pd.DataFrame(scores).T
scores_df.columns = ['P', 'R', 'F1', '#']
scores_df.index = labels
scores_df.loc['(micro)'] = calculate_score(actual, predicted, average='micro', labels=labels)
scores_df.loc['(macro)'] = calculate_score(actual, predicted, average='macro', labels=labels)

scores_df

Unnamed: 0,P,R,F1,#
T1003.001,1.0,1.0,1.0,2.0
T1005,1.0,1.0,1.0,4.0
T1016,1.0,1.0,1.0,1.0
T1021.001,1.0,1.0,1.0,2.0
T1027,1.0,1.0,1.0,12.0
T1033,1.0,1.0,1.0,2.0
T1041,1.0,1.0,1.0,2.0
T1047,1.0,1.0,1.0,2.0
T1053.005,1.0,1.0,1.0,5.0
T1055,1.0,0.5,0.666667,2.0
