In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

Get tram single label data:

In [None]:
!mkdir -p../data/input
!wget -O../data/input/single_label.json https: // raw.githubusercontent.com/center- for -threat-informed-defense/tram/main/data/tram2-data/single_label.json

In this version, we will consider the text, tactic and document title, all 3 of them as nodes.
The ontology then will be:

Nodes: 
    text, technique, doc_title
    
Relationships: 
    uses, found-in

Graph triple types will be:
    text uses technique
    text found-in doc_title
    technique found-in doc_title

In [None]:
data = pd.read_json('../data/input/single_label.json')

In [None]:
data

Getting all unique labels, doc_titles and text:

In [None]:
all_techniques = data['label'].explode().dropna().unique()
all_techniques

In [None]:
doc_titles = data['doc_title'].explode().dropna().unique()
doc_titles

In [None]:
text = data['text'].to_numpy()
text

Adding them all in one place labels, text, doc_titles:
there are 50 labels, 149 doc_titles and 5089 text

In [None]:
nodes = np.concatenate((all_techniques, doc_titles, text))
nodes

The node list will then have 
    0-49 techniques
    50-198 doc_titles 
    199-5287 text
    
Now to make the numeric triples, we will use the indexes of the nodes from the nodes list.


Let us say that of the two relationships, uses = 0 and found-in = 1

1. we make the triples for text uses technique
2. we make the triples for text found-in doc_title
3. we make the triples for technique found-in doc_title

In [None]:
triples = []
tech2doc = []

np_data = data.to_numpy()

for row in np_data:
    text_index = np.where(nodes == row[0])[0][0]
    technique_index = np.where(nodes == row[1])[0][0]
    doc_title_index = np.where(nodes == row[2])[0][0]

    triples.append((text_index, 0, technique_index))
    triples.append((text_index, 1, doc_title_index))
    tech2doc.append((technique_index, 1, doc_title_index))

tech2doc = np.unique(tech2doc, axis=0)
triples = np.array(triples)
triples = np.append(triples, tech2doc, axis=0)

In [None]:
triples

In [None]:
len(triples)

In [None]:
assert len(triples) == 2 * len(np_data) + len(tech2doc)

split the triples into validation, test and save them to a file

In [None]:
!mkdir -p../data/output/single

In [None]:
output = "../data/output/single"
pd.DataFrame(triples).to_csv(output + '/triples.txt', index=False, header=False, sep=' ')
train, valid = train_test_split(triples, test_size=0.05)
pd.DataFrame(train).to_csv(output + '/train.txt', index=False, header=False, sep=' ')
pd.DataFrame(valid).to_csv(output + '/valid.txt', index=False, header=False, sep=' ')
assert len(train) + len(valid) == len(triples)

Also train test validation split the nodes.txt for MLM 

In [None]:
def write_file(file_path, _list):
    with open(file_path, 'w') as f:
        for _row in _list:
            f.write(_row.replace("\n", r"\n").replace("\t", r"\t") + "\n")

In [None]:
n_train, n_test = train_test_split(nodes, test_size=0.2)

# pd.DataFrame(n_test).to_csv('./data/output/single/nodes_test.txt', index=False, header=False, sep=' ')

n_train, n_valid = train_test_split(n_train, test_size=0.05)
#pd.DataFrame(n_train).to_csv('./data/output/single/nodes_train.txt', index=False, header=False, sep=' ')
#pd.DataFrame(n_valid).to_csv('./data/output/single/nodes_valid.txt', index=False, header=False, sep=' ')

assert len(n_train) + len(n_test) + len(n_valid) == len(nodes)

write_file(output + '/nodes_train.txt', n_train)
write_file(output + '/nodes_valid.txt', n_valid)
write_file(output + '/nodes_test.txt', n_test)

save the nodes to a file, this is somewhat tricky, since some of the node texts contain newline characters, and we need to preserve them.

In [None]:
write_file(output + '/nodes.txt', nodes)

Now we follow Kepler@s Readme.md and prepare the KE and MLM data from the above files.
    We will use the nodes...txt as our MLM data.
    We will use the triples...txt as our KE data.
    

We first install the local version of kepler, which is built by extending fairsec:
We now start with KE data preprocessing:

In [None]:
!pip install --editable../../

1. Encode the entity descriptions with the GPT-2 BPE:

In [None]:
!mkdir -p../data/gpt2_bpe
!wget -O../data/gpt2_bpe/encoder.json https: // dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
!wget -O../data/gpt2_bpe/vocab.bpe https: // dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

!python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json../data/gpt2_bpe/encoder.json \
    --vocab-bpe../data/gpt2_bpe/vocab.bpe \
    --inputs../data/output/single/nodes.txt \
    --outputs../data/output/single/nodes.bpe \
    --keep-empty \
    --workers 60

2. Do negative sampling and dump the whole training and validation data:

In [None]:
!python../../examples/KEPLER/Pretrain/KGpreprocess.py --dumpPath../data/output/single/KE1 \
    -ns 1 \
    --ent_desc../data/output/single/nodes.bpe \
    --train../data/output/single/train.txt \
    --valid../data/output/single/valid.txt

3. then randomly split the KE training data into smaller parts and the number of training instances in each part aligns with the MLM training data
For our case it will be just one split, since our data is small.

In [None]:
!python../../examples/KEPLER/Pretrain/splitDump.py --Path ../data/output/single/KE1 \
    --split_size 6834352 \
    --negative_sampling_size 1

4. We then binarize them for training:

In [None]:
%%bash

wget -O ../data/gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt

KE_Data=../data/output/single/KE1_0/
for SPLIT in head tail negHead negTail;
  do
    python -m fairseq_cli.preprocess \
      --only-source \
      --srcdict ../data/gpt2_bpe//dict.txt \
      --trainpref ${KE_Data}${SPLIT}/train.bpe \
      --validpref ${KE_Data}${SPLIT}/valid.bpe \
      --destdir ${KE_Data}${SPLIT} \
      --workers 60; \
  done

We now start with MLM data preprocessing:


1. Now we encode the nodes_train, nodes_train and nodes_valid with the GPT-2 BPE:
   (gpt2_bpe is already downloaded during the KE data preparation, we reuse that.)

In [None]:
%%bash

mkdir -p ../data/output/single/MLM

for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json ../data/gpt2_bpe/encoder.json \
        --vocab-bpe ../data/gpt2_bpe/vocab.bpe \
        --inputs ../data/output/single/nodes_${SPLIT}.txt \
        --outputs ../data/output/single/MLM/nodes_${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

2. We then preprocess/binarize the data using the GPT-2 fairseq dictionary:

In [None]:
%%bash

mkdir -p ../data/output/single/MLM-bin

fairseq-preprocess \
    --only-source \
    --srcdict ../data/gpt2_bpe/dict.txt \
    --trainpref ../data/output/single/MLM/nodes_train.bpe \
    --validpref ../data/output/single/MLM/nodes_valid.bpe \
    --testpref ../data/output/single/MLM/nodes_test.bpe \
    --destdir ../data/output/single/MLM-bin \
    --workers 60

All preprocessing is done, now we try out training the model with our data.

We first download the pretrained models:

In [None]:
%%bash

mkdir ../data/keplerModels

if ! [ -f ../data/keplerModels/KEPLERforNLP.pt ]; then
    wget -o ../data/keplerModels/KEPLERforNLP.pt https://cloud.tsinghua.edu.cn/seafhttp/files/a21e5254-ceac-4b88-88e9-8ec58cbe8a1a/KEPLERforNLP.pt
fi
if ! [ -f ../data/keplerModels/KEPLERforKE.p ]; then
    wget -o ../data/keplerModels/KEPLERforKE.pt https://cloud.tsinghua.edu.cn/seafhttp/files/a684dc30-6a1a-4613-97ad-0144ae84e1ca/KEPLERforKE.pt
fi

Then we first train on the NLP model:

In [None]:
%%bash

TOTAL_UPDATES=125000                                    # Total number of training steps
WARMUP_UPDATES=10000                                    # Warmup the learning rate over this many updates
LR=6e-04                                                # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2                           
MAX_SENTENCES=3                                         # Batch size.
NUM_NODES=1			                                    # Number of machines
ROBERTA_PATH="../data/keplerModels/KEPLERforNLP.pt"     # Path to the original roberta model
CHECKPOINT_PATH="../data/checkpoints"                   # Directory to store the checkpoints
UPDATE_FREQ=`expr 784 / $NUM_NODES`                     # Increase the batch size

DATA_DIR=../data/output/single

#Path to the preprocessed KE dataset, each item corresponds to a data directory for one epoch
KE_DATA=$DATA_DIR/KE1_0:

DIST_SIZE=`expr $NUM_NODES \* 4`

fairseq-train $DATA_DIR/MLM-bin --KEdata $KE_DATA --restore-file $ROBERTA_PATH \
        --save-dir $CHECKPOINT_PATH \
        --max-sentences $MAX_SENTENCES \
        --tokens-per-sample 512 \
        --task MLMetKE \
        --sample-break-mode complete \
        --required-batch-size-multiple 1 \
        --arch roberta_base \
        --criterion MLMetKE \
        --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
        --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
        --clip-norm 0.0 \
        --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_UPDATES --warmup-updates $WARMUP_UPDATES \
        --update-freq "$UPDATE_FREQ" \
        --negative-sample-size 1 --ke-model TransE \
        --init-token 0 \
        --separator-token 2 \
        --gamma 4 --nrelation 822 \
        --skip-invalid-size-inputs-valid-test \
        --fp16 --fp16-init-scale 2 --threshold-loss-scale 1 --fp16-scale-window 128 \
        --reset-optimizer --distributed-world-size "${DIST_SIZE}" --ddp-backend no_c10d --distributed-port 23456 \
        --log-format simple --log-interval 1 \
        #--relation-desc  #Add this option to encode the relation descriptions as relation embeddings (KEPLER-Rel in the paper)