# **CyBERT** (**Cy**ber Security **BERT**): Pre-Training and Fine-Tuning

This Colab allows you to further pre-train and/or fine-tune a BERT model. The two steps can be performed independently.

* Executing the prerequisites is required to be able to perform the subsequent steps. Activating the GPU is optional but highly recommended for accelerating the training.

* Part 1 is the Pre-Training step consisting of the creation of a corpus using different text files as source and the subsequent training with the specified Hyperparameters.

* Part 2 is the Fine-Tuning. This step can be executed on the pre-trained model from Part 1 or any other BERT model. We provide an already pre-trained CyBERT model in the PEASEC cloud which comes along with the other downloaded files in the prerequisites.
Part 2 consists of the pre-processing of a labeled CSV file and the subsequent split into different sets of data. Afterwards the fine-tuning using the specified Hyperparameters can be performed.

# Part 0: Prerequisites

## Activate GPU

For training the model activate the GPU in the notebook settings as follows:

*   Edit (Bearbeiten) -> Notebook Settings (Notebook-Einstellungen)
*   In the Hardware Accelerator (Hardwarebeschleuniger) dropdown select 'GPU'


---

Run the following cell to check whether GPU is enabled.




In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

## Downloading and unzipping files

In this step all required files are downloaded from the PEASEC Cloud and unzipped. This includes datasets, code and a ready to go CyBERT model.

In [None]:
!wget -O cybert_data.zip https://cloud.peasec.de/index.php/s/YCiyap7yK5DfyFx/download
!unzip cybert_data.zip

## Installing required libraries via pip

We have a list of required Python packages and clone the current master branch of the 🤗 Transformer repository.


In [None]:
!pip install -r cybert/requirements.txt

!pip install git+https://github.com/huggingface/transformers

# Part 1: Pre-Training CyBERT

If you do not want to pre-train CyBERT yourself, you can jump to [Part 2: Fine-Tuning](#fine_tuning) and use the provided CyBERT model.

## Compiling together datasets


When running the following cell a *cysec_corpus.txt* file will be compiled from all txt files in *cybert/input/Corpus/* directory.
This file is used afterwards for pre-training the model.



 You can append additional text datasets to the input corpus by saving them as .txt files in *cybert/input* directory before this step.

In [None]:
import glob
import shutil

input_txt_paths = glob.glob("cybert/input/Corpus/*.txt")

with open('cysec_corpus.txt','wb') as output_corpus:
    for input_txt_file in input_txt_paths:
        with open(input_txt_file,'rb') as input_txt:
            shutil.copyfileobj(input_txt, output_corpus)
            output_corpus.write(b'\n')

## Hyperparameters and Training the model

This step consists of specifying Hyperparameters **and** the training itself. You can modify the parameters as needed.

* A list of available pretrained models for using as the pre-training base can be found here:  [HuggingFace Pretrained Models](https://huggingface.co/models)

* The full specification of the parameters: [HuggingFace TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)<br>
You can extend the list of inputted parameters by appending arguments to the Python call in the following cell.

In [None]:
MODEL = 'bert-base-cased' #@param ["bert-base-cased"] {allow-input: true}
INPUT_FILE_PATH = 'cysec_corpus.txt'  #@param {type: "string"}

NUM_EPOCHS = 3 #@param {type: "integer"}
BATCH_SIZE = 4 #@param {type: "integer"}
LEARN_RATE = "2e-5" #@param ["5e-5", "2e-5", "1e-5", "1e-4"]
WARMUP_STEPS = 10000 #@param {type: "integer"}
WEIGHT_DECAY = 0.01 #@param {type: "number"}

SAVE_STRATEGY = "steps" #@param ["no", "epoch", "steps"]
SAVE_STEPS = 5000 #@param {type: "integer"}
LOGGING_STEPS = 500 #@param {type: "integer"}

SEED = 42 #@param {type: "integer"}
GRADIENT_ACC_STEPS = 1 #@param {type: "integer"}

OVERWRITE_OUTPUT_DIR = False #@param {type:"boolean"}

OUTPUT_DIR = 'model/' + 'cybert_e-' + str(NUM_EPOCHS) + '_b-' + str(BATCH_SIZE) + '_l-' + str(LEARN_RATE)


!python cybert/code/run_mlm.py \
    --model_name_or_path $MODEL \
    --train_file $INPUT_FILE_PATH \
    --do_train \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $BATCH_SIZE \
    --learning_rate $LEARN_RATE \
    --weight_decay $WEIGHT_DECAY \
    --output_dir $OUTPUT_DIR \
    --overwrite_output_dir $OVERWRITE_OUTPUT_DIR \
    --save_steps $SAVE_STEPS \
    --save_strategy $SAVE_STRATEGY \
    --warmup_steps $WARMUP_STEPS \
    --cache_dir "cache" \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS


<a name="fine_tuning"></a>
# Part 2: Fine-Tuning and Evaluation

## Pre-process and split NER dataset

In this step the provided CSV file is formatted into a JSONL and splitted into a Train, a Test and an optional Dev dataset. You can specify the file path if using an own dataset but be aware that the implementation of this pre-processing stepwas developed for the provided dataset and only works on data with the same format. Default is the provided labeled CVE database from the Ovana paper.

Other parameters to be specified:
* SPLIT_SEED: Choose a seed for the data's random split
* SPLIT_RATIO: The train set's size, the remainder is the test set's size or is splitted between test set and dev set equally if CREATE_DEV_SET is true
* CREATE_DEV_SET: Decide, whether you want to create a dev set or not - it is not included in the fine-tuning step by default
* NER_TAG: Since multi-tagging is not supported, you need to choose the specific tag you want the model to be fine-tuned on
* CSV_SEPERATOR: Delimiter of your input CSV file




In [None]:
# Parameters
# =============================================================================
FILE_PATH = 'cybert/input/NER/tagged_all.csv' #@param {type:"string"}
SPLIT_SEED =  2021#@param {type: "integer"}
SPLIT_RATIO = 80 #@param {type:"slider", min:0, max:100, step:10}
CREATE_DEV_SET = False #@param {type:"boolean"}

NER_TAG = 'SN' #@param ['SV', 'SN', 'AC']

CSV_SEPARATOR = 'Space(s)' #@param ['Space(s)', ';', ',']
# =============================================================================


# Code

import csv
import json
import os
from pathlib import Path
from numpy.random import randn
import pandas as pd
import numpy as np
import sklearn
from numpy.random import RandomState
import numpy as np
from sklearn.model_selection import train_test_split

O_TAG = 'O'


def main():
    df = read_csv()

    json_list = create_json_per_cve(df)

    split_dict = split_data(json_list)

    data_to_json(split_dict)


def read_csv():
    global csv_filename, req_tag
    csv_filename = FILE_PATH
    req_tag = NER_TAG
    csv_sep = '\s+' if CSV_SEPARATOR.startswith('Space') else CSV_SEPARATOR
              #'\t' if CSV_SEPARATOR == 'tab' else \
              

    if(csv_filename[-3:] != 'csv'):
        raise ValueError("Not a CSV file!")

    df = pd.read_csv(
    csv_filename,
    sep=csv_sep,
    dtype=str,
    header=None,
    skip_blank_lines=True,
    na_filter=True
    )

    return df


def create_json_per_cve(df):
    json_list = []

    groupedby_cve = df.groupby(by=2).groups
    for cve in groupedby_cve.values():

        entities = []
        ner = []
        for cve_loc in cve:
          # workaround to filter out NaN values,
          # since pandas.DataFrame.dropna seems to have a bug on Google Colab
            if (df.loc[[cve_loc]][0].notna().all() and
                df.loc[[cve_loc]][1].notna().all()):
                entities.append(df.loc[[cve_loc]][0].values[0])
                ner.append(tags_to_tag(df.loc[[cve_loc]][1].values[0]))

        json_list.append({
            'words' : entities,
            'ner': ner
             }
            )
    return json_list
    

def split_data(json_list):
    ret_dict = dict()
    seed = SPLIT_SEED
    ratio = int(SPLIT_RATIO) / 100

    train, dev_test = train_test_split(json_list, train_size=ratio, random_state=seed)
    ret_dict = {'train': train,
                'test': dev_test,
                'dev': None}
    if CREATE_DEV_SET:
        dev, test = train_test_split(json_list, train_size=0.5, random_state=seed)
        ret_dict['dev'] = dev
        ret_dict['test'] = test

    return ret_dict


def data_to_json(split_dict):

    json_filename_prefix = Path(csv_filename[:-4] + '_' + req_tag)
    json_filename_train = json_filename_prefix / 'train.json'
    json_filename_test = json_filename_prefix / 'test.json'
    
    os.makedirs(os.path.dirname(json_filename_train), exist_ok=True)
    with open(json_filename_train, 'w') as json_file:
        for e in split_dict['train']:
            json.dump(e, json_file)
            json_file.write('\n')
        print("Train dataset saved in : " + str(json_filename_train))

    with open(json_filename_test, 'w') as json_file:
        for e in split_dict['test']:
            json.dump(e, json_file)
            json_file.write('\n')
        print("Test dataset saved in : " + str(json_filename_test))

    if CREATE_DEV_SET:
        json_filename_dev = json_filename_prefix / 'dev.json'
        with open(json_filename_dev, 'w') as json_file:
            for e in split_dict['dev']:
                json.dump(e, json_file)
                json_file.write('\n')
        print("Dev dataset saved in : " + str(json_filename_dev))


def tags_to_tag(tags):
    return req_tag if req_tag in str(tags) else O_TAG


        
main()

## Fine-Tuning Process

* The full specification of the parameters: [HuggingFace TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)<br>
You can extend the list of inputted parameters by appending arguments to the Python call in the following cell.

The evaluation results will be printed in the cell's output and also stored in the *all_results.json* file of the fine-tuned model's directory.

In [None]:
# ============================== Hyperparameters ==============================

MODEL_NAME_OR_PATH = 'cybert/model/CyBERT_beta' #@param {type: "string"}
DATASET_DIR_PATH = 'cybert/input/NER/tagged_all_SN' #@param {type: "string"}
NUM_EPOCHS =  4 #@param {type: "integer"}
BATCH_SIZE =  16 #@param {type: "integer"}

SAVE_STRATEGY = "epoch" #@param ["no", "epoch", "steps"]
EVAL_STRATEGY = "epoch" #@param ["no", "epoch", "steps"]
SAVE_STEPS = 500 #@param {type: "integer"}
LOGGING_STEPS = 500 #@param {type: "integer"}
NER_SEED = 42 #@param {type: "integer"}

OVERWRITE_OUTPUT_DIR = True #@param {type:"boolean"}
DO_TRAIN = True #@param {type:"boolean"}
DO_EVAL = True #@param {type:"boolean"}

# =============================================================================

from pathlib import Path
datasets_posix = Path(DATASET_DIR_PATH)

OUTPUT_DIR = Path('cybert/model') / '{}-fine_tuned-{}'.format(
    Path(MODEL_NAME_OR_PATH).stem,
    datasets_posix.stem
    )
print("Fine-tuned model will be saved in: " + str(OUTPUT_DIR))

TRAIN_FILE = str(datasets_posix / "train.json")
TEST_FILE = str(datasets_posix / "test.json")
DEV_FILE = datasets_posix / "dev.json"


!python cybert/code/run_ner.py \
  --model_name_or_path=$MODEL_NAME_OR_PATH \
  --task_name=ner \
  --train_file=$TRAIN_FILE \
  --do_train=$DO_TRAIN \
  --validation_file=$TEST_FILE \
  --do_eval=$DO_EVAL \
  --evaluation_strategy=$EVAL_STRATEGY \
  --num_train_epochs=$NUM_EPOCHS \
  --per_device_train_batch_size=$BATCH_SIZE \
  --output_dir=$OUTPUT_DIR \
  --overwrite_output_dir=$OVERWRITE_OUTPUT_DIR \
  --save_strategy=$SAVE_STRATEGY
