# Observation of Data related to Cyber-Threat-Intelligence

## Task definition of NER:

NER stands for Named Entity Recognition, which is a subtask of Natural Language Processing (NLP) that involves identifying and categorizing named
entities in text into predefined categories such as person names, organization names, locations, and others.

## Task definition of NER-CTI

NER-CTI stands for Named Entity Recognition for Cyber-Threat-Intelligence, which is a subtask of NER that involves identifying and categorizing named
entities related to Cyber-Threats in text into predefined categories such as IPs, URLs, protocols, locations or threat participants.

## Data sources:

As data is limited for NER-CTI but the question of NER-CTI boils down to the same questions of NER but with special tag-sets, we focus on the
following three open-source datasets:

    1. APTNER: https://github.com/wangxuren/APTNER
    2. CyNER: https://github.com/aiforsec/CyNER
    3. DNRTI: https://github.com/SCreaMxp/DNRTI-A-Large-scale-Dataset-for-Named-Entity-Recognition-in-Threat-Intelligence

The following link might be of special interest, as it contains a curated list about resources for CTI in general:

    1. https://github.com/hslatman/awesome-threat-intelligence

## Comparability (Pre-Limitation):

This is a basic comparison of the three datasets APTNER, CyNER and DNRTI according to:

    1) Format and readability
    2) Data distribution (train, valid, test) at token- and tag-level
    3) Interpretation of label semantic

### 1) Format and readability

Since we work with three datasets, it is important to compare them in terms of some very basic characteristics: format and readability.
Format refers to the way the data is structured, meaning if the data formated similar or universal.

It turned out that the format is indeed the same. All three files show to have the format (token, whitespace, tag). With respect to APTNER, all
datasets use train, valid and test as identifiers. For APTNER the files need to be renamed accordingly. In addition, APTNER uses a dev set which
is equivalent to a valid set.

In terms of readability, it has to be compared if the data contains any label errors, like tokens only having more than two or no tag.
I turned out that the error rate for all three datasets is quite low.

## Evaluation of different NER-techniques:

    TBC.

In [1]:
import pandas as pd
import csv

In [2]:
def read_file_from(path):
    """
    This method reads a file for NER-CTI in the format (token, tag) where token and tag are separated by whitespace.
    Further, this method counts the cases of data being assigned with more than one label.

    :param path: The path to the file to read.
    :return: Tuple having all tokens and tags as dataframe and the amount of mislabeled data.
    """
    column_names = ['Token', 'Tag']

    with open(path, newline='') as file:
        reader = csv.reader(file)
        malicious = 0
        token_tag_list = list()
        for row in reader:
            if len(row) == 1:
                row_split = row[0].split()
                if len(row_split) == 2:
                    token, tag = row_split[0], row_split[1]
                    token_tag_list += [(token, tag)]
                else:
                    malicious += 1
        df = pd.DataFrame.from_records(token_tag_list, columns=column_names)
    return df, malicious

In [3]:
def show_readability(dataset, train_mal, valid_mal, test_mal):
    """
    This method shows some information about the readability and error rate regarding a specified dataset.

    :param dataset: The dataset to be used.
    :param train_mal: Amount of malicious training data.
    :param valid_mal: Amount of malicious validation data.
    :param test_mal: Amount of malicious test data.
    :return: Nothing
    """

    train = dataset[dataset.Set == 'train']
    valid = dataset[dataset.Set == 'valid']
    test = dataset[dataset.Set == 'test']


    print("Length Train:", train.shape[0])
    print("Length Valid:", valid.shape[0])
    print("Length Test:", test.shape[0])

    print()

    print("Unique tokens:", len(dataset.Token.unique()))
    print("Unique tokens:", len(train.Token.unique()))
    print("Unique tokens:", len(valid.Token.unique()))
    print("Unique tokens:", len(test.Token.unique()))

    print()

    print("Error Rate Train:", round(100 * (train_mal / aptner_train.shape[0]), 2), "%")
    print("Error Rate Dev:", round(100 * (valid_mal / aptner_valid.shape[0]), 2), "%")
    print("Error Rate Test:", round(100 * (test_mal / aptner_test.shape[0]), 2), "%")

## APTNER

Notice: For this dataset, the files had to be renamed. The original pattern was like "APTNERtrain.txt". Now the file are simply train.txt, valid.txt and test.txt. Furthermore, the file "APTNERdev.txt" is now valid.txt.

In [4]:
aptner_train, aptner_train_malicious = read_file_from('../data/APTNER/train.txt')
aptner_train['Set'] = 'train'
aptner_valid, aptner_valid_malicious = read_file_from('../data/APTNER/valid.txt')
aptner_valid['Set'] = 'valid'
aptner_test, aptner_test_malicious = read_file_from('../data/APTNER/test.txt')
aptner_test['Set'] = 'test'

aptner = pd.concat([aptner_train, aptner_valid, aptner_test])

show_readability(aptner, aptner_train_malicious, aptner_valid_malicious, aptner_test_malicious)

Length Train: 154426
Length Valid: 35990
Length Test: 37359

Unique tokens: 14991
Unique tokens: 11818
Unique tokens: 5501
Unique tokens: 4793

Error Rate Train: 0.33 %
Error Rate Dev: 0.19 %
Error Rate Test: 0.06 %


In [5]:
cyner_train, cyner_train_malicious = read_file_from('../data/CyNER/train.txt')
cyner_train['Set'] = 'train'
cyner_valid, cyner_valid_malicious = read_file_from('../data/CyNER/valid.txt')
cyner_valid['Set'] = 'valid'
cyner_test, cyner_test_malicious = read_file_from('../data/CyNER/test.txt')
cyner_test['Set'] = 'test'

cyner = pd.concat([cyner_train, cyner_valid, cyner_test])

show_readability(cyner, cyner_train_malicious, cyner_valid_malicious, cyner_test_malicious)

Length Train: 25769
Length Valid: 18742
Length Test: 6726

Unique tokens: 6733
Unique tokens: 4567
Unique tokens: 3363
Unique tokens: 1830

Error Rate Train: 0.02 %
Error Rate Dev: 0.0 %
Error Rate Test: 0.03 %


## DNRTI

In [6]:
dnrti_train, dnrti_train_malicious = read_file_from('../data/DNRTI/train.txt')
dnrti_train['Set'] = 'train'
dnrti_valid, dnrti_valid_malicious = read_file_from('../data/DNRTI/valid.txt')
dnrti_valid['Set'] = 'valid'
dnrti_test, dnrti_test_malicious = read_file_from('../data/DNRTI/test.txt')
dnrti_test['Set'] = 'test'

dnrti = pd.concat([dnrti_train, dnrti_valid, dnrti_test])

show_readability(dnrti, dnrti_train_malicious, dnrti_valid_malicious, dnrti_test_malicious)

Length Train: 94829
Length Valid: 16652
Length Test: 16706

Unique tokens: 8325
Unique tokens: 7377
Unique tokens: 3326
Unique tokens: 3239

Error Rate Train: 0.29 %
Error Rate Dev: 0.09 %
Error Rate Test: 0.1 %
