# Observation of Data related to Cyber-Threat-Intelligence

## Task definition of NER:

NER stands for Named Entity Recognition, which is a subtask of Natural Language Processing (NLP) that involves identifying and categorizing named
entities in text into predefined categories such as person names, organization names, locations, and others.

## Task definition of NER-CTI

NER-CTI stands for Named Entity Recognition for Cyber-Threat-Intelligence, which is a subtask of NER that involves identifying and categorizing named
entities related to Cyber-Threats in text into predefined categories such as IPs, URLs, protocols, locations or threat participants.

## BIO format
The BIO format is a commonly used labeling scheme in NER tasks. In this format, each token in a text is labeled with a prefix indicating whether it belongs to a named entity and, if so, what type of entity it is. The prefix is either "B", "I", or "O", where:

B (Beginning) indicates that the token is the beginning of a named entity.
I (Inside) indicates that the token is inside a named entity.
O (Outside) indicates that the token is not part of a named entity.

This is an example of how BIO might look in a sentence:

    John   lives in  New   York  City
    B-PER  O     O   B-LOC I-LOC I-LOC

In this example, "John" is the beginning of a person (PER) entity, "New York" is the beginning of a location (LOC) entity, and "City" is inside the same location entity.

## CoNLL format

The CoNLL format is a standard format for representing labeled sequences of tokens, often used for tasks like named entity recognition (NER) or part-of-speech (POS) tagging. The format is named after the Conference on Natural Language Learning (CoNLL), which first introduced it in 2000.

In the CoNLL format, each line of a text file represents a single token and its associated labels. The first column contains the token itself, while subsequent columns contain labels for various linguistic features. For example, in a typical NER task, the second column might contain the named entity label for each token, while in a POS tagging task, it might contain the part-of-speech tag.

## Data sources:

As data is limited for NER-CTI but the question of NER-CTI boils down to the same questions of NER but with special tag-sets, we focus on the
following three open-source datasets:

| APTNER   | Token    | Unique  | Sentence  | Error |
|----------|----------|---------|-----------|-------|
| Train    | 154.412  |  11.818 |  6.940    |  518  |
| Valid    |  35.990  |   5.501 |  1.664    |   68  |
| Test     |  37.359  |   4.793 |  1.529    |   23  |

**Entity-Types:** 5  *(B I O E S)*

**Entity-Labels:** 22 *('TIME', 'OS', 'ACT', 'LOC', 'TOOL', 'VULNAME', 'DOM', 'APT', 'EMAIL', 'IP', 'SHA1', 'SHA2', 'URL', 'IDTY', 'FILE', 'SECTEAM', 'PROT', 'MAL', 'VULID', 'MD5', 'O', 'ENCR')*

**Repository:** https://github.com/wangxuren/APTNER

| CyNER | Token | Unique | Sentence  | Error |
|-------|--------|--------|-----------|-------|
| Train | 25.769 | 4.567  | 1.097     | 33    |
| Valid | 18.742 | 3.363  | 785       | 0     |
| Test  | 6.726  | 1.830  | 294       | 12    |

**Entity-Types:** 3 *(B I O)*

**Entity-Labels:** 6 *('Organization', 'System', 'Malware', 'Indicator', 'O', 'Vulnerability')*

**Repository:** https://github.com/aiforsec/CyNER

| DNRTI    | Token    | Unique | Sentence  | Unique  | Error |
|----------|----------|--------|-----------|---------|-------|
| Train    | 94.829   | 7.377  | 3.704     | 7.377   |  450  |
| Valid    | 16.652   | 3.326  |   662     | 3.326   |   33  |
| Test     | 16.706   | 3.239  |   663     | 3.239   |   39  |

**Entity-Types:** 3 *(B I O)*

**Entity-Labels:** 14 *('Exp', 'OffAct', 'Area', 'SamFile', 'Tool', 'Features', 'Way', 'SecTeam', 'Org', 'Purp', 'Time', 'Idus', 'O', 'HackOrg')*

**Repository:** https://github.com/SCreaMxp/DNRTI-A-Large-scale-Dataset-for-Named-Entity-Recognition-in-Threat-Intelligence

### Other (not) usable datasets:


* [1TCFII](!https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1TCFII) contains 1000 binary annotated tweets. This is maybe good for a final model.

* [twitter-cyberthreat-detection](!https://paperswithcode.com/dataset/twitter-cyberthreat-detection-dataset) contains annotated tweets by their id. Hence, the data is not directly accessible.

* [BERT-for-Cybersecurity-NER](!https://github.com/stelemate/BERT-for-Cybersecurity-NER) only contains data written in chinese.

* [CTIMiner](!https://github.com/dgkim0803/CTIMiner) maybe interesting but behind paywall and uses XML structure.

* [CrossNER](!https://github.com/zliucr/CrossNER) is interesting because it combines entity label from different sources (science, politics, music, ...), good for future work.


The following link might be of special interest, as it contains a curated list about resources for CTI in general:

* [awesome-threat-intelligence](!https://github.com/hslatman/awesome-threat-intelligence)

* [Awesome-Cybersecurity-Datasets](!https://github.com/shramos/Awesome-Cybersecurity-Datasets)


## Evaluation of different NER-techniques:

**Idea:** Compare the pipelines of CoreNLP and spaCy by focusing on their primary components. Hence, it might be interesting to see how they both work compared to each other. This means, both have several components leading to the final detection on named entities in texts. Another interesting factor might be their implementation, usability in terms of programming effort and scalability. 

In [1]:
import pandas as pd
import csv

In [2]:
def read_file_from(path):
    """
    This method reads a file for NER-CTI in the format (token, tag) where token and tag are separated by whitespace.
    Further, this method counts the cases of data being assigned with more than one label.

    :param path: The path to the file to read.
    :return: Tuple having all tokens and tags as dataframe and the amount of mislabeled data.
    """
    column_names = ['Token', 'Tag']

    with open(path, newline='') as file:
        reader = csv.reader(file)
        malicious = 0
        token_tag_list = list()
        for row in reader:
            if len(row) == 1:
                row_split = row[0].split()
                if len(row_split) == 2:
                    token, tag = row_split[0], row_split[1]
                    if len(tag.split('-')) == 2 or tag == 'O':
                        token_tag_list += [(token, tag)]
                    else:
                        malicious += 1
                else:
                    malicious += 1
        df = pd.DataFrame.from_records(token_tag_list, columns=column_names)
    return df, malicious

In [3]:
def show_readability(dataset, train_mal, valid_mal, test_mal):
    """
    This method shows some information about the readability and error rate regarding a specified dataset.

    :param dataset: The dataset to be used.
    :param train_mal: Amount of malicious training data.
    :param valid_mal: Amount of malicious validation data.
    :param test_mal: Amount of malicious test data.
    :return: Nothing
    """

    train = dataset[dataset.Set == 'train']
    valid = dataset[dataset.Set == 'valid']
    test = dataset[dataset.Set == 'test']


    print("Length Train:", train.shape[0])
    print("Length Valid:", valid.shape[0])
    print("Length Test:", test.shape[0])

    print()

    print("Sentences Train:", train[train.Token == '.'].shape[0])
    print("Sentences Valid:", valid[valid.Token == '.'].shape[0])
    print("Sentences Test:", test[test.Token == '.'].shape[0])

    print()

    print("Unique Tokens Train:", len(train.Token.unique()))
    print("Unique Tokens Valid:", len(valid.Token.unique()))
    print("Unique Tokens Test:", len(test.Token.unique()))

    print()

    print("Error Rate Train:", train_mal)
    print("Error Rate Dev:", valid_mal)
    print("Error Rate Test:", test_mal)

In [4]:
def show_labels(dataset):
    """
    This method creates an overview about the different kind of labels and types.

    :param dataset: The dataset to work with.
    :return: Nothing
    """
    unique_tags = dataset.Tag.unique()

    tag_types = set({'O'})
    tag_words = set()

    for tag in unique_tags:
        tag_type = None
        tag_word = None
        if '-' in tag:
            tag_type = tag.split('-')[0]
            tag_word = tag.split('-')[1]
            tag_types.update({tag_type})
        else:
            tag_word = tag

        tag_words.update({tag_word})

    print('Different Entity Types:', len(tag_types))
    print('Different Entity Labels:', len(tag_words))
    print('Entity Types:', tag_types)
    print('Entity Labels:', tag_words)

## Datasets

In [5]:
for dataset in ['APTNER', 'CyNER', 'DNRTI']:
    train, train_malicious = read_file_from(f'../data/{dataset}/train.txt')
    train['Set'] = 'train'
    valid, valid_malicious = read_file_from(f'../data/{dataset}/valid.txt')
    valid['Set'] = 'valid'
    test, test_malicious = read_file_from(f'../data/{dataset}/test.txt')
    test['Set'] = 'test'

    data = pd.concat([train, valid, test])

    print(f'#### {dataset} ####')
    print('About the data')
    show_readability(data, train_malicious, valid_malicious, test_malicious)
    print()
    print('About the labels')
    show_labels(data)
    print()

#### APTNER ####
About the data
Length Train: 154412
Length Valid: 35990
Length Test: 37359

Sentences Train: 6940
Sentences Valid: 1664
Sentences Test: 1529

Unique Tokens Train: 11818
Unique Tokens Valid: 5501
Unique Tokens Test: 4793

Error Rate Train: 518
Error Rate Dev: 68
Error Rate Test: 23

About the labels
Different Entity Types: 5
Different Entity Labels: 22
Entity Types: {'B', 'E', 'S', 'O', 'I'}
Entity Labels: {'O', 'MAL', 'FILE', 'TOOL', 'ENCR', 'SECTEAM', 'SHA2', 'SHA1', 'VULNAME', 'IDTY', 'DOM', 'TIME', 'VULID', 'URL', 'LOC', 'MD5', 'ACT', 'OS', 'APT', 'PROT', 'IP', 'EMAIL'}

#### CyNER ####
About the data
Length Train: 25769
Length Valid: 18742
Length Test: 6726

Sentences Train: 1097
Sentences Valid: 785
Sentences Test: 294

Unique Tokens Train: 4567
Unique Tokens Valid: 3363
Unique Tokens Test: 1830

Error Rate Train: 33
Error Rate Dev: 0
Error Rate Test: 12

About the labels
Different Entity Types: 3
Different Entity Labels: 6
Entity Types: {'B', 'I', 'O'}
Entity Lab