# Comparison of NER methods in the filed of Cyber-Threat-Intelligence

## Approach
The approach of this report is twofold. In the first instance, the traditional approaches are contrasted. Therefore, the [CoreNLP pipeline](https://stanfordnlp.github.io/CoreNLP/ner.html#additional-tokensregexner-rules), which determines the entities in last instance on the basis [Conditional Random Field (CRF)](https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776), and the [extended spaCy pipeline](https://spacy.io/usage/processing-pipelines) are compared with one another. Extended means that spaCy is the modern one of both libraries and makes it possible for example to replace individual components such as the feature extraction by means of different embedding techniques. Based on this, a [foundation model](https://research.ibm.com/blog/what-are-foundation-models), i.e. a more specialized [transformer pipeline](https://spacy.io/usage/v3#features-transformers-pipelines), is to be integrated into this process and evaluated.

## Task definition of NER:
NER stands for Named Entity Recognition, which is a subtask of Natural Language Processing (NLP) that involves identifying and categorizing named
entities in text into predefined categories such as person names, organization names, locations, and others.

## Task definition of NER-CTI
NER-CTI stands for Named Entity Recognition for Cyber-Threat-Intelligence, which is a subtask of NER that involves identifying and categorizing named
entities related to Cyber-Threats in text into predefined categories such as IPs, URLs, protocols, locations or threat participants.

## BIO format
The BIO format is a commonly used labeling scheme in NER tasks. In this format, each token in a text is labeled with a prefix indicating whether it belongs to a named entity and, if so, what type of entity it is. The prefix is either "B", "I", or "O", where:

B (Beginning) indicates that the token is the beginning of a named entity.
I (Inside) indicates that the token is inside a named entity.
O (Outside) indicates that the token is not part of a named entity.

This is an example of how BIO might look in a sentence:

    John   lives in  New   York  City
    B-PER  O     O   B-LOC I-LOC I-LOC

In this example, "John" is the beginning of a person (PER) entity, "New York" is the beginning of a location (LOC) entity, and "City" is inside the same location entity.

## BIOES format
The BIOES format is an extension of the BIO format. This format adds more semantic to the respective token-label relation as the simpler BIO format does not consider single words and also tags the last word in an entity with I.
With the BIOES format it is possible to denote single entity words like "John" (S) and to strictly tell the ending of an entity (E).
Hence, the additional prefixes are "S" and "E", where:

S (Start) indicates that the token is the complete named entity.
E (Ending) indicates that the token is the ending of a named entity.

This is the extended example for BIOES:

    John   lives in  New   York  City
    S-PER  O     O   B-LOC I-LOC E-LOC

However, both formats are interchangeable and the choice how which format to apply depends on how fine-grained the annotation or model should be. But it appears that BIO is preferred for the most applications.

## CoNLL format
The CoNLL format is a standard format for representing labeled sequences of tokens, often used for tasks like named entity recognition (NER) or part-of-speech (POS) tagging. The format is named after the [Conference on Natural Language Learning (CoNLL)](https://www.conll.org/previous-tasks), which first introduced it in 2000.

In the CoNLL format was introduced for the tasks of language-independent named entity recognition in [2002](https://www.clips.uantwerpen.be/conll2002/ner/) and [2003](https://www.clips.uantwerpen.be/conll2003/ner/), each line of a text file represents a single token and its associated labels. The first column contains the token itself, while subsequent columns contain labels for various linguistic features. For example, in a typical NER task, the second column might contain the named entity label for each token, while in a POS tagging task, it might contain the part-of-speech tag.

## Data sources:
As data is limited for NER-CTI but the question of NER-CTI boils down to the same questions of NER but with special tag-sets, we focus on the following three open-source datasets:

| APTNER   | Token    | Unique  | Sentence  | Error |
|----------|----------|---------|-----------|-------|
| Train    | 154.412  |  11.818 |  6.940    |  518  |
| Valid    |  35.990  |   5.501 |  1.664    |   68  |
| Test     |  37.359  |   4.793 |  1.529    |   23  |

**Entity-Types:** 5  *(B I O E S)*

**Entity-Labels:** 22 *('TIME', 'OS', 'ACT', 'LOC', 'TOOL', 'VULNAME', 'DOM', 'APT', 'EMAIL', 'IP', 'SHA1', 'SHA2', 'URL', 'IDTY', 'FILE', 'SECTEAM', 'PROT', 'MAL', 'VULID', 'MD5', 'O', 'ENCR')*

**Repository:** https://github.com/wangxuren/APTNER

**Example:**

    Kaspersky Lab       products detect the malware described
    B-SECTEAM E-SECTEAM O        O      O   O       O

    in this report as Trojan.Win32.Remexi and Trojan.Win32.Agent .
    O  O    O      O  S-FILE              O   S-FILE             0

In this example the security team (SECTEAM) "Kaspersky Lab" detected the files "Trojan.Win32.Remexi" and "Trojan.Win32.Agent" (FILE). As it becomes directly apparent by this example, the task is only to identify named entities but no relations between them as "detect" is masked as "O".

-----

| CyNER | Token | Unique | Sentence  | Error |
|-------|--------|--------|-----------|-------|
| Train | 25.769 | 4.567  | 1.097     | 33    |
| Valid | 18.742 | 3.363  | 785       | 0     |
| Test  | 6.726  | 1.830  | 294       | 12    |

**Entity-Types:** 3 *(B I O)*

**Entity-Labels:** 6 *('Organization', 'System', 'Malware', 'Indicator', 'O', 'Vulnerability')*

**Repository:** https://github.com/aiforsec/CyNER

**Example:**

    This malicious APK is 334326 bytes file , MD5 :
    O    O         O   O  O      O     O    O O   O

    0b8806b38b52bebfe39ff585639e2ea2 and is detected
    B-Indicator                      O   O  O

    by Kaspersky      Lab products   as " Backdoor.AndroidOS.Chuli.a " .
    O  B-Organization I-Organization O  O B-Indicator                O O

This example is similar to the previous one but shows significant differences in expression of a labels meaning. Here, file like "0b8806b38b52bebfe39ff585639e2ea2" or Backdoor.AndroidOS.Chuli.a" (Indicator) are detected by "Kaspersky Lab" (Organization). Again, the same relation "detected by" is not of further interest.

-----

| DNRTI    | Token    | Unique | Sentence  | Unique  | Error |
|----------|----------|--------|-----------|---------|-------|
| Train    | 94.829   | 7.377  | 3.704     | 7.377   |  450  |
| Valid    | 16.652   | 3.326  |   662     | 3.326   |   33  |
| Test     | 16.706   | 3.239  |   663     | 3.239   |   39  |

**Entity-Types:** 3 *(B I O)*

**Entity-Labels:** 14 *('Exp', 'OffAct', 'Area', 'SamFile', 'Tool', 'Features', 'Way', 'SecTeam', 'Org', 'Purp', 'Time', 'Idus', 'O', 'HackOrg')*

**Repository:** https://github.com/SCreaMxp/DNRTI-A-Large-scale-Dataset-for-Named-Entity-Recognition-in-Threat-Intelligence

**Example:**

    Kaspersky Lab       's products detect the Microsoft Office exploits
    B-SecTeam I-SecTeam  0 0        0      0   B-Exp     I-Exp  I-Exp

    used in the spear-phishing attacks  , including Exploit.MSWord.CVE-2010-333
    0    0  0   B-OffAct       I-OffAct 0 0         B-SamFile

    , Exploit.Win32.CVE-2012-0158 .
    0 B-SamFile                   0

In this example "Kaspersky Lab" (SecTeam) again detect the "Microsoft Office exploits" (Exp) that are typical used for offensive acts like "spear-fishing" (OffAct) with files such as "Exploit.MSWord.CVE-2010-333" and "Exploit.Win32.CVE-2012-0158" (SamFile). In this example it is also remarkable that "'s" contains more than one token but is only labeled as 'O'.

------
### About a universal annotation language for CTI data (STIX)

As exemplified with the three samples mentioned above, it appears that the might be the need for the yet young community of CTI research to develop a universal annotation language.
All the sample mention the same entities, but use different wordings for expressing the same things as this can excellently be traced by files like "Trojan.Win32.Agent" tagged as "FILE", "Backdoor.AndroidOS.Chuli.a" labeled with "Indicator", and finally "Exploit.Win32.CVE-2012-0158" denoted as "SamFile".

One aspect might include having a close look at the [STIX](https://oasis-open.github.io/cti-documentation/stix/intro) guidelines mentioned in APTNER (STIX2.1) and DNRTI (STIX).
Having a closer look at STIX might also be interesting to find other CTI-datasets also including relationships.

Additionally, STIX adds an interesting turn in working with CTI-data by introducing not only "Entities" and "Relations" but also "Sightings" defined as: "belief that something in CTI (e.g., an indicator, malware, tool, threat actor, etc.) was seen". This is especially fascinating, because a relation like "Kaspersky Lab detected Trojan.Win32.Agent" can be seen as the facts of having a CTI already broke the system. In contrast, a "sighting" is information streamed in real-time data not proven to be true or false, thus making the task of detecting cyberattacks especially difficult.

### Other (not) usable datasets:
* [1TCFII](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1TCFII) contains 1000 binary annotated tweets. This is maybe good for a final model.

* [twitter-cyberthreat-detection](https://paperswithcode.com/dataset/twitter-cyberthreat-detection-dataset) contains annotated tweets by their id. Hence, the data is not directly accessible.

* [BERT-for-Cybersecurity-NER](https://github.com/stelemate/BERT-for-Cybersecurity-NER) only contains data written in chinese.

* [CTIMiner](https://github.com/dgkim0803/CTIMiner) maybe interesting but behind paywall and uses XML structure.

* [CrossNER](https://github.com/zliucr/CrossNER) is interesting because it combines entity label from different sources (science, politics, music, ...), good for future work.


The following link might be of special interest, as it contains a curated list about resources for CTI in general:
* [awesome-threat-intelligence](https://github.com/hslatman/awesome-threat-intelligence)

* [Awesome-Cybersecurity-Datasets](https://github.com/shramos/Awesome-Cybersecurity-Datasets)

In [1]:
import pandas as pd
import csv

In [2]:
def read_file_from(path):
    """
    This method reads a file for NER-CTI in the format (token, tag) where token and tag are separated by whitespace.
    Further, this method counts the cases of data being assigned with more than one label.

    :param path: The path to the file to read.
    :return: Tuple having all tokens and tags as dataframe and the amount of mislabeled data.
    """
    column_names = ['Token', 'Tag']

    with open(path, newline='') as file:
        reader = csv.reader(file)
        malicious = 0
        token_tag_list = list()
        for row in reader:
            if len(row) == 1:
                row_split = row[0].split()
                if len(row_split) == 2:
                    token, tag = row_split[0], row_split[1]
                    if len(tag.split('-')) == 2 or tag == 'O':
                        token_tag_list += [(token, tag)]
                    else:
                        malicious += 1
                else:
                    malicious += 1
        df = pd.DataFrame.from_records(token_tag_list, columns=column_names)
    return df, malicious

In [3]:
def show_readability(dataset, train_mal, valid_mal, test_mal):
    """
    This method shows some information about the readability and error rate regarding a specified dataset.

    :param dataset: The dataset to be used.
    :param train_mal: Amount of malicious training data.
    :param valid_mal: Amount of malicious validation data.
    :param test_mal: Amount of malicious test data.
    :return: Nothing
    """

    train = dataset[dataset.Set == 'train']
    valid = dataset[dataset.Set == 'valid']
    test = dataset[dataset.Set == 'test']


    print("Length Train:", train.shape[0])
    print("Length Valid:", valid.shape[0])
    print("Length Test:", test.shape[0])

    print()

    print("Sentences Train:", train[train.Token == '.'].shape[0])
    print("Sentences Valid:", valid[valid.Token == '.'].shape[0])
    print("Sentences Test:", test[test.Token == '.'].shape[0])

    print()

    print("Unique Tokens Train:", len(train.Token.unique()))
    print("Unique Tokens Valid:", len(valid.Token.unique()))
    print("Unique Tokens Test:", len(test.Token.unique()))

    print()

    print("Error Rate Train:", train_mal)
    print("Error Rate Dev:", valid_mal)
    print("Error Rate Test:", test_mal)

In [4]:
def show_labels(dataset):
    """
    This method creates an overview about the different kind of labels and types.

    :param dataset: The dataset to work with.
    :return: Nothing
    """
    unique_tags = dataset.Tag.unique()

    tag_types = set({'O'})
    tag_words = set()

    for tag in unique_tags:
        tag_type = None
        tag_word = None
        if '-' in tag:
            tag_type = tag.split('-')[0]
            tag_word = tag.split('-')[1]
            tag_types.update({tag_type})
        else:
            tag_word = tag

        tag_words.update({tag_word})

    print('Different Entity Types:', len(tag_types))
    print('Different Entity Labels:', len(tag_words))
    print('Entity Types:', tag_types)
    print('Entity Labels:', tag_words)

In [5]:
for dataset in ['APTNER', 'CyNER', 'DNRTI']:
    train, train_malicious = read_file_from(f'../data/{dataset}/train.txt')
    train['Set'] = 'train'
    valid, valid_malicious = read_file_from(f'../data/{dataset}/valid.txt')
    valid['Set'] = 'valid'
    test, test_malicious = read_file_from(f'../data/{dataset}/test.txt')
    test['Set'] = 'test'

    data = pd.concat([train, valid, test])

    print(f'#### {dataset} ####')
    print('About the data')
    show_readability(data, train_malicious, valid_malicious, test_malicious)
    print()
    print('About the labels')
    show_labels(data)
    print()

#### APTNER ####
About the data
Length Train: 154412
Length Valid: 35990
Length Test: 37359

Sentences Train: 6940
Sentences Valid: 1664
Sentences Test: 1529

Unique Tokens Train: 11818
Unique Tokens Valid: 5501
Unique Tokens Test: 4793

Error Rate Train: 518
Error Rate Dev: 68
Error Rate Test: 23

About the labels
Different Entity Types: 5
Different Entity Labels: 22
Entity Types: {'B', 'O', 'I', 'E', 'S'}
Entity Labels: {'TOOL', 'SECTEAM', 'SHA2', 'ENCR', 'URL', 'MD5', 'IDTY', 'VULID', 'PROT', 'OS', 'LOC', 'IP', 'DOM', 'O', 'EMAIL', 'FILE', 'SHA1', 'ACT', 'TIME', 'APT', 'MAL', 'VULNAME'}

#### CyNER ####
About the data
Length Train: 25769
Length Valid: 18742
Length Test: 6726

Sentences Train: 1097
Sentences Valid: 785
Sentences Test: 294

Unique Tokens Train: 4567
Unique Tokens Valid: 3363
Unique Tokens Test: 1830

Error Rate Train: 33
Error Rate Dev: 0
Error Rate Test: 12

About the labels
Different Entity Types: 3
Different Entity Labels: 6
Entity Types: {'B', 'O', 'I'}
Entity Lab

# Evaluation of different NER-techniques:
**Idea:** Compare the pipelines of CoreNLP and spaCy by focusing on their primary components. Hence, it might be interesting to see how they both work compared to each other. This means, both have several components leading to the final detection on named entities in texts. Another fascinating factor might be their implementation, usability in terms of programming effort and scalability.

**Possible Criteria:**

    1) General structure of pipelines
    2) Ease of use (Functionality)
    3) Changeability of components
    4) Domain adaptation
    5) Performance (Runtime, Scalability)

| Tool                                                                                                                        | Basic Entities | BIO Format | Domain-Adaptation | Methods for Entity Recognition                        | Adding Pre-trained Models | End-to-End Readiness | Programming Language | Popularity on GitHub |
|-----------------------------------------------------------------------------------------------------------------------------|----------------|------------|-------------------|-------------------------------------------------------|----------------------------|----------------------|----------------------|----------------------|
| [spaCy](https://spacy.io/usage/linguistic-features#named-entities)                                                          | 18             | Yes        | Yes               | Ensemble, CNN, BILSTM, rule-based                     | Yes                        | Yes                  | Python               | 25,000+              |
| [flairNLP](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#named-entity-recognition-ner) | 13             | Yes        | Yes               | Ensemble, CRF, BILSTM, rule-based                     | Yes                        | Yes                  | Python               | 12,000+              |
| [NLTK](https://www.nltk.org/book/ch07.html#named-entity-recognition)                                                        | 5              | Yes        | Yes               | MaxEntropy, rule-based, regexp                        | No                         | No                   | Python               | 11,000+              |
| [CoreNLP](https://stanfordnlp.github.io/CoreNLP/ner.html)                                                                   | 4              | Yes        | Yes               | Ensemble, CRF, rule-based, perceptron, neural network | No                         | No                   | Java                 | 8,000+               |

**spaCy:** spaCy has excellent documentation that is well-organized, comprehensive, and easy to follow. The documentation includes detailed guides for installation, usage, and customization, as well as a complete API reference. Additionally, spaCy has a vibrant community of developers who contribute to the documentation and provide support through forums and chat channels.

**flairNLP:** flairNLP also has good documentation, although it is not as extensive as spaCy's. The documentation includes guides for installation, usage, and customization, as well as examples and API reference.

**NLTK:** NLTK has been around for a long time and has a very extensive documentation, with comprehensive guides and tutorials for various natural language processing tasks. However, the documentation can be overwhelming for new users, as it covers a lot of ground and may require some programming experience to fully understand.

**CoreNLP:** CoreNLP has documentation that is adequate for basic usage, but it can be difficult to navigate and lacks examples and detailed explanations for more advanced features like adding new entities. Additionally, the documentation is less actively maintained than some other libraries, which may make it harder to get support when needed.

# spaCy:

The spaCy library provides a powerful and flexible pipeline for state-of-the-art natural language processing. At its core is the nlp object, which represents the pipeline itself. The pipeline is a sequence of tracable components that are applied to each input text in turn, with each component performing a specific task such as tokenization, part-of-speech tagging, or named entity recognition.

The nlp object is created by loading a pre-trained model, such as en_core_web_sm, which contains a set of pre-defined pipeline components for standard cases of NER. These components can be modified or extended as needed using the nlp.add_pipe() method. Each pipeline component takes a Doc object as input and returns a modified Doc object with additional annotations.

The Doc object represents a processed document of text, and contains a sequence of Token objects that represent individual words or other elements of the text, such as punctuation or whitespace. Each Token object has a variety of properties and annotations, such as its lemma, part-of-speech tag, and named entity label.

The nlp object also provides a range of convenient methods and attributes for working with processed documents, such as accessing specific tokens or entities, visualizing the document structure, or performing similarity calculations between documents.

![SpaCy pipeline](https://spacy.io/images/pipeline.svg)

## Basic Entities Labels

All spaCy pipelines provide a basic set of 18 entities to beinterpreted as:

    CARDINAL : Numerals that do not fall under another type
    DATE : Absolute or relative dates or periods
    EVENT : Named hurricanes, battles, wars, sports events, etc.
    FAC : Buildings, airports, highways, bridges, etc.
    GPE : Countries, cities, states
    LANGUAGE : Any named language
    LAW : Named documents made into laws.
    LOC : Non-GPE locations, mountain ranges, bodies of water
    MONEY : Monetary values, including unit
    NORP : Nationalities or religious or political groups
    ORDINAL : "first", "second", etc.
    ORG : Companies, agencies, institutions, etc.
    PERCENT : Percentage, including "%"
    PERSON : People, including fictional
    PRODUCT : Objects, vehicles, foods, etc. (not services)
    QUANTITY : Measurements, as of weight or distance
    TIME : Times smaller than a day
    WORK_OF_ART : Titles of books, songs, etc.

These basic entity labels provide a solid ground for the most and common nlp setups.
However, certain usecases like NER-CTI other downstream-tasks require specific entity-labels as you can see for APTNER, CyNER and DNRTI. This will be shown in the section "Domain Adaption".

## Model NER Performance In Comparison

| Name                  | F1   | Prec. | Rec. |    
|-----------------------|------|-------|------|
| en_web_core_sm        | 84.6 | 84.5  | 84.6 |
| en_web_core_md        | 85.2 | 84.9  | 85.5 |
| en_web_core_lg        | -    | -     | -    |
| en_web_core_trf       | -    | -     | -    |

## Domain Adaptation
TBC.

In [6]:
# Please note: To install nlp-models like en_core_web_md use !python -m spacy download <model>

In [7]:
import re
import spacy
from spacy.language import Language
from spacy.tokens import Doc

# Define an example sentence from APTNER containing a special component.
sentence = "Kaspersky Lab products detect the malware described in this report as Trojan.Win32.Remexi and Trojan.Win32.Agent ."

# Define the custom tokenizer as a pipeline component
@Language.component('conll_tokenizer')
def conll_tokenizer(doc):
    # Define a regular expression pattern to split tokens on whitespace (CoNLL format) only.
    pattern = r'\s+'
    processed_tokens = [token for token in re.split(pattern, doc.text)]
    return Doc(doc.vocab, words=processed_tokens)

# Loading the ner pipeline.
nlp = spacy.load('en_core_web_sm')
# The tokenizer is a custom component and runs as a preprocessing pipeline.
nlp.add_pipe("conll_tokenizer", first=True)

# Have a look at all components present.
for name, component in nlp.components:
    # Have a look at https://spacy.io/usage/spacy-101#pipelines
    # The models components are separate pipelines and can be accessed via nlp.get_pipe(<name>).
    print(name, component)

# Print the last layer of the ner pipeline used for classification.
ner = nlp.get_pipe('ner')
ner_last_layer = ner.model.layers[-1]
print("\tLast layer in ner pipeline:", ner_last_layer.name)

# Test the custom tokenizer on a sample text
print("\nNamed entities in example sentence:")
doc = nlp(sentence)

# Extracting the named entities
for ent in nlp(sentence).ents:
    print(ent.text, ent.label_)

2023-03-03 11:18:03.532690: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


conll_tokenizer <function conll_tokenizer at 0x127d2b490>
tok2vec <spacy.pipeline.tok2vec.Tok2Vec object at 0x13e6b2c20>
tagger <spacy.pipeline.tagger.Tagger object at 0x13e6b1cc0>
parser <spacy.pipeline.dep_parser.DependencyParser object at 0x13e5b0c80>
senter <spacy.pipeline.senter.SentenceRecognizer object at 0x13e6b2e00>
attribute_ruler <spacy.pipeline.attributeruler.AttributeRuler object at 0x13e8a5d80>
lemmatizer <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x13e8e0c40>
ner <spacy.pipeline.ner.EntityRecognizer object at 0x13e5b0c10>
	Last layer in ner pipeline: linear

Named entities in example sentence:
Kaspersky Lab PERSON


In [8]:
# Evaluate the model performance on their standard data.
for model_name in ['en_core_web_sm', 'en_core_web_md']:
    pre_trained_model = spacy.load(model_name)
    # get the baseline values for all standard models.
    performance = pre_trained_model.meta.get('performance')
    pre = performance.get('ents_p')
    rec = performance.get('ents_r')
    f1 = performance.get('ents_f')
    
    print(f"Name:{model_name}:")
    for source in pre_trained_model.meta.get('sources'):
        print(f"Source: {source.get('name')}")
    print("NER Performance: ")
    print(f"\tF1 = {f1:.3f}, Prec. = {pre:.3f}, Rec. {rec:.3f}:\n")

Name:en_core_web_sm:
Source: OntoNotes 5
Source: ClearNLP Constituent-to-Dependency Conversion
Source: WordNet 3.0
NER Performance: 
	F1 = 0.846, Prec. = 0.845, Rec. 0.846:

Name:en_core_web_md:
Source: OntoNotes 5
Source: ClearNLP Constituent-to-Dependency Conversion
Source: WordNet 3.0
Source: Explosion Vectors (OSCAR 2109 + Wikipedia + OpenSubtitles + WMT News Crawl)
NER Performance: 
	F1 = 0.852, Prec. = 0.849, Rec. 0.855:



# Domain Adaptation