# Guided Exploration - Infer MITRE technique from Threat Intel Data

__Notebook Version:__ 1.0 <br>
__Notebook Author:__ Vani Asawa<br>


__Python Version:__ >=Python 3.8<br>
__Platforms Supported:__  Azure Machine Learning Notebooks<br>

__Data Source Required:__ None<br>

__GPU Compute Required:__ No<br>
__GPU Compute Recommended:__ Yes<br>

__Requirements Path:__ ```../mitre-technique-inference/requirements.txt```<br>
__Essential Packages:__ 
- ipywidgets==7.5.1
- transformers==4.5.1
- torch==1.10.2
- msticpy==2.1.2
- nltk==3.6.2
- iocextract==1.13.1
- shap==0.41.0

## Description
**Cyber Threat Intelligence** (CTI) provides a framework for threat analysts to document the operations of a threat actor group and record the findings of their investigations of specific cyber attack incidents.

With the increasing number and sophistication of attacks occuring across organization's workspace CTI allows organisations to develop a more robust and proactive security posture better detect threat vulnerabilities in their infrastructre and adopt security solutions and policies that allow them to better protect their environment. For example **Indicators of Compromise (IoC)** represent network artifacts of a cyber intrusion and are widely used in intrusion detection systems and antivirus softwares to detect future attacks.

**Threat Intel Data** is another form of CTI which comprises of rich unstructured textual data describing the tools techniques and procedures used by threat actor groups in a cyber operation. Historically TI data is made available to the security community in the form of blog posts reports and white papers. With the increasing numebr of cyber attacks it is not scalable to manually process this growing corpus of TI data to understand the motivations capabilities and TTPs associated with an actor group. Additionally TI data does not facilitate easy extraction of IoCs which if documented in the report can result in the loss of known indicators in the threat intelligence corpus. This opens up several avenues for **Machine Learning** more particularly **Natural Language Processing** (NLP) to identify TTPs and IoCs from this data.

The **MITRE ATT&CK** framework is an openly-sourced knowledge base of TTPs used by adversaries across enterprise and mobile applications. MITRE TTPs allow people and organizations to proactively identify vulnerabilites in their system based on the behaviors methods and patterns of activity used by an actor group in different stages of a cyber operation.

#################################

In this notebook we use NLP to
1. *Detect MITRE TTPs* &
2. *Extract IoCs*

from unstructured English text-based Threat Intel data. We also provide some explainability into the TTP predictions made by our NLP model by identifying specific words or phrases in the input TI data that contribute to the prediction.

#################################

## Prerequisites
**Please do not run the notebook cells all at once**. The cells need to be run sequentially and successfully executed before proceeding with the remainder of the notebook.

## Table of Contents

1. Imports
2. Configure Input Data and Model Parameters
3. Get Model Artifacts
4. Process TI Data
5. Inference
6. Explainability

## Installations

Please download the packages in ```../mitre-technique-inference/requirements.txt``` in your virtual environment before running the rest of the cells in the notebook.

In [1]:
import os
import sys
sys.path.append(os.getcwd())

############### REQUIREMENTS.TXT ################
# requirements_path = os.path.join(os.getcwd(), 'requirements.txt')
# os.system(f'pip install -r {requirements_path}')

##  Imports

The modules used to run this notebook can be found under ```mitre-technique-inference/utils/*```

In [18]:
import torch
import utils
from utils import (
    configs as config_utils,
    storage as storage_utils,
    inference as inference_utils,
    process as process_utils,
)

## Configure Input Data and Model Parameters,
The notebook requires the following parameters from the user:
1. ***Threat Intel Data***: Unstructured, English text data that the user would like to process through the NLP model. If you are inputting multiple text reports in the widget, please input the reports separated by an empty line. Do not include any commas, punctuations, or brackets before and after the reports. <br>
- For example: Here, we are processing three different threat reports, which are separated by an empty line. The length of each report can be more than one sentence. In this example, for the purposes of succinct documentation, the length of each report is 1 sentence.

    ```
    Like many threat groups, TG-3390 conducts strategic web compromises (SWCs), also known as watering hole attacks, on websites associated with the target organization's vertical or demographic to increase the likelihood of finding victims with relevant information.

    Threat groups use strategic web compromises (SWCs), also known as watering hole attacks, to target a wide array of potential victims.

    A build tool is likely being used by these attackers that allows the operator to configure details such as C2 addresses, C2 encryption keys, and a campaign code.
    ```
       
2. ***Select NLP Model***: We have trained four variations of GPT-2 transformer models using publicly-available threat intel datasets that map TI data to MITRE TTPs. 
- *distilgpt2* models are 40% lower in storage size than the *gpt2* models <br>

- *distilgpt2-1024* and *gpt2-1024* models process more word tokens in a single threat input statement than the *distilgpt2-512* and *distilgpt2-1024* models, which can be particularly useful if your threat intel data is long. <br>

- Default model: **distilgpt2_512** <br><br>

3. ***Minimum Score Threshold***: The TTP predictions for a sample TI input data have an associated confidence score from the NLP model, ranging from 0 (less confident) to 1 (most confident). Filter the results to predictions with confidence >= threshold configured by the user. <br>

- Default threshold: **0.7** <br> <br>
       
4. ***Chunk Threat Intel Data?***: 
- One of the limitations of the transformer models is that they can only process inputs upto a certain length, after which the rest of the report is discarded. 
- As a result, the model will lose out on potentially important information about the actor's TTPs, described in the latter parts of the report. 
- If a single threat report in your input is longer than 3 sentences, we recommend **chunking** - The model will process the sentences in your input data in batches of 3 sentences, hence assigning a TTP prediction for each chunk of data, and processing the entire report. <br>

- Default value: **Yes** <br><br>

5. ***Extract Indicators of Compromise (IoCs)***: Extract IoCs from the input TI data. <br>

- Default value: **Yes** <br><br>

6. ***Get NLP Model Explainability***: Obtain further insights into which words and phrases in your input data contributed to the TTP prediction. <br>

- Default value: **Yes** <br><br>

In [3]:
all_config_widgets = config_utils.configure_model_parameters()
for k in all_config_widgets.keys():
       display(all_config_widgets[k])

Textarea(value='', description='Threat Intel Data:', layout=Layout(height='200px', width='80%'), style=Descrip…

Select(description='Select NLP Model: ', layout=Layout(height='80px', width='50%'), options=('distilgpt2-512',…

FloatSlider(value=0.7, description='Minimum Score Threshold: ', layout=Layout(height='30px', width='50%'), max…

Select(description='Chunk Threat Intel data?: ', layout=Layout(height='80px', width='50%'), options=('Yes', 'N…

Select(description='Extract Indicators Of Compromise (IoCs)?: ', layout=Layout(height='80px', width='50%'), op…

Select(description='Get NLP Model Explainability?: ', layout=Layout(height='80px', width='50%'), options=('Yes…

In [5]:
set_configs = {
    k: v.value for k, v in all_config_widgets.items()
}

configs = config_utils.format_user_configuration(set_configs, verbose=True)

#################### SUMMARY #################### 

Threat Intel (TI) Data: [

	"Like many threat groups, TG-3390 conducts strategic web compromises (SWCs), also known as watering hole attacks, on websites associated with the target organization's vertical or demographic to increase the likelihood of finding victims with relevant information.", 

	"Threat groups use strategic web compromises (SWCs), also known as watering hole attacks, to target a wide array of potential victims.", 

	"A build tool is likely being used by these attackers that allows the operator to configure details such as C2 addresses, C2 encryption keys, and a campaign code."

]

# of TI entries: 3

NLP Model: distilgpt2-512

Minimum Score Threshold: 0.1

Chunk Threat Intel data?: No

Extract Indicators Of Compromise (IoCs)?: Yes

Get NLP Model Explainability?: Yes

################################################# 



## Download Model Artifacts

- In order to download the model artifacts, you will need ```bash``` configured in your notebook environment.
- The bash script will download the model configured by the user from [MSTICPy's Data Repository](https://github.com/microsoft/msticpy-data/tree/mitre-inference/mitre-inference-models) to your local machine. 
- You do not need to re-run the bash script once the model has been downloaded to your machine, if you choose to re-run the notebook with the same model configuration (unless you have removed the model folder).
- All the model artifacts associated with the configured model will be stored under ```mitre-technique-inference/artifacts/CONFIGURED-MODEL-NAME/*```. 
- If you have access to a GPU, we HIGHLY recommend using a GPU in the inference environment. The notebook will detect the device that is used to run the notebook, and configure the model to run on that device.

In [6]:
! bash ./models.sh {configs['model']}

Downloading labels for model distilgpt2-512...

mkdir: cannot create directory ‘distilgpt2-512’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100 19532  100 19532    0     0  20845      0 --:--:-- --:--:-- --:--:-- 20845
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --


Downloaded labels for model distilgpt2-512.
Downloading tokenizer for model distilgpt2-512...
Downloaded tokenizer for model distilgpt2-512.
Downloading model dicts for model distilgpt2-512...
Downloaded model dicts for model distilgpt2-512.



  0  319M    0  192k    0     0   218k      0  0:24:52 --:--:--  0:24:52  218k
  1  319M    1 3471k    0     0  1843k      0  0:02:57  0:00:01  0:02:56 3263k
  2  319M    2 6879k    0     0  2390k      0  0:02:16  0:00:02  0:02:14 3342k
  3  319M    3 9967k    0     0  2570k      0  0:02:07  0:00:03  0:02:04 3257k
  3  319M    3 12.2M    0     0  2573k      0  0:02:06  0:00:04  0:02:02 3089k
  4  319M    4 15.5M    0     0  2697k      0  0:02:01  0:00:05  0:01:56 3132k
  5  319M    5 18.5M    0     0  2754k      0  0:01:58  0:00:06  0:01:52 3096k
  6  319M    6 21.4M    0     0  2761k      0  0:01:58  0:00:07  0:01:51 2972k
  7  319M    7 22.6M    0     0  2611k      0  0:02:05  0:00:08  0:01:57 2644k
  7  319M    7 23.2M    0     0  2413k      0  0:02:15  0:00:09  0:02:06 2256k
  7  319M    7 25.1M    0     0  2369k      0  0:02:17  0:00:10  0:02:07 1981k
  8  319M    8 28.0M    0     0  2422k      0  0:02:14  0:00:11  0:02:03 1964k
  9  319M    9 30.8M    0     0  2382k      0  0:02

 95  319M   95  303M    0     0  2904k      0  0:01:52  0:01:47  0:00:05 1512k
 95  319M   95  303M    0     0  2877k      0  0:01:53  0:01:48  0:00:05 1139k
 95  319M   95  303M    0     0  2850k      0  0:01:54  0:01:49  0:00:05  639k
 95  319M   95  303M    0     0  2824k      0  0:01:55  0:01:50  0:00:05 89026
 95  319M   95  303M    0     0  2799k      0  0:01:56  0:01:51  0:00:05     0
 95  319M   95  303M    0     0  2774k      0  0:01:57  0:01:52  0:00:05     0
 95  319M   95  303M    0     0  2749k      0  0:01:58  0:01:53  0:00:05     0
 95  319M   95  303M    0     0  2725k      0  0:01:59  0:01:54  0:00:05     0
 95  319M   95  303M    0     0  2705k      0  0:02:00  0:01:54  0:00:06  3382
 96  319M   96  307M    0     0  2716k      0  0:02:00  0:01:55  0:00:05  815k
 97  319M   97  310M    0     0  2716k      0  0:02:00  0:01:56  0:00:04 1365k
 98  319M   98  313M    0     0  2722k      0  0:02:00  0:01:57  0:00:03 2097k
 99  319M   99  316M    0     0  2728k      0  0:01:

In [7]:
assets = storage_utils.AssetStorage(
    configs['model']
)

inference_model = inference_utils.InferenceClassificationPipeline(
    model = assets.model,
    tokenizer = assets.tokenizer,
    device = assets.device.type
)

Tokenizer artifact obtained from path c:\Users\vaasawa\Documents\GitHub\Azure-Sentinel-Notebooks\mitre-technique-inference\artifacts\distilgpt2-512\tokenizer
Labels artifact obtained from path c:\Users\vaasawa\Documents\GitHub\Azure-Sentinel-Notebooks\mitre-technique-inference\artifacts\distilgpt2-512\labels


Some weights of the model checkpoint at distilgpt2 were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model artifact obtained from path c:\Users\vaasawa\Documents\GitHub\Azure-Sentinel-Notebooks\mitre-technique-inference\artifacts\distilgpt2-512\model_state_dicts
Model on device 'cpu'


## Process TI Data

- Extract IoCs from the raw Threat Intel reports using the ```iocextract``` package, and ```msticpy```'s IoC Extractor.
- Process the raw Threat Intel reports using the NLTK package, while assigning special tokens to IoCs detected in the raw data.

In [8]:
processed_data_object = process_utils.ProcessData(
    configs = configs
)

processed_data_object.go()

## Inference

- MITRE ATT&CK is a knowledge base of adversary tactics and techniques based on real-world attacks on a customer's workspace. The knowledge base is publicly available for use by the community, and is used to develop specific threat models and methodologies across all sectors.
- The MITRE Enterprise ATT&CK Matrix represents how an adversary achieves its tactical goal, by performing a certain action. A chain of actions represent the sequence of events that the actor uses to carry out an attack. More information about the kinds of tactics and techniques used by threat actors can be found [here](https://attack.mitre.org/techniques/enterprise/).
- The model has been trained on publicly available threat intel data that has enterprise tactics assigned to the entries by security experts. We have scraped data from TRAM, Sentinel Hunting and Detection Queries, Sigma, CTID, and MITRE Repositories to create our training dataset, comprising of 13k entries. The model has been trained on all 191 MITRE Enterprise techniques, but the number of entries per technique used for training varies.
- The trained model uses the processed threat intel data as input, and outputs the MITRE Enterprise Technique inferred from the processed text description. 

In [9]:
outputs = inference_model.go(processed_data_object.processed_data)

In [11]:
inference_df = inference_utils.format_predictions(
    configs = configs,
    processed_data_object = processed_data_object,
    labels = assets.labels,
    outputs = outputs,
    classifier = inference_model.classifier_max_scores
)

Partition explainer: 2it [00:31, 31.20s/it]               
Partition explainer: 2it [00:17, 17.31s/it]               
                                                  

In [12]:
print(f'Shape of Inference DF: {inference_df.shape}')
       
print('Sample Result: ')
if inference_df.empty:
    print('Empty Dataframe.')
else:
    print('Preview: ')
    display(inference_df.head(1))

Shape of DF: (3, 11)
Sample Result: 
Preview: 


Unnamed: 0,threat_intel,processed_threat_intel,flag_chunk,num_chunks,flag_iocs,iocs,output,model,flag_explain,shap_base,shap_contribution
0,"Like many threat groups, TG-3390 conducts stra...","like many threat group , alphanumeric_token co...",False,,True,"{'IP': [], 'EMAIL': [], 'URL': [], 'YARA': [],...","{'label': 'LABEL_78', 'score': 0.2774803936481...",distilgpt2-512,True,"{'values': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","{'positive': {}, 'neutral': {'like': 0.0, 'man..."


## Explainability - Visualisation

- We use SHAP (Shapley Additive Explanations, Lundberg and Lee 2016) values to determine which words in our input data contributed to the corresponding MITRE Enterprise Technique prediction.
- Based on whether the user configured the 

In [30]:
#### Set row index to any specific index of a row in the inference_df that you would like to examine further
row_index = 0

In [31]:
inference_results = inference_utils.process_shap_explainability_for_row(
    inference_df, row_index
)

Inference Dataframe row index: 0 

Threat Intel Data: 

Like many threat groups, TG-3390 conducts strategic web compromises (SWCs), also known as watering hole attacks, on websites associated with the target organization's vertical or demographic to increase the likelihood of finding victims with relevant information.

Processed Data: 
like many threat group , alphanumeric_token conduct strategic web compromise ( swcs ) , also known watering hole attack , website associated target organization 's vertical demographic increase likelihood finding victim relevant information .

Predicted Label: 
{'label': 'LABEL_78', 'score': 0.2774803936481476, 'technique': 'T1190'}

Insufficient shap data for explainability.
