# Guided Exploration - Infer MITRE technique from Threat Intel Data v2

__Notebook Version:__ 1.0 <br>
__Notebook Author:__ Vani Asawa<br>


__Python Version:__ >=Python 3.8<br>
__Platforms Supported:__  Azure Machine Learning Notebooks<br>

__Data Source Required:__ None<br>

__GPU Compute Required:__ No<br>
__GPU Compute Recommended:__ Yes<br>

__Requirements Path:__ ```../mitre-technique-inference/requirements.txt```<br>
__Essential Packages:__ 
- ipywidgets==7.5.1
- transformers==4.5.1
- torch==1.10.2
- msticpy==2.1.2
- nltk==3.6.2
- iocextract==1.13.1
- shap==0.41.0

## Motivation
**Cyber Threat Intelligence** (CTI) provides a framework for threat analysts to document the operations of a threat actor group, and record the findings of their investigations of specific cyber attack incidents.

With the increasing number and sophistication of attacks occuring across organization's workspace, CTI allows organisations to:
- Develop a more robust and proactive security posture 
- Better detect threat vulnerabilities in their infrastructre 
- Adopt security solutions and policies that allow them to better protect their environment. 

For example **Indicators of Compromise (IoC)** represent network artifacts of a cyber intrusion and are widely used in intrusion detection systems and antivirus softwares to detect future attacks.

**Threat Intel Data** is another form of CTI which comprises of unstructured text data, describing the tools, techniques and procedures (TTPs) used by threat actor groups in a cyber operation. Historically TI data is made available to the security community in the form of *blog posts reports* and *white papers*. With the increasing numebr of cyber attacks, it is not scalable to manually process this growing corpus of TI data to understand the motivations capabilities and TTPs associated with an actor group. Additionally TI data does not facilitate easy extraction of IoCs which, if documented in the report, can result in the loss of known indicators in the threat intelligence corpus. This opens up several avenues for **Machine Learning**, more particularly **Natural Language Processing** (NLP), to identify TTPs and extract IoCs from this data.

The **MITRE ATT&CK** framework is an openly-sourced knowledge base of TTPs used by adversaries across enterprise and mobile applications. MITRE TTPs allow people and organizations to proactively identify vulnerabilites in their system based on the behaviors, methods and patterns of activity used by an actor group in different stages of a cyber operation. More information about the kinds of tactics and techniques used by threat actors can be found [here](https://attack.mitre.org/techniques/enterprise/).

#################################

In this notebook we use NLP to
1. *Detect MITRE TTPs* using the **Distil-GPT2** transformer model, &
2. *Extract IoCs* using the ```iocextract``` package, and ```msticpy```'s IoC Extractor.

from unstructured English text-based Threat Intel data. We also provide some explainability into the TTP predictions made by our NLP model by identifying specific words or phrases in the input TI data that contribute to the prediction, using [SHAP](https://arxiv.org/pdf/1705.07874.pdf) values.

#################################

## Prerequisites
**Please do not run the notebook cells all at once**. The cells need to be run sequentially and successfully executed before proceeding with the remainder of the notebook.

## Table of Contents

0. Installations [One-Time Setup]
1. Imports
2. Configure Input Data and Model Parameters
3. Run
4. Results

Note - If you are copying this notebook, then you need to copy the associated utils folder.
- Can we bundle the package and requirements text in a zip file? A whl?
- Make a utils.zip

## 0. Installations [One-Time Setup]

Please configure a virtual environment, before downloading the packages in ```../mitre-technique-inference/requirements.txt``` in your virtual environment. Download the model artifacts.

### Creating a virtual environment

Navigate to the ```\Azure-Sentinel-Notebooks``` folder in terminal, configure a virtual environment, and download the ```../mitre-technique-inference/requirements.txt``` packages in your venv -

``` 
    > pip install virtualenv
    > virtualenv <VENV_NAME>
    > source <VENV_NAME>\Scripts\activate
    > cd mitre-technique-inference
    > pip install -r requirements.txt
```

### Downloading Model Artifacts

Estimated Time: < 10 minutes

- [**Distil-GPT2**](https://huggingface.co/distilgpt2) is an English-language model, pre-trained with the smallest GPT-2 Model, and was developed using knowledge-distillation, to serve as a faster, light-weight version of GPT-2. 

- We train a Distil-GPT2 model on publicly available Threat Intel data that has been mapped to Enterprise Techniques by security experts. We have scraped data from TRAM, Sentinel Hunting and Detection Queries, Sigma, CTID, and MITRE Repositories to create our training dataset, comprising of 13k entries. The model has been trained on all 191 MITRE Enterprise techniques, but the number of entries per technique used for training varies.

- In order to download the model artifacts, you will need ```bash``` configured in your notebook environment. The bash script will download the trained ```distilgpt2-512``` model artifacts from [MSTICPy's Data Repository](https://github.com/microsoft/msticpy-data/tree/mitre-inference/mitre-inference-models) to the local path ```../mitre-technique-inference/artifacts/distilgpt2-512/*```. <br>

- **Alternatively**, you can use GitHub to download the model artifacts to the above local path. 

- The model artifacts stored locally will comprise of:<br>

    - ```../mitre-technique-inference/artifacts/distilgpt2-512/model_state_dicts``` - Model weights associated with the trained Distil-GPT2 Model.
    - ```../mitre-technique-inference/artifacts/distilgpt2-512/labels``` - Mapping of prediction labels to MITRE Enterprise Techniques.
    - ```../mitre-technique-inference/artifacts/distilgpt2-512/tokenizer``` - Trained Distil-GPT2 tokenizer associated with the model. <br>
<br>

- If you have access to a GPU, we HIGHLY recommend using a GPU in the inference environment. The notebook will detect the device that is used to run the notebook, and configure the model to run on that device.<br>

In [None]:
! bash ./model.sh distilgpt2-512

Re-start the kernel and run the Notebook from **1. Imports**.

##  1. Imports

The modules used to run this notebook can be found under ```mitre-technique-inference/utils/*```

In [None]:
import os
import sys
sys.path.append(os.getcwd())

import utils
from utils import main, inference, configs

## 2. Configure Input Data and Model Parameters,

**IMPORTANT** In order to view the widgets in your Notebook, consider downloading the following jupyter extension via Terminal - ```jupyter labextension install @jupyter-widgets/jupyterlab-manager``` <br><br>

The notebook requires the following parameters from the user:
1. ***Threat Intel Data***: Unstructured, English threat report that the user would like to process through the NLP model.

- Sample Report:

    ```    
    The Black Basta ransomware group quickly gained notoriety after it laid claim to massive breaches earlier this year. On April 20, 2022, a user under the name “Black Basta” sought out corporate network access credentials on underground forums in exchange for a share of the profits from their ransomware attacks. Specifically, the user was in the market for credentials that could compromise organizations based in English-speaking countries, including Australia, Canada, New Zealand, the UK, and the US.

    Two days later, the American Dental Association (ADA) suffered a cyberattack that led it to shutter multiple systems. Data allegedly stolen from the ADA was published on the Black Basta leak site only 96 hours after the attack.

    While it was previously assumed that the ransomware group used bought or stolen corporate network access credentials to infiltrate its victims’ networks, our analysis of another set of samples monitored within a 72-hour time frame shows a possible correlation between the Qakbot trojan and the Black Basta ransomware. Black Basta continued to evolve, and in June, a Linux build of the ransomware that encrypts VMware ESXi virtual machines was discovered in the wild.

    Interestingly, the ransomware group does not appear to distribute its malware at random. That Black Basta’s operators have turned to underground markets to acquire network access credentials and have hard-coded a unique ID in every Black Basta build betrays their mature understanding of how ransomware works as a business. While Black Basta may be a newly formed group, the individuals behind it are likely seasoned cybercriminals.
    ```

2. ***Minimum Score Threshold***: The TTP predictions for a sample TI input data have an associated confidence score from the NLP model, ranging from 0 (not very confident) to 1 (most confident). Filter the results to predictions with confidence >= threshold configured by the user. <br>

- Default threshold: **0.6** <br> <br>

In [None]:
config_widgets = configs.configure_model_parameters()
for k in config_widgets.keys():
       display(config_widgets[k])

## 3. Run

Size of Threat Intel Report - Mention time estimates

In [None]:
configs, inference_df, iocs_df = main.go(
    config_widgets
)

## 4. Results

In [None]:
inference.print_detailed_report(
    inference_df,
    configs
)

In [None]:
print('Summary Statistics for Inference Dataframe: ')
print('Shape of Inference Dataframe: ', inference_df.shape)
if not inference_df.empty:
    print('Sample rows: ')
    display(inference_df.head(5))
else:
    print('No results obtained.')

In [None]:
print('Summary Statistics for IOCs Dataframe: ')
print('Shape of IOCs Dataframe: ', iocs_df.shape)
if not iocs_df.empty:
    print('Distinct counts for each category of IOCs: ')
    display(iocs_df.groupby('IOC_Type').count().rename(columns={'IOC_Value': 'Count'}))
    print('Sample rows: ')
    display(iocs_df.head(5))
else:
    print('No IOCs obtained.')