In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

---

# Get data source from Mitre ATT&CK website

1. We could find the source of attack techniques in (https://github.com/mitre-attack/attack-website/blob/master/modules/site_config.py), those links point to the Mitre CTI GitHub repo.

```python
# Domains for stix objects
domains = [
    {
        "name": "enterprise-attack",
        "location": "https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json",
        "alias": "Enterprise",
        "deprecated": False,
    },
    {
        "name": "mobile-attack",
        "location": "https://raw.githubusercontent.com/mitre/cti/master/mobile-attack/mobile-attack.json",
        "alias": "Mobile",
        "deprecated": False,
    },
    {
        "name": "pre-attack",
        "location": "https://raw.githubusercontent.com/mitre/cti/master/pre-attack/pre-attack.json",
        "alias": "PRE-ATT&CK",
        "deprecated": True,
    },
]
```

2. According to the attack technique pages (ex: https://attack.mitre.org/techniques/T1134/), the technique information could be found in (https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json). It contains some useful detail

    - "id"
    - "type"
    - "name"
    - "description"
    - "x_mitre_detection"
    

```json
{
    "object_marking_refs": ["marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168"],
    "external_references": [
        {
            "source_name": "mitre-attack",
            "external_id": "T1134",
            "url": "https://attack.mitre.org/techniques/T1134",
        },
        {
            "external_id": "CAPEC-633",
            "source_name": "capec",
            "url": "https://capec.mitre.org/data/definitions/633.html",
        },
        {
            "url": "https://pentestlab.blog/2017/04/03/token-manipulation/",
            "description": "netbiosX. (2017, April 3). Token Manipulation. Retrieved April 21, 2017.",
            "source_name": "Pentestlab Token Manipulation",
        },
        {
            "url": "https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing",
            "description": "Mathers, B. (2017, March 7). Command line process auditing. Retrieved April 21, 2017.",
            "source_name": "Microsoft Command-line Logging",
        },
        {
            "url": "https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx",
            "description": "Microsoft TechNet. (n.d.). Retrieved April 25, 2017.",
            "source_name": "Microsoft LogonUser",
        },
        {
            "url": "https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx",
            "description": "Microsoft TechNet. (n.d.). Retrieved April 25, 2017.",
            "source_name": "Microsoft DuplicateTokenEx",
        },
        {
            "url": "https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx",
            "description": "Microsoft TechNet. (n.d.). Retrieved April 25, 2017.",
            "source_name": "Microsoft ImpersonateLoggedOnUser",
        },
        {
            "url": "https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf",
            "description": "Atkinson, J., Winchester, R. (2017, December 7). A Process is No One: Hunting for Token Manipulation. Retrieved December 21, 2017.",
            "source_name": "BlackHat Atkinson Winchester Token Manipulation",
        },
    ],
    "description": "Adversaries may modify access tokens to operate under a different user or system security context to perform actions and bypass access controls. Windows uses access tokens to determine the ownership of a running process. A user can manipulate access tokens to make a running process appear as though it is the child of a different process or belongs to someone other than the user that started the process. When this occurs, the process also takes on the security context associated with the new token.\n\nAn adversary can use built-in Windows API functions to copy access tokens from existing processes; this is known as token stealing. These token can then be applied to an existing process (i.e. [Token Impersonation/Theft](https://attack.mitre.org/techniques/T1134/001)) or used to spawn a new process (i.e. [Create Process with Token](https://attack.mitre.org/techniques/T1134/002)). An adversary must already be in a privileged user context (i.e. administrator) to steal a token. However, adversaries commonly use token stealing to elevate their security context from the administrator level to the SYSTEM level. An adversary can then use a token to authenticate to a remote system as the account for that token if the account has appropriate permissions on the remote system.(Citation: Pentestlab Token Manipulation)\n\nAny standard user can use the <code>runas</code> command, and the Windows API functions, to create impersonation tokens; it does not require access to an administrator account. There are also other mechanisms, such as Active Directory fields, that can be used to modify access tokens.",
    "name": "Access Token Manipulation",
    "created_by_ref": "identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5",
    "id": "attack-pattern--dcaa092b-7de9-4a21-977f-7fcb77e89c48",
    "type": "attack-pattern",
    "kill_chain_phases": [
        {"kill_chain_name": "mitre-attack", "phase_name": "defense-evasion"},
        {"kill_chain_name": "mitre-attack", "phase_name": "privilege-escalation"},
    ],
    "modified": "2021-04-24T13:40:52.952Z",
    "created": "2017-12-14T16:46:06.044Z",
    "x_mitre_defense_bypassed": [
        "Windows User Account Control",
        "System access controls",
        "File system access controls",
        "Heuristic Detection",
        "Host forensic analysis",
    ],
    "x_mitre_is_subtechnique": false,
    "x_mitre_version": "2.0",
    "x_mitre_contributors": [
        "Tom Ueltschi @c_APT_ure",
        "Travis Smith, Tripwire",
        "Robby Winchester, @robwinchester3",
        "Jared Atkinson, @jaredcatkinson",
    ],
    "x_mitre_data_sources": [
        "Process: Process Creation",
        "Process: Process Metadata",
        "Process: OS API Execution",
        "User Account: User Account Metadata",
        "Active Directory: Active Directory Object Modification",
        "Command: Command Execution",
    ],
    "x_mitre_detection": "If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows.(Citation: Microsoft Command-line Logging)\n\nIf an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. \n\nThere are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> (Citation: Microsoft LogonUser), <code>DuplicateTokenEx</code>(Citation: Microsoft DuplicateTokenEx), and <code>ImpersonateLoggedOnUser</code>(Citation: Microsoft ImpersonateLoggedOnUser)). Please see the referenced Windows API pages for more information.\n\nQuery systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account.(Citation: BlackHat Atkinson Winchester Token Manipulation)\n\nLook for inconsistencies between the various fields that store PPID information, such as the EventHeader ProcessId from data collected via Event Tracing for Windows (ETW), Creator Process ID/Name from Windows event logs, and the ProcessID and ParentProcessID (which are also produced from ETW and other utilities such as Task Manager and Process Explorer). The ETW provided EventHeader ProcessId identifies the actual parent process.",
    "x_mitre_permissions_required": ["User", "Administrator"],
    "x_mitre_effective_permissions": ["SYSTEM"],
    "x_mitre_platforms": ["Windows"],
},
```

3. According to the attack technique pages (ex: https://attack.mitre.org/techniques/T1134/), we could find procedure exmaples of technique (G0108: Blue Mockingbird, S0038: Duqu ... etc). It's also found in (https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json)

```json
{
    "created_by_ref": "identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5",
    "object_marking_refs": ["marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168"],
    "source_ref": "intrusion-set--73a80fab-2aa3-48e0-a4d0-3a4828200aee",
    "target_ref": "attack-pattern--dcaa092b-7de9-4a21-977f-7fcb77e89c48",
    "external_references": [
        {
            "source_name": "RedCanary Mockingbird May 2020",
            "url": "https://redcanary.com/blog/blue-mockingbird-cryptominer/",
            "description": "Lambert, T. (2020, May 7). Introducing Blue Mockingbird. Retrieved May 26, 2020.",
        }
    ],
    "description": "[Blue Mockingbird](https://attack.mitre.org/groups/G0108) has used JuicyPotato to abuse the <code>SeImpersonate</code> token privilege to escalate from web application pool accounts to NT Authority\\SYSTEM.(Citation: RedCanary Mockingbird May 2020)",
    "relationship_type": "uses",
    "id": "relationship--6d3d48ff-ea37-4626-8148-4111163e95e3",
    "type": "relationship",
    "modified": "2020-06-25T13:59:09.926Z",
    "created": "2020-05-27T15:31:09.535Z",
},
{
    "id": "relationship--968610c5-7fa5-4840-b9bb-2f70eecd87fa",
    "created_by_ref": "identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5",
    "description": "[Duqu](https://attack.mitre.org/software/S0038) examines running system processes for tokens that have specific system privileges. If it finds one, it will copy the token and store it for later use. Eventually it will start new processes with the stored token attached. It can also steal tokens to acquire administrative privileges.(Citation: Kaspersky Duqu 2.0)",
    "object_marking_refs": ["marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168"],
    "external_references": [
        {
            "url": "https://securelist.com/files/2015/06/The_Mystery_of_Duqu_2_0_a_sophisticated_cyberespionage_actor_returns.pdf",
            "description": "Kaspersky Lab. (2015, June 11). The Duqu 2.0. Retrieved April 21, 2017.",
            "source_name": "Kaspersky Duqu 2.0",
        }
    ],
    "source_ref": "malware--68dca94f-c11d-421e-9287-7c501108e18c",
    "relationship_type": "uses",
    "target_ref": "attack-pattern--dcaa092b-7de9-4a21-977f-7fcb77e89c48",
    "type": "relationship",
    "modified": "2019-04-24T23:18:53.108Z",
    "created": "2017-12-14T16:46:06.044Z",
},
```

# Reference


- **Techniques with procedure exmaple**
    - https://attack.mitre.org/techniques/T1134/
- **GitHub Repo**
    - https://github.com/mitre-attack/attack-website/
    - https://github.com/mitre/cti
- **Attack Techniques**
    - https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json
    - https://raw.githubusercontent.com/mitre/cti/master/mobile-attack/mobile-attack.json
    - (Deprecated) https://raw.githubusercontent.com/mitre/cti/master/pre-attack/pre-attack.json
- **Others**
    - https://www.anomali.com/resources/what-mitre-attck-is-and-how-it-is-useful

---

# Build dataset

In [2]:
from functional import seq

enterprise_attack = seq.json("../data/mitre/enterprise-attack.json").to_dict()

<IPython.core.display.Javascript object>

Enterprise attack fields:
- id
- objects
- spec_version
- type

In [3]:
attack_objects = seq(enterprise_attack.get("objects")).cache()

# get attack pattern from objects
attack_patterns = attack_objects.filter(
    lambda attack_object: attack_object.get("id").startswith("attack-pattern")
).cache()

attack_patterns.take(1)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
2020-02-11T18:46:56.263Z,2020-03-20T15:56:55.022Z,"[{'kill_chain_name': 'mitre-attack', 'phase_name': 'credential-access'}]",attack-pattern,attack-pattern--d0b4fcdb-d67d-4ed2-99ce-788b12f8c0f4,"Adversaries may attempt to dump the contents of <code>/etc/passwd</code> and <code>/etc/shadow</code> to enable offline password cracking. Most modern Linux operating systems use a combination of <code>/etc/passwd</code> and <code>/etc/shadow</code> to store user account information including password hashes in <code>/etc/shadow</code>. By default, <code>/etc/shadow</code> is only readable by the root user.(Citation: Linux Password and Shadow File Formats) The Linux utility, unshadow, can be used to combine the two files in a format suited for password cracking utilities such as John the Ripper:(Citation: nixCraft - John the Ripper) <code># /usr/bin/unshadow /etc/passwd /etc/shadow > /tmp/crack.password.db</code>",/etc/passwd and /etc/shadow,identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5,['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'],"[{'url': 'https://attack.mitre.org/techniques/T1003/008', 'external_id': 'T1003.008', 'source_name': 'mitre-attack'}, {'description': 'The Linux Documentation Project. (n.d.). Linux Password and Shadow File Formats. Retrieved February 19, 2020.', 'url': 'https://www.tldp.org/LDP/lame/LAME/linux-admin-made-easy/shadow-file-formats.html', 'source_name': 'Linux Password and Shadow File Formats'}, {'description': 'Vivek Gite. (2014, September 17). Linux Password Cracking: Explain unshadow and john Commands (John the Ripper Tool). Retrieved February 19, 2020.', 'url': 'https://www.cyberciti.biz/faq/unix-linux-password-cracking-john-the-ripper/', 'source_name': 'nixCraft - John the Ripper'}]",['Linux'],True,1,['root'],"The AuditD monitoring tool, which ships stock in many Linux distributions, can be used to watch for hostile processes attempting to access <code>/etc/passwd</code> and <code>/etc/shadow</code>, alerting on the pid, process name, and arguments of such programs.","['Command: Command Execution', 'File: File Access']"


<IPython.core.display.Javascript object>

In [4]:
attack_pattern_technique_ids = attack_patterns.map(
    lambda attack_pattern: (
        attack_pattern.get("id"),
        seq(attack_pattern.get("external_references"))
        .filter(lambda ref: ref.get("source_name") == "mitre-attack")
        .map(lambda ref: ref.get("external_id"))
        .make_string(""),
    )
).cache()

attack_pattern_technique_ids

0,1
attack-pattern--d0b4fcdb-d67d-4ed2-99ce-788b12f8c0f4,T1003.008
attack-pattern--cabe189c-a0e3-4965-a473-dcff00f17213,T1557.002
attack-pattern--3986e7fd-a8e9-4ecb-bfc6-55920855912b,T1558.004
attack-pattern--67720091-eee3-4d2d-ae16-8264567f6f5b,T1548
attack-pattern--dcaa092b-7de9-4a21-977f-7fcb77e89c48,T1134
attack-pattern--9b99b83a-1aac-4e29-b975-b374950551a3,T1015
attack-pattern--70e52b04-2a0c-4cea-9d18-7149f1df9dc5,T1546.008
attack-pattern--b24e2a20-3b3d-4bf0-823b-1ed765398fb0,T1531
attack-pattern--72b74d71-8169-42aa-92e0-e7b04b9f5a08,T1087
attack-pattern--a10641f4-87b4-45a3-a906-92a149cb2c27,T1098


<IPython.core.display.Javascript object>

In [5]:
attack_pattern_procedures = (
    attack_objects.filter_not(
        lambda attack_object: attack_object.get("id").startswith("attack-pattern")
    )
    .group_by(lambda attack_object: attack_object.get("target_ref"))
    .starmap(
        lambda target_ref, attack_objects: (
            target_ref,
            seq(attack_objects).map(
                lambda attack_object: attack_object.get("description")
            ),
        )
    )
    .filter_not(lambda kv: kv[0] == None)
    .filter(lambda kv: kv[0].startswith("attack-pattern"))
    .cache()
)

attack_pattern_procedures.take(1)

0,1
attack-pattern--ca1a3f50-5ebd-41f8-8320-2c7d6a6e88be,"[None, 'Remove users from the local administrator group on systems.', 'Although UAC bypass techniques exist, it is still prudent to use the highest enforcement level for UAC when possible and mitigate bypass opportunities that exist with techniques such as [DLL Search Order Hijacking](https://attack.mitre.org/techniques/T1038).', 'Check for common UAC bypass weaknesses on Windows systems to be aware of the risk posture and address issues where appropriate. (Citation: Github UACMe)']"


<IPython.core.display.Javascript object>

In [6]:
# join by key "target_ref": "attack-pattern--dcaa092b-7de9-4a21-977f-7fcb77e89c48",

attack_pattern_descriptions = (
    attack_pattern_technique_ids.join(attack_pattern_procedures)
    .starmap(lambda pattern, join_result: join_result)
    .starmap(
        lambda technique_id, descriptions: seq(descriptions)
        .filter_not(lambda description: description == None)
        .map(lambda description: (technique_id, description))
    )
    .flatten()
    .sorted(lambda x: x[0])
)

attack_pattern_descriptions

0,1
T1001,"The [Axiom](https://attack.mitre.org/groups/G0001) group has used other forms of obfuscation, include commingling legitimate traffic with communications traffic so that network streams appear legitimate."
T1001,[FlawedAmmyy](https://attack.mitre.org/software/S0381) may obfuscate portions of the initial C2 handshake.(Citation: Proofpoint TA505 Mar 2018)
T1001,Network intrusion detection and prevention systems that use network signatures to identify traffic for specific adversary malware can be used to mitigate some obfuscation activity at the network level.
T1001,[RDAT](https://attack.mitre.org/software/S0495) has used encoded data within subdomains as AES ciphertext to communicate from the host to the C2.(Citation: Unit42 RDAT July 2020)
T1001,"[Operation Wocao](https://attack.mitre.org/groups/G0116) has encrypted IP addresses used for ""Agent"" proxy hops with RC4.(Citation: FoxIT Wocao December 2019)"
T1001,[SLOTHFULMEDIA](https://attack.mitre.org/software/S0533) has hashed a string containing system information prior to exfiltration via POST requests.(Citation: CISA MAR SLOTHFULMEDIA October 2020)
T1001.001,"[APT28](https://attack.mitre.org/groups/G0007) added ""junk data"" to each encoded string, preventing trivial decoding without knowledge of the junk removal algorithm. Each implant was given a ""junk length"" value when created, tracked by the controller software to allow seamless communication but prevent analysis of the command protocol on the wire.(Citation: FireEye APT28)"
T1001.001,"[Downdelph](https://attack.mitre.org/software/S0134) inserts pseudo-random characters between each original character during encoding of C2 network requests, making it difficult to write signatures on them.(Citation: ESET Sednit Part 3)"
T1001.001,[P2P ZeuS](https://attack.mitre.org/software/S0016) added junk data to outgoing UDP packets to peer implants.(Citation: Dell P2P ZeuS)
T1001.001,Network intrusion detection and prevention systems that use network signatures to identify traffic for specific adversary malware can be used to mitigate some obfuscation activity at the network level.


<IPython.core.display.Javascript object>

In [7]:
import re

# remove links and citation
def clean_text(text):
    return re.sub(r"\(https://attack.mitre.org/.+?\)|\(Citation:.+?\)", "", text)


# example
text = """[Operation Wocao](https://attack.mitre.org/groups/G0116) has encrypted IP addresses used for "Agent" proxy hops with RC4.(Citation: FoxIT Wocao December 2019)"""
clean_text(text)

'[Operation Wocao] has encrypted IP addresses used for "Agent" proxy hops with RC4.'

<IPython.core.display.Javascript object>

In [8]:
# save as dataset
attack_pattern_descriptions.starmap(
    lambda technique_id, description: (technique_id, clean_text(description).strip('"'))
).to_csv("../data/mitre/technique_description_dataset.csv")

<IPython.core.display.Javascript object>

---