# Explore ATT&CK Data Sources
------------------

## Goals:
* Access ATT&CK data sources in STIX format via a public TAXII server
* Learn to interact with ATT&CK data all at once
* Explore and idenfity patterns in the data retrieved
* Learn more about ATT&CK data sources

## 1. ATT&CK Python Client Installation

You can install it via PIP: **pip install attackcti**

## 2. Import ATT&CK API Client

In [1]:
from attackcti import attack_client

## 3. Import Extra Libraries

In [2]:
from pandas import *
import numpy as np

import altair as alt
alt.renderers.enable('default')

import itertools

## 4. Initialize ATT&CK Client Class

In [3]:
lift = attack_client()

## 5. Getting Information About Techniques

Getting ALL ATT&CK Techniques

In [4]:
all_techniques = lift.get_techniques(stix_format=False)

Showing the first technique in our list

In [5]:
all_techniques[0]

{'external_references': [{'source_name': 'mitre-attack',
   'external_id': 'T1553.006',
   'url': 'https://attack.mitre.org/techniques/T1553/006'},
  {'source_name': 'Microsoft DSE June 2017',
   'url': 'https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn653559(v=vs.85)?redirectedfrom=MSDN',
   'description': 'Microsoft. (2017, June 1). Digital Signatures for Kernel Modules on Windows. Retrieved April 22, 2021.'},
  {'source_name': 'Apple Disable SIP',
   'url': 'https://developer.apple.com/documentation/security/disabling_and_enabling_system_integrity_protection',
   'description': 'Apple. (n.d.). Disabling and Enabling System Integrity Protection. Retrieved April 22, 2021.'},
  {'source_name': 'Microsoft Unsigned Driver Apr 2017',
   'url': 'https://docs.microsoft.com/en-us/windows-hardware/drivers/install/installing-an-unsigned-driver-during-development-and-test',
   'description': 'Microsoft. (2017, April 20). Installing an Unsigned Driver during Developme

Normalizing semi-structured JSON data into a flat table via **pandas.io.json.json_normalize**
* Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html

In [6]:
techniques_normalized = pandas.json_normalize(all_techniques)

In [7]:
techniques_normalized[0:1]

Unnamed: 0,external_references,kill_chain_phases,x_mitre_version,x_mitre_is_subtechnique,url,matrix,technique_id,object_marking_refs,created_by_ref,technique,...,effective_permissions,impact_type,revoked,x_mitre_deprecated,x_mitre_old_attack_id,difficulty_explanation,difficulty_for_adversary,detectable_explanation,detectable_by_common_defenses,tactic_type
0,"[{'source_name': 'mitre-attack', 'external_id'...","[{'kill_chain_name': 'mitre-attack', 'phase_na...",1.0,True,https://attack.mitre.org/techniques/T1553/006,mitre-attack,T1553.006,[marking-definition--fa42a846-8d90-4e51-bc29-7...,identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5,Code Signing Policy Modification,...,,,,,,,,,,


## 6. Re-indexing Dataframe

In [8]:
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

In [9]:
techniques.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources
0,mitre-attack,"[Windows, macOS]",[defense-evasion],Code Signing Policy Modification,T1553.006,[Windows Registry: Windows Registry Key Modifi...
1,mitre-attack,"[Windows, Linux, macOS, IaaS]",[discovery],System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process..."
2,mitre-attack,[Containers],[discovery],Container and Resource Discovery,T1613,"[Cluster: Cluster Metadata, Container: Contain..."
3,mitre-attack,[Containers],[credential-access],Container API,T1552.007,"[Command: Command Execution, File: File Access..."
4,mitre-attack,[Containers],[defense-evasion],Build Image on Host,T1612,"[Image: Image Creation, Network Traffic: Netwo..."


In [10]:
print('A total of ',len(techniques),' techniques')

A total of  1062  techniques


## 7. Removing Revoked Techniques

In [11]:
all_techniques_no_revoked = lift.remove_revoked(all_techniques)

In [12]:
print('A total of ',len(all_techniques_no_revoked),' techniques')

A total of  916  techniques


## 8. Extractinng Revoked Techniques

In [13]:
all_techniques_revoked = lift.extract_revoked(all_techniques)

In [14]:
print('A total of ',len(all_techniques_revoked),' techniques that have been revoked')

A total of  146  techniques that have been revoked


The revoked techniques are the following ones:

In [15]:
for t in all_techniques_revoked:
    print(t['technique'])

Web Session Cookie
Emond
Cloud Instance Metadata API
Revert Cloud Instance
Application Access Token
Elevated Execution with Prompt
Credentials from Web Browsers
PowerShell Profile
Parent PID Spoofing
Compile After Delivery
Systemd Service
Runtime Data Manipulation
Transmitted Data Manipulation
Stored Data Manipulation
Disk Content Wipe
Disk Structure Wipe
Domain Generation Algorithms
Compiled HTML File
SIP and Trust Provider Hijacking
Time Providers
CMSTP
Credentials in Registry
Control Panel Items
Kernel Modules and Extensions
Spearphishing Link
Sudo Caching
Spearphishing Attachment
Kerberoasting
Spearphishing via Service
LSASS Driver
Password Filter DLL
Screensaver
AppCert DLLs
Domain Fronting
Mshta
Dynamic Data Exchange
Hooking
Image File Execution Options Injection
Extra Window Memory Injection
LLMNR/NBT-NS Poisoning and Relay
Multi-hop Proxy
SSH Hijacking
Process Doppelgänging
SID-History Injection
Application Shimming
Hidden Window
Login Item
Plist Modification
Re-opened Applicat

## 9. Updating our Dataframe

In [16]:
techniques_normalized = pandas.json_normalize(all_techniques_no_revoked)
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

## 10. Techniques Per Matrix
Using **altair** python library we can start showing a few charts stacking the number of techniques with or without data sources.
Reference: https://altair-viz.github.io/

In [17]:
data = techniques
data_2 = data.groupby(['matrix'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3

Unnamed: 0,matrix,technique
0,mitre-attack,563
1,mitre-ics-attack,89
2,mitre-mobile-attack,90
3,mitre-pre-attack,174


In [18]:
alt.Chart(data_3).mark_bar().encode(x='technique', y='matrix', color='matrix').properties(height = 200)

## 11. Techniques With and Without Data Sources

In [19]:
data_source_distribution = pandas.DataFrame({
    'Techniques': ['Without DS','With DS'],
    'Count of Techniques': [techniques['data_sources'].isna().sum(),techniques['data_sources'].notna().sum()]})
bars = alt.Chart(data_source_distribution).mark_bar().encode(x='Techniques',y='Count of Techniques',color='Techniques').properties(width=200,height=300)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

What is the distribution of techniques based on ATT&CK Matrix?

In [20]:
data = techniques
data['Count_DS'] = data['data_sources'].str.len()
data['Ind_DS'] = np.where(data['Count_DS']>0,'With DS','Without DS')
data_2 = data.groupby(['matrix','Ind_DS'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3

Unnamed: 0,matrix,Ind_DS,technique
0,mitre-attack,With DS,472
1,mitre-attack,Without DS,91
2,mitre-ics-attack,With DS,72
3,mitre-ics-attack,Without DS,17
4,mitre-mobile-attack,Without DS,90
5,mitre-pre-attack,Without DS,174


In [21]:
alt.renderers.enable('default') 

RendererRegistry.enable('default')

In [22]:
alt.Chart(data_3).mark_bar().encode(x='technique', y='Ind_DS', color='matrix').properties(height = 200)

What are those mitre-attack techniques without data sources?

In [23]:
data[(data['matrix']=='mitre-attack') & (data['Ind_DS']=='Without DS')]

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources,Count_DS,Ind_DS
10,mitre-attack,[PRE],[resource-development],Link Target,T1608.005,,,Without DS
11,mitre-attack,[PRE],[resource-development],Drive-by Target,T1608.004,,,Without DS
12,mitre-attack,[PRE],[resource-development],Install Digital Certificate,T1608.003,,,Without DS
13,mitre-attack,[PRE],[resource-development],Upload Tool,T1608.002,,,Without DS
14,mitre-attack,[PRE],[resource-development],Upload Malware,T1608.001,,,Without DS
...,...,...,...,...,...,...,...,...
524,mitre-attack,"[Linux, macOS, Windows]",[execution],Graphical User Interface,T1061,,,Without DS
531,mitre-attack,[Windows],[lateral-movement],Shared Webroot,T1051,,,Without DS
536,mitre-attack,"[Linux, macOS, Windows]",[command-and-control],Commonly Used Port,T1043,,,Without DS
542,mitre-attack,[Windows],"[persistence, privilege-escalation]",Path Interception,T1034,,,Without DS


### Techniques without data sources

In [24]:
techniques_without_data_sources=techniques[techniques.data_sources.isnull()].reset_index(drop=True)

In [25]:
techniques_without_data_sources.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources,Count_DS,Ind_DS
0,mitre-attack,[PRE],[resource-development],Link Target,T1608.005,,,Without DS
1,mitre-attack,[PRE],[resource-development],Drive-by Target,T1608.004,,,Without DS
2,mitre-attack,[PRE],[resource-development],Install Digital Certificate,T1608.003,,,Without DS
3,mitre-attack,[PRE],[resource-development],Upload Tool,T1608.002,,,Without DS
4,mitre-attack,[PRE],[resource-development],Upload Malware,T1608.001,,,Without DS


In [26]:
print('There are ',techniques['data_sources'].isna().sum(),' techniques without data sources (',"{0:.0%}".format(techniques['data_sources'].isna().sum()/len(techniques)),' of ',len(techniques),' techniques)')

There are  372  techniques without data sources ( 41%  of  916  techniques)


### Techniques With Data Sources

In [27]:
techniques_with_data_sources=techniques[techniques.data_sources.notnull()].reset_index(drop=True)

In [28]:
techniques_with_data_sources.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources,Count_DS,Ind_DS
0,mitre-attack,"[Windows, macOS]",[defense-evasion],Code Signing Policy Modification,T1553.006,[Windows Registry: Windows Registry Key Modifi...,3.0,With DS
1,mitre-attack,"[Windows, Linux, macOS, IaaS]",[discovery],System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process...",4.0,With DS
2,mitre-attack,[Containers],[discovery],Container and Resource Discovery,T1613,"[Cluster: Cluster Metadata, Container: Contain...",6.0,With DS
3,mitre-attack,[Containers],[credential-access],Container API,T1552.007,"[Command: Command Execution, File: File Access...",3.0,With DS
4,mitre-attack,[Containers],[defense-evasion],Build Image on Host,T1612,"[Image: Image Creation, Network Traffic: Netwo...",4.0,With DS


In [29]:
print('There are ',techniques['data_sources'].notna().sum(),' techniques with data sources (',"{0:.0%}".format(techniques['data_sources'].notna().sum()/len(techniques)),' of ',len(techniques),' techniques)')

There are  544  techniques with data sources ( 59%  of  916  techniques)


## 12. Grouping Techniques With Data Sources By Matrix

Let's create a graph to represent the number of techniques per matrix:

In [30]:
matrix_distribution = pandas.DataFrame({
    'Matrix': list(techniques_with_data_sources.groupby(['matrix'])['matrix'].count().keys()),
    'Count of Techniques': techniques_with_data_sources.groupby(['matrix'])['matrix'].count().tolist()})
bars = alt.Chart(matrix_distribution).mark_bar().encode(y='Matrix',x='Count of Techniques').properties(width=300,height=100)
text = bars.mark_text(align='center',baseline='middle',dx=10,dy=0).encode(text='Count of Techniques')
bars + text

All the techniques belong to **mitre-attack** matrix which is the main **Enterprise** matrix. Reference: https://attack.mitre.org/wiki/Main_Page 

## 13. Grouping Techniques With Data Sources by Platform

First, we need to split the **platform** column values because a technique might be mapped to more than one platform

In [31]:
techniques_platform=techniques_with_data_sources

attributes_1 = ['platform'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_1:
    s = techniques_platform.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_platform=techniques_platform.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_platform", and then join "techniques_platform" with "s"

# Let's re-arrange the columns from general to specific
techniques_platform_2=techniques_platform.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

We can now show techniques with data sources mapped to one platform at the time

In [32]:
techniques_platform_2.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources
0,mitre-attack,Windows,[defense-evasion],Code Signing Policy Modification,T1553.006,[Windows Registry: Windows Registry Key Modifi...
1,mitre-attack,macOS,[defense-evasion],Code Signing Policy Modification,T1553.006,[Windows Registry: Windows Registry Key Modifi...
2,mitre-attack,Windows,[discovery],System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process..."
3,mitre-attack,Linux,[discovery],System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process..."
4,mitre-attack,macOS,[discovery],System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process..."


Let's create a visualization to show the number of techniques grouped by platform:

In [33]:
platform_distribution = pandas.DataFrame({
    'Platform': list(techniques_platform_2.groupby(['platform'])['platform'].count().keys()),
    'Count of Techniques': techniques_platform_2.groupby(['platform'])['platform'].count().tolist()})
bars = alt.Chart(platform_distribution,height=300).mark_bar().encode(x ='Platform',y='Count of Techniques',color='Platform').properties(width=200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

In the bar chart above we can see that there are more techniques with data sources mapped to the Windows platform.

## 14. Grouping Techniques With Data Sources by Tactic

Again, first we need to split the tactic column values because a technique might be mapped to more than one tactic:

In [34]:
techniques_tactic=techniques_with_data_sources

attributes_2 = ['tactic'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_2:
    s = techniques_tactic.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_tactic = techniques_tactic.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_tactic", and then join "techniques_tactic" with "s"

# Let's re-arrange the columns from general to specific
techniques_tactic_2=techniques_tactic.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)


We can now show techniques with data sources mapped to one tactic at the time

In [35]:
techniques_tactic_2.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources
0,mitre-attack,"[Windows, macOS]",defense-evasion,Code Signing Policy Modification,T1553.006,[Windows Registry: Windows Registry Key Modifi...
1,mitre-attack,"[Windows, Linux, macOS, IaaS]",discovery,System Location Discovery,T1614,"[Instance: Instance Metadata, Process: Process..."
2,mitre-attack,[Containers],discovery,Container and Resource Discovery,T1613,"[Cluster: Cluster Metadata, Container: Contain..."
3,mitre-attack,[Containers],credential-access,Container API,T1552.007,"[Command: Command Execution, File: File Access..."
4,mitre-attack,[Containers],defense-evasion,Build Image on Host,T1612,"[Image: Image Creation, Network Traffic: Netwo..."


Let's create a visualization to show the number of techniques grouped by tactic:

In [36]:
tactic_distribution = pandas.DataFrame({
    'Tactic': list(techniques_tactic_2.groupby(['tactic'])['tactic'].count().keys()),
    'Count of Techniques': techniques_tactic_2.groupby(['tactic'])['tactic'].count().tolist()}).sort_values(by='Count of Techniques',ascending=True)
bars = alt.Chart(tactic_distribution,width=800,height=300).mark_bar().encode(x ='Tactic',y='Count of Techniques',color='Tactic').properties(width=400)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Defende-evasion and Persistence are tactics with the highest nummber of techniques with data sources

## 15. Grouping Techniques With Data Sources by Data Source

We need to split the data source column values because a technique might be mapped to more than one data source:

In [37]:
techniques_data_source=techniques_with_data_sources

attributes_3 = ['data_sources'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_3:
    s = techniques_data_source.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_data_source = techniques_data_source.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_data_source", and then join "techniques_data_source" with "s"

# Let's re-arrange the columns from general to specific
techniques_data_source_2 = techniques_data_source.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_source_3 = techniques_data_source_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

We can now show techniques with data sources mapped to one data source at the time

In [38]:
techniques_data_source_3.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources
0,mitre-attack,"[Windows, macOS]",[defense-evasion],Code Signing Policy Modification,T1553.006,Windows Registry: Windows Registry Key Modific...
1,mitre-attack,"[Windows, macOS]",[defense-evasion],Code Signing Policy Modification,T1553.006,Command: Command Execution
2,mitre-attack,"[Windows, macOS]",[defense-evasion],Code Signing Policy Modification,T1553.006,Process: Process Creation
3,mitre-attack,"[Windows, Linux, macOS, IaaS]",[discovery],System Location Discovery,T1614,Instance: Instance Metadata
4,mitre-attack,"[Windows, Linux, macOS, IaaS]",[discovery],System Location Discovery,T1614,Process: Process Creation


Let's create a visualization to show the number of techniques grouped by data sources:

In [39]:
data_source_distribution = pandas.DataFrame({
    'Data Source': list(techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().keys()),
    'Count of Techniques': techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().tolist()})
bars = alt.Chart(data_source_distribution,width=800,height=300).mark_bar().encode(x ='Data Source',y='Count of Techniques',color='Data Source').properties(width=1200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

A few interesting things from the bar chart above:
* Process Monitoring, File Monitoring, and Process Command-line parameters are the Data Sources with the highest number of techniques
* There are some data source names that include string references to Windows such as PowerShell, Windows and wmi

## 16. Most Relevant Groups Of Data Sources Per Technique

### Number Of Data Sources Per Technique

Although identifying the data sources with the highest number of techniques is a good start, they usually do not work alone. You might be collecting **Process Monitoring** already but you might be still missing a lot of context from a data perspective.

In [40]:
data_source_distribution_2 = pandas.DataFrame({
    'Techniques': list(techniques_data_source_3.groupby(['technique'])['technique'].count().keys()),
    'Count of Data Sources': techniques_data_source_3.groupby(['technique'])['technique'].count().tolist()})

data_source_distribution_3 = pandas.DataFrame({
    'Number of Data Sources': list(data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().keys()),
    'Count of Techniques': data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().tolist()})

bars = alt.Chart(data_source_distribution_3).mark_bar().encode(x ='Number of Data Sources',y='Count of Techniques').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

The image above shows you the number data sources needed per techniques according to ATT&CK:
* There are 71 techniques that require 3 data sources as enough context to validate the detection of them according to ATT&CK
* Only one technique has 12 data sources
* One data source only applies to 19 techniques

Let's create subsets of data sources with the data source column defining and using a python function:

In [41]:
# https://stackoverflow.com/questions/26332412/python-recursive-function-to-display-all-subsets-of-given-set
def subs(l):
    res = []
    for i in range(1, len(l) + 1):
        for combo in itertools.combinations(l, i):
            res.append(list(combo))
    return res

Before applying the function, we need to use lowercase data sources names and sort data sources names to improve consistency:

In [42]:
df = techniques_with_data_sources[['data_sources']]

In [43]:
for index, row in df.iterrows():
    row["data_sources"]=[x.lower() for x in row["data_sources"]]
    row["data_sources"].sort()

In [44]:
df.head()

Unnamed: 0,data_sources
0,"[command: command execution, process: process ..."
1,"[command: command execution, instance: instanc..."
2,"[application log: application log content, clu..."
3,"[command: command execution, file: file access..."
4,"[image: image creation, network traffic: netwo..."


Let's apply the function and split the subsets column:

In [45]:
df['subsets']=df['data_sources'].apply(subs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['subsets']=df['data_sources'].apply(subs)


In [46]:
df.head()

Unnamed: 0,data_sources,subsets
0,"[command: command execution, process: process ...","[[command: command execution], [process: proce..."
1,"[command: command execution, instance: instanc...","[[command: command execution], [instance: inst..."
2,"[application log: application log content, clu...","[[application log: application log content], [..."
3,"[command: command execution, file: file access...","[[command: command execution], [file: file acc..."
4,"[image: image creation, network traffic: netwo...","[[image: image creation], [network traffic: ne..."


We need to split the subsets column values:

In [47]:
techniques_with_data_sources_preview = df

In [48]:
attributes_4 = ['subsets']

for a in attributes_4:
    s = techniques_with_data_sources_preview.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    s.name = a
    techniques_with_data_sources_preview = techniques_with_data_sources_preview.drop(a, axis=1).join(s).reset_index(drop=True)
    
techniques_with_data_sources_subsets = techniques_with_data_sources_preview.reindex(['data_sources','subsets'], axis=1)


In [49]:
techniques_with_data_sources_subsets.head()

Unnamed: 0,data_sources,subsets
0,"[command: command execution, process: process ...",[command: command execution]
1,"[command: command execution, process: process ...",[process: process creation]
2,"[command: command execution, process: process ...",[windows registry: windows registry key modifi...
3,"[command: command execution, process: process ...","[command: command execution, process: process ..."
4,"[command: command execution, process: process ...","[command: command execution, windows registry:..."


Let's add three columns to analyse the dataframe: subsets_name (Changing Lists to Strings), subsets_number_elements ( Number of data sources per subset) and number_data_sources_per_technique

In [50]:
techniques_with_data_sources_subsets['subsets_name']=techniques_with_data_sources_subsets['subsets'].apply(lambda x: ','.join(map(str, x)))
techniques_with_data_sources_subsets['subsets_number_elements']=techniques_with_data_sources_subsets['subsets'].str.len()
techniques_with_data_sources_subsets['number_data_sources_per_technique']=techniques_with_data_sources_subsets['data_sources'].str.len()

In [51]:
techniques_with_data_sources_subsets.head()

Unnamed: 0,data_sources,subsets,subsets_name,subsets_number_elements,number_data_sources_per_technique
0,"[command: command execution, process: process ...",[command: command execution],command: command execution,1,3
1,"[command: command execution, process: process ...",[process: process creation],process: process creation,1,3
2,"[command: command execution, process: process ...",[windows registry: windows registry key modifi...,windows registry: windows registry key modific...,1,3
3,"[command: command execution, process: process ...","[command: command execution, process: process ...","command: command execution,process: process cr...",2,3
4,"[command: command execution, process: process ...","[command: command execution, windows registry:...","command: command execution,windows registry: w...",2,3


As it was described above, we need to find grups pf data sources, so we are going to filter out all the subsets with only one data source:

In [52]:
subsets = techniques_with_data_sources_subsets

subsets_ok=subsets[subsets.subsets_number_elements != 1]

In [53]:
subsets_ok.head()

Unnamed: 0,data_sources,subsets,subsets_name,subsets_number_elements,number_data_sources_per_technique
3,"[command: command execution, process: process ...","[command: command execution, process: process ...","command: command execution,process: process cr...",2,3
4,"[command: command execution, process: process ...","[command: command execution, windows registry:...","command: command execution,windows registry: w...",2,3
5,"[command: command execution, process: process ...","[process: process creation, windows registry: ...","process: process creation,windows registry: wi...",2,3
6,"[command: command execution, process: process ...","[command: command execution, process: process ...","command: command execution,process: process cr...",3,3
11,"[command: command execution, instance: instanc...","[command: command execution, instance: instanc...","command: command execution,instance: instance ...",2,4


Finally, we calculate the most relevant groups of data sources (Top 15):

In [54]:
subsets_graph = subsets_ok.groupby(['subsets_name'])['subsets_name'].count().to_frame(name='subsets_count').sort_values(by='subsets_count',ascending=False)[0:15]

In [55]:
subsets_graph

Unnamed: 0_level_0,subsets_count
subsets_name,Unnamed: 1_level_1
"command: command execution,process: process creation",167
"network traffic: network traffic content,network traffic: network traffic flow",62
"file: file modification,process: process creation",60
"file: file creation,process: process creation",57
"command: command execution,file: file modification",54
"file: file creation,file: file modification",51
"command: command execution,file: file modification,process: process creation",47
"command: command execution,process: os api execution",47
"command: command execution,file: file creation",47
"command: command execution,windows registry: windows registry key modification",45


In [56]:
subsets_graph_2 = pandas.DataFrame({
    'Data Sources': list(subsets_graph.index),
    'Count of Techniques': subsets_graph['subsets_count'].tolist()})

bars = alt.Chart(subsets_graph_2).mark_bar().encode(x ='Data Sources', y ='Count of Techniques', color='Data Sources').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx= 0,dy=-5).encode(text='Count of Techniques')
bars + text

Group (Process Monitoring - Process Command-line parameters) is the is the group of data sources with the highest number of techniques. This group of data sources are suggested to hunt 78 techniques

## 17. Let's Split all the Information About Techniques With Data Sources Defined: Matrix, Platform, Tactic and Data Source

Let's split all the relevant columns of the dataframe:

In [57]:
techniques_data = techniques_with_data_sources

attributes = ['platform','tactic','data_sources'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes:
    s = techniques_data.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_data=techniques_data.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_data", and then join "techniques_data" with "s"

# Let's re-arrange the columns from general to specific
techniques_data_2=techniques_data.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_3 = techniques_data_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

techniques_data_3.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources
0,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Windows Registry: Windows Registry Key Modific...
1,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Command: Command Execution
2,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Process: Process Creation
3,mitre-attack,macOS,defense-evasion,Code Signing Policy Modification,T1553.006,Windows Registry: Windows Registry Key Modific...
4,mitre-attack,macOS,defense-evasion,Code Signing Policy Modification,T1553.006,Command: Command Execution


Do you remember data sources names with a reference to Windows? After splitting the dataframe by platforms, tactics and data sources, are there any macOC or linux techniques that consider windows data sources? Let's identify those rows:

In [58]:
# After splitting the rows of the dataframe, there are some values that relate windows data sources with platforms like linux and masOS.
# We need to identify those rows
conditions = [(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
             (techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
             (techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True)]
# In conditions we indicate a logical test

choices = ['NO OK','NO OK','NO OK','NO OK','NO OK','NO OK']
# In choices, we indicate the result when the logical test is true

techniques_data_3['Validation'] = np.select(conditions,choices,default='OK')
# We add a column "Validation" to "techniques_data_3" with the result of the logical test. The default value is going to be "OK"

What is the inconsistent data?

In [59]:
techniques_analysis_data_no_ok = techniques_data_3[techniques_data_3.Validation == 'NO OK']
# Finally, we are filtering all the values with NO OK

techniques_analysis_data_no_ok.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources,Validation
3,mitre-attack,macOS,defense-evasion,Code Signing Policy Modification,T1553.006,Windows Registry: Windows Registry Key Modific...,NO OK
307,mitre-attack,Linux,defense-evasion,Run Virtual Instance,T1564.006,Windows Registry: Windows Registry Key Modific...,NO OK
312,mitre-attack,macOS,defense-evasion,Run Virtual Instance,T1564.006,Windows Registry: Windows Registry Key Modific...,NO OK
318,mitre-attack,Linux,defense-evasion,Hidden File System,T1564.005,Windows Registry: Windows Registry Key Modific...,NO OK
321,mitre-attack,macOS,defense-evasion,Hidden File System,T1564.005,Windows Registry: Windows Registry Key Modific...,NO OK


In [60]:
print('There are ',len(techniques_analysis_data_no_ok),' rows with inconsistent data')

There are  100  rows with inconsistent data


What is the impact of this inconsistent data from a platform and data sources perspective?

In [61]:
df = techniques_with_data_sources

attributes = ['platform','data_sources']

for a in attributes:
    s = df.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    s.name = a
    df=df.drop(a, axis=1).join(s).reset_index(drop=True)
    
df_2=df.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
df_3 = df_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

conditions = [(df_3['data_sources'].str.contains('windows',case=False)== True),
              (df_3['data_sources'].str.contains('powershell',case=False)== True),
              (df_3['data_sources'].str.contains('wmi',case=False)== True)]

choices = ['Windows','Windows','Windows']

df_3['Validation'] = np.select(conditions,choices,default='Other')
df_3['Num_Tech'] = 1
df_4 = df_3[df_3.Validation == 'Windows']
df_5 = df_4.groupby(['data_sources','platform'])['technique'].nunique()
df_6 = df_5.to_frame().reset_index()

In [62]:
alt.Chart(df_6).mark_bar().encode(x=alt.X('technique', stack="normalize"),    y='data_sources',    color='platform').properties(height=200)

There are techniques that consider Windows Error Reporting, Windows Registry, and Windows event logs as data sources and they also consider platforms like Linux and masOS. We do not need to consider this rows because those data sources can only be managed at a Windows environment. These are the techniques that we should not consider in our data base:

In [63]:
techniques_analysis_data_no_ok[['technique','data_sources']].drop_duplicates().sort_values(by='data_sources',ascending=True)

Unnamed: 0,technique,data_sources
2956,Event Triggered Execution,WMI: WMI Creation
5094,OS Credential Dumping,Windows Registry: Windows Registry Key Access
2415,Unsecured Credentials,Windows Registry: Windows Registry Key Access
4024,Browser Extensions,Windows Registry: Windows Registry Key Creation
3155,Create or Modify System Process,Windows Registry: Windows Registry Key Creation
2892,Boot or Logon Autostart Execution,Windows Registry: Windows Registry Key Creation
4851,Boot or Logon Initialization Scripts,Windows Registry: Windows Registry Key Creation
2326,Subvert Trust Controls,Windows Registry: Windows Registry Key Creation
1533,Install Root Certificate,Windows Registry: Windows Registry Key Creation
4499,Indicator Removal on Host,Windows Registry: Windows Registry Key Deletion


Without considering this inconsistent data, the final dataframe is:

In [64]:
techniques_analysis_data_ok = techniques_data_3[techniques_data_3.Validation == 'OK']
techniques_analysis_data_ok.head()

Unnamed: 0,matrix,platform,tactic,technique,technique_id,data_sources,Validation
0,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Windows Registry: Windows Registry Key Modific...,OK
1,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Command: Command Execution,OK
2,mitre-attack,Windows,defense-evasion,Code Signing Policy Modification,T1553.006,Process: Process Creation,OK
4,mitre-attack,macOS,defense-evasion,Code Signing Policy Modification,T1553.006,Command: Command Execution,OK
5,mitre-attack,macOS,defense-evasion,Code Signing Policy Modification,T1553.006,Process: Process Creation,OK


In [65]:
print('There are ',len(techniques_analysis_data_ok),' rows of data that you can play with')

There are  5693  rows of data that you can play with


## 18. Getting Techniques by Data Sources

This function gets techniques' information that includes specific data sources

In [66]:
data_source = 'PROCESS MONITORING'

In [67]:
results = lift.get_techniques_by_datasources(data_source)

In [68]:
len(results)

21

In [69]:
type(results)

list

In [70]:
results2 = lift.get_techniques_by_datasources('pRoceSS MoniTorinG','process commAnd-linE parameters')

In [71]:
len(results2)

22

In [72]:
results2[1]

AttackPattern(type='attack-pattern', id='attack-pattern--d67adac8-e3b9-44f9-9e6d-6c2a7d69dbe4', created_by_ref='identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', created='2020-05-21T17:43:26.506Z', modified='2020-05-21T17:43:26.506Z', name='Connection Proxy', description='Adversaries may use a connection proxy to direct network traffic between systems or act as an intermediary for network communications.\n\nThe definition of a proxy can also be expanded to encompass trust relationships between networks in peer-to-peer, mesh, or trusted connections between networks consisting of hosts or systems that regularly communicate with each other.\n\nThe network may be within a single organization or across multiple organizations with trust relationships. Adversaries could use these types of relationships to manage command and control communications, to reduce the number of simultaneous outbound network connections, to provide resiliency in the face of connection loss, or to ride over existing tr