# Building the Nation-State Attack Dataset
This notebook is dedicated to aggregating all of the nation-state data (NSA) sent via email and merging it with the secondary nation-state data that was tested for normality and validated for correlation in `nsa_corr_validation`.

The first set of nation-state data is built from a dictionary of key-value pairs with the name of the attack, it's group and state attribution, the year it started, the year it ended, and the ID and description of the CVE exploited. The second is imported as an Excel file that has already been cleaned.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re # For regex support

## Aggregated NSA Data

In [2]:
# Create nation-state attack dataframe
nsa_dict = {
    'attack_name': [
        'Mirai Botnet',
        'VPNFilter',
        'Triton/Trisis',
        'Iranian Cyberattacks on Water Systems',
        'Iranian APT Exploits on Fortinet Vulnerabilities',
        'Operation Shadowhammer',
        'Ripple20 Vulnerabilities',
        'Dragonfly/Energetic Bear Campaign 1',
        'Dragonfly/Energetic Bear Campaign 2',
        'Stuxnet',
        'Heartbleed Exploits',
        'BlackEnergy Attack on Ukraine',
        'Microsoft Exchange ProxyShell Exploits',
        'F5 BIG-IP Exploits',
        'Pulse Secure VPN Exploits',
        'Equifax Data Breach',
        'SolarWinds Orion Supply Chain Attack',
        'Not Petya Ransomware Attack',
        'WannaCry Ransomware Attack'
    ],
    'year_start': [
        2016,
        2018,
        2017,
        2020,
        2021,
        2018,
        2020,
        2013,
        2017,
        2018,
        2014,
        2015,
        2021,
        2021,
        2019,
        2017,
        2020,
        2017,
        2017
    ],
    'year_end': [
        2016,
        2018,
        2017,
        2020,
        2021,
        2019,
        2020,
        2014,
        2017,
        2018,
        2014,
        2015,
        2021,
        2021,
        2021,
        2017,
        2020,
        2017,
        2017
    ],
    'attribution_group': [
        pd.NA,
        'APT28 (Fancy Bear)',
        pd.NA,
        pd.NA,
        pd.NA,
        'APT41',
        pd.NA,
        'Dragonfly (Energetic Bear)',
        'Dragonfly (Energetic Bear)',
        pd.NA,
        pd.NA,
        'Sandworm',
        pd.NA,
        pd.NA,
        'APT5',
        'APT10',
        'APT29 (Cozy Bear)',
        'Sandworm',
        'Lazarus'
    ],
    'attribution_state': [
        [pd.NA],
        ['Russia', 'Russia', 'Russia', 'Russia', 'Russia'],
        ['Russia', 'Russia'],
        ['Iran'],
        ['Iran'],
        ['China'],
        [pd.NA, pd.NA, pd.NA, pd.NA],
        ['Russia'],
        ['Russia'],
        ['US', 'Israel'],
        ['China'],
        ['Russia'],
        ['China', 'China', 'China'],
        ['Russia', 'China'],
        ['China'],
        ['China'],
        ['Russia'],
        ['Russia'],
        ['DPRK']
    ],
    'cve_id': [
        [pd.NA],
        [
            'CVE-2018-14847',
            'CVE-2017-12074',
            'CVE-2018-10561',
            'CVE-2018-10562',
            'CVE-2017-8418',
        ],
        [
            'CVE-2017-7905',
            'CVE-2017-7921'
        ],
        [pd.NA],
        ['CVE-2018-13379'],
        ['CVE-2019-19781'],
        [
            'CVE-2020-11896',
            'CVE-2020-11898',
            'CVE-2020-11899',
            'CVE-2020-11901'
        ],
        [pd.NA],
        [pd.NA],
        [pd.NA, pd.NA],
        ['CVE-2014-0160'],
        [pd.NA],
        [
            'CVE-2021-34473',
            'CVE-2021-34523',
            'CVE-2021-31207'
        ],
        [
            'CVE-2020-5902',
            'CVE-2020-5902'
        ],
        ['CVE-2019-11510'],
        ['CVE-2017-5638'],
        [pd.NA],
        ['CVE-2017-0144'],
        ['CVE-2017-0144']
    ],
    'description': [
        [pd.NA],
        [
            'MikroTik RouterOS through 6.42 allows unauthenticated remote attackers to read arbitrary files and remote authenticated attackers to write arbitrary files due to a directory traversal vulnerability in the WinBox interface.',
            'Directory traversal vulnerability in the SYNO.DNSServer.Zone.MasterZoneConf in Synology DNS Server before 2.2.1-3042 allows remote authenticated attackers to write arbitrary files via the domain_name parameter.',
            'An issue was discovered on Dasan GPON home routers. It is possible to bypass authentication simply by appending "?images" to any URL of the device that requires authentication, as demonstrated by the /menu.html?images/ or /GponForm/diag_FORM?images/ URI. One can then manage the device.',
            "An issue was discovered on Dasan GPON home routers. Command Injection can occur via the dest_host parameter in a diag_action=ping request to a GponForm/diag_Form URI. Because the router saves ping results in /tmp and transmits them to the user when the user revisits /diag.html, it's quite simple to execute commands and retrieve their output.",
            'RuboCop 0.48.1 and earlier does not use /tmp in safe way, allowing local users to exploit this to tamper with cache files belonging to other users.'
        ],
        [
            'A Weak Cryptography for Passwords issue was discovered in General Electric (GE) Multilin SR 750 Feeder Protection Relay, firmware versions prior to Version 7.47; SR 760 Feeder Protection Relay, firmware versions prior to Version 7.47; SR 469 Motor Protection Relay, firmware versions prior to Version 5.23; SR 489 Generator Protection Relay, firmware versions prior to Version 4.06; SR 745 Transformer Protection Relay, firmware versions prior to Version 5.23; SR 369 Motor Protection Relay, all firmware versions; Multilin Universal Relay, firmware Version 6.0 and prior versions; and Multilin URplus (D90, C90, B95), all versions. Ciphertext versions of user passwords were created with a non-random initialization vector leaving them susceptible to dictionary attacks. Ciphertext of user passwords can be obtained from the front LCD panel of affected products and through issued Modbus commands.',
            'An Improper Authentication issue was discovered in Hikvision DS-2CD2xx2F-I Series V5.2.0 build 140721 to V5.4.0 build 160530, DS-2CD2xx0F-I Series V5.2.0 build 140721 to V5.4.0 Build 160401, DS-2CD2xx2FWD Series V5.3.1 build 150410 to V5.4.4 Build 161125, DS-2CD4x2xFWD Series V5.2.0 build 140721 to V5.4.0 Build 160414, DS-2CD4xx5 Series V5.2.0 build 140721 to V5.4.0 Build 160421, DS-2DFx Series V5.2.0 build 140805 to V5.4.5 Build 160928, and DS-2CD63xx Series V5.0.9 build 140305 to V5.3.5 Build 160106 devices. The improper authentication vulnerability occurs when an application does not adequately or correctly authenticate users. This may allow a malicious user to escalate his or her privileges on the system and gain access to sensitive information.'
        ],
        [pd.NA],
        ['An Improper Limitation of a Pathname to a Restricted Directory ("Path Traversal") in Fortinet FortiOS 6.0.0 to 6.0.4, 5.6.3 to 5.6.7 and 5.4.6 to 5.4.12 and FortiProxy 2.0.0, 1.2.0 to 1.2.8, 1.1.0 to 1.1.6, 1.0.0 to 1.0.7 under SSL VPN web portal allows an unauthenticated attacker to download system files via special crafted HTTP resource requests.'],
        ['An issue was discovered in Citrix Application Delivery Controller (ADC) and Gateway 10.5, 11.1, 12.0, 12.1, and 13.0. They allow Directory Traversal.'],
        [
            'The Treck TCP/IP stack before 6.0.1.66 allows Remote Code Execution, related to IPv4 tunneling.',
            'The Treck TCP/IP stack before 6.0.1.66 improperly handles an IPv4/ICMPv4 Length Parameter Inconsistency, which might allow remote attackers to trigger an information leak.',
            'The Treck TCP/IP stack before 6.0.1.66 has an IPv6 Out-of-bounds Read.',
            'The Treck TCP/IP stack before 6.0.1.66 allows Remote Code execution via a single invalid DNS response.'
        ],
        [pd.NA],
        [pd.NA],
        [pd.NA, pd.NA],
        ['The (1) TLS and (2) DTLS implementations in OpenSSL 1.0.1 before 1.0.1g do not properly handle Heartbeat Extension packets, which allows remote attackers to obtain sensitive information from process memory via crafted packets that trigger a buffer over-read, as demonstrated by reading private keys, related to d1_both.c and t1_lib.c, aka the Heartbleed bug.'],
        [pd.NA],
        [
            'Microsoft Exchange Server Remote Code Execution Vulnerability',
            'Microsoft Exchange Server Elevation of Privilege Vulnerability',
            'Microsoft Exchange Server Security Feature Bypass Vulnerability'
        ],
        [
            'In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14.1.2.5, 13.1.0-13.1.3.3, 12.1.0-12.1.5.1, and 11.6.1-11.6.5.1, the Traffic Management User Interface (TMUI), also referred to as the Configuration utility, has a Remote Code Execution (RCE) vulnerability in undisclosed pages.',
            'In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14.1.2.5, 13.1.0-13.1.3.3, 12.1.0-12.1.5.1, and 11.6.1-11.6.5.1, the Traffic Management User Interface (TMUI), also referred to as the Configuration utility, has a Remote Code Execution (RCE) vulnerability in undisclosed pages.'
        ],
        ['In Pulse Secure Pulse Connect Secure (PCS) 8.2 before 8.2R12.1, 8.3 before 8.3R7.1, and 9.0 before 9.0R3.4, an unauthenticated remote attacker can send a specially crafted URI to perform an arbitrary file reading vulnerability.'],
        ['The Jakarta Multipart parser in Apache Struts 2 2.3.x before 2.3.32 and 2.5.x before 2.5.10.1 has incorrect exception handling and error-message generation during file-upload attempts, which allows remote attackers to execute arbitrary commands via a crafted Content-Type, Content-Disposition, or Content-Length HTTP header, as exploited in the wild in March 2017 with a Content-Type header containing a #cmd= string.'],
        [pd.NA],
        ['The SMBv1 server in Microsoft Windows Vista SP2; Windows Server 2008 SP2 and R2 SP1; Windows 7 SP1; Windows 8.1; Windows Server 2012 Gold and R2; Windows RT 8.1; and Windows 10 Gold, 1511, and 1607; and Windows Server 2016 allows remote attackers to execute arbitrary code via crafted packets, aka "Windows SMB Remote Code Execution Vulnerability." This vulnerability is different from those described in CVE-2017-0143, CVE-2017-0145, CVE-2017-0146, and CVE-2017-0148.'],
        ['The SMBv1 server in Microsoft Windows Vista SP2; Windows Server 2008 SP2 and R2 SP1; Windows 7 SP1; Windows 8.1; Windows Server 2012 Gold and R2; Windows RT 8.1; and Windows 10 Gold, 1511, and 1607; and Windows Server 2016 allows remote attackers to execute arbitrary code via crafted packets, aka "Windows SMB Remote Code Execution Vulnerability." This vulnerability is different from those described in CVE-2017-0143, CVE-2017-0145, CVE-2017-0146, and CVE-2017-0148.']
    ],
}

nsa = pd.DataFrame(nsa_dict)
nsa.head(3)

Unnamed: 0,attack_name,year_start,year_end,attribution_group,attribution_state,cve_id,description
0,Mirai Botnet,2016,2016,,[<NA>],[<NA>],[<NA>]
1,VPNFilter,2018,2018,APT28 (Fancy Bear),"[Russia, Russia, Russia, Russia, Russia]","[CVE-2018-14847, CVE-2017-12074, CVE-2018-1056...",[MikroTik RouterOS through 6.42 allows unauthe...
2,Triton/Trisis,2017,2017,,"[Russia, Russia]","[CVE-2017-7905, CVE-2017-7921]",[A Weak Cryptography for Passwords issue was d...


## Exploding the Data
With the attack data aggregated and processed into a (dirty) dataset, we have to look at the relationships between lists in list-containing columns. Since there is a one-to-one relationship between `cve_id` and their `description`, we'll explode these columns simultaneously. Only then will we explode the lists in the `attribution_state` column, since we don't want to create false relationships that suggest that, within the context of a single attack, Nation $A$ used CVE $A$ while Nation $B$ used CVE $B$, when in fact we don't know. Ultimately, we have to represent the situation as both nations having used both CVEs. I created the dictionary object knowing how Pandas needs our observation's lists aligned, so we can avoid we what had to do for the CVE and CWE lists in terms of normalizing their content lengths.

In [3]:
# Explode the nation-state attack data
nsa = nsa.explode(['cve_id', 'description'])
nsa = nsa.explode('attribution_state')

## Deleting Duplicates

In [4]:
print(f'There were {nsa.duplicated().sum()} duplicates.')

# Remove duplicates
nsa = nsa.drop_duplicates()
print(f'There are now {nsa.duplicated().sum()} duplicates.')

There were 44 duplicates.
There are now 0 duplicates.


## Convert Objects To Text Data

In [5]:
obj_cols = nsa.select_dtypes(include=['object']).columns
nsa[obj_cols] = nsa[obj_cols].astype('string')

## Remove Extra Whitespace

In [6]:
nsa[obj_cols] = nsa[obj_cols].apply(lambda x: x.str.strip())

## Import Second NSA Data
This dataset contains several cleaning steps. The columns have to be renamed to allow merges with other data to happen seamlessly, some of the columns are no longer needed, some data should be converted to `string` for easier processing, and some of the values need to be fixed for consistency's sake.

In [7]:
new_nsa = pd.read_parquet(path='../data/new_nsa_data_v1.parquet')
new_nsa.head(3)

Unnamed: 0,Attack_Name,IoT,CVE_ID,CVE_List_Date,Date_Of_First_Exploit,Patch_Release_Date,Patch/Device_Manufacturer,Affected_Devices,CVSS,CVSS_Status,Days_To_Patch_Release,Days_To_First_Exploit
0,VPN Filter malware,True,CVE-2017-6742,2017-10-17,2018-05-24,2018-06-27,CISCO,CISCO routers,9.8,critical,253,219
1,VPN Filter malware,True,CVE-2017-6750,2017-10-17,2018-05-24,2018-06-27,CISCO,"CISCO routers, network devices",9.8,critical,253,219
2,VPN Filter malware,True,CVE-2017-6751,2017-10-17,2018-05-24,2018-06-27,CISCO,CISCO routers,9.8,critical,253,219


## Lowercasing the Columns

In [8]:
new_nsa.columns = new_nsa.columns.str.lower()

## Removing Unnecessary Columns

In [9]:
unnecessary_cols = ['iot', 'patch/device_manufacturer', 'affected_devices']
new_nsa = new_nsa.drop(columns=unnecessary_cols)

## Converting Object to Text Data

In [10]:
obj_cols = new_nsa.select_dtypes(include=['object']).columns
new_nsa[obj_cols] = new_nsa[obj_cols].astype('string')

## Fixing Attack Names
Some of the attack names need to be conformed to the way they appear in the first NSA dataframe. We can take the years that exist in the old values to populate some new attributes that will fit in with the larger NSA dataset.

In [11]:
# Rename attacks to accord with larger dataset
new_names = {
    'VPN Filter malware': 'VPNFilter',
    'Dragonfly (2022)': 'Dragonfly/Energetic Bear Campaign 3',
    'BlackEnergy (2015)': 'BlackEnergy Attack on Ukraine',
    'stuxnet': 'Stuxnet',
    'Mirai Botnet Variants': 'Mirai Botnet',
    'Triton/Trisis (2017 and ongoing)': 'Triton/Trisis'
}

new_nsa['attack_name'] = new_nsa['attack_name'].replace(new_names)

# Add data to new nation state values
updates = {
    'BlackEnergy Attack on Ukraine': [2015, 2015, 'Sandworm', 'Russia'],
    'Dragonfly/Energetic Bear Campaign 3': [2022, 2022, 'Dragonfly (Energetic Bear)', 'Russia'],
    'Stuxnet': [2018, 2018, pd.NA, 'US'],
    'Triton/Trisis': [2017, 2017, pd.NA, 'Russia'],
    'VPNFilter': [2018, 2018, 'APT28 (Fancy Bear)', 'Russia'],
    'Volt Typhoon': [pd.NA, pd.NA, pd.NA, 'China']
}

for attack, values in updates.items():
    new_nsa.loc[new_nsa['attack_name'] == attack, ['year_start', 'year_end', 'attribution_group', 'attribution_state']] = values

# Reconvert to string data the new values that were incorrectly parsed
obj_cols = new_nsa.select_dtypes(include=['object']).columns
new_nsa[obj_cols] = new_nsa[obj_cols].astype('string')

In [12]:
new_nsa.head(3)

Unnamed: 0,attack_name,cve_id,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state
0,VPNFilter,CVE-2017-6742,2017-10-17,2018-05-24,2018-06-27,9.8,critical,253,219,2018.0,2018.0,APT28 (Fancy Bear),Russia
1,VPNFilter,CVE-2017-6750,2017-10-17,2018-05-24,2018-06-27,9.8,critical,253,219,2018.0,2018.0,APT28 (Fancy Bear),Russia
2,VPNFilter,CVE-2017-6751,2017-10-17,2018-05-24,2018-06-27,9.8,critical,253,219,2018.0,2018.0,APT28 (Fancy Bear),Russia


## Merging Both Dataframes
An outward merge between these two dataframes makes the most sense, since all of the data from both sets will be kept. This emergent dataset will be used to populate our master copy—however sparsely—with information about nation-state attacks. We don't want to narrow this down.

Once we've merged the two, I'll perform another round of checks to make sure everything is clean, update certain values to ensure consistency, and extract certain 

In [12]:
nsa = nsa.merge(
    new_nsa,
    on='attack_name',
    how='outer'
)

In [13]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   attack_name            87 non-null     string        
 1   year_start_x           75 non-null     float64       
 2   year_end_x             75 non-null     float64       
 3   attribution_group_x    43 non-null     string        
 4   attribution_state_x    67 non-null     string        
 5   cve_id_x               52 non-null     string        
 6   description            52 non-null     string        
 7   cve_id_y               67 non-null     string        
 8   cve_list_date          67 non-null     datetime64[ns]
 9   date_of_first_exploit  67 non-null     datetime64[ns]
 10  patch_release_date     67 non-null     datetime64[ns]
 11  cvss                   67 non-null     float64       
 12  cvss_status            67 non-null     category      
 13  days_to

## Combine Data
Because we merged on the attack names alone (which we needed to do to keep the rest of the data), the other columns that existed in both dataframes were duplicated in the merged dataset. The easiest fix is to simply combine the duplicates back into one column and drop the duplicates.

In [14]:
# Combining "year_start"
nsa['year_start'] = nsa['year_start_x'].combine_first(nsa['year_start_y'])

# Combining "year_end"
nsa['year_end'] = nsa['year_end_x'].combine_first(nsa['year_end_y'])

# Combining "attribution_group"
nsa['attribution_group'] = nsa['attribution_group_x'].combine_first(nsa['attribution_group_y'])

# Combining "attribution_state"
nsa['attribution_state'] = nsa['attribution_state_x'].combine_first(nsa['attribution_state_y'])

# Combining "cve_id"
nsa['cve_id'] = nsa['cve_id_x'].combine_first(nsa['cve_id_y'])

## Populate the Description Column
Because descriptions of the CVEs were only present for one of the datasets' CVE IDs, we'll pull in the cleaned up CVE data `cve_list_v3` and merge them leftwardly into the NSA data. We'll then need to drop a host of as yet unwanted columns. 

In [15]:
# Import CVE data
cves = pd.read_parquet(path='../data/CVE_Project/cvelistV5/cve_list_v3.parquet')

# Merge with just the descriptions
nsa = nsa.merge(
    cves[['cve_id', 'description']],
    on=['cve_id'],
    how='left'
)

# Combine the descriptions
nsa['description'] = nsa['description_x'].combine_first(nsa['description_y'])

## Drop Unnecessary Columns

In [16]:
unnecessary_cols = [
    'year_start_x',
    'year_start_y',
    'year_end_x',
    'year_end_y',
    'attribution_group_x',
    'attribution_group_y',
    'attribution_state_x',
    'attribution_state_y',
    'cve_id_x',
    'cve_id_y',
    'description_x',
    'description_y'
]

nsa = nsa.drop(columns=unnecessary_cols)
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   attack_name            87 non-null     string        
 1   cve_list_date          67 non-null     datetime64[ns]
 2   date_of_first_exploit  67 non-null     datetime64[ns]
 3   patch_release_date     67 non-null     datetime64[ns]
 4   cvss                   67 non-null     float64       
 5   cvss_status            67 non-null     category      
 6   days_to_patch_release  67 non-null     float64       
 7   days_to_first_exploit  67 non-null     float64       
 8   year_start             80 non-null     float64       
 9   year_end               80 non-null     float64       
 10  attribution_group      48 non-null     string        
 11  attribution_state      79 non-null     string        
 12  cve_id                 83 non-null     string        
 13  descrip

<span style='color:#ffcc00;text-shadow:0 0 3px #ffcc00;'>This is the only description not found for its respective CVE ID, and its because such an ID doesn't exist in MITRE's data.</span>

In [21]:
nsa[(nsa['cve_id'].notnull()) & nsa['description'].isnull()].head()

Unnamed: 0,attack_name,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state,cve_id,description
80,Volt Typhoon,2022-05-10,2023-04-25,2022-06-06,6.5,medium,27.0,350.0,,,,China,CVE-2022-26658,


## Series of Rapid Checks

In [17]:
# Update data types
floats = nsa.select_dtypes(include=['float']).columns
wanted_floats = [col for col in floats if col != 'cvss']
nsa[wanted_floats] = nsa[wanted_floats].astype('Int64')

# Convert datetime to UTC time
dates = nsa.select_dtypes(include=['datetime64[ns]']).columns
nsa[dates] = nsa[dates].apply(pd.to_datetime, utc=True)

# Convert into categorical data types
cat_cols = [
    'attack_name',
    'attribution_group',
    'attribution_state'
]
for col in cat_cols:
    nsa[col] = nsa[col].astype('category')

# Remove whitespace
str_cols = nsa.select_dtypes(include=['string']).columns
nsa[str_cols] = nsa[str_cols].apply(lambda x: x.str.strip())

In [18]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   attack_name            87 non-null     category           
 1   cve_list_date          67 non-null     datetime64[ns, UTC]
 2   date_of_first_exploit  67 non-null     datetime64[ns, UTC]
 3   patch_release_date     67 non-null     datetime64[ns, UTC]
 4   cvss                   67 non-null     float64            
 5   cvss_status            67 non-null     category           
 6   days_to_patch_release  67 non-null     Int64              
 7   days_to_first_exploit  67 non-null     Int64              
 8   year_start             80 non-null     Int64              
 9   year_end               80 non-null     Int64              
 10  attribution_group      48 non-null     category           
 11  attribution_state      79 non-null     category           
 

## Saving the Data

In [19]:
nsa.to_parquet(path='../data/nsa_data_v2.parquet')