# Building the Nation-State Attack Dataset
This notebook is dedicated to aggregating all of the nation-state attack (NSA) data sent via email and merging it with the second NSA data that was tested for normality and validated for correlation in `nsa_corr_validation`.

The first set of nation-state data is built from a dictionary of key-value pairs with the name of the attack, it's group and state attribution, the year it started, the year it ended, and the ID and description of the CVE exploited. Some of these values are actually lists of values that describe more complex relationships between attacks and their properties. It was important to capture this information and intuitive to organize it as such, but it presents a problem commonly besetting more complex datasets that will be addressed after exploding these lists across multiple rows.

The second NSA data imported into this notebook came from the Excel sheet that I was given and contains <i>incorrect</i> dates about patch releases and exploitations. <span style='color:#ffcc00;text-shadow:0 0 3px #ffcc00;'>It's important not to confuse <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>nsa_data_p2_v0</span> (the unclean Excel sheet that is cleaned up to be the second NSA data) with <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>new_nsa_data_v1</span> (the cleaned version of the second NSA data), nor either from <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>nsa_data_v2</span> (the <i>incorrect</i> completed NSA data product resulting from a merge between the first and second parts), <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>mock_nsa_data_v3</span> (the merged result of both parts, understanding that it contains mock data), or <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>nsa_data_v3</span> (the cleaned, correct merge of all relevent NSA data)</span>.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re # For regex support

## Aggregated NSA Data

In [2]:
# Create nation-state attack dataframe
nsa_dict = {
    'attack_name': [
        'Mirai Botnet',
        'VPNFilter',
        'Triton/Trisis',
        'Iranian Cyberattacks on Water Systems',
        'Iranian APT Exploits on Fortinet Vulnerabilities',
        'Operation Shadowhammer',
        'Ripple20 Vulnerabilities',
        'Dragonfly/Energetic Bear Campaign 1',
        'Dragonfly/Energetic Bear Campaign 2',
        'Stuxnet',
        'Heartbleed Exploits',
        'BlackEnergy Attack on Ukraine',
        'Microsoft Exchange ProxyShell Exploits',
        'F5 BIG-IP Exploits',
        'Pulse Secure VPN Exploits',
        'Equifax Data Breach',
        'SolarWinds Orion Supply Chain Attack',
        'Not Petya Ransomware Attack',
        'WannaCry Ransomware Attack'
    ],
    'year_start': [
        2016,
        2018,
        2017,
        2020,
        2021,
        2018,
        2020,
        2013,
        2017,
        2018,
        2014,
        2015,
        2021,
        2021,
        2019,
        2017,
        2020,
        2017,
        2017
    ],
    'year_end': [
        2016,
        2018,
        2017,
        2020,
        2021,
        2019,
        2020,
        2014,
        2017,
        2018,
        2014,
        2015,
        2021,
        2021,
        2021,
        2017,
        2020,
        2017,
        2017
    ],
    'attribution_group': [
        pd.NA,
        'APT28 (Fancy Bear)',
        pd.NA,
        pd.NA,
        pd.NA,
        'APT41',
        pd.NA,
        'Dragonfly (Energetic Bear)',
        'Dragonfly (Energetic Bear)',
        pd.NA,
        pd.NA,
        'Sandworm',
        pd.NA,
        pd.NA,
        'APT5',
        'APT10',
        'APT29 (Cozy Bear)',
        'Sandworm',
        'Lazarus'
    ],
    'attribution_state': [
        [pd.NA],
        ['Russia', 'Russia', 'Russia', 'Russia', 'Russia'],
        ['Russia', 'Russia'],
        ['Iran'],
        ['Iran'],
        ['China'],
        [pd.NA, pd.NA, pd.NA, pd.NA],
        ['Russia'],
        ['Russia'],
        ['US', 'Israel'],
        ['China'],
        ['Russia'],
        ['China', 'China', 'China'],
        ['Russia', 'China'],
        ['China'],
        ['China'],
        ['Russia'],
        ['Russia'],
        ['DPRK']
    ],
    'cve_id': [
        [pd.NA],
        [
            'CVE-2018-14847',
            'CVE-2017-12074',
            'CVE-2018-10561',
            'CVE-2018-10562',
            'CVE-2017-8418',
        ],
        [
            'CVE-2017-7905',
            'CVE-2017-7921'
        ],
        [pd.NA],
        ['CVE-2018-13379'],
        ['CVE-2019-19781'],
        [
            'CVE-2020-11896',
            'CVE-2020-11898',
            'CVE-2020-11899',
            'CVE-2020-11901'
        ],
        [pd.NA],
        [pd.NA],
        [pd.NA, pd.NA],
        ['CVE-2014-0160'],
        [pd.NA],
        [
            'CVE-2021-34473',
            'CVE-2021-34523',
            'CVE-2021-31207'
        ],
        [
            'CVE-2020-5902',
            'CVE-2020-5902'
        ],
        ['CVE-2019-11510'],
        ['CVE-2017-5638'],
        [pd.NA],
        ['CVE-2017-0144'],
        ['CVE-2017-0144']
    ],
    'description': [
        [pd.NA],
        [
            'MikroTik RouterOS through 6.42 allows unauthenticated remote attackers to read arbitrary files and remote authenticated attackers to write arbitrary files due to a directory traversal vulnerability in the WinBox interface.',
            'Directory traversal vulnerability in the SYNO.DNSServer.Zone.MasterZoneConf in Synology DNS Server before 2.2.1-3042 allows remote authenticated attackers to write arbitrary files via the domain_name parameter.',
            'An issue was discovered on Dasan GPON home routers. It is possible to bypass authentication simply by appending "?images" to any URL of the device that requires authentication, as demonstrated by the /menu.html?images/ or /GponForm/diag_FORM?images/ URI. One can then manage the device.',
            "An issue was discovered on Dasan GPON home routers. Command Injection can occur via the dest_host parameter in a diag_action=ping request to a GponForm/diag_Form URI. Because the router saves ping results in /tmp and transmits them to the user when the user revisits /diag.html, it's quite simple to execute commands and retrieve their output.",
            'RuboCop 0.48.1 and earlier does not use /tmp in safe way, allowing local users to exploit this to tamper with cache files belonging to other users.'
        ],
        [
            'A Weak Cryptography for Passwords issue was discovered in General Electric (GE) Multilin SR 750 Feeder Protection Relay, firmware versions prior to Version 7.47; SR 760 Feeder Protection Relay, firmware versions prior to Version 7.47; SR 469 Motor Protection Relay, firmware versions prior to Version 5.23; SR 489 Generator Protection Relay, firmware versions prior to Version 4.06; SR 745 Transformer Protection Relay, firmware versions prior to Version 5.23; SR 369 Motor Protection Relay, all firmware versions; Multilin Universal Relay, firmware Version 6.0 and prior versions; and Multilin URplus (D90, C90, B95), all versions. Ciphertext versions of user passwords were created with a non-random initialization vector leaving them susceptible to dictionary attacks. Ciphertext of user passwords can be obtained from the front LCD panel of affected products and through issued Modbus commands.',
            'An Improper Authentication issue was discovered in Hikvision DS-2CD2xx2F-I Series V5.2.0 build 140721 to V5.4.0 build 160530, DS-2CD2xx0F-I Series V5.2.0 build 140721 to V5.4.0 Build 160401, DS-2CD2xx2FWD Series V5.3.1 build 150410 to V5.4.4 Build 161125, DS-2CD4x2xFWD Series V5.2.0 build 140721 to V5.4.0 Build 160414, DS-2CD4xx5 Series V5.2.0 build 140721 to V5.4.0 Build 160421, DS-2DFx Series V5.2.0 build 140805 to V5.4.5 Build 160928, and DS-2CD63xx Series V5.0.9 build 140305 to V5.3.5 Build 160106 devices. The improper authentication vulnerability occurs when an application does not adequately or correctly authenticate users. This may allow a malicious user to escalate his or her privileges on the system and gain access to sensitive information.'
        ],
        [pd.NA],
        ['An Improper Limitation of a Pathname to a Restricted Directory ("Path Traversal") in Fortinet FortiOS 6.0.0 to 6.0.4, 5.6.3 to 5.6.7 and 5.4.6 to 5.4.12 and FortiProxy 2.0.0, 1.2.0 to 1.2.8, 1.1.0 to 1.1.6, 1.0.0 to 1.0.7 under SSL VPN web portal allows an unauthenticated attacker to download system files via special crafted HTTP resource requests.'],
        ['An issue was discovered in Citrix Application Delivery Controller (ADC) and Gateway 10.5, 11.1, 12.0, 12.1, and 13.0. They allow Directory Traversal.'],
        [
            'The Treck TCP/IP stack before 6.0.1.66 allows Remote Code Execution, related to IPv4 tunneling.',
            'The Treck TCP/IP stack before 6.0.1.66 improperly handles an IPv4/ICMPv4 Length Parameter Inconsistency, which might allow remote attackers to trigger an information leak.',
            'The Treck TCP/IP stack before 6.0.1.66 has an IPv6 Out-of-bounds Read.',
            'The Treck TCP/IP stack before 6.0.1.66 allows Remote Code execution via a single invalid DNS response.'
        ],
        [pd.NA],
        [pd.NA],
        [pd.NA, pd.NA],
        ['The (1) TLS and (2) DTLS implementations in OpenSSL 1.0.1 before 1.0.1g do not properly handle Heartbeat Extension packets, which allows remote attackers to obtain sensitive information from process memory via crafted packets that trigger a buffer over-read, as demonstrated by reading private keys, related to d1_both.c and t1_lib.c, aka the Heartbleed bug.'],
        [pd.NA],
        [
            'Microsoft Exchange Server Remote Code Execution Vulnerability',
            'Microsoft Exchange Server Elevation of Privilege Vulnerability',
            'Microsoft Exchange Server Security Feature Bypass Vulnerability'
        ],
        [
            'In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14.1.2.5, 13.1.0-13.1.3.3, 12.1.0-12.1.5.1, and 11.6.1-11.6.5.1, the Traffic Management User Interface (TMUI), also referred to as the Configuration utility, has a Remote Code Execution (RCE) vulnerability in undisclosed pages.',
            'In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14.1.2.5, 13.1.0-13.1.3.3, 12.1.0-12.1.5.1, and 11.6.1-11.6.5.1, the Traffic Management User Interface (TMUI), also referred to as the Configuration utility, has a Remote Code Execution (RCE) vulnerability in undisclosed pages.'
        ],
        ['In Pulse Secure Pulse Connect Secure (PCS) 8.2 before 8.2R12.1, 8.3 before 8.3R7.1, and 9.0 before 9.0R3.4, an unauthenticated remote attacker can send a specially crafted URI to perform an arbitrary file reading vulnerability.'],
        ['The Jakarta Multipart parser in Apache Struts 2 2.3.x before 2.3.32 and 2.5.x before 2.5.10.1 has incorrect exception handling and error-message generation during file-upload attempts, which allows remote attackers to execute arbitrary commands via a crafted Content-Type, Content-Disposition, or Content-Length HTTP header, as exploited in the wild in March 2017 with a Content-Type header containing a #cmd= string.'],
        [pd.NA],
        ['The SMBv1 server in Microsoft Windows Vista SP2; Windows Server 2008 SP2 and R2 SP1; Windows 7 SP1; Windows 8.1; Windows Server 2012 Gold and R2; Windows RT 8.1; and Windows 10 Gold, 1511, and 1607; and Windows Server 2016 allows remote attackers to execute arbitrary code via crafted packets, aka "Windows SMB Remote Code Execution Vulnerability." This vulnerability is different from those described in CVE-2017-0143, CVE-2017-0145, CVE-2017-0146, and CVE-2017-0148.'],
        ['The SMBv1 server in Microsoft Windows Vista SP2; Windows Server 2008 SP2 and R2 SP1; Windows 7 SP1; Windows 8.1; Windows Server 2012 Gold and R2; Windows RT 8.1; and Windows 10 Gold, 1511, and 1607; and Windows Server 2016 allows remote attackers to execute arbitrary code via crafted packets, aka "Windows SMB Remote Code Execution Vulnerability." This vulnerability is different from those described in CVE-2017-0143, CVE-2017-0145, CVE-2017-0146, and CVE-2017-0148.']
    ],
}

nsa = pd.DataFrame(nsa_dict)
nsa.head(3)

Unnamed: 0,attack_name,year_start,year_end,attribution_group,attribution_state,cve_id,description
0,Mirai Botnet,2016,2016,,[<NA>],[<NA>],[<NA>]
1,VPNFilter,2018,2018,APT28 (Fancy Bear),"[Russia, Russia, Russia, Russia, Russia]","[CVE-2018-14847, CVE-2017-12074, CVE-2018-1056...",[MikroTik RouterOS through 6.42 allows unauthe...
2,Triton/Trisis,2017,2017,,"[Russia, Russia]","[CVE-2017-7905, CVE-2017-7921]",[A Weak Cryptography for Passwords issue was d...


##### Save First Part of Nation-State Attack Data As Dataset

## Check for List Misalignment

In [3]:
list_cols = ['cve_id', 'description', 'attribution_state']

def check_alignment(row):
    lengths = [len(row[col]) for col in list_cols]
    return len(set(lengths)) == 1

aligned = nsa.apply(check_alignment, axis=1)
misaligned = nsa[~aligned]
if not misaligned.empty:
    print('Rows with misaligned lists columns:')
    print(misaligned)
else:
    print('No misaligned lists are present.')


No misaligned lists are present.


## Exploding the Data
With the attack data aggregated and processed into a (dirty) dataset, we have to look at the relationships between lists in list-containing columns. Since there is a one-to-one relationship between `cve_id` and their `description`, we'll explode these columns simultaneously. Only then will we explode the lists in the `attribution_state` column, since we don't want to create false relationships that suggest that, within the context of a single attack, Nation $A$ used CVE $A$ while Nation $B$ used CVE $B$, when in fact we don't know. Ultimately, we have to represent the situation as both nations having used both CVEs. I created the dictionary object knowing how Pandas needs our observation's lists aligned, so we can avoid we what had to do for the CVE and CWE lists in terms of normalizing their content lengths.

In [5]:
# Explode the nation-state attack data
nsa = nsa.explode(list_cols, ignore_index=True)

In [6]:
# Convert objects to text data
obj_cols = nsa.select_dtypes(include=['object']).columns
nsa[obj_cols] = nsa[obj_cols].astype('string')

# Remove extra whitespace
nsa[obj_cols] = nsa[obj_cols].apply(lambda x: x.str.strip())

In [7]:
# Identify duplicates based on 'cve_id'
duplicates = nsa[nsa.duplicated('cve_id', keep=False)]

# Show potential duplicates for review
print("Duplicate CVE IDs:")
duplicates

Duplicate CVE IDs:


Unnamed: 0,attack_name,year_start,year_end,attribution_group,attribution_state,cve_id,description
0,Mirai Botnet,2016,2016,,,,
8,Iranian Cyberattacks on Water Systems,2020,2020,,Iran,,
15,Dragonfly/Energetic Bear Campaign 1,2013,2014,Dragonfly (Energetic Bear),Russia,,
16,Dragonfly/Energetic Bear Campaign 2,2017,2017,Dragonfly (Energetic Bear),Russia,,
17,Stuxnet,2018,2018,,US,,
18,Stuxnet,2018,2018,,Israel,,
20,BlackEnergy Attack on Ukraine,2015,2015,Sandworm,Russia,,
24,F5 BIG-IP Exploits,2021,2021,,Russia,CVE-2020-5902,"In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14...."
25,F5 BIG-IP Exploits,2021,2021,,China,CVE-2020-5902,"In BIG-IP versions 15.0.0-15.1.0.3, 14.1.0-14...."
28,SolarWinds Orion Supply Chain Attack,2020,2020,APT29 (Cozy Bear),Russia,,


## Aggregating Duplicates
Here, we tackle a challenge commonly encountered when working with complex datasets: handling duplicate values. Initially, our dataset contained multiple values for certain fields within the same row. To manage this, we "exploded" or "unnested" these lists into multiple rows, ensuring that each value appeared in its own row. However, this step introduced a new issue—merging this dataset with others would likely produce **Cartesian products** (unwanted duplication of data) and misalignment across rows and columns.

While simply deleting the duplicate rows might seem like a solution, it would result in the loss of valuable information. For instance:
- A specific CVE (like CVE-2020-5902) might have been exploited by two countries in the same attack.
- Another CVE (like CVE-2017-0144) might have been used by different countries in separate attacks.
- Though other attacks lack A CVE altogether, they still represent important attacks that we may fill with a CVE once we find them.
This type of detail is crucial for our analysis and should be preserved.

To address this, we use a process called aggregation. Instead of removing duplicates, we group the rows by their cve_id and aggregate the values of the other columns into lists. This approach allows us to retain all relevant information while ensuring that our dataset remains clean and organized, ready for further merging or analysis. By combining multiple rows into one, we avoid the risk of data misalignment during future merges and ensure that the relationships between cve_ids and their associated data (such as countries involved or attack details) remain intact.

This method helps us maintain a balance: we retain the richness of the data without sacrificing the structural integrity of our dataset.

In [9]:
# Columns we expect to have over one value associated with a given CVE
listable_cols = ['attack_name', 'attribution_state']
other_cols = [col for col in nsa.columns.to_list() if col not in listable_cols]

# For aggregating the data around our merge key
nsa = nsa.groupby('cve_id').agg({
    **{col: lambda x: list(x.dropna().unique()) for col in listable_cols},
    **{col: 'first' for col in other_cols}
}).reset_index(drop=True)

## Import Second NSA Data
This dataset contains several cleaning steps. The columns have to be renamed to allow merges with other data to happen seamlessly, some of the columns are no longer needed, some data should be converted to `string` for easier processing, and some of the values need to be fixed for consistency's sake.

In [10]:
new_nsa = pd.read_parquet(path='../data/new_nsa_data_v1.parquet')
new_nsa.head(3)

Unnamed: 0,Attack_Name,IoT,CVE_ID,CVE_List_Date,Date_Of_First_Exploit,Patch_Release_Date,Patch/Device_Manufacturer,Affected_Devices,CVSS,CVSS_Status,Days_To_Patch_Release,Days_To_First_Exploit
0,VPN Filter malware,True,CVE-2017-6742,2017-10-17,2018-05-24,2018-06-27,CISCO,CISCO routers,9.8,critical,253,219
1,VPN Filter malware,True,CVE-2017-6750,2017-10-17,2018-05-24,2018-06-27,CISCO,"CISCO routers, network devices",9.8,critical,253,219
2,VPN Filter malware,True,CVE-2017-6751,2017-10-17,2018-05-24,2018-06-27,CISCO,CISCO routers,9.8,critical,253,219


## Lowercasing the Columns

In [11]:
new_nsa.columns = new_nsa.columns.str.lower()

## Removing Unnecessary Columns

In [12]:
unnecessary_cols = ['iot', 'patch/device_manufacturer', 'affected_devices']
new_nsa = new_nsa.drop(columns=unnecessary_cols)

## Converting Object to Text Data

In [13]:
obj_cols = new_nsa.select_dtypes(include=['object']).columns
new_nsa[obj_cols] = new_nsa[obj_cols].astype('string')

## Fixing Attack Names
Some of the attack names need to be conformed to the way they appear in the first NSA dataframe. We can take the years that exist in the old values to populate some new attributes that will fit in with the larger NSA dataset.

In [14]:
# Rename attacks to accord with larger dataset
new_names = {
    'VPN Filter malware': 'VPNFilter',
    'Dragonfly (2022)': 'Dragonfly/Energetic Bear Campaign 3',
    'BlackEnergy (2015)': 'BlackEnergy Attack on Ukraine',
    'stuxnet': 'Stuxnet',
    'Mirai Botnet Variants': 'Mirai Botnet',
    'Triton/Trisis (2017 and ongoing)': 'Triton/Trisis'
}

new_nsa['attack_name'] = new_nsa['attack_name'].replace(new_names)

# Add data to new nation state values
updates = {
    'BlackEnergy Attack on Ukraine': [2015, 2015, 'Sandworm', 'Russia'],
    'Dragonfly/Energetic Bear Campaign 3': [2022, 2022, 'Dragonfly (Energetic Bear)', 'Russia'],
    'Stuxnet': [2018, 2018, pd.NA, 'US'],
    'Triton/Trisis': [2017, 2017, pd.NA, 'Russia'],
    'VPNFilter': [2018, 2018, 'APT28 (Fancy Bear)', 'Russia'],
    'Volt Typhoon': [pd.NA, pd.NA, pd.NA, 'China']
}

for attack, values in updates.items():
    new_nsa.loc[new_nsa['attack_name'] == attack, ['year_start', 'year_end', 'attribution_group', 'attribution_state']] = values

# Reconvert to string data the new values that were incorrectly parsed
obj_cols = new_nsa.select_dtypes(include=['object']).columns
new_nsa[obj_cols] = new_nsa[obj_cols].astype('string')

In [16]:
nsa[nsa['cve_id'] == 'CVE-2017-0144']

Unnamed: 0,attack_name,attribution_state,year_start,year_end,attribution_group,cve_id,description
1,"[Not Petya Ransomware Attack, WannaCry Ransomw...","[Russia, DPRK]",2017,2017,Sandworm,CVE-2017-0144,The SMBv1 server in Microsoft Windows Vista SP...


In [15]:
new_nsa[new_nsa['cve_id'] == 'CVE-2017-0144']

Unnamed: 0,attack_name,cve_id,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state
18,Dragonfly/Energetic Bear Campaign 3,CVE-2017-0144,2017-03-14,2017-05-12,2017-03-14,8.1,high,0,59,2022.0,2022.0,Dragonfly (Energetic Bear),Russia


## Aggregate Duplicates
We can check for duplicates in our merge key `cve_id` and depending what we find here, we have nothing else to do to prepare the data for a merge or we'll do the same thing that we did in the first part of the NSA data and determine how we can aggregate our information around the merge key. After checking the duplicates that exist, we see that certain CVEs were targeted in multiple attacks by different groups belonging to different nation-states. These are the columns whose values we will aggregate.

In [17]:
# Identify duplicates based on 'cve_id'
duplicates = new_nsa[new_nsa.duplicated('cve_id', keep=False)]

# Show potential duplicates for review
print('Duplicate CVE IDs:')
duplicates

Duplicate CVE IDs:


Unnamed: 0,attack_name,cve_id,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state
25,BlackEnergy Attack on Ukraine,CVE-2014-0630,2014-01-31,2015-12-23,2014-07-01,5.0,medium,151,691,2015.0,2015.0,Sandworm,Russia
26,BlackEnergy Attack on Ukraine,CVE-2014-4166,2014-10-07,2015-12-23,2015-01-12,5.0,medium,97,442,2015.0,2015.0,Sandworm,Russia
27,BlackEnergy Attack on Ukraine,CVE-2014-6485,2014-10-14,2015-12-23,2014-12-10,7.5,high,57,435,2015.0,2015.0,Sandworm,Russia
28,BlackEnergy Attack on Ukraine,CVE-2015-0057,2015-01-27,2015-12-23,2015-01-27,6.8,medium,0,330,2015.0,2015.0,Sandworm,Russia
29,BlackEnergy Attack on Ukraine,CVE-2015-1673,2015-04-15,2015-12-23,2015-04-15,7.5,high,0,252,2015.0,2015.0,Sandworm,Russia
30,Stuxnet,CVE-2014-0630,2014-01-31,2015-12-23,2014-07-01,5.0,medium,151,691,2018.0,2018.0,,US
31,Stuxnet,CVE-2014-4166,2014-10-07,2015-12-23,2015-01-12,5.0,medium,97,442,2018.0,2018.0,,US
32,Stuxnet,CVE-2014-6485,2014-10-14,2015-12-23,2014-12-10,7.5,high,57,435,2018.0,2018.0,,US
33,Stuxnet,CVE-2015-0057,2015-01-27,2015-12-23,2015-01-27,6.8,medium,0,330,2018.0,2018.0,,US
34,Stuxnet,CVE-2015-1673,2015-04-15,2015-12-23,2015-04-15,7.5,high,0,252,2018.0,2018.0,,US


In [18]:
listable_cols = ['attack_name', 'attribution_group', 'attribution_state']
other_cols = [col for col in new_nsa.columns if col not in listable_cols]

new_nsa = new_nsa.groupby('cve_id').agg({
    **{col: lambda x: list(x.dropna().unique()) for col in listable_cols},
    **{col: 'first' for col in other_cols}
}).reset_index(drop=True)

## Merging Both Dataframes
An outward merge between these two dataframes makes the most sense, since all of the data from both sets will be kept. This emergent dataset will be used to populate our master copy—however sparsely—with information about nation-state attacks. We don't want to narrow this down.

Once we've merged the two, I'll perform another round of checks to make sure everything is clean, update certain values to ensure consistency, and extract certain 

In [19]:
nsa = nsa.merge(
    new_nsa,
    on='cve_id',
    how='outer'
)

In [20]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   attack_name_x          21 non-null     object        
 1   attribution_state_x    21 non-null     object        
 2   year_start_x           21 non-null     float64       
 3   year_end_x             21 non-null     float64       
 4   attribution_group_x    9 non-null      string        
 5   cve_id                 48 non-null     string        
 6   description            21 non-null     string        
 7   attack_name_y          30 non-null     object        
 8   attribution_group_y    30 non-null     object        
 9   attribution_state_y    30 non-null     object        
 10  cve_list_date          30 non-null     datetime64[ns]
 11  date_of_first_exploit  30 non-null     datetime64[ns]
 12  patch_release_date     30 non-null     datetime64[ns]
 13  cvss   

## Combine Data
Because we merged on the attack names alone (which we needed to do to keep the rest of the data), the other columns that existed in both dataframes were duplicated in the merged dataset. The easiest fix is to simply combine the duplicates back into one column and drop the duplicates.

In [25]:
# Custom logic to handle combining lists of data
def combine_lists(val1, val2):
    if isinstance(val1, list) and isinstance(val2, list): # Both columns are lists
        return list(set(val1 + val2)) # Combine and remove duplicates
    elif isinstance(val1, list): # Append val2 if it's not in val1's list
        if pd.isna(val2):
            return val1
        return val1 if val2 in val1 else val1 + [val2]
    elif isinstance(val2, list): # Append val1 if it's not in val2's list
        if pd.isna(val1):
            return val2
        return val2 if val1 in val2 else val2 + [val1]
    else: # If neither is a list
        return val1 if pd.notna(val1) else val2

# Helper function to combine columns and drop their duplicates
def combine_and_drop(df, cols: dict):
    """
    This function takes a dataframe and a dictionary containing a list of
    columns to merge and drop whose key is the name of the resultant column.
    """
    # Loop through dictionary to combine columns
    for result, source in cols.items():
        if "date_known" in result:
            # Take earliest date between the two
            df[result] = df[[source[0], source[1]]].min(axis=1)
        else:
            df[result] = df.apply(lambda row: combine_lists(row[source[0]], row[source[1]]), axis=1)

    # Drop duplicate columns
    df = df.drop(
        columns=[
            col for result, source in cols.items() for col in source if col != result
        ]
    )
    return df

cols_to_combine = {
    'attack_name': ['attack_name_x', 'attack_name_y'],
    'year_start': ['year_start_x', 'year_start_y'],
    'year_end': ['year_end_x', 'year_end_y'],
    'attribution_group': ['attribution_group_x', 'attribution_group_y'],
    'attribution_state': ['attribution_state_x', 'attribution_state_y'],
}

nsa = combine_and_drop(nsa, cols_to_combine)

In [27]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   cve_id                 48 non-null     string        
 1   description            21 non-null     string        
 2   cve_list_date          30 non-null     datetime64[ns]
 3   date_of_first_exploit  30 non-null     datetime64[ns]
 4   patch_release_date     30 non-null     datetime64[ns]
 5   cvss                   30 non-null     float64       
 6   cvss_status            30 non-null     category      
 7   days_to_patch_release  30 non-null     float64       
 8   days_to_first_exploit  30 non-null     float64       
 9   attack_name            48 non-null     object        
 10  year_start             37 non-null     float64       
 11  year_end               37 non-null     float64       
 12  attribution_group      37 non-null     object        
 13  attribu

## Populate the Description Column
Because descriptions of the CVEs were only present for one of the datasets' CVE IDs, we'll pull in the cleaned up CVE data `cve_list_v3` and merge them leftwardly into the NSA data. We'll then need to drop a host of as yet unwanted columns. 

In [30]:
# Import CVE data
cves = pd.read_parquet(path='../data/CVE_Project/cvelistV5/cve_list_v3.parquet')

# Merge with just the descriptions
nsa = nsa.merge(
    cves[['cve_id', 'description']],
    on=['cve_id'],
    how='left'
)

# Combine the descriptions
nsa['description'] = nsa['description_x'].combine_first(nsa['description_y'])
nsa = nsa.drop(columns=['description_x', 'description_y'])

In [31]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   cve_id                 48 non-null     string        
 1   cve_list_date          30 non-null     datetime64[ns]
 2   date_of_first_exploit  30 non-null     datetime64[ns]
 3   patch_release_date     30 non-null     datetime64[ns]
 4   cvss                   30 non-null     float64       
 5   cvss_status            30 non-null     category      
 6   days_to_patch_release  30 non-null     float64       
 7   days_to_first_exploit  30 non-null     float64       
 8   attack_name            48 non-null     object        
 9   year_start             37 non-null     float64       
 10  year_end               37 non-null     float64       
 11  attribution_group      37 non-null     object        
 12  attribution_state      48 non-null     object        
 13  descrip

<span style='color:#ffcc00;text-shadow:0 0 3px #ffcc00;'>This is the only description not found for its respective CVE ID, and its because such an ID doesn't exist in MITRE's data.</span>

In [21]:
nsa[(nsa['cve_id'].notnull()) & nsa['description'].isnull()].head()

Unnamed: 0,attack_name,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state,cve_id,description
80,Volt Typhoon,2022-05-10,2023-04-25,2022-06-06,6.5,medium,27.0,350.0,,,,China,CVE-2022-26658,


## Series of Rapid Checks

In [32]:
# Update data types
floats = nsa.select_dtypes(include=['float']).columns
wanted_floats = [col for col in floats if col != 'cvss']
nsa[wanted_floats] = nsa[wanted_floats].astype('Int64')

# Convert datetime to UTC time
dates = nsa.select_dtypes(include=['datetime64[ns]']).columns
nsa[dates] = nsa[dates].apply(pd.to_datetime, utc=True)

# Remove whitespace
str_cols = nsa.select_dtypes(include=['string']).columns
nsa[str_cols] = nsa[str_cols].apply(lambda x: x.str.strip())

In [33]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   cve_id                 48 non-null     string             
 1   cve_list_date          30 non-null     datetime64[ns, UTC]
 2   date_of_first_exploit  30 non-null     datetime64[ns, UTC]
 3   patch_release_date     30 non-null     datetime64[ns, UTC]
 4   cvss                   30 non-null     float64            
 5   cvss_status            30 non-null     category           
 6   days_to_patch_release  30 non-null     Int64              
 7   days_to_first_exploit  30 non-null     Int64              
 8   attack_name            48 non-null     object             
 9   year_start             37 non-null     Int64              
 10  year_end               37 non-null     Int64              
 11  attribution_group      37 non-null     object             
 

Because of the aggregation of certain values in lists and their combination through merging with single values that are not lists, Pandas is incapable of saving the data (in a way that makes loading and reusing the data "easy"). To solve this, I'm going to stuff each value in certain columns inside a list.

In [35]:
# Listify semi-listed columns
def put_in_list(val):
    return val if isinstance(val, list) else [val]

cols_to_listify = ['attack_name', 'attribution_group', 'attribution_state']

for col in cols_to_listify:
    nsa[col] = nsa[col].apply(put_in_list)

nsa.head(3)

Unnamed: 0,cve_id,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,attack_name,year_start,year_end,attribution_group,attribution_state,description
0,CVE-2014-0160,NaT,NaT,NaT,,,,,[Heartbleed Exploits],2014,2014,[nan],[China],The (1) TLS and (2) DTLS implementations in Op...
1,CVE-2014-0630,2014-01-31 00:00:00+00:00,2015-12-23 00:00:00+00:00,2014-07-01 00:00:00+00:00,5.0,medium,151.0,691.0,"[BlackEnergy Attack on Ukraine, Stuxnet]",2015,2015,[Sandworm],"[Russia, US]",EMC Documentum TaskSpace (TSP) 6.7SP1 before P...
2,CVE-2014-4166,2014-10-07 00:00:00+00:00,2015-12-23 00:00:00+00:00,2015-01-12 00:00:00+00:00,5.0,medium,97.0,442.0,"[BlackEnergy Attack on Ukraine, Stuxnet]",2015,2015,[Sandworm],"[Russia, US]",Cross-site scripting (XSS) vulnerability in th...


## Saving the Data

In [36]:
nsa.to_parquet(path='../data/NSA/mock_nsa_data_v3.parquet')