# Working with Known IoT-Related CVES: A New Direction
This notebook is dedicated to cleaning up the MITRE list of known IoT-related CVEs, creating a dataframe from nation-state attack data, and merging both of these with a cleaned-up version of the CVE data agregated in the `APT_IoT_CVE_EDA` notebook. The resulting dataset is then saved to both CSV and parquet file types for easy reading/preprocessing.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Merging with IoT CVE Data

In [12]:
df = df2019_2024.merge(
    df_nsa,
    on=['cve_id', 'description'],
    how='outer'
)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1120 entries, 0 to 1119
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   cve_id             1112 non-null   object 
 1   description        1112 non-null   object 
 2   attack             31 non-null     object 
 3   year_start         31 non-null     float64
 4   year_end           31 non-null     float64
 5   attribution_group  14 non-null     object 
 6   attribution_state  26 non-null     object 
dtypes: float64(2), object(5)
memory usage: 61.4+ KB


## Adding CVEs from the Check Point Article
[Revisit the article here.](https://blog.checkpoint.com/security/the-tipping-point-exploring-the-surge-in-iot-cyberattacks-plaguing-the-education-sector/)

In [13]:
# New observations from article
cp = {
    'cve_id': [
        'CVE-2015-2051',
        'CVE-2016-6277',
        'CVE-2022-37061'
    ],
    'description': [
        'The D-Link DIR-645 Wired/Wireless Router Rev. Ax with firmware 1.04b12 and earlier allows remote attackers to execute arbitrary commands via a GetDeviceSettings action to the HNAP interface.',
        'NETGEAR R6250 before 1.0.4.6.Beta, R6400 before 1.0.1.18.Beta, R6700 before 1.0.1.14.Beta, R6900, R7000 before 1.0.7.6.Beta, R7100LG before 1.0.0.28.Beta, R7300DST before 1.0.0.46.Beta, R7900 before 1.0.1.8.Beta, R8000 before 1.0.3.26.Beta, D6220, D6400, D7000, and possibly other routers allow remote attackers to execute arbitrary commands via shell metacharacters in the path info to cgi-bin/.',
        'All FLIR AX8 thermal sensor cameras version up to and including 1.46.16 are vulnerable to Remote Command Injection. This can be exploited to inject and execute arbitrary shell commands as the root user through the id HTTP POST parameter in the res.php endpoint. A successful exploit could allow the attacker to execute arbitrary commands on the underlying operating system with the root privileges.'
    ],
    'attack': [pd.NA, pd.NA, pd.NA],
    'year_start': [pd.NA, pd.NA, pd.NA],
    'year_end': [pd.NA, pd.NA, pd.NA],
    'attribution_group': [pd.NA, pd.NA, pd.NA],
    'attribution_state': [pd.NA, pd.NA, pd.NA]
}

# Convert new data to dataframe
df_cp = pd.DataFrame(cp)

# Concatenate df and df_cp
df = pd.concat([df, df_cp], ignore_index=True)

  df = pd.concat([df, df_cp], ignore_index=True)


In [14]:
# Convert data types to text
obj_cols = df.select_dtypes('object').columns
df[obj_cols] = df[obj_cols].astype('string')

In [15]:
# Extract year from CVE ID
df['year_cve'] = df['cve_id'].str.split('-').str[1]

# Convert year columns back to whole numbers
year_cols = ['year_start', 'year_end', 'year_cve']
df[year_cols] = df[year_cols].astype('Int64')

# Move the year column
df.insert(1, 'year_cve', df.pop('year_cve'))

### Importing and Cleaning the CVE Data
All these steps were determined to be necessary in the `APT_IoT_CVE_EDA` notebook.

In [16]:
# Import
cves = pd.read_parquet('../data/CVE_V5/CVE_List.parquet')

# Drop rejected CVEs
cves = cves.drop(cves[cves['cve_state'] == 'REJECTED'].index)

# Convert publication date to datetime format
cves['date_published'] = pd.to_datetime(cves['date_published'], format='ISO8601', utc=True)

# Convert objects to text data (string)
obj_cols = cves.select_dtypes(include=['object']).columns
cves[obj_cols] = cves[obj_cols].astype('string')

# Standardize severity scores
cves['severity'] = cves['severity'].replace(['medium', 'MODERATE'], 'MEDIUM')
cves['severity'] = cves['severity'].str.lower()

# Remove leading or trailing whitespace
str_cols = cves.select_dtypes(include=['string']).columns
cves[str_cols] = cves[str_cols].apply(lambda x: x.str.strip())

In [38]:
# Glance
cves.head(3)

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
0,CVE-1999-0001,,PUBLISHED,2000-02-04 05:00:00+00:00,ip_input.c in BSD-derived TCP/IP implementatio...,,,,
1,CVE-1999-0002,,PUBLISHED,1999-09-29 04:00:00+00:00,Buffer overflow in NFS mountd gives root acces...,,,,
2,CVE-1999-0003,,PUBLISHED,1999-09-29 04:00:00+00:00,Execute commands as root via buffer overflow i...,,,,


### Merging into Main Dataframe
Since we don't want hundreds of thousands of CVEs that we don't know are related to IoTs in our dataset, I'm going to preform a leftward merge into our nation-state and IoT CVE attack data. This will only keep information from the CVE data if a CVE's `cve_id` is also found in our main dataset's `cve_id` attribute.

In [17]:
df = df.merge(
    cves,
    on=['cve_id', 'description'],
    how='left'
)

In [18]:
# Rename CVE data's attack-related attributes for clarity
df = df.rename(columns={
    'date_published': 'cve_publish_date',
    'severity': 'cve_severity',
    'severity_score': 'cve_severity_score',
    'attack_vector': 'cve_attack_vector',
    'attack_complexity': 'cve_attack_complexity'
})

In [42]:
df.head(3)

Unnamed: 0,cve_id,year_cve,description,attack,year_start,year_end,attribution_group,attribution_state,cwe_id,cve_state,cve_publish_date,cve_severity,cve_severity_score,cve_attack_vector,cve_attack_complexity
0,CVE-2014-0160,2014,The (1) TLS and (2) DTLS implementations in Op...,Heartbleed Exploits,2014,2014,,China,,PUBLISHED,2014-04-07 00:00:00+00:00,,,,
1,CVE-2017-0144,2017,The SMBv1 server in Microsoft Windows Vista SP...,Not Petya Ransomware Attack,2017,2017,Sandworm,Russia,,PUBLISHED,2017-03-17 00:00:00+00:00,,,,
2,CVE-2017-0144,2017,The SMBv1 server in Microsoft Windows Vista SP...,WannaCry Ransomware Attack,2017,2017,Lazarus,DPRK,,PUBLISHED,2017-03-17 00:00:00+00:00,,,,


Currently, the following $4$ observations are the only ones in the dataset that have both attack names and severity scores.

In [19]:
df[(df['cve_severity'].notnull()) & (df['attack'].notnull())]

Unnamed: 0,cve_id,year_cve,description,attack,year_start,year_end,attribution_group,attribution_state,cwe_id,cve_state,cve_publish_date,cve_severity,cve_severity_score,cve_attack_vector,cve_attack_complexity
10,CVE-2018-13379,2018,An Improper Limitation of a Pathname to a Rest...,Iranian APT Exploits on Fortinet Vulnerabilities,2021,2021,,Iran,,PUBLISHED,2019-06-04 20:18:08+00:00,critical,9.1,NETWORK,LOW
844,CVE-2021-31207,2021,Microsoft Exchange Server Security Feature Byp...,Microsoft Exchange ProxyShell Exploits,2021,2021,,China,,PUBLISHED,2021-05-11 19:11:41+00:00,medium,6.6,,
856,CVE-2021-34473,2021,Microsoft Exchange Server Remote Code Executio...,Microsoft Exchange ProxyShell Exploits,2021,2021,,China,,PUBLISHED,2021-07-14 17:54:03+00:00,critical,9.1,,
857,CVE-2021-34523,2021,Microsoft Exchange Server Elevation of Privile...,Microsoft Exchange ProxyShell Exploits,2021,2021,,China,,PUBLISHED,2021-07-14 17:54:38+00:00,critical,9.0,,


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1123 entries, 0 to 1122
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   cve_id                 1115 non-null   string             
 1   year_cve               1115 non-null   Int64              
 2   description            1115 non-null   string             
 3   attack                 31 non-null     string             
 4   year_start             31 non-null     Int64              
 5   year_end               31 non-null     Int64              
 6   attribution_group      14 non-null     string             
 7   attribution_state      26 non-null     string             
 8   cwe_id                 107 non-null    string             
 9   cve_state              1103 non-null   string             
 10  cve_publish_date       1103 non-null   datetime64[ns, UTC]
 11  cve_severity           476 non-null    string           

### Merging New Nation-State Attack Data

In [25]:
df = df.merge(
    new_nsa,
    on=['attack', 'cve_id'],
    how='outer',
    suffixes=('_main', '_new')
)

In [26]:
# Combine the columns that weren't merged (because doing so would have prevented the merge from capturing all the data)
df['cvss'] = df['cvss_main'].combine_first(df['cvss_new'])
df['cvss_status'] = df['cvss_status_main'].combine_first(df['cvss_status_new'])

In [27]:
# Check if the values in 'cvss' match those in 'cvss_main' where 'cvss_main' is not empty
cvss_main_match = df.loc[df['cvss_main'].notna(), 'cvss'] == df.loc[df['cvss_main'].notna(), 'cvss_main']

# Check if the values in 'cvss' match those in 'cvss_new' where 'cvss_new' is not empty
cvss_new_match = df.loc[df['cvss_new'].notna(), 'cvss'] == df.loc[df['cvss_new'].notna(), 'cvss_new']

# Check all values from both columns
cvss_main_match_all = cvss_main_match.all()
cvss_new_match_all = cvss_new_match.all()

# Print results
print(f"All values from 'cvss_main' correctly transferred: {cvss_main_match_all}")
print(f"All values from 'cvss_nsa' correctly transferred: {cvss_new_match_all}")

# Check if the values in 'cvss_status' match those in 'cvss_status_main' where 'cvss_status_main' is not empty
cvss_status_main_match = df.loc[df['cvss_status_main'].notna(), 'cvss_status'] == df.loc[df['cvss_status_main'].notna(), 'cvss_status_main']

# Check if the values in 'cvss_status' match those in 'cvss_status_new' where 'cvss_status_new' is not empty
cvss_status_new_match = df.loc[df['cvss_status_new'].notna(), 'cvss_status'] == df.loc[df['cvss_status_new'].notna(), 'cvss_status_new']

# Check all values from both columns
cvss_status_main_match_all = cvss_status_main_match.all()
cvss_status_new_match_all = cvss_status_new_match.all()

# Print results
print(f"All values from 'cvss_status_main' correctly transferred: {cvss_status_main_match_all}")
print(f"All values from 'cvss_status_nsa' correctly transferred: {cvss_status_new_match_all}")

All values from 'cvss_main' correctly transferred: True
All values from 'cvss_nsa' correctly transferred: True
All values from 'cvss_status_main' correctly transferred: True
All values from 'cvss_status_nsa' correctly transferred: True


## Filling in Missing Information
The next step here is to fill in key pieces of data that were missing from either of the datasets by using selective merging techniques to effectively "re-merge" the datasets on different columns to capture additional data that may have been left behind by how the overall merge had to be constructed. To make this a clearer process, we'll drop all unnecessary columns from the dataset.

In [28]:
df['cve_publish_date'] = df['cve_publish_date'].combine_first(df['cve_list_date'])

In [36]:
print(df.columns.to_list())

['cve_id', 'year_cve', 'description', 'attack', 'year_start', 'year_end', 'attribution_group', 'attribution_state', 'cwe_id', 'cve_publish_date', 'cvss_status_main', 'cvss_main', 'cve_attack_vector', 'cve_attack_complexity', 'cve_list_date', 'date_of_first_exploit', 'patch_release_date', 'cvss_new', 'cvss_status_new', 'days_to_patch_release', 'days_to_first_exploit', 'cvss', 'cvss_status']


In [29]:
unnecessary_cols = ['cve_attack_vector', 'cve_attack_complexity', 'cwe_id', 'cvss_main', 'cvss_new', 'cvss_status_main', 'cvss_status_new', 'cve_list_date']

df.drop(labels=unnecessary_cols, axis=1, inplace=True)

In [30]:
df = df.merge(
    cves,
    on=['cve_id'],
    how='left'
)

In [31]:
# Combine the descriptions
df['description'] = df['description_y'].combine_first(df['description_x'])

In [32]:
unnecessary_cols = ['cwe_id', 'cve_state', 'cwe_id', 'date_published', 'severity', 'severity_score', 'attack_vector', 'attack_complexity', 'description_y', 'description_x']

df.drop(labels=unnecessary_cols, axis=1, inplace=True)

In [33]:
# Add data to new nation state values
updates = {
    'BlackEnergy Attack on Ukraine': [2015, 2015, 'Sandworm', 'Russia'],
    'Dragonfly/Energetic Bear Campaign 3': [2022, 2022, 'Dragonfly (Energetic Bear)', 'Russia'],
    'Stuxnet': [2018, 2018, pd.NA, 'US'],
    'Triton/Trisis': [2017, 2017, pd.NA, 'Russia'],
    'VPNFilter': [2018, 2018, 'APT28 (Fancy Bear)', 'Russia'],
    'Volt Typhoon': [pd.NA, pd.NA, pd.NA, 'China']
}

for attack, values in updates.items():
    df.loc[df['attack'] == attack, ['year_start', 'year_end', 'attribution_group', 'attribution_state']] = values

In [34]:
# Drop one of the Stuxnet observations whose state attribution was changed from 'Israel' to 'US'
df = df.drop(index=41)

# Filter and copy dataframe for Stuxnet observations
stuxnet_df = df[df['attack'] == 'Stuxnet'].copy()

# Update state attribution
stuxnet_df['attribution_state'] = 'Israel'

# Split the dataframe to add new observations in desired location
df_part1 = df.loc[:40]
df_part2 = df.loc[41:]

# Reappend updated observations to original dataframe
df = pd.concat([df_part1, stuxnet_df, df_part2], ignore_index=True)

In [35]:
# Recorrect data types for year colums
yr_cols = ['year_cve', 'year_start', 'year_end']

df[yr_cols] = df[yr_cols].apply(lambda x: pd.to_numeric(x))

In [36]:
# Re-extract year from CVE ID
df['year_cve'] = df['cve_id'].str.split('-').str[1]

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162 entries, 0 to 1161
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   cve_id                 1154 non-null   string        
 1   year_cve               1154 non-null   object        
 2   attack                 70 non-null     string        
 3   year_start             59 non-null     Int64         
 4   year_end               59 non-null     Int64         
 5   attribution_group      29 non-null     string        
 6   attribution_state      61 non-null     object        
 7   cve_publish_date       1142 non-null   object        
 8   date_of_first_exploit  40 non-null     datetime64[ns]
 9   patch_release_date     40 non-null     datetime64[ns]
 10  days_to_patch_release  40 non-null     float64       
 11  days_to_first_exploit  40 non-null     float64       
 12  cvss                   516 non-null    float64       
 13  cvs

### Sorting Records with CVSS Scores First

In [49]:
df[df['cvss'].notnull()].info()

<class 'pandas.core.frame.DataFrame'>
Index: 516 entries, 0 to 515
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   cve_id                 516 non-null    string        
 1   year_cve               516 non-null    object        
 2   attack                 44 non-null     string        
 3   year_start             33 non-null     Int64         
 4   year_end               33 non-null     Int64         
 5   attribution_group      16 non-null     string        
 6   attribution_state      40 non-null     object        
 7   cve_publish_date       516 non-null    object        
 8   date_of_first_exploit  40 non-null     datetime64[ns]
 9   patch_release_date     40 non-null     datetime64[ns]
 10  days_to_patch_release  40 non-null     float64       
 11  days_to_first_exploit  40 non-null     float64       
 12  cvss                   516 non-null    float64       
 13  cvss_statu

In [44]:
df.sort_values(by='cvss', ascending=True, inplace=True)
df = df.reset_index(drop=True)

## Saving the Dataframe

In [45]:
df.to_parquet(path='../data/IoT_CVE_Attacks_V2.parquet', index=None)
df.to_csv('../data/IoT_CVE_Attacks_V2_Indexed.csv', index=True)