# Analysis of IoT-Related CVEs and Their Exploitation by APTs

In [2]:
import os # For traversing and reading folders and files
import json # For reading and extracting data from CVE records
import re # Handle regex patterns
import xml.etree.ElementTree as ET # For reading and extracting data from CWE records
import pandas as pd # For data cleaning and analysis
import numpy as np # For advanced calculations
import matplotlib.pyplot as plt # For data visualization

## Data Collection
### Extracting Data from the CVE List
There are precisely $260,896$ records in the CVE list; each is represented by a single JSON file. These files contain all the information (though lots of it is incomplete) that will populate our `CVE_df` dataframe. It was clear after creating the initial script that massive discrepencacies existed between the files' structures. This makes it especially complicated to reliably retrieve desired values, but I managed to create a recursive function that performs various data retrieval processes based on the kind of data type the object holds at every level on its way to pull out the desired info. Based on the CVE List schema, I figured that the most important information—and really, the only kinds of information contained in the files that would be useful to us—would be the CVE ID, the CWE ID, the description, the date the record was published, a categorical and numerical severity score, the vector of attack used in the CVE, and attack complexity.



### Extracting Data from the CWE List
The CWE list contains lots of information that could be pertinant to our analysis that's relatively easy to access. Among these I counted the CWE IDs, names, descriptions, related CWEs, the nature of these relationships, the technological vector (web-based, network, etc.), background details, the phases of development wherein the weakness could be introduced, descriptions about these unfortunate events, the likelihood of these weaknesses, the scope and impact of the weakness's common consequences, detection methods and their descriptions and effectiveness, mitigation strategies, potential vulnerabilities, observed vulnerabilities (with direct references to CVE IDs), and more. Because of the observed vulnerabilities, we can merge together these two dataframes and clean and filter the resultant so that we have a comprehensive dataset to work with. Because there's so much text data in the CWE list, we should have plenty of information to feed into a matural language processing pipeline that can help build a classification model to predict the association of a given CVE with the IoT and it's various vulnerabilities.

I created several helper functions to get us from `A` to `B`. The first, I can just feed the XML file. Next, we have a function that takes an element and takes all of the text data inside of it and it's children nodes and returns it as a long sentence. This kind of column data will be useful when we tokenize and lemmatize the text to more accurately search for our desire IoT-vulnerability-related keywords.

### Saving the Data
Parquet is a file type that streamlines the storage and retrieval of columnar data. I also saved a copy of both dataframes to CSV (a basic type of spreadsheet file). I chose to use the default option `None` for the method's `index` parameter, which saves the index of each record in a special kind of metadata range loop. This means it won't take up the kind of memory it would have if the index was actually saved into the dataframe as a separate attribute, but also provides a way to keep track of the records for the purposes of splitting them up between training, test, and validation sets for an ML algorithm should our work come to that.

<span style='font-weight:600;color:#ff8800;background-color:#666672;border-radius:3px;padding-inline:3px;padding-block:1px;'>Don't run these code cells unless you want to overwrite the saved files!</span>

## Loading the Data
I saved the dataframes resulting from our extraction into secondary storage because pulling data from $260,000$ files takes a very long time (1.5+ hours). Now that we have neatly-packaged CSV files, we'll read them into Pandas so we can inspect their cleanliness, merge them together in interest ways, and construct an comprehensive analysis with descriptive statistics, chi-square independence testing, Spearman's rank correlation, and logistic regression.

In [46]:
cves = pd.read_csv('../data/CVE_V5/CVE_List.csv')

cves.head(n=1)

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
0,CVE-1999-0001,,PUBLISHED,2000-02-04T05:00:00,ip_input.c in BSD-derived TCP/IP implementatio...,,,,


In [47]:
cwes = pd.read_csv('../data/CWE_V4.15/CWE_List.csv')

cwes.head(n=1)

Unnamed: 0,cwe_id,name,description,tech_class,bg_details,rel_ids,nature_of_rels,modes_of_intro_phases,modes_of_intro_descs,likelihood,...,consequence_notes,detection_method,detection_desc,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,['732'],['ChildOf'],['Implementation'],[''],Medium,...,"['If the HttpOnly flag is not set, then sensit...",['Automated Static Analysis'],"['Automated static analysis, commonly referred...",['High'],Implementation,['Leverage the HttpOnly flag when setting a se...,['High'],"[""While this mitigation is effective for prote...","['CVE-2022-24045', 'CVE-2014-3852', 'CVE-2015-...",['Web application for a room automation system...


## Preprocessing
In order to reliably analyze the data, we need to make sure it's clean. This involves several important steps, namely:
1) Understanding the data at a bird's eye level; e.g. number of rows, summary stats about numerical columns, data types (numbers, text, dates, etc.)
2) Handling missing values, redundant white space, inconsistent formatting, and typos
3) Removing duplicates
4) Converting data types into their respective forms for efficient processing

Before feeding the data into a machine learning model (if we get there), we'll have several additional steps to add on to this process:
1) We'll want to encode our categorical data so that we can use math on it
2) We'll split our data into training, validation, and testing sets to ensure our model generalizes well to unseen data
3) We'll have to handle outliers that can make overfitting our model a dangerous probability
4) We'll scale our data around a common mean of `0` and a standard deviation of `1` so that all features contribute equally to the algorithm's output.

### Bird's Eye View of CVE List
Let's start with the CVE List, then traverse the CWEs, and finally, the full dataset once we've merged them.

In [48]:
cves.info() # Overview of the size of the dataset, its null values in each column, and their datatypes
cves.describe() # View summary stats of numerical attributes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260894 entries, 0 to 260893
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cve_id             260894 non-null  object 
 1   cwe_id             58852 non-null   object 
 2   cve_state          260894 non-null  object 
 3   date_published     257370 non-null  object 
 4   description        246505 non-null  object 
 5   severity           65617 non-null   object 
 6   severity_score     65742 non-null   float64
 7   attack_vector      52410 non-null   object 
 8   attack_complexity  52431 non-null   object 
dtypes: float64(1), object(8)
memory usage: 17.9+ MB


Unnamed: 0,severity_score
count,65742.0
mean,6.631829
std,1.752317
min,0.0
25%,5.4
50%,6.5
75%,7.8
max,10.0


In [49]:
cves['severity'].value_counts() # Types of severity

severity
MEDIUM      31632
HIGH        24219
CRITICAL     5646
LOW          4085
NONE           33
medium          1
MODERATE        1
Name: count, dtype: int64

In [50]:
cves['attack_vector'].value_counts() # Types of attack vector

attack_vector
NETWORK             37346
LOCAL               11747
ADJACENT_NETWORK     2659
PHYSICAL              614
ADJACENT               44
Name: count, dtype: int64

In [51]:
cves['attack_complexity'].value_counts() # Levels of attack complexity

attack_complexity
LOW     43947
HIGH     8484
Name: count, dtype: int64

In [52]:
cves['cwe_id'].describe() # View summary stats of CWE IDs

count      58852
unique       664
top       CWE-79
freq        8871
Name: cwe_id, dtype: object

In [53]:
cves['date_published'] # View date formats

0              2000-02-04T05:00:00
1              1999-09-29T04:00:00
2              1999-09-29T04:00:00
3              2000-02-04T05:00:00
4              1999-09-29T04:00:00
                    ...           
260889    2024-08-20T23:31:03.646Z
260890    2024-08-20T23:31:05.010Z
260891    2024-08-21T20:20:26.856Z
260892    2024-08-21T20:20:27.045Z
260893    2024-08-21T20:20:27.239Z
Name: date_published, Length: 260894, dtype: object

In [55]:
cves['description']

0         ip_input.c in BSD-derived TCP/IP implementatio...
1         Buffer overflow in NFS mountd gives root acces...
2         Execute commands as root via buffer overflow i...
3         MIME buffer overflow in email clients, e.g. So...
4         Arbitrary command execution via IMAP buffer ov...
                                ...                        
260889    A vulnerability was found in Genexis Tilgin Ho...
260890    A vulnerability classified as critical has bee...
260891    Inappropriate implementation in WebApp Install...
260892    Inappropriate implementation in Custom Tabs in...
260893    Inappropriate implementation in Extensions in ...
Name: description, Length: 260894, dtype: object

In [28]:
cves['cve_state'].value_counts() # Types of state

cve_state
PUBLISHED    246505
REJECTED      14389
Name: count, dtype: int64

In [42]:
cves[cves['cve_state'] == 'REJECTED'].head() # Rejected CVE IDs

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
19,CVE-1999-0020,,REJECTED,2000-02-04T05:00:00,,,,,
109,CVE-1999-0110,,REJECTED,2000-02-04T05:00:00,,,,,
186,CVE-1999-0187,,REJECTED,2000-02-04T05:00:00,,,,,
281,CVE-1999-0282,,REJECTED,2000-02-04T05:00:00,,,,,
334,CVE-1999-0335,,REJECTED,1999-09-29T04:00:00,,,,,


In [30]:
len(cves[cves['description'].isna()]) == len(cves[cves['cve_state'] == 'REJECTED']) # Does the number of CVEs lacking a description exactly match the number of rejected CVEs?

True

From the looks of it, every record has an ID, a published date, and a state, though some $14000$ have been rejected, because of this contain no other useful information, and can be dropped entirely from the table. The publish date will be converted to a proper datetime format. Every published record has a description. The severity scores are already the proper numerical data type, but their categorical counterparts contain multiple levels of `MEDIUM` (`medium` and `MODERATE`) which, based on the [NVD-documented categories](https://nvd.nist.gov/vuln-metrics/cvss), will be need to be addressed.

### Dropping Unnecessary Information

In [63]:
# Drop rejected CVEs
cves = cves.drop(cves[cves['cve_state'] == 'REJECTED'].index)
cves['cve_state'].value_counts()

cve_state
PUBLISHED    246505
Name: count, dtype: int64

### Correcting Data Types

In [74]:
# Convert publication date to datetime format
cves['date_published'] = pd.to_datetime(cves['date_published'], format='ISO8601', utc=True)

# Convert objects to text data (string)
obj_cols = cves.select_dtypes(include=['object']).columns
cves[obj_cols] = cves[obj_cols].astype('string')

cves.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246505 entries, 0 to 260893
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   cve_id             246505 non-null  string             
 1   cwe_id             58852 non-null   string             
 2   cve_state          246505 non-null  string             
 3   date_published     245989 non-null  datetime64[ns, UTC]
 4   description        246505 non-null  string             
 5   severity           65617 non-null   string             
 6   severity_score     65742 non-null   float64            
 7   attack_vector      52410 non-null   string             
 8   attack_complexity  52431 non-null   string             
dtypes: datetime64[ns, UTC](1), float64(1), string(7)
memory usage: 18.8 MB


### Standardizing Formatting

In [70]:
# Standardize severity scores
cves['severity'] = cves['severity'].replace(['medium', 'MODERATE'], 'MEDIUM')
cves['severity'].value_counts()

severity
MEDIUM      31634
HIGH        24219
CRITICAL     5646
LOW          4085
NONE           33
Name: count, dtype: int64

In [84]:
# Check for leading or trailing whitespace
str_cols = cves.select_dtypes(include=['string']).columns

def check_whitespace():
    for col in str_cols:
        trimmable = cves[col].str.contains(r'^\s|\s$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

# Remove leading or trailing whitespace
cves[str_cols] = cves[str_cols].apply(lambda x: x.str.strip())

check_whitespace()

Column 'cve_id' has 0 trimmable whitespace characters.
Column 'cwe_id' has 0 trimmable whitespace characters.
Column 'cve_state' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.
Column 'severity' has 0 trimmable whitespace characters.
Column 'attack_vector' has 0 trimmable whitespace characters.
Column 'attack_complexity' has 0 trimmable whitespace characters.


### Removing Duplicates

In [71]:
# Check for duplicates
cves.duplicated().sum()

0

### Bird's Eye View of CWE List
We still have to handle null values in the CVE list, but since we'll be populating it with data from the CWE list during the merge, we can apply the same techniques to clean our second dataset. Once they've been merged, we'll filter out or impute the null values depending on what works best for the analysis.

We'll want to figure out the same basic facts about the CWE list that we did for the CVE list.

In [89]:
cwes.info() # Overview of the size of the dataset, its null values in each column, and their datatypes
cwes.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 964 entries, 0 to 963
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cwe_id                    964 non-null    int64 
 1   name                      964 non-null    object
 2   description               964 non-null    object
 3   tech_class                172 non-null    object
 4   bg_details                45 non-null     object
 5   rel_ids                   964 non-null    object
 6   nature_of_rels            964 non-null    object
 7   modes_of_intro_phases     964 non-null    object
 8   modes_of_intro_descs      964 non-null    object
 9   likelihood                185 non-null    object
 10  scope_of_consequences     964 non-null    object
 11  impact_of_consequences    964 non-null    object
 12  consequence_notes         964 non-null    object
 13  detection_method          964 non-null    object
 14  detection_desc            

Unnamed: 0,cwe_id,name,description,tech_class,bg_details,rel_ids,nature_of_rels,modes_of_intro_phases,modes_of_intro_descs,likelihood,...,consequence_notes,detection_method,detection_desc,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,['732'],['ChildOf'],['Implementation'],[''],Medium,...,"['If the HttpOnly flag is not set, then sensit...",['Automated Static Analysis'],"['Automated static analysis, commonly referred...",['High'],Implementation,['Leverage the HttpOnly flag when setting a se...,['High'],"[""While this mitigation is effective for prote...","['CVE-2022-24045', 'CVE-2014-3852', 'CVE-2015-...",['Web application for a room automation system...
1,1007,Insufficient Visual Distinction of Homoglyphs ...,The product displays information or identifier...,Web Based,,['451'],['ChildOf'],"['Architecture and Design', 'Implementation']",['This weakness may occur when characters from...,Medium,...,"[""An attacker may ultimately redirect a user t...",['Manual Dynamic Analysis'],"['If utilizing user accounts, attempt to submi...",['Moderate'],Implementation,[' Use a browser that displays Punycode for ID...,"['', '']","['', '']","['CVE-2013-7236', 'CVE-2012-0584', 'CVE-2009-0...",['web forum allows impersonation of users with...
2,102,Struts: Duplicate Validation Forms,The product uses multiple validation forms wit...,,,"['694', '1173', '20']","['ChildOf', 'ChildOf', 'ChildOf']",['Implementation'],[''],,...,[''],[],[],[],Implementation,['The DTD or schema validation will not catch ...,[''],[''],[],[]
3,1021,Improper Restriction of Rendered UI Layers or ...,The web application does not restrict or incor...,Web Based,,"['441', '610', '451']","['ChildOf', 'ChildOf', 'ChildOf']",['Implementation'],[''],,...,"[""An attacker can trick a user into performing...",['Automated Static Analysis'],"['Automated static analysis, commonly referred...",['High'],Implementation,[' The use of X-Frame-Options allows developer...,"['', '', '']","['', '', '']","['CVE-2017-7440', 'CVE-2017-5697', 'CVE-2017-4...",['E-mail preview feature in a desktop applicat...
4,1022,Use of Web Link to Untrusted Target with windo...,The web application produces links to untruste...,Web Based,,['266'],['ChildOf'],"['Architecture and Design', 'Implementation']",['This weakness is introduced during the desig...,Medium,...,['The user may be redirected to an untrusted p...,['Automated Static Analysis'],"['Automated static analysis, commonly referred...",['High'],Implementation,['Specify in the design that any linked extern...,"['', '', '']","['', '', '']",['CVE-2022-4927'],"['Library software does not use rel: ""noopener..."


In [93]:
cves[cves['cve_id'] == 'CVE-2017-5697']

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
105930,CVE-2017-5697,,PUBLISHED,2017-06-14 12:00:00+00:00,Insufficient clickjacking protection in the We...,,,,
