# Preprocessing CVE Data
This data was pulled from MITRE's CVE project, a collection of several hundred thousand JSON files containing vulnerability information about all kinds of software, products, and services. Within this particular notebook, it will be processed in as many ways as necessary and merged with FIRST's Exploit Prediction Scoring System (EPSS) before being saved for use in further notebooks, namely `iot_cve_main_v3_preprocessing` and `iot_cve_main_analysis`.

## Preprocessing Steps
In order to reliably use the data in successive merges, we need to make sure it's clean. This involves several important steps, namely:
- Understanding the data at a bird's eye level—e.g. number of rows, summary stats about numerical columns, data types (numbers, text, dates, etc.)
- Handling incorrect data types
- Interpolating or discarding missing values
- Removing redundant white space characters
- Standardizing the formatting of column names, categorical data, etc.
- Deleting duplicate observations

In [1]:
# Import libraries
import pandas as pd # For data cleaning and analysis
import numpy as np # For advanced calculations
from datetime import datetime # For working with dates

# Import the data
cves = pd.read_parquet('../data/CVE_Project/cvelistV5/cve_list_v2.parquet')
cves.head(n=3)

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,date_public,description,cvss_v4_0,cvss_v3_1,cvss_v3_0,cvss_v2_0,vendor,product
0,CVE-2024-9549,,,,,,,,,,,
1,CVE-2024-9549,,,,,,,,,,,
2,CVE-1999-0001,,PUBLISHED,2000-02-04T05:00:00,,ip_input.c in BSD-derived TCP/IP implementatio...,,,,,,


In [4]:
cves.info() # Overview of the size of the dataset, its null values in each column, and their datatypes
cves.describe() # View summary stats of numerical attributes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264610 entries, 0 to 264609
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   cve_id          264610 non-null  object 
 1   cwe_id          62937 non-null   object 
 2   cve_state       264608 non-null  object 
 3   date_published  261096 non-null  object 
 4   date_public     128264 non-null  object 
 5   description     250171 non-null  object 
 6   cvss_v4_0       2012 non-null    float64
 7   cvss_v3_1       56594 non-null   float64
 8   cvss_v3_0       16962 non-null   float64
 9   cvss_v2_0       4227 non-null    float64
 10  vendor          250023 non-null  object 
 11  product         250057 non-null  object 
dtypes: float64(4), object(8)
memory usage: 24.2+ MB


Unnamed: 0,cvss_v4_0,cvss_v3_1,cvss_v3_0,cvss_v2_0
count,2012.0,56594.0,16962.0,4227.0
mean,6.541054,6.620412,6.461508,5.647433
std,1.634232,1.751461,1.850871,1.652543
min,0.0,0.0,0.0,0.8
25%,5.3,5.4,5.3,4.0
50%,6.3,6.5,6.4,5.8
75%,8.2,7.8,7.8,6.5
max,10.0,10.0,10.0,10.0


## Check Validity of IDs

In [2]:
non_ideal_cves = cves[~cves['cve_id'].str.startswith('CVE-', na=True)]
print(f'There are {len(non_ideal_cves)} incorrectly-formatted CVE IDs.')

There are 0 incorrectly-formatted CVE IDs.


## Data Filtration
The data current has about $14,000$ records where the CVE was rejected, meaning it was never mapped to an actual vulnerability. In addition to this, MITRE's GitHub repository containing the CVE list warns that records between $\text{May } 8, 2024$ and $\text{June } 7, 2024$ contain "publication and update discrepancies... affecting approximately $515$ records." These will be removed in the section [Validate Date Ranges](#validate-date-ranges) once the rest of the preprocessing is completed because the datatypes need to be accurate for such removal to take place successfully.

In [6]:
cves['cve_state'].value_counts()

cve_state
PUBLISHED    250171
REJECTED      14437
Name: count, dtype: int64

In [3]:
# Only keep published CVEs
cves = cves[cves['cve_state'] == 'PUBLISHED']

# Drop the state column since every value is identical now
cves = cves.drop(columns=['cve_state'])

## Dealing with Duplicates
There seems to be two duplicates of the very last observation placed at the very top of the dataframe that will have to be discarded.

In [4]:
cves[cves['cve_id'] == 'CVE-2024-9549']

Unnamed: 0,cve_id,cwe_id,date_published,date_public,description,cvss_v4_0,cvss_v3_1,cvss_v3_0,cvss_v2_0,vendor,product
264609,CVE-2024-9549,CWE-120,2024-10-06T04:00:06.718Z,,A vulnerability was found in D-Link DIR-605L 2...,8.7,8.8,8.8,9.0,D-Link,DIR-605L


In [27]:
# Delete the index where the duplicates appear
cves = cves.drop(index=[0, 1])
cves = cves.reset_index(drop=True)

In [5]:
print(f'There are {cves.duplicated().sum()} duplicate observations.')

There are 0 duplicate observations.


## Data Type Validation
After mapping the CVSS scores to their respective categories, we'll need to redo this step for those specific variables.

In [6]:
col_dtypes = {
    'cve_id': 'string',
    'cwe_id': 'category',
    'description': 'string',
    #'cvss_v4_0': 'float',
    #'cvss_v3_1': 'float',
    #'cvss_v3_0': 'float',
    #'cvss_v2_0': 'float',
    # 'cvss_v4_0_cat': 'category',
    # 'cvss_v3_1_cat': 'category',
    # 'cvss_v3_0_cat': 'category',
    # 'cvss_v2_0_cat': 'category',
    'vendor': 'string',
    'product': 'string',
}

cves = cves.astype(col_dtypes)

# Handle date-based data types
date_cols = ['date_published', 'date_public']
cves[date_cols] = cves[date_cols].apply(pd.to_datetime, utc=True, errors='coerce')


In [7]:
cves.dtypes

cve_id                 string[python]
cwe_id                       category
date_published    datetime64[ns, UTC]
date_public       datetime64[ns, UTC]
description            string[python]
cvss_v4_0                     float64
cvss_v3_1                     float64
cvss_v3_0                     float64
cvss_v2_0                     float64
vendor                 string[python]
product                string[python]
dtype: object

## Map CVSS Scores to Labels
Four versions of the CVSS score were scraped from the CVE list. Versions number $3$ and $3.1$ are calculated in the same way according to FIRST, the organization that designed the standard. The only thing that differentiates the two subversions is the clarification of metrics and documentation. As such, the two attributes are blended into one, <span style=''>taking</span> from version $3$ where $3.1$ was not available from MITRE and from version $3.1$ where $3$ was not available. In situations where there existed a discrepancy between the two versions—applicable to precisely $1$ observation (`CVE-2023-22515`), $3.1$ was preferred for its currency.

In [8]:
def map_cvss_to_category(score):
    if score >= 9.0:
        return 'critical'
    elif score >= 7.0:
        return 'high'
    elif score >= 4.0:
        return 'medium'
    elif score > 0.0:
        return 'low'
    elif score == 0.0:
        return 'none'
    return None

cvss_cols = ['cvss_v4_0', 'cvss_v3_1', 'cvss_v3_0', 'cvss_v2_0']

for col in cvss_cols:
    label_col = col + '_cat'
    cves[label_col] = cves[col].apply(map_cvss_to_category)

In [9]:
# Merge versions 3 and 3.1 of the CVSS
cves['cvss_v3'] = cves['cvss_v3_1'].combine_first(cves['cvss_v3_0'])
cves['cvss_v3_cat'] = cves['cvss_v3_1_cat'].combine_first(cves['cvss_v3_0_cat'])

# Delete extraneous versions
cvss_cols = [
    'cvss_v4_0',
    'cvss_v3_1',
    'cvss_v3_0',
    'cvss_v2_0',
    'cvss_v4_0_cat',
    'cvss_v3_1_cat',
    'cvss_v3_0_cat',
    'cvss_v2_0_cat'
]
cves = cves.drop(columns=cvss_cols)

# Reposition the aggregated scores
cves.insert(5, 'cvss_v3', cves.pop('cvss_v3'))
cves.insert(6, 'cvss_v3_cat', cves.pop('cvss_v3_cat'))

# Order CVSS categories
cvss_cats = ['none', 'low', 'medium', 'high', 'critical']
cves['cvss_v3_cat'] = pd.Categorical(
    cves['cvss_v3_cat'],
    categories=cvss_cats,
    ordered=True
)

## Remove Extraneous Whitespace

In [10]:
def check_whitespace(df, cols):
    for col in cols:
        trimmable = df[col].str.contains(r'^\s+|\s+$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

str_cols = ['cve_id', 'description', 'vendor', 'product']

# Check for whitespace characters in the specified columns
check_whitespace(cves, str_cols)

# Remove the whitespace that was found
cves[str_cols] = cves[str_cols].apply(lambda x: x.str.strip())

# Check again
check_whitespace(cves, str_cols)

Column 'cve_id' has 0 trimmable whitespace characters.
Column 'description' has 8179 trimmable whitespace characters.
Column 'vendor' has 299 trimmable whitespace characters.
Column 'product' has 618 trimmable whitespace characters.
Column 'cve_id' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.
Column 'vendor' has 0 trimmable whitespace characters.
Column 'product' has 0 trimmable whitespace characters.


In [93]:
cves.info()

<class 'pandas.core.frame.DataFrame'>
Index: 250171 entries, 0 to 264607
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   cve_id          250171 non-null  string             
 1   cwe_id          62937 non-null   category           
 2   date_published  177416 non-null  datetime64[ns, UTC]
 3   date_public     117154 non-null  datetime64[ns, UTC]
 4   description     250171 non-null  string             
 5   cvss_v3         69256 non-null   float64            
 6   cvss_v3_cat     69256 non-null   category           
 7   vendor          250023 non-null  string             
 8   product         250057 non-null  string             
dtypes: category(2), datetime64[ns, UTC](2), float64(1), string(4)
memory usage: 16.0 MB


## Validate Date Ranges
MITRE's CVE Project GitHub warns of $515$ records affected by date discrepancies between $05/08/2024$ and $06/07/2024$. I've only counted $69$ records that fall within this range. Apparently, <span style='color:#ffcc00;text-shadow:0 0 3px #ffcc00;'>there is a fix in progress</span>. It may be the case that once we've merged all of the data together, dates from this range don't appear anyways, so <span style='color:#ffcc00;text-shadow:0 0 3px #ffcc00;'>I've made a preliminary executive decision to keep them in the dataset as of <span style='font-weight:bold;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;font-style:normal;text-shadow:none;'>cve_list_v3</span></span>. Once the data is fixed on MITRE's end, it will be a nearly automatic process updating it in the notebooks, so I'd rather filter the dates out in the final preparatory notebook if necessary.

What this section currently does however is implement a decision which was made to keep the earliest date between the two date variables `date_published` (the date the CVE was listed) and `date_public` (the date the CVE was known to the public). This attribute is named `date_known` within the `cve_list_v3` dataset.

In [102]:
# Filter out the dates
start_date = pd.Timestamp('2024-05-08 00:00:00', tz='UTC')
end_date = pd.Timestamp('2024-06-07 23:59:59', tz='UTC')

date_discrepancies = len(cves[(cves['date_published'] >= start_date) & (cves['date_published'] <= end_date)])
print(f'There are {date_discrepancies} records that fall within the range believed by MITRE to have inconsistent dates.')

In [113]:
date_cols = ['date_published', 'date_public']

# Take the earliest date between both attributes
cves['date_known'] = cves[date_cols].min(axis=1)

# Drop the previous date columns
cves = cves.drop(columns=date_cols)

# Reposition date known
position = cves.columns.get_loc('description') + 1
cves.insert(position, 'date_known', cves.pop('date_known'))


## Remove CWE ID Attribute
This variable is not necessary for our current analytic direction, so it's continued existance in the dataset is clutter and should be removed. The `vendor` and `product` attributes are kept solely to search for the CVEs' patch release and first exploited dates.

In [None]:
cves = cves.drop(columns=['cwe_id'])

In [116]:
cves.head(3)

Unnamed: 0,cve_id,description,date_known,cvss_v3,cvss_v3_cat,vendor,product
0,CVE-1999-0001,ip_input.c in BSD-derived TCP/IP implementatio...,2000-02-04 05:00:00+00:00,,,,
1,CVE-1999-0002,Buffer overflow in NFS mountd gives root acces...,1999-09-29 04:00:00+00:00,,,,
2,CVE-1999-0003,Execute commands as root via buffer overflow i...,1999-09-29 04:00:00+00:00,,,,


## Saving the Dataset
This file is being saved as a `.parquet` file within the `CVE_Project` directory inside of `data`.

In [117]:
cves.to_parquet(path='../data/CVE_Project/cve_list_v3.parquet', index=False)