# Preprocessing EPSS Data
The Exploit Prediction Scoring System (EPSS) provides a score—a decimal-coded percentage value between $0$ ($0$% likely) and $1$ ($100$% likely)—that keeps track of how likely a given CVE is to be exploited in a rolling $30$-day window. This score is also converted into a percentile value that ranks the CVE's likelihood relative to all other CVEs. The data was gathered from FIRST and includes a full dataset with no empty values. This particular EPSS model was created $03/01/2023$ and the date in which the scores were computed is $09/21/2024$.

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
epss = pd.read_csv('../data/EPSS/EPSS_data/EPSS_data.csv')
epss.head(3)

Unnamed: 0,cve,epss,percentile
0,CVE-1999-0001,0.00383,0.73343
1,CVE-1999-0002,0.0208,0.89305
2,CVE-1999-0003,0.04409,0.92563


In [2]:
epss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260010 entries, 0 to 260009
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   cve         260010 non-null  object 
 1   epss        260010 non-null  float64
 2   percentile  260010 non-null  float64
dtypes: float64(2), object(1)
memory usage: 6.0+ MB


## Convert ID to Text Type

In [3]:
epss['cve'] = epss['cve'].astype('string')

## Rename Columns

In [7]:
epss = epss.rename(columns={'cve': 'cve_id'})

## Validate Consistency
We'll produce a series of simple checks to make sure all the CVE IDs are valid and do not contain any extra white space characters. Then we'll make sure the `epss` and `percentile` are properly ranged before resaving the file as a `.parquet`.

In [12]:
# Remove the whitespace that was found
epss['cve_id'] = epss['cve_id'].str.strip()

# Check for invalid CVEs
non_ideal_epss = epss[~epss['cve_id'].str.startswith('CVE-', na=True)]
print(f'There are {len(non_ideal_epss)} incorrectly-formatted CVE IDs.')

# Validate ranges
cols = ['epss', 'percentile']

for col in cols:
    count = epss[(epss[col] < 0) | (epss[col] > 1)]
    print(f'{col} has {len(count)} out-of-range values.')

There are 0 incorrectly-formatted CVE IDs.
epss has 0 out-of-range values.
percentile has 0 out-of-range values.


In [13]:
epss.head()

Unnamed: 0,cve_id,epss,percentile
0,CVE-1999-0001,0.00383,0.73343
1,CVE-1999-0002,0.0208,0.89305
2,CVE-1999-0003,0.04409,0.92563
3,CVE-1999-0004,0.00917,0.83132
4,CVE-1999-0005,0.91963,0.99


## Saving the Data

In [14]:
epss.to_parquet(path='../data/EPSS/epss_data.parquet', index=None)