# Exploratory Data Analysis Notebook
This notebook is dedicated to the exploratory data analysis of the large and focused final data sets. General facts about the data will be recorded and available in the following cells. Should it be warranted, they may also appear in the README file. We'll be using Pandas and Numpy for data processing and analysis and `matplotlib` and `seaborn` for data visualization. The first step is to load in the libraries and the data we're interested in. This data has already been cleaned up, but it still contains null values in some attributes that will have to handled during certain operations.

In [1]:
import os
import sys
import pandas as pd
# import numpy as np
# import scipy.stats as stats
# import matplotlib.pyplot as plot
# import seaborn as sbn
import textwrap as txt

# Import utility functions
src_path = os.path.abspath(os.path.join('..', 'src'))
if src_path not in sys.path:
    sys.path.append(src_path)

from classes import Plotter

df = pd.read_parquet(path='../data/processed/composite/dataset_focused.parquet')

The first thing to look at are general facts about the data we're interested in.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10891 entries, 0 to 10890
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   cve_id                10891 non-null  string             
 1   date_public           10678 non-null  datetime64[ns, UTC]
 2   origin                10891 non-null  category           
 3   cvss                  10290 non-null  Float64            
 4   cvss_severity         10891 non-null  category           
 5   cvss_src              10290 non-null  category           
 6   exploit_count         10891 non-null  Float64            
 7   days_to_poc_exploit   10678 non-null  Float64            
 8   exploitation_date_0   10891 non-null  datetime64[ns, UTC]
 9   epss_0                3969 non-null   Float64            
 10  percentile_0          2886 non-null   Float64            
 11  exploitation_date_30  4654 non-null   datetime64[ns, UTC]
 12  epss

In [11]:
print(f'There are \033[32;1m{df.shape[0]}\033[0m CVE records in the data frame, each with \033[32;1m{df.shape[1]}\033[0m attributes.')
print('Some general statistics about the numerical columns:')

# Change output display for clarity
pd.options.display.precision = 3
df.describe().T

There are [32;1m10891[0m CVE records in the data frame, each with [32;1m20[0m attributes.
Some general statistics about the numerical columns:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cvss,10290.0,7.36,1.794,0.0,6.1,7.5,8.8,10.0
exploit_count,10891.0,1.86,5.999,1.0,1.0,1.0,1.0,394.0
days_to_poc_exploit,10678.0,230.404,790.141,-4457.0,0.0,1.0,66.0,9978.0
epss_0,3969.0,0.067,0.191,-0.783,0.0,0.004,0.02,1.019
percentile_0,2886.0,0.488,0.317,0.008,0.194,0.42,0.788,1.0
epss_30,4003.0,0.1,0.224,0.0,0.001,0.009,0.051,0.975
percentile_30,3965.0,0.554,0.334,0.008,0.229,0.563,0.895,1.0
epss_60,4116.0,0.104,0.236,-0.008,0.001,0.009,0.052,1.937
percentile_60,4008.0,0.557,0.334,0.008,0.243,0.565,0.898,1.0
change_0_to_30,3945.0,1910.202,15133.251,-32500.0,0.0,0.0,20.0,226241.86


There are $601$ CVEs with `UNKNOWN` CVSS score severities, calculated by virtue of their corresponding scores not belonging to the range of applicable numbers as set forth by FIRST.

In [9]:
df[(df['cvss'].isnull()) & (df['cvss_severity'].notnull())]['cvss_severity'].value_counts()

cvss_severity
UNKNOWN     601
CRITICAL      0
HIGH          0
LOW           0
MEDIUM        0
NONE          0
Name: count, dtype: int64

In [12]:
date_cols = df.select_dtypes('datetime64[ns, UTC]').columns
df[date_cols].describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max
date_public,10678,2016-01-07 08:06:07.169118464+00:00,1990-08-14 04:00:00+00:00,2008-12-01 00:00:00+00:00,2018-04-19 12:00:00+00:00,2022-05-24 23:53:15+00:00,2025-01-28 22:15:15.860000+00:00
exploitation_date_0,10891,2016-10-06 01:45:41.016527616+00:00,1990-05-19 00:00:00+00:00,2009-04-20 00:00:00+00:00,2019-10-15 06:26:08+00:00,2023-01-31 22:51:30+00:00,2025-01-13 10:07:25+00:00
exploitation_date_30,4654,2023-05-27 01:51:19.655350272+00:00,2021-05-14 00:00:00+00:00,2022-05-26 11:11:19+00:00,2023-07-05 17:10:29.500000+00:00,2024-05-03 17:20:47.750000128+00:00,2025-02-12 10:07:25+00:00
exploitation_date_60,4654,2023-06-26 01:51:19.655350272+00:00,2021-06-13 00:00:00+00:00,2022-06-25 11:11:19+00:00,2023-08-04 17:10:29.500000+00:00,2024-06-02 17:20:47.750000128+00:00,2025-03-14 10:07:25+00:00


In [19]:
cvss_greater_than_or_equal_to_7 = df[df['cvss'] >= 7.0]
cvss_greater_than_or_equal_to_7_count = len(cvss_greater_than_or_equal_to_7)
nonnull_epss = cvss_greater_than_or_equal_to_7[cvss_greater_than_or_equal_to_7['epss_0'].notnull()]
valid_epss_threshold = nonnull_epss[nonnull_epss['epss_0'] >= 0.5]

text = f"""
    There are \033[32;1m{cvss_greater_than_or_equal_to_7_count}\033[0m CVEs with a CVSS score greater than or
    equal to 7.0, \033[32;1m{(cvss_greater_than_or_equal_to_7_count / df['cvss'].count()) * 100:.2f}%\033[0m of the total number
    of CVSS scores contained in the dataset. Of these CVEs, \033[32;1m{len(nonnull_epss)}\033[0m have valid
    EPSS scores. \033[32;1m{len(valid_epss_threshold)}\033[0m of the CVEs have EPSS scores that fall above a threshold
    of 0.5, meaning those with a 50% chance of being exploited in the 30 days after said scores were calculated.
    This is \033[32;1m{(len(valid_epss_threshold) / len(nonnull_epss)) * 100:.2f}%\033[0m of our valid EPSS scores, and \033[32;1m{(len(valid_epss_threshold) / df['epss_0'].count()) * 100:.2f}%\033[0m
    of the total number of EPSS scores contained in the dataset. This suggests that most CVEs, even those with severe CVSS scores,
    are not likely to be exploited in the 30 days after.
"""

output = txt.fill(
    text,
    initial_indent='    ',
    width=75,
    break_long_words=False,
)

print(output)

         There are [32;1m6706[0m CVEs with a CVSS score greater than or
equal to 7.0, [32;1m65.17%[0m of the total number     of CVSS scores
contained in the dataset. Of these CVEs, [32;1m2826[0m have valid
EPSS scores. [32;1m180[0m of the CVEs have EPSS scores that fall above a
threshold     of 0.5, meaning those with a 50% chance of being exploited in
the 30 days after said scores were calculated.     This is [32;1m6.37%[0m
of our valid EPSS scores, and [32;1m4.54%[0m     of the total number of
EPSS scores contained in the dataset. This suggests that most CVEs, even
those with severe CVSS scores,     are not likely to be exploited in the 30
days after.


## Verifying Distributive Normalcy
### Histogram, Q-Q Plots, and Scatterplots
From the looks of the histogram and Q-Q plots, none of the project's variables appear to be normally distributed; rather, they seem heavily skewed to the right, exponential, and perhaps bimodal. We can't rely on appearances alone though, so in the next section, we'll use several statistical tests to verify the data's shape mathematically.

Based on the results of these tests, we can perform correlation analysis with the Spearman's Rank Coefficient and Kendall's Tau and begin to flesh out an analysis that will help validate the effectiveness of the project's model versus the CVSS and EPSS parameters. Scatterplots graphing the relationship between variables will help in this endeavour to visualize any relationships that may emerge.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10891 entries, 0 to 10890
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   cve_id                10891 non-null  string             
 1   date_public           10678 non-null  datetime64[ns, UTC]
 2   origin                10891 non-null  category           
 3   cvss                  10290 non-null  Float64            
 4   cvss_severity         10891 non-null  category           
 5   cvss_src              10290 non-null  category           
 6   exploit_count         10891 non-null  Float64            
 7   days_to_poc_exploit   10678 non-null  Float64            
 8   exploitation_date_0   10891 non-null  datetime64[ns, UTC]
 9   epss_0                3969 non-null   Float64            
 10  percentile_0          2886 non-null   Float64            
 11  exploitation_date_30  4654 non-null   datetime64[ns, UTC]
 12  epss

In [None]:
# Load Plotter
plotter = Plotter(figsize=(4, 3), save_fig=True)

# Plot histograms
plotter.plot_histogram(
    df,
    column='cvss',
    title='Frequency of CVEs with varying CVSS scores',
    xlabel='CVSS Score',
    ylabel='Number of CVEs',
    bins=50,
    alpha=0.5,
    ylim=(0, 2000)
)
plotter.plot_histogram(
    df=df[df['cvss_severity'] != 'UNKNOWN'],
    column='cvss_severity',
    title='Frequency of CVSS Severity Levels',
    xlabel='CVSS Severity',
    ylabel='Number of CVEs',
    bins=5,
    alpha=0.5,
    ylim=(0, 5000)
)
plotter.plot_histogram(
    df,
    column='exploit_count',
    title='Distribution of Exploit Counts',
    xlabel='Exploit Count',
    ylabel='Number of CVEs',
    bins=10,
    alpha=0.5,
    xlim=(0, 10),
    ylim=(0, 5000),
    transform='log',
)
plotter.plot_histogram(
    df,
    column='days_to_poc_exploit',
    title='Time to First Exploit Code Publishing',
    xlabel='Days From Public Disclosure',
    ylabel='Number of CVEs',
    bins=100,
    alpha=0.5,
    xlim=(-500, 3000)
)
plotter.plot_histogram(
    df,
    column='epss_0',
    title='Histogram of EPSS Scores on Exploit Code Publish Date',
    xlabel='EPSS Score',
    ylabel='Number of CVEs',
    bins=100,
    alpha=0.5,
    xlim=(0.001, 1.000),
    ylim=(0, 200)
)
plotter.plot_histogram(
    df,
    column='epss_0',
    title='Histogram of EPSS Scores 30 Days After Exploit Code Publish Date',
    xlabel='EPSS Score',
    ylabel='Number of CVEs',
    bins=100,
    alpha=0.5,
    xlim=(0.001, 1.000),
    ylim=(0, 200)
)
plotter.plot_histogram(
    df,
    column='epss_0',
    title='Histogram of EPSS Scores 60 Days After on Exploit Code Publish Date',
    xlabel='EPSS Score',
    ylabel='Number of CVEs',
    bins=100,
    alpha=0.5,
    xlim=(0.001, 1.000),
    ylim=(0, 200)
)

In [None]:
# Configure Q-Q plots
plotter.plot_qq(
    df,
    column='cvss',
    dist='norm',
    title='QQ Plot of CVSS Scores',
    xlabel='Theoretical Quantiles',
    ylabel='CVSS Scores',
    color='blue',
)
plotter.plot_qq(
    data=df[df['cvss_severity'] != 'UNKNOWN'],
    column='cvss',
    dist='norm',
    title='QQ Plot of CVSS Severity Levels',
    xlabel='Theoretical Quantiles',
    ylabel='CVSS Severity Levels',
    color='blue',
)
plotter.plot_qq(
    df,
    column='exploit_count',
    dist='norm',
    title='QQ Plot of Exploit Count',
    xlabel='Theoretical Quantiles',
    ylabel='Number of Exploits',
    color='blue'
)
plotter.plot_qq(
    df,
    column='days_to_poc_exploit',
    dist='norm',
    title='QQ Plot of the Interval Between CVE Disclosure and Exploit Code Publish Date',
    xlabel='Theoretical Quantiles',
    ylabel='Days to Exploit Code Publishing',
    color='blue',
)
plotter.plot_qq(
    df,
    column='epss_0',
    dist='norm',
    title='QQ Plot of EPSS Scores on Exploit Code Publish Date',
    xlabel='Theoretical Quantiles',
    ylabel='EPSS Scores',
    color='blue',
)
plotter.plot_qq(
    df,
    column='epss_30',
    dist='norm',
    title='QQ Plot of EPSS Scores 30 Days After on Exploit Code Publish Date',
    xlabel='Theoretical Quantiles',
    ylabel='EPSS Scores',
    color='blue',
)
plotter.plot_qq(
    df,
    column='epss_60',
    dist='norm',
    title='QQ Plot of EPSS Scores 60 Days After Exploit Code Publish Date',
    xlabel='Theoretical Quantiles',
    ylabel='EPSS Scores',
    color='blue',
)

### Statistical Normality Testing

In [23]:
df[(df['origin'] == 'kev') | (df['origin'] == 'poc_kev')].count()

cve_id                  730
date_public             730
origin                  730
cvss                    710
cvss_severity           730
cvss_src                710
exploit_count           730
days_to_poc_exploit     730
exploitation_date_0     730
epss_0                  700
percentile_0            628
exploitation_date_30    730
epss_30                 703
percentile_30           702
exploitation_date_60    730
epss_60                 712
percentile_60           707
change_0_to_30          700
change_30_to_60         700
change_0_60             700
dtype: int64

In [None]:
# Variables to test
cvss = df['cvss']
epss = df['epss']
days_to_exploit = df['days_to_exploit']