# Exploratory Data Analysis Notebook
This notebook is dedicated to the exploratory data analysis of the large and focused final data sets. General facts about the data will be recorded and available in the following cells. Should it be warranted, they may also appear in the README file. We'll be using Pandas and Numpy for data processing and analysis and `matplotlib` and `seaborn` for data visualization. The first step is to load in the libraries and the data we're interested in. This data has already been cleaned up, but it still contains null values in some attributes that will have to handled during certain operations.

In [43]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plot
import seaborn as sb
import textwrap as txt

df = pd.read_parquet(path='../data/processed/composite/dataset_focused.parquet')

The first thing to look at are general facts about the data we're interested in.

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6259 entries, 0 to 6258
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   cve_id             6259 non-null   string             
 1   exploit_count      6259 non-null   Int64              
 2   exploitation_date  5506 non-null   datetime64[ns, UTC]
 3   cvss               1565 non-null   Float64            
 4   cvss_severity      1565 non-null   category           
 5   date_published     5924 non-null   datetime64[ns, UTC]
 6   epss               2253 non-null   Float64            
 7   percentile         2253 non-null   Float64            
dtypes: Float64(3), Int64(1), category(1), datetime64[ns, UTC](2), string(1)
memory usage: 373.2 KB


In [62]:
print(f'There are \033[32;1m{df.shape[0]}\033[0m CVE records in the data frame, each with \033[32;1m{df.shape[1]}\033[0m attributes.')
print('Some general statistics about the numerical columns:')

df.describe().T

There are [32;1m6259[0m CVE records in the data frame, each with [32;1m8[0m attributes.
Some general statistics about the numerical columns:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
exploit_count,6259.0,2.292858,7.765336,1.0,1.0,1.0,1.0,391.0
cvss,1565.0,7.767987,1.734501,0.0,6.6,7.8,9.1,10.0
epss,2253.0,0.047891,0.150789,0.00042,0.00048,0.00358,0.01404,0.97516
percentile,2253.0,0.459049,0.308128,0.043279,0.16754,0.39478,0.74964,0.99991


In [59]:
cvss_greater_than_or_equal_to_7 = df[df['cvss'] >= 7.0]
cvss_greater_than_or_equal_to_7_count = len(cvss_greater_than_or_equal_to_7)
nonnull_epss = cvss_greater_than_or_equal_to_7[cvss_greater_than_or_equal_to_7['epss'].notnull()]
valid_epss_threshold = nonnull_epss[nonnull_epss['epss'] >= 0.5]

output = txt.fill(
    f"There are \033[32;1m{cvss_greater_than_or_equal_to_7_count}\033[0m CVEs with a CVSS score greater than or equal to 7.0, \033[32;1m{(cvss_greater_than_or_equal_to_7_count / df['cvss'].count()) * 100:.2f}%\033[0m of the total number of CVSS scores contained in the dataset. Of these CVEs, \033[32;1m{len(nonnull_epss)}\033[0m have valid EPSS scores. \033[32;1m{len(valid_epss_threshold)}\033[0m of the CVEs have EPSS scores that fall above a threshold of 0.5, meaning those with a 50% chance of being exploited in the 30 days after said scores were calculated. This is \033[32;1m{(len(valid_epss_threshold) / len(nonnull_epss)) * 100:.2f}%\033[0m of our valid EPSS scores, and \033[32;1m{(len(valid_epss_threshold) / df['epss'].count()) * 100:.2f}%\033[0m of the total number of EPSS scores contained in the dataset. This suggests that most CVEs, even those with severe CVSS scores, are not likely to be exploited in the 30 days after.",
    initial_indent='    ',
    width=75,
    break_long_words=False,
)

print(output)

    There are [32;1m1131[0m CVEs with a CVSS score greater than or equal
to 7.0, [32;1m72.27%[0m of the total number of CVSS scores contained in
the dataset. Of these CVEs, [32;1m694[0m have valid EPSS scores.
[32;1m15[0m of the CVEs have EPSS scores that fall above a threshold of
0.5, meaning those with a 50% chance of being exploited in the 30 days
after said scores were calculated. This is [32;1m2.16%[0m of our valid
EPSS scores, and [32;1m0.67%[0m of the total number of EPSS scores
contained in the dataset. This suggests that most CVEs, even those with
severe CVSS scores, are not likely to be exploited in the 30 days after.


## Verifying Distributive Normalcy
### Histogram Plots
From the looks of the histogram plots, the CVSS scores seem normal (because of the bell curve) but the days to patch and days to exploit don't seem normal; rather, they seem heavily skewed to the right and even bimodal, respectively. We can't rely on appearances alone though, so in the next section, we'll use several statistical tests to verify the data's shape mathematically. The histogram plots are superimposed with a kernel density estimation (KDE) that approximates the shape of the distribution. Taken together, we can see that the test data, though theoretically continuous, has larger gaps where certain potential values are not represented. This makes interpreting the results of the normality testing that follows less accurate since they all expect continuous data. This issue may sufficiently fall away when testing our actual dataset given its larger sample size. Fortunately for now, non-parametric tests like Spearman's correlation and Kendall's Tau can handle discrete variables, but we'll still go through normality testing to verify the discreteness of our current test data.

In [None]:
# Variables to test
cvss = df['cvss']
epss = df['epss']
days_to_exploit = df['days_to_exploit']
dtfe = df['days_to_first_exploit']