# Building the Master Dataset
This notebook will track the development of the master dataset upon which all our analyses will be based. Data from MITRE's CVE Project and IoT CVEs, the NVD, FIRST's EPSS data, and aggregated nation-state attack (NSA) data, as well as a smattering of IoT CVEs found from various articles (e.g. Check Point) will be merged via appropriate methods to produce a dataset that can offer a comprehensive analysis in the pursuit of building a holistic metric with which to help the (industrial) internet-of-things industry develop their interests in cybersecurity against advanced persistent threats (APT).

In the process of merging this data together, certain duplicate or empty values may be created that need to be dealt with accordingly. The actual analysis of the resulting dataset will be undertaken in a separate notebook (`master_analysis`) for clarity and in the interest of a separation of concerns.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the datasets
cves = pd.read_parquet(path='../data/CVE_Project/cvelistV5/cve_list_v3.parquet')
epss = pd.read_parquet(path='../data/EPSS/epss_data.parquet')
iots = pd.read_parquet(path='../data/MITRE/mitre_iot_cves_v1.parquet')
nvd = pd.read_parquet(path='../data/NVD/nvd_data_v1.parquet')
nsa = pd.read_parquet(path='../data/nsa_data_v2.parquet')

In [13]:
cves.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250171 entries, 0 to 250170
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype              
---  ------       --------------   -----              
 0   cve_id       250171 non-null  string             
 1   description  250171 non-null  string             
 2   date_known   194415 non-null  datetime64[ns, UTC]
 3   cvss_v3      69256 non-null   float64            
 4   cvss_v3_cat  69256 non-null   category           
 5   vendor       250023 non-null  string             
 6   product      250057 non-null  string             
dtypes: category(1), datetime64[ns, UTC](1), float64(1), string(4)
memory usage: 11.7 MB


In [14]:
epss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260010 entries, 0 to 260009
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   cve_id      260010 non-null  string 
 1   epss        260010 non-null  float64
 2   percentile  260010 non-null  float64
dtypes: float64(2), string(1)
memory usage: 6.0 MB


In [15]:
iots.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   cve_id       1088 non-null   string
 1   description  1088 non-null   string
dtypes: string(2)
memory usage: 17.1 KB


In [16]:
nvd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   cve_id      1088 non-null   string             
 1   date_known  1088 non-null   datetime64[ns, UTC]
 2   cvss_v3     1088 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(1), string(1)
memory usage: 25.6 KB


In [17]:
nsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   attack_name            87 non-null     category      
 1   cve_list_date          67 non-null     datetime64[ns]
 2   date_of_first_exploit  67 non-null     datetime64[ns]
 3   patch_release_date     67 non-null     datetime64[ns]
 4   cvss                   67 non-null     float64       
 5   cvss_status            67 non-null     category      
 6   days_to_patch_release  67 non-null     Int64         
 7   days_to_first_exploit  67 non-null     Int64         
 8   year_start             80 non-null     Int64         
 9   year_end               80 non-null     Int64         
 10  attribution_group      48 non-null     category      
 11  attribution_state      79 non-null     category      
 12  cve_id                 83 non-null     string        
 13  descrip

In [18]:
cves.head()

Unnamed: 0,cve_id,description,date_known,cvss_v3,cvss_v3_cat,vendor,product
0,CVE-1999-0001,ip_input.c in BSD-derived TCP/IP implementatio...,2000-02-04 05:00:00+00:00,,,,
1,CVE-1999-0002,Buffer overflow in NFS mountd gives root acces...,1999-09-29 04:00:00+00:00,,,,
2,CVE-1999-0003,Execute commands as root via buffer overflow i...,1999-09-29 04:00:00+00:00,,,,
3,CVE-1999-0004,"MIME buffer overflow in email clients, e.g. So...",2000-02-04 05:00:00+00:00,,,,
4,CVE-1999-0005,Arbitrary command execution via IMAP buffer ov...,1999-09-29 04:00:00+00:00,,,,


In [19]:
epss.head()

Unnamed: 0,cve_id,epss,percentile
0,CVE-1999-0001,0.00383,0.73343
1,CVE-1999-0002,0.0208,0.89305
2,CVE-1999-0003,0.04409,0.92563
3,CVE-1999-0004,0.00917,0.83132
4,CVE-1999-0005,0.91963,0.99


In [20]:
iots.head()

Unnamed: 0,cve_id,description
0,CVE-2024-29195,The azure-c-shared-utility is a C library for ...
1,CVE-2024-29055,Microsoft Defender for IoT Elevation of Privil...
2,CVE-2024-29054,Microsoft Defender for IoT Elevation of Privil...
3,CVE-2024-29053,Microsoft Defender for IoT Remote Code Executi...
4,CVE-2024-21324,Microsoft Defender for IoT Elevation of Privil...


In [21]:
nsa.head()

Unnamed: 0,attack_name,cve_list_date,date_of_first_exploit,patch_release_date,cvss,cvss_status,days_to_patch_release,days_to_first_exploit,year_start,year_end,attribution_group,attribution_state,cve_id,description
0,BlackEnergy Attack on Ukraine,2014-01-31,2015-12-23,2014-07-01,5.0,medium,151,691,2015,2015,Sandworm,Russia,CVE-2014-0630,EMC Documentum TaskSpace (TSP) 6.7SP1 before P...
1,BlackEnergy Attack on Ukraine,2014-10-07,2015-12-23,2015-01-12,5.0,medium,97,442,2015,2015,Sandworm,Russia,CVE-2014-4166,Cross-site scripting (XSS) vulnerability in th...
2,BlackEnergy Attack on Ukraine,2014-10-14,2015-12-23,2014-12-10,7.5,high,57,435,2015,2015,Sandworm,Russia,CVE-2014-6485,Unspecified vulnerability in Oracle Java SE 8u...
3,BlackEnergy Attack on Ukraine,2015-01-27,2015-12-23,2015-01-27,6.8,medium,0,330,2015,2015,Sandworm,Russia,CVE-2015-0057,win32k.sys in the kernel-mode drivers in Micro...
4,BlackEnergy Attack on Ukraine,2015-04-15,2015-12-23,2015-04-15,7.5,high,0,252,2015,2015,Sandworm,Russia,CVE-2015-1673,The Windows Forms (aka WinForms) libraries in ...


## Merge Strategy
The goal of this merge strategy is to combine data from five distinct dataframes: `iots`, `cves`, `epss`, `nvd`, and `nsa`. The merge will begin by a series of leftward merges into the `iots` dataframe—since we're focused on IoT CVEs—to preserve its integreity as the core base of our master data. We will finalize the working copy of the master data through an outward merge with `nsa` since this contains relevant NSA data that we don't want to filter out. Sometimes, if the merge key is not the only column of the same name across both tables, and if the merge key is the only column we're merging from, then both columns from both tables will be merged and each will be appended with an `_x` and a `_y`, respectively. Because of this, it may be necessary to combine these columns and drop the duplicates after each successful merge.

In [2]:
# Helper function to combine columns and drop their duplicates
def combine_and_drop(df, cols: dict):
    """
    This function takes a dataframe and a dictionary containing a list of
    columns to merge and drop whose key is the name of the resultant column.
    """
    # Loop through dictionary to combine columns
    for result, source in cols.items():
        if "date_known" in result:
            # Take earliest date between the two
            df[result] = df[[source[0], source[1]]].min(axis=1)
        else:
            df[result] = df[source[0]].combine_first(df[source[1]])
    # Drop duplicate columns
    df = df.drop(
        columns=[
            col for result, source in cols.items() for col in source if col != result
        ]
    )
    return df

### Merge with `cves`

In [3]:
df = iots.merge(cves, on='cve_id', how='left')

# Dictionary of columns to combine
cols_to_combine = {
    'description': ['description_x', 'description_y']
}

df = combine_and_drop(df, cols_to_combine)

### Merge with `epss`

In [4]:
df = df.merge(epss, on='cve_id', how='left')

### Merge with `nvd`

In [5]:
df = df.merge(nvd, on='cve_id', how='left')

cols_to_combine = {
    'date_known': ['date_known_x', 'date_known_y'],
    'cvss_v3': ['cvss_v3_x', 'cvss_v3_y']
}

df = combine_and_drop(df, cols_to_combine)

### Merge with `nsa`

In [6]:
df = df.merge(nsa, on='cve_id', how='outer')

cols_to_combine = {
    'description': ['description_x', 'description_y'],
    'cvss_v3': ['cvss_v3', 'cvss'],
    'cvss_v3_cat': ['cvss_v3_cat', 'cvss_status'],
    'date_known': ['date_known', 'cve_list_date']
}

df = combine_and_drop(df, cols_to_combine)

## Re-Merge Procedure
Due to the way the `nsa` merge expands the observation count, we need to re-merge `cves` and `epss` into the dataset to capture a few CVSS and EPSS scores, dates, vendors and products (if available). If `cves` doesn't have this information, we'll update the API caller in `nvd_extraction` and pass in the new list of CVEs we have in this master dataset, then merge an updated version of `nvd_data` back into our master dataset.

In [7]:
# Remerge CVEs
df = df.merge(cves, on='cve_id', how='left')

cols_to_combine = {
    'cvss_v3_cat': ['cvss_v3_cat_x', 'cvss_v3_cat_y'],
    'vendor': ['vendor_x', 'vendor_y'],
    'product': ['product_x', 'product_y'],
    'date_known': ['date_known_x', 'date_known_y'],
    'cvss_v3': ['cvss_v3_x', 'cvss_v3_y'],
    'description': ['description_x', 'description_y']
}

df = combine_and_drop(df, cols_to_combine)

In [8]:
# Remerge EPSS
df = df.merge(epss, on='cve_id', how='left')

cols_to_combine = {
    'epss': ['epss_x', 'epss_y'],
    'percentile': ['percentile_x', 'percentile_y']
}

df = combine_and_drop(df, cols_to_combine)

This is that same CVE ID (`CVE-2022-26658`) that doesn't exist in MITRE's CVE Project, the NVD, VulnDB, or CVEFeed.io. MITRE's website says that this ID is `reserved`, meaning that it hasn't been mapped to an actual vulnerability yet. Curiously, it still a list date, patch release data, first exploitation date, CVSS score, and an association with Volt Typhoon. Obviously, this is something we'll have to look further into. This is likely the only CVE ID that is reserved, considering that only reserved IDs are missing descritions and that all other IDs in our dataset have respective descriptions.

In [9]:
# Get rid of the non-existant vulnerability
df = df.drop(df[df['cve_id'] == 'CVE-2022-26658'].index)

## Adding CVEs from Checkpoint Article
Based on missing CVSS scores for $11$ CVE IDs, we'll take those $11$ observations, save them into their own small dataset, import it into `nvd_extraction`, and call NVD's API to grab what we need. At the same time, we'll add the $2$ CVEs found in the Check Pount article ([reread it here](https://blog.checkpoint.com/security/the-tipping-point-exploring-the-surge-in-iot-cyberattacks-plaguing-the-education-sector/)) that weren't in our data already because MITRE doesn't have their CVSS scores.

In [15]:
count = len(df[(df['cve_id'].notnull()) & (df['cvss_v3'].isnull())])
print(f'We need to gather CVSS scores for {count} CVEs from the NVD.')

We need to gather CVSS scores for 11 CVEs from the NVD.


In [18]:
# Create Check Point CVE dataset
cp = {
    'cve_id': [
        'CVE-2015-2051',
        'CVE-2016-6277',
        'CVE-2022-37061'
    ]
}
article_cves = pd.DataFrame(cp)

# Pull out mini dataset
missing_cves = df[(df['cve_id'].notnull()) & (df['cvss_v3'].isnull())]
missing_cves = missing_cves['cve_id']
missing_cves = missing_cves.to_frame()
missing_cves = pd.concat([missing_cves, article_cves], ignore_index=True)

# Save the mini dataset
# missing_cves.to_parquet(path='../data/miniset_cves.parquet')

In [10]:
# Loading in the mini response taken from NVD
mini = pd.read_parquet(path='../data/NVD/mini_nvd_response.parquet')

In [12]:
# Remerge NVD
df = df.merge(mini, on='cve_id', how='left')

cols_to_combine = {
    'date_known': ['date_known_x', 'date_known_y'],
    'cvss_v3': ['cvss_v3_x', 'cvss_v3_y']
}

df = combine_and_drop(df, cols_to_combine)

## Standardizing Null Values

In [14]:
cols_of_int = ['vendor', 'product']
df[cols_of_int] = df[cols_of_int].replace('n/a', pd.NA)

## Validating CVSS Scores

In [15]:
def map_cvss_to_category(score):
    if score >= 9.0:
        return 'critical'
    elif score >= 7.0:
        return 'high'
    elif score >= 4.0:
        return 'medium'
    elif score > 0.0:
        return 'low'
    elif score == 0.0:
        return 'none'
    return None

df['cvss_v3_cat'] = df['cvss_v3'].apply(map_cvss_to_category)

cvss_cats = ['none', 'low', 'medium', 'high', 'critical']
df['cvss_v3_cat'] = pd.Categorical(
    df['cvss_v3_cat'],
    categories=cvss_cats,
    ordered=True
)

## Reordering Columns for Readability

In [16]:
new_order = [
    'cve_id',
    'description',
    'epss',
    'percentile',
    'cvss_v3',
    'cvss_v3_cat',
    'date_known',
    'patch_release_date',
    'date_of_first_exploit',
    'days_to_patch_release',
    'days_to_first_exploit',
    'vendor',
    'product',
    'attack_name',
    'year_start',
    'year_end',
    'attribution_group',
    'attribution_state'
]
df = df[new_order]

## Sorting the Data
The following section floats meaningful data to the top and sinks empty values to the bottom. It does this by sorting a dummy column that counts the number of non-null values across each row. After this column has been sorted, the dataset is sorted according to `cve_id`.

In [17]:
# Create column of non-null counts
df['nn_count'] = df.notnull().sum(axis=1)

# Sort table
df = df.sort_values(by=['nn_count', 'cve_id'], ascending=[False, True])

# Drop dummy column
df = df.drop(columns=['nn_count'])
df.head(3)

Unnamed: 0,cve_id,description,epss,percentile,cvss_v3,cvss_v3_cat,date_known,patch_release_date,date_of_first_exploit,days_to_patch_release,days_to_first_exploit,vendor,product,attack_name,year_start,year_end,attribution_group,attribution_state
20,CVE-2017-0144,The SMBv1 server in Microsoft Windows Vista SP...,0.96402,0.99603,8.1,high,2017-03-14 00:00:00+00:00,2017-03-14 00:00:00+00:00,2017-05-12 00:00:00+00:00,0,59,Microsoft Corporation,Windows SMB,Dragonfly/Energetic Bear Campaign 3,2022,2022,Dragonfly (Energetic Bear),Russia
21,CVE-2017-0144,The SMBv1 server in Microsoft Windows Vista SP...,0.96402,0.99603,8.1,high,2017-03-14 00:00:00+00:00,2017-03-14 00:00:00+00:00,2017-05-12 00:00:00+00:00,0,59,Microsoft Corporation,Windows SMB,Dragonfly/Energetic Bear Campaign 3,2022,2022,Dragonfly (Energetic Bear),Russia
26,CVE-2017-12074,Directory traversal vulnerability in the SYNO....,0.00062,0.2731,9.8,critical,2017-08-23 00:00:00+00:00,2018-06-27 00:00:00+00:00,2018-05-24 00:00:00+00:00,253,219,Synology,Synology DNS Server,VPNFilter,2018,2018,APT28 (Fancy Bear),Russia


## Dropping Duplicate Observations

In [18]:
dups = df.duplicated().sum()
print(f'There are {dups} duplicate observations in the dataset.')

# Drop duplicates
df = df.drop_duplicates(keep='first')
dups = df.duplicated().sum()
print(f'After the drop, there are now {dups} duplicate observations in the dataset.')

There are 24 duplicate observations in the dataset.
After the drop, there are now 0 duplicate observations in the dataset.


The following table shows the four observations that do not include CVE IDs and their associated information. If we find them, adding this information will be relatively straightforward.

In [29]:
df[df['cve_id'].isnull()]

Unnamed: 0,cve_id,description,epss,percentile,cvss_v3,cvss_v3_cat,date_known,patch_release_date,date_of_first_exploit,days_to_patch_release,days_to_first_exploit,vendor,product,attack_name,year_start,year_end,attribution_group,attribution_state
1175,,,,,,,NaT,NaT,NaT,,,,,Dragonfly/Energetic Bear Campaign 1,2013,2014,Dragonfly (Energetic Bear),Russia
1176,,,,,,,NaT,NaT,NaT,,,,,Dragonfly/Energetic Bear Campaign 2,2017,2017,Dragonfly (Energetic Bear),Russia
1178,,,,,,,NaT,NaT,NaT,,,,,SolarWinds Orion Supply Chain Attack,2020,2020,APT29 (Cozy Bear),Russia
1177,,,,,,,NaT,NaT,NaT,,,,,Iranian Cyberattacks on Water Systems,2020,2020,,Iran


## Resetting Index

In [19]:
df = df.reset_index(drop=True)

In [20]:
df.head(50)

Unnamed: 0,cve_id,description,epss,percentile,cvss_v3,cvss_v3_cat,date_known,patch_release_date,date_of_first_exploit,days_to_patch_release,days_to_first_exploit,vendor,product,attack_name,year_start,year_end,attribution_group,attribution_state
0,CVE-2017-0144,The SMBv1 server in Microsoft Windows Vista SP...,0.96402,0.99603,8.1,high,2017-03-14 00:00:00+00:00,2017-03-14 00:00:00+00:00,2017-05-12 00:00:00+00:00,0.0,59.0,Microsoft Corporation,Windows SMB,Dragonfly/Energetic Bear Campaign 3,2022.0,2022.0,Dragonfly (Energetic Bear),Russia
1,CVE-2017-12074,Directory traversal vulnerability in the SYNO....,0.00062,0.2731,9.8,critical,2017-08-23 00:00:00+00:00,2018-06-27 00:00:00+00:00,2018-05-24 00:00:00+00:00,253.0,219.0,Synology,Synology DNS Server,VPNFilter,2018.0,2018.0,APT28 (Fancy Bear),Russia
2,CVE-2017-12074,Directory traversal vulnerability in the SYNO....,0.00062,0.2731,8.8,high,2017-08-23 00:00:00+00:00,2018-06-27 00:00:00+00:00,2018-05-24 00:00:00+00:00,307.0,273.0,Synology,Synology DNS Server,VPNFilter,2018.0,2018.0,APT28 (Fancy Bear),Russia
3,CVE-2017-12074,Directory traversal vulnerability in the SYNO....,0.00062,0.2731,7.5,high,2017-08-23 00:00:00+00:00,2018-06-27 00:00:00+00:00,2018-05-24 00:00:00+00:00,147.0,113.0,Synology,Synology DNS Server,VPNFilter,2018.0,2018.0,APT28 (Fancy Bear),Russia
4,CVE-2018-13379,An Improper Limitation of a Pathname to a Rest...,0.96972,0.99773,9.8,critical,2018-09-27 00:00:00+00:00,2018-10-02 00:00:00+00:00,2020-01-30 00:00:00+00:00,5.0,490.0,Fortinet,"Fortinet FortiOS, FortiProxy",Dragonfly/Energetic Bear Campaign 3,2022.0,2022.0,Dragonfly (Energetic Bear),Russia
5,CVE-2020-0601,A spoofing vulnerability exists in the way Win...,0.96964,0.99766,8.4,high,2020-01-14 00:00:00+00:00,2020-01-14 00:00:00+00:00,2020-01-13 00:00:00+00:00,0.0,-1.0,Microsoft,Windows,Dragonfly/Energetic Bear Campaign 3,2022.0,2022.0,Dragonfly (Energetic Bear),Russia
6,CVE-2021-31166,HTTP Protocol Stack Remote Code Execution Vuln...,0.97159,0.99844,7.5,high,2021-03-09 00:00:00+00:00,2021-03-09 00:00:00+00:00,2021-03-09 00:00:00+00:00,0.0,0.0,Microsoft,Windows 10 Version 2004,Dragonfly/Energetic Bear Campaign 3,2022.0,2022.0,Dragonfly (Energetic Bear),Russia
7,CVE-2021-22986,"On BIG-IP versions 16.0.x before 16.0.1.1, 15....",0.97465,0.99969,9.8,critical,2021-03-09 00:00:00+00:00,2021-03-09 00:00:00+00:00,2021-03-09 00:00:00+00:00,0.0,0.0,,BIG-IP; BIG-IQ,Dragonfly/Energetic Bear Campaign 3,2022.0,2022.0,Dragonfly (Energetic Bear),Russia
8,CVE-2014-0630,EMC Documentum TaskSpace (TSP) 6.7SP1 before P...,0.00129,0.48399,5.0,medium,2014-01-31 00:00:00+00:00,2014-07-01 00:00:00+00:00,2015-12-23 00:00:00+00:00,151.0,691.0,,,BlackEnergy Attack on Ukraine,2015.0,2015.0,Sandworm,Russia
9,CVE-2014-4166,Cross-site scripting (XSS) vulnerability in th...,0.00221,0.6039,5.0,medium,2014-10-07 00:00:00+00:00,2015-01-12 00:00:00+00:00,2015-12-23 00:00:00+00:00,97.0,442.0,,,BlackEnergy Attack on Ukraine,2015.0,2015.0,Sandworm,Russia


In [21]:
df.tail(50)

Unnamed: 0,cve_id,description,epss,percentile,cvss_v3,cvss_v3_cat,date_known,patch_release_date,date_of_first_exploit,days_to_patch_release,days_to_first_exploit,vendor,product,attack_name,year_start,year_end,attribution_group,attribution_state
1105,CVE-2019-20473,An issue was discovered on TK-Star Q90 Junior ...,0.00099,0.41836,6.8,medium,2021-02-01 20:13:13+00:00,NaT,NaT,,,,,,,,,
1106,CVE-2020-11624,An issue was discovered in AvertX Auto focus N...,0.00465,0.75759,9.8,critical,2020-07-23 20:02:05+00:00,NaT,NaT,,,,,,,,,
1107,CVE-2020-11915,An issue was discovered in Svakom Siime Eye 14...,0.00088,0.38458,6.8,medium,2021-02-08 01:40:45+00:00,NaT,NaT,,,,,,,,,
1108,CVE-2020-11920,An issue was discovered in Svakom Siime Eye 14...,0.01624,0.87764,9.8,critical,2021-02-08 01:43:04+00:00,NaT,NaT,,,,,,,,,
1109,CVE-2020-11922,An issue was discovered in WiZ Colors A60 1.14...,0.00083,0.35986,4.3,medium,2021-02-15 00:00:00+00:00,NaT,NaT,,,,,,,,,
1110,CVE-2020-11923,An issue was discovered in WiZ Colors A60 1.14...,0.00051,0.20874,5.5,medium,2021-02-15 00:00:00+00:00,NaT,NaT,,,,,,,,,
1111,CVE-2020-11924,An issue was discovered in WiZ Colors A60 1.14...,0.00051,0.20874,5.5,medium,2021-02-15 00:00:00+00:00,NaT,NaT,,,,,,,,,
1112,CVE-2020-11925,An issue was discovered in Luvion Grand Elite ...,0.00103,0.42827,8.8,high,2021-02-22 00:00:00+00:00,NaT,NaT,,,,,,,,,
1113,CVE-2020-13702,The Rolling Proximity Identifier used in the A...,0.00361,0.72568,4.3,medium,2020-06-11 18:16:01+00:00,NaT,NaT,,,,,,,,,
1114,CVE-2020-14934,Buffer overflows were discovered in Contiki-NG...,0.00423,0.74573,9.8,critical,2020-08-18 16:25:51+00:00,NaT,NaT,,,,,,,,,


## Saving Master Data
That's it! This is the state of our data so far.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   cve_id                 1151 non-null   string             
 1   description            1151 non-null   string             
 2   epss                   1151 non-null   float64            
 3   percentile             1151 non-null   float64            
 4   cvss_v3                1151 non-null   float64            
 5   cvss_v3_cat            1151 non-null   category           
 6   date_known             1151 non-null   datetime64[ns, UTC]
 7   patch_release_date     47 non-null     datetime64[ns, UTC]
 8   date_of_first_exploit  47 non-null     datetime64[ns, UTC]
 9   days_to_patch_release  47 non-null     Int64              
 10  days_to_first_exploit  47 non-null     Int64              
 11  vendor                 1044 non-null   string           

In [23]:
df.to_parquet(path='../data/master_data_v1.parquet', index=None)
df.to_csv('../data/master_data_v1_indexed.csv', index=True)