# NVD Data
The data captured from MITRE's CVE Project is fairly robust, but it nevertheless contains large gaps of information among certain CVE variables, notably the dates in which the given CVE went public and when it was published and CVSS severity scores. Gathering this additional data will help flesh out the master dataset we'll analyze. To get this data, a library exists that makes it easier to communicate with the NVD's API. This notebook will use this library to populate a dataframe that will ultimately be merged into the master dataset within the notebook `master_merge`.

In [2]:
# Import libraries
import os # For using the system's tools
import time # For sleeping requests to the API
import requests # For establishing contact with the NVD's API
import json # For handling JSON serialization
import pandas as pd # For data collection, cleaning, and storage
import numpy as np # For advanced calculations
# import nvdlib as nvd # A wrapper to make API comms more intuitive
from dotenv import load_dotenv as env # For securely loading the API key

# Import data
iots = pd.read_parquet(path='../data/MITRE/mitre_iot_cves_v1.parquet')

In [11]:
iots.head()

Unnamed: 0,cve_id,description
0,CVE-2024-29195,The azure-c-shared-utility is a C library for ...
1,CVE-2024-29055,Microsoft Defender for IoT Elevation of Privil...
2,CVE-2024-29054,Microsoft Defender for IoT Elevation of Privil...
3,CVE-2024-29053,Microsoft Defender for IoT Remote Code Executi...
4,CVE-2024-21324,Microsoft Defender for IoT Elevation of Privil...


## Communicating with NVD's API
According to the documentation (https://nvdlib.com/en/latest/v1/v1.html), NVD's API is rate-limited for both calls without and with an API key. Every CVE ID we look up will at the bare minimum take $0.6$ seconds because of this rate-limiting, so looping through ~$260,000$+ CVEs—which at the very least take $43$ hours—is obviously impractical, which is why we're importing MITRE's IoT CVEs (since they form the backbone of the rest of the data in our master copy)—which has just $1088$ CVEs and will therefore only take around $11$ minutes. The first call is a glimpse at the structure of the API's response. In it, we see exactly the keys we need to retrieve from the JSON (`"published"` and `"baseScore"`).

In [3]:
# Load environment variables from .env file
env()

# Access the API key from environment variables
key = os.getenv('NVD_API_KEY')

base_url = 'https://services.nvd.nist.gov/rest/json/cves/2.0'

# Formulating the API request
headers = {
    'apiKey': key
}

# Example request to see how the data is returned
#query = f'{base_url}?cveId=CVE-2024-29195'

# Call the API
#response = requests.get(query, headers=headers)

# Parse the JSON response
#cve_data = response.json()

# Pretty print the response to make it easier to inspect
#print(json.dumps(cve_data, indent=4))

In [6]:
# Function to retrieve CVE details from NVD's API
def fetch_cve_data(cve):
    query = f'{base_url}?cveId={cve}'
    try:
        response = requests.get(query, headers=headers)
        if response.status_code == 200:
            cve_data = response.json()

            vulnerabilities = cve_data.get('vulnerabilities', [])
            if vulnerabilities:
                cve_item = vulnerabilities[0].get('cve', {})
                # Extract the info
                cve_id = cve_item.get('id', None)
                date_known = cve_item.get('published', None)

                cvss = None
                version = None
                metrics = cve_item.get('metrics', {})
                # print(json.dumps(metrics, indent=4))

                # Check for CVSS v3.1
                if 'cvssMetricV31' in metrics:
                    cvss = metrics['cvssMetricV31'][0]['cvssData'].get('baseScore', None)
                    version = metrics['cvssMetricV31'][0]['cvssData'].get('version', None)

                # Check for CVSS v3.0 if v3.1 not found
                if cvss is None and 'cvssMetricV30' in metrics:
                    cvss = metrics['cvssMetricV30'][0]['cvssData'].get('baseScore', None)
                    version = metrics['cvssMetricV30'][0]['cvssData'].get('version', None)

                return {
                    'cve_id': cve_id,
                    'date_known': date_known,
                    'cvss_v3': cvss,
                    'version': version
                }
        # If no vulnerabilities are found or NVD has bad data
        return None
    except requests.exceptions.HTTPError as h_e:
        print(f'HTTP Error for CVE {cve}: {h_e}')
    except requests.exceptions.RequestException as r_e:
        print(f'Request Error for CVE {cve}: {r_e}')
    return None

# # Filter a copy of the IoT CVE's for just their IDs
# cves = iots['cve_id']

# nvd_data = []

# # Call the API
# for i, cve in enumerate(cves):
#     # Limit to 5 calls for testing
#     # if i >= 1:
#     #    break

#     cve_info = fetch_cve_data(cve)
#     if cve_info:
#         nvd_data.append(cve_info)
#     # Respect the rate limit between requests
#     time.sleep(0.6)

# # Create dataframe from the extracted information
# nvd = pd.DataFrame(nvd_data)

In [53]:
nvd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   cve_id      1088 non-null   object 
 1   date_known  1088 non-null   object 
 2   cvss_v3     1088 non-null   float64
 3   version     1088 non-null   object 
dtypes: float64(1), object(3)
memory usage: 34.1+ KB


In [66]:
nvd.head()

Unnamed: 0,cve_id,date_known,cvss_v3
0,CVE-2024-29195,2024-03-26 03:15:13.333000+00:00,6.0
1,CVE-2024-29055,2024-04-09 17:15:59.320000+00:00,7.2
2,CVE-2024-29054,2024-04-09 17:15:59.123000+00:00,7.2
3,CVE-2024-29053,2024-04-09 17:15:58.930000+00:00,8.8
4,CVE-2024-21324,2024-04-09 17:15:34.607000+00:00,7.2


## Correcting Datatypes

In [61]:
# Convert to datetime,
nvd['date_known'] = pd.to_datetime(nvd['date_known'], utc=True)

# Convert to text
nvd['cve_id'] = nvd['cve_id'].astype('string')

## Validate CVSS Range

In [58]:
count = len(nvd[(nvd['cvss_v3'] < 0.0) & (nvd['cvss_v3'] > 10.0)])
print(f'{count} observations fall out of range.')

0 observations fall out of range.


In [60]:
v3_0 = len(nvd[nvd['version'] == '3.0'])
print(f'{v3_0} CVSS scores were of version 3.0.')

64 CVSS scores were of version 3.0.


In [64]:
# Drop the version column
nvd = nvd.drop(columns=['version'])

# Remove extra whitespace
nvd['cve_id'] = nvd['cve_id'].str.strip()

# Check ID formating
non_ideal = nvd[~nvd['cve_id'].str.startswith('CVE-', na=True)]
print(f'There are {len(non_ideal)} incorrectly-formatted CVE IDs.')

There are 0 incorrectly-formatted CVE IDs.


## Saving the Data

In [65]:
nvd.to_parquet(path='../data/NVD/nvd_data_v1.parquet')

## Re-Aggregating CVSS Scores From CVE Miniset

In [9]:
mini_cves = pd.read_parquet(path='../data/miniset_cves_v2.parquet')
mini_cves = pd.DataFrame(mini_cves)

In [10]:
mini_cves.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cve_id  17 non-null     object
dtypes: object(1)
memory usage: 264.0+ bytes


In [11]:
mini_cves = mini_cves['cve_id']
mini_data = []

for i, cve in enumerate(mini_cves):
    # Limit to 5 calls for testing
    # if i >= 1:
    #     break

    cve_info = fetch_cve_data(cve)
    if cve_info:
        mini_data.append(cve_info)
    # Respect the rate limit between requests
    time.sleep(0.6)

mini = pd.DataFrame(mini_data)

In [12]:
mini.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   cve_id      17 non-null     object 
 1   date_known  17 non-null     object 
 2   cvss_v3     17 non-null     float64
 3   version     17 non-null     object 
dtypes: float64(1), object(3)
memory usage: 672.0+ bytes


In [13]:
mini['date_known'] = pd.to_datetime(mini['date_known'], utc=True)
mini['cve_id'] = mini['cve_id'].astype('string')
mini['cve_id'] = mini['cve_id'].str.strip()
mini = mini.drop(columns=['version'])

mini.to_parquet(path='../data/NVD/mini_nvd_response_v2.parquet')