# Extracting Data from the CVE Project
This notebook is designed to scrape and process vulnerability data from over 260,000 JSON files, which are part of MITRE's CVE (Common Vulnerabilities and Exposures) project. The goal of this notebook is to extract key information that can be utilized in combination with other data from CISA, the NVD, and elsewhere in order to analyze the relationship between criticality, the speed of exploitation, and the importance of proper patching in the Internet-of-Things (IoT) space. Utilizing recursive searching, this notebook uses recursive searching techniques to efficiently navigate through the JSON file structure, collecting relevant data for further assessment.

In [None]:
import os # For traversing and reading folders and files
import json # For reading and extracting data from CVE records
import re # Handle regex patterns
import xml.etree.ElementTree as ET # For reading and extracting data from CWE records
import pandas as pd # For data cleaning and analysis
import numpy as np # For advanced calculations
import matplotlib.pyplot as plt # For data visualization

## Data Collection
There are precisely $264,610$ records in the CVE list; each is represented by a single JSON file. These files contain all the information (though lots of it is incomplete) that will populate our primary dataframe. It was clear after creating the initial script that massive discrepencacies existed between the files' structures. This makes the use of recusive searching particularly beneficial, as it's far more flexible and easier to maintain that a program with explicitly-design conditionality statements covering as many cases as possible. If any more data is needed, this collection function can be easily adapted to fit future requirements. The data currently retrieved from this program populates a dataframe with the following variables: CVE IDs, CWE IDs (Common Weakness Enumeration), publication dates, vulnerability descriptions, four types of severity scores (CVSS 2.0, 3.0, 3.1, and 4.0), affected vendors, and impacted products.

## Saving the Data
Parquet is a file type that streamlines the storage and retrieval of columnar data since it is capable of saving the type of data within the set. Even though CSV files do not have this capability, they are easily shared and viewable in common spreadsheet software. Because of this, I saved a copy with both file extensions. I chose to use the default option `None` for the method's `index` parameter, which saves the index of each record in a special kind of metadata range loop. This means it won't take up the kind of memory it would have if the index was actually saved into the dataframe as a separate attribute, but also provides a way to keep track of the records for the purposes of splitting them up between training, test, and validation sets for an machine-learning algorithm should our work come to that.

<span style='font-weight:600;color:#ff9900;background-color:#525767;border-radius:3px;padding-inline:3px;padding-block:1px;'>Don't run this code cell unless you want to overwrite the saved files!</span>