# Extracting Information from Exploit DB
Despite Exploit DB not having an official API, a community-built API named `pyxploitdb` will help search through the Exploit DB by CVE. Documentation for this API is [available on GitHub](https://github.com/nicolasmf/pyxploit-db/wiki/How-to-use#searchcve). The package can only be installed with `pip`, so a separate environment was created in order to avoid future package conflicts with packages managed by `conda`. This makes integrating an extraction module into the pipeline from within the `src` directory a challenge that this notebook aims to address, since a different kernel can be selected to run it in isolation.

The note takes the CVEs from the SCADA-filtered MITRE results and extracts a couple key pieces of information from Exploit DB. After having extracted this information, it provides a way to merge these CVEs into the cleaned-up proof-of-concept data attained from Nomi Sec's GitHub repository. This creates a robust dataset of CVEs that have exploit codes available that can be used as the backbone of the project's analysis.

The first thing to do is to import the required libraries.

In [9]:
import os
import sys
import pandas as pd
import pyxploitdb as pyx
from pprint import pprint as pp
from datetime import datetime

From here, it makes sense to define a couple functions that can keep track of the data they're finding in Exploit DB. `process_cve` takes a single CVE from the SCADA-filtered MITRE data and uses it to call Exploit DB's API, whereas `run_xdb_extraction` loops over each CVE to process.

In [23]:
def process_cve(cve_id):
    # Call the searchCVE function
    result = pyx.searchCVE(cve_id)

    # Default data
    poc_count = 0
    earliest_date = pd.NaT

    if result:
        print('Result found!')
        poc_count = len(result)
        earliest_date = min(
            result, key=lambda exploit: datetime.strptime(
                exploit.date_published, '%Y-%m-%d'
            )
        )
        earliest_date = earliest_date.date_published

    return poc_count, earliest_date

def run_xdb_extraction(df: pd.DataFrame) -> pd.DataFrame:
    # Grab our CVE IDs
    cves = df['cve_id']
    # Initialize an empty dataframe to append our results
    results = []
    # Loop through the CVEs and append the captured data
    for cve in cves:
        print(f'Processing CVE: {cve} ...')
        poc_count, earliest_date = process_cve(cve)
        results.append({
            'cve_id': cve,
            'exploit_count': poc_count,
            'earliest_date': earliest_date,
        })

    return pd.DataFrame(results)

The following notebook cell actually loads the CVEs and runs the API client.

In [None]:
df = pd.read_parquet(path='../../data/processed/mitre/cve/cve_cleaned.parquet')

# Extract proof-of-concept data
df = run_xdb_extraction(df)

In [43]:
# Save the Exploit DB data for ease-of-use
df.to_parquet(path='../../data/intermediate/exploits/xdb/xdb_extracted.parquet')

With proof-of-concept exploit code information successfully extracted from the database, the next step is to merge the DataFrame created from Exploit DB into the one extracted from the PoC-in-GitHub data. This helps us not simply combine all of the rows and drop duplicates based on the `cve_id`, but add together the total `exploit_count` from each dataset or retain the `earliest_date` of the exploit code for a given CVE ID should they overlap.

In [147]:
# Load into Exploit DB data
xdb = pd.read_parquet(
    path='../../data/intermediate/exploits/xdb/xdb_extracted.parquet'
)
# Load PoC-in-GitHub data
poc = pd.read_parquet(
    path='../../data/processed/exploits/poc/poc_cleaned.parquet'
)
# Load in KEV data
kev = pd.read_parquet(
    path='../../data/processed/cisa/kev/kev_processed.parquet'
)

In [148]:
# Merge PoC-in-GitHub with Exploit DB data
df = pd.merge(poc, xdb, on='cve_id', how='outer', suffixes=('_poc', '_xdb'))
# Find the sum total of exploit codes for a given CVE ID
df['exploit_count'] = (
    df['exploit_count_poc'].fillna(0) + df['exploit_count_xdb'].fillna(0)
)
# Find the earliest date of an exploit code from either dataset for a given CVE
df['earliest_date'] = df[['earliest_date_poc', 'earliest_date_xdb']].min(axis=1)
# Drop intermediate columns created during merge
df = df[['cve_id', 'exploit_count', 'earliest_date']]

# Group by CVE and aggregate the sum exploit count and min date for duplicates
df = df.groupby('cve_id').agg({
    'exploit_count': 'sum',
    'earliest_date': 'min'
}).reset_index()

Now is the time to add the KEV CVEs into the mix as well. This requires a different strategy since the datapoints used in the previous merge do not exist in the KEV. This merge is simpler to accomplish.

In [149]:
# Merge new DataFrame with KEV data
df = pd.merge(df, kev, on='cve_id', how='outer', indicator=True)
df['_merge'].value_counts()

_merge
left_only     5028
right_only     753
both           478
Name: count, dtype: int64

With the indicator, we can see that 478 CVEs were shared in common. We'll update the `exploit_count` attribute to `1` for those CVEs that came only from the KEV since we know that they have at least one exploit code that successfully exploits the vulnerability, but we won't touch the `exploit_count` of those CVEs that existed in both datasets since the merge ensured they adopted the proper counts from the get-go. Lastly, we'll perform a simple datatype transformation before saving the dataset.

In [151]:
from utils import convert_cols

# Allow imports from outside the notebook directory
src_path = os.path.abspath(os.path.join('../..', 'src'))
if src_path not in sys.path:
    sys.path.append(src_path)

# Update KEV CVE's exploit counts
df.loc[df['_merge'] == 'right_only', 'exploit_count'] = 1
# Update column types
COL_TYPES = {
    'string': [
        'cve_id',
        'cve_name',
        'cve_short_desc',
        'required_action',
        'notes',
        'cwe_id'
    ],
    'category': ['vendor', 'product']
}
df = convert_cols(df, COL_TYPES)
# Drop the merge indicator
df.drop(columns='_merge', inplace=True)

cve_id converted to string!
cve_name converted to string!
cve_short_desc converted to string!
required_action converted to string!
notes converted to string!
cwe_id converted to string!
vendor converted to category!
product converted to category!


Save the merged and aggregated DataFrame for use in the rest of the pipeline.

In [156]:
df.to_parquet(path='../../data/processed/composite/exploits_cleaned.parquet')