# Extracting Information from Exploit DB
Despite Exploit DB not having an official API, a community-built API named `pyxploitdb` will help search through the Exploit DB by CVE. Documentation for this API is [available on GitHub](https://github.com/nicolasmf/pyxploit-db/wiki/How-to-use#searchcve). The package can only be installed with `pip`, so a separate environment was created in order to avoid future package conflicts with packages managed by `conda`. This makes integrating an extraction module into the pipeline from within the `src` directory a challenge that this notebook aims to address, since a different kernel can be selected to run it in isolation.

The notebook takes the CVEs from the MITRE results, randomly samples $20%$ of the records, and uses them to extract a couple key pieces of information from ExploitDB. After having extracted this information, it provides a way to merge these CVEs into the cleaned-up proof-of-concept data attained from Nomi Sec's GitHub repository as well as CISA's KEV catalogue. This creates a robust dataset of CVEs that have exploit codes available that can be used as the backbone of the project's analysis.

The first thing to do is to import the required libraries.

In [38]:
import os
import sys
import time # To track progress of API
import pandas as pd
import pyxploitdb as pyx
from pprint import pprint as pp
from datetime import datetime

From here, it makes sense to define a couple functions that can keep track of the data they're finding in Exploit DB. `process_cve` takes a single CVE from the SCADA-filtered MITRE data and uses it to call Exploit DB's API, whereas `run_xdb_extraction` loops over each CVE to process.

In [41]:
def process_cve(cve_id: str, max_retries: int=3, retry_delay: int=5):
    start_time = time.time() # Start a timer for the current CVE

    # Default data
    poc_count = 0
    earliest_date = pd.NaT

    for attempt in range(1, max_retries + 1):
        try:
            # Call the searchCVE function
            result = pyx.searchCVE(cve_id)

            if not result:
                break # Exit loop immediately

            print('Result found!')
            poc_count = len(result)
            earliest_date = min(
                result, key=lambda exploit: datetime.strptime(
                    exploit.date_published, '%Y-%m-%d'
                )
            )
            earliest_date = earliest_date.date_published

            break # Exit retry loop if successful

        except Exception as e:  # Catches ANY unexpected errors
            print(f"❌ Unexpected error on attempt {attempt}/{max_retries} for {cve_id}: {e}")
            time.sleep(retry_delay)

    elapsed_time = time.time() - start_time
    return poc_count, earliest_date, elapsed_time

def run_xdb_extraction(df: pd.DataFrame) -> pd.DataFrame:
    # Grab random sample of 20% of the data frame; seed=1945
    sample = df.sample(frac=0.2, random_state=1945)

    # Grab our CVE IDs
    cves = sample['cve_id'].tolist()

    # # TEST: Limit API calls
    # cves = cves[:50]

    # Grab total number of CVEs
    total_cves = len(cves)
    # Initialize an empty dataframe to append our results
    results = []
    # Start a timer for the current extraction
    start_time = time.time()

    # Loop through the CVEs and append the captured data
    for i, cve in enumerate(cves, 1):
        print(f'Processing CVE {i}/{total_cves}: {cve}')
        poc_count, earliest_date, elapsed_time = process_cve(cve)
        results.append({
            'cve_id': cve,
            'exploit_count': poc_count,
            'earliest_date': earliest_date,
        })

        # Calculate remaining time estimate
        time_spent = time.time() - start_time
        cves_left = total_cves - i
        avg_time_per_cve = time_spent / i
        estimated_time_remaining = avg_time_per_cve * cves_left
        hours, remainder = divmod(estimated_time_remaining, 3600)
        minutes, seconds = divmod(remainder, 60)

        # Print progress stats
        print(
            f'Elapsed time: {time_spent:.2f}s | '
            f'Remaining time: {int(hours)}hrs {int(minutes)}mins {seconds:.2f}s\r'
        )

    return pd.DataFrame(results)

The following notebook cell actually loads the CVEs and runs the API client.

In [42]:
df = pd.read_parquet(path='../data/processed/mitre/cve/cve_cleaned.parquet')

# Extract proof-of-concept data
df = run_xdb_extraction(df)

# Save the Exploit DB data for ease-of-use
df.to_parquet(path='../data/intermediate/exploits/xdb/xdb_extracted.parquet')

Processing CVE 1/51746: CVE-2022-42127
Elapsed time: 0.44s | Remaining time: 6hrs 22mins 17.55s
Processing CVE 2/51746: CVE-2006-0597
Elapsed time: 0.75s | Remaining time: 5hrs 21mins 19.84s
Processing CVE 3/51746: CVE-2021-44495
Elapsed time: 1.09s | Remaining time: 5hrs 12mins 24.21s
Processing CVE 4/51746: CVE-2024-48734
Elapsed time: 1.37s | Remaining time: 4hrs 55mins 9.27s
Processing CVE 5/51746: CVE-2007-0010
Result found!
Elapsed time: 1.78s | Remaining time: 5hrs 7mins 48.79s
Processing CVE 6/51746: CVE-2011-3421
Elapsed time: 2.10s | Remaining time: 5hrs 1mins 53.44s
Processing CVE 7/51746: CVE-2015-1355
Elapsed time: 2.42s | Remaining time: 4hrs 57mins 49.87s
Processing CVE 8/51746: CVE-2021-41292
Elapsed time: 2.70s | Remaining time: 4hrs 50mins 59.73s
Processing CVE 9/51746: CVE-2023-38881
Elapsed time: 3.01s | Remaining time: 4hrs 48mins 8.89s
Processing CVE 10/51746: CVE-2016-5330
Result found!
Elapsed time: 3.33s | Remaining time: 4hrs 47mins 12.09s
Processing CVE 11/51

In [43]:
df.head()

Unnamed: 0,cve_id,exploit_count,earliest_date
0,CVE-2022-42127,0,NaT
1,CVE-2006-0597,0,NaT
2,CVE-2021-44495,0,NaT
3,CVE-2024-48734,0,NaT
4,CVE-2007-0010,1,2007-01-24


In [43]:
# Save the Exploit DB data for ease-of-use
df.to_parquet(path='../data/intermediate/exploits/xdb/xdb_extracted.parquet') # ! THIS IS THE RIGHT PATH

With proof-of-concept exploit code information successfully extracted from the database, the next step is to merge the DataFrame created from Exploit DB into the one extracted from the PoC-in-GitHub data. This helps us not simply combine all of the rows and drop duplicates based on the `cve_id`, but add together the total `exploit_count` from each dataset or retain the `earliest_date` of the exploit code for a given CVE ID should they overlap.

In [92]:
# Load into Exploit DB data
xdb = pd.read_parquet(
    path='../data/intermediate/exploits/xdb/xdb_extracted.parquet'
)
# Focus on relevant ExploitDB data
xdb = xdb[xdb['exploit_count'] > 0].reset_index(drop=True)

# Load PoC-in-GitHub data
poc = pd.read_parquet(
    path='../data/processed/exploits/poc/poc_cleaned.parquet'
)

# Load in KEV data
kev = pd.read_parquet(
    path='../data/processed/cisa/kev/kev_processed.parquet'
)

In [93]:
xdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4931 entries, 0 to 4930
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   cve_id         4931 non-null   object
 1   exploit_count  4931 non-null   int64 
 2   earliest_date  4931 non-null   object
dtypes: int64(1), object(2)
memory usage: 115.7+ KB


In [84]:
# Merge PoC-in-GitHub with Exploit DB data
df = pd.merge(
    poc,
    xdb,
    on='cve_id',
    how='outer',
    suffixes=('_poc', '_xdb'),
    indicator='_merge1'
)

# Customize indicator values
df['_merge1'] = df['_merge1'].replace({
    'left_only': 'poc',
    'right_only': 'xdb',
    'both': 'poc_xdb'
})

# Find the sum total of exploit codes for a given CVE ID
df['exploit_count'] = (
    df['exploit_count_poc'].fillna(0) + df['exploit_count_xdb'].fillna(0)
)

# Find the earliest date of an exploit code from either dataset for a given CVE
date_cols = ['earliest_date_poc', 'earliest_date_xdb']
for col in date_cols:
    df[col] = pd.to_datetime(df[col], utc=True)
df['earliest_date'] = df[date_cols].min(axis=1)

# Drop intermediate columns created during merge
df = df[['cve_id', 'exploit_count', 'earliest_date', '_merge1']]

# Group by CVE and aggregate the sum exploit count and min date for duplicates
df = df.groupby('cve_id').agg({
    'exploit_count': 'sum',
    'earliest_date': 'min',
    '_merge1': 'first'
}).reset_index()

  df['_merge1'] = df['_merge1'].replace({


In [85]:
df['_merge1'].value_counts()

_merge1
poc        5230
xdb        4742
poc_xdb     189
Name: count, dtype: int64

In [86]:
kev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1231 entries, 0 to 1230
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   cve_id              1231 non-null   object             
 1   vendor              1231 non-null   object             
 2   product             1231 non-null   object             
 3   cve_name            1231 non-null   object             
 4   kev_date_published  1231 non-null   datetime64[ns, UTC]
 5   cve_short_desc      1231 non-null   object             
 6   required_action     1231 non-null   object             
 7   due_date            1231 non-null   datetime64[ns, UTC]
 8   known_use           1231 non-null   category           
 9   notes               1231 non-null   object             
 10  cwe_id              1231 non-null   object             
dtypes: category(1), datetime64[ns, UTC](2), object(8)
memory usage: 97.6+ KB


In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10181 entries, 0 to 10180
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   exploit_count_poc  5439 non-null   Int64              
 1   earliest_date_poc  5439 non-null   datetime64[ns, UTC]
 2   cve_id             10181 non-null  object             
 3   exploit_count_xdb  4933 non-null   float64            
 4   earliest_date_xdb  4933 non-null   object             
 5   _merge1            10181 non-null  category           
dtypes: Int64(1), category(1), datetime64[ns, UTC](1), float64(1), object(2)
memory usage: 417.8+ KB


Now is the time to add the KEV CVEs into the mix as well. This requires a different strategy since the datapoints used in the previous merge do not exist in the KEV. This merge is simpler to accomplish.

In [89]:
# Merge new DataFrame with KEV data
df = pd.merge(
    df,
    kev,
    on='cve_id',
    how='outer',
    suffixes=('_og', '_kev'),
    indicator='_merge2'
)

# Customize indicator values
df['_merge2'] = df['_merge2'].replace({
    'left_only': 'poc_xdb',
    'right_only': 'kev',
    'both': 'poc_xdb_kev'
})

# Find the earliest date of an exploit code from either dataset for a given CVE
df['earliest_date'] = df[['earliest_date', 'kev_date_published']].min(axis=1)

# Update KEV CVE's exploit counts
df.loc[df['_merge2'] == 'kev', 'exploit_count'] = 1

# # Group by CVE and aggregate the sum exploit count and min date for duplicates
# df = df.groupby('cve_id').agg({
#     'exploit_count': 'sum',
#     'earliest_date': 'min'
# }).reset_index()
# df['_merge2'].value_counts()

ValueError: Cannot use name of an existing column for indicator column

With the indicator, we can see that 478 CVEs were shared in common. We'll update the `exploit_count` attribute to `1` for those CVEs that came only from the KEV since we know that they have at least one exploit code that successfully exploits the vulnerability, but we won't touch the `exploit_count` of those CVEs that existed in both datasets since the merge ensured they adopted the proper counts from the get-go. Lastly, we'll perform a simple datatype transformation before saving the dataset.

In [151]:
# Allow imports from outside the notebook directory
src_path = os.path.abspath(os.path.join('../..', 'src'))
if src_path not in sys.path:
    sys.path.append(src_path)

from utils import convert_cols

# Update KEV CVE's exploit counts
df.loc[df['_merge'] == 'right_only', 'exploit_count'] = 1
# Update column types
COL_TYPES = {
    'string': [
        'cve_id',
        'cve_name',
        'cve_short_desc',
        'required_action',
        'notes',
        'cwe_id'
    ],
    'category': ['vendor', 'product']
}
df = convert_cols(df, COL_TYPES)
# Drop the merge indicator
df.drop(columns='_merge', inplace=True)

cve_id converted to string!
cve_name converted to string!
cve_short_desc converted to string!
required_action converted to string!
notes converted to string!
cwe_id converted to string!
vendor converted to category!
product converted to category!


Save the merged and aggregated DataFrame for use in the rest of the pipeline.

In [156]:
df.to_parquet(path='../../data/processed/composite/exploits_cleaned.parquet')