# Analysis of IoT-Related CVEs and Their Exploitation by APTs

In [2]:
import os # For traversing and reading folders and files
import json # For reading and extracting data from CVE records
import re # Handle regex patterns
import xml.etree.ElementTree as ET # For reading and extracting data from CWE records
import ast # For safely evaluating strings as Python expressions
import pandas as pd # For data cleaning and analysis
import numpy as np # For advanced calculations
import matplotlib.pyplot as plt # For data visualization

## Data Collection
### Extracting Data from the CVE List
There are precisely $260,896$ records in the CVE list; each is represented by a single JSON file. These files contain all the information (though lots of it is incomplete) that will populate our `CVE_df` dataframe. It was clear after creating the initial script that massive discrepencacies existed between the files' structures. This makes it especially complicated to reliably retrieve desired values, but I managed to create a recursive function that performs various data retrieval processes based on the kind of data type the object holds at every level on its way to pull out the desired info. Based on the CVE List schema, I figured that the most important information—and really, the only kinds of information contained in the files that would be useful to us—would be the CVE ID, the CWE ID, the description, the date the record was published, a categorical and numerical severity score, the vector of attack used in the CVE, and attack complexity.



### Extracting Data from the CWE List
The CWE list contains lots of information that could be pertinant to our analysis that's relatively easy to access. Among these I counted the CWE IDs, names, descriptions, related CWEs, the nature of these relationships, the technological vector (web-based, network, etc.), background details, the phases of development wherein the weakness could be introduced, descriptions about these unfortunate events, the likelihood of these weaknesses, the scope and impact of the weakness's common consequences, detection methods and their descriptions and effectiveness, mitigation strategies, potential vulnerabilities, observed vulnerabilities (with direct references to CVE IDs), and more. Because of the observed vulnerabilities, we can merge together these two dataframes and clean and filter the resultant so that we have a comprehensive dataset to work with. Because there's so much text data in the CWE list, we should have plenty of information to feed into a matural language processing pipeline that can help build a classification model to predict the association of a given CVE with the IoT and it's various vulnerabilities.

I created several helper functions to get us from `A` to `B`. The first, I can just feed the XML file. Next, we have a function that takes an element and takes all of the text data inside of it and it's children nodes and returns it as a long sentence. This kind of column data will be useful when we tokenize and lemmatize the text to more accurately search for our desire IoT-vulnerability-related keywords.

### Saving the Data
Parquet is a file type that streamlines the storage and retrieval of columnar data. I also saved a copy of both dataframes to CSV (a basic type of spreadsheet file). I chose to use the default option `None` for the method's `index` parameter, which saves the index of each record in a special kind of metadata range loop. This means it won't take up the kind of memory it would have if the index was actually saved into the dataframe as a separate attribute, but also provides a way to keep track of the records for the purposes of splitting them up between training, test, and validation sets for an ML algorithm should our work come to that.

<span style='font-weight:600;color:#ff8800;background-color:#666672;border-radius:3px;padding-inline:3px;padding-block:1px;'>Don't run these code cells unless you want to overwrite the saved files!</span>

## Loading the Data
I saved the dataframes resulting from our extraction into secondary storage because pulling data from $260,000$ files takes a very long time (1.5+ hours). Now that we have neatly-packaged CSV files, we'll read them into Pandas so we can inspect their cleanliness, merge them together in interest ways, and construct an comprehensive analysis with descriptive statistics, chi-square independence testing, Spearman's rank correlation, and logistic regression. The CWE list specifically contains many columns whose values are are lists of values. Because Pandas doesn't know how to handle these lists natively, I give the parsing function a converter that saves us a cleaning step. That's what "explosive_cols" is; the columns that contain lists of values that we'll explode into new rows during preprocessing.

In [3]:
cves = pd.read_parquet('../data/CVE_V5/CVE_List.parquet')

cves.head(n=1)

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
0,CVE-1999-0001,,PUBLISHED,2000-02-04T05:00:00,ip_input.c in BSD-derived TCP/IP implementatio...,,,,


In [4]:
cwes = pd.read_parquet('../data/CWE_V4.15/CWE_List.parquet')

cwes.head(n=1)

Unnamed: 0,cwe_id,name,description,tech_class,bg_details,rel_ids,nature_of_rels,modes_of_intro_phases,modes_of_intro_descs,likelihood,...,consequence_notes,detection_method,detection_desc,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,[732],[ChildOf],[Implementation],[],Medium,...,"[If the HttpOnly flag is not set, then sensiti...",[Automated Static Analysis],"[Automated static analysis, commonly referred ...",[High],Implementation,[Leverage the HttpOnly flag when setting a sen...,[High],[While this mitigation is effective for protec...,"[CVE-2022-24045, CVE-2014-3852, CVE-2015-4138]",[Web application for a room automation system ...


## Preprocessing
In order to reliably analyze the data, we need to make sure it's clean. This involves several important steps, namely:
1) Understanding the data at a bird's eye level; e.g. number of rows, summary stats about numerical columns, data types (numbers, text, dates, etc.)
2) Handling missing values, redundant white space, inconsistent formatting, and typos
3) Removing duplicates
4) Converting data types into their respective forms for efficient processing

Before feeding the data into a machine learning model (if we get there), we'll have several additional steps to add on to this process:
1) We'll want to encode our categorical data so that we can use math on it
2) We'll split our data into training, validation, and testing sets to ensure our model generalizes well to unseen data
3) We'll have to handle outliers that can make overfitting our model a dangerous probability
4) We'll scale our data around a common mean of `0` and a standard deviation of `1` so that all features contribute equally to the algorithm's output.

### Bird's Eye View of CVE List
Let's start with the CVE List, then traverse the CWEs, and finally, the full dataset once we've merged them.

In [5]:
cves.info() # Overview of the size of the dataset, its null values in each column, and their datatypes
cves.describe() # View summary stats of numerical attributes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260894 entries, 0 to 260893
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cve_id             260894 non-null  object 
 1   cwe_id             58852 non-null   object 
 2   cve_state          260894 non-null  object 
 3   date_published     257370 non-null  object 
 4   description        246505 non-null  object 
 5   severity           65617 non-null   object 
 6   severity_score     65742 non-null   float64
 7   attack_vector      52410 non-null   object 
 8   attack_complexity  52431 non-null   object 
dtypes: float64(1), object(8)
memory usage: 17.9+ MB


Unnamed: 0,severity_score
count,65742.0
mean,6.631829
std,1.752317
min,0.0
25%,5.4
50%,6.5
75%,7.8
max,10.0


In [6]:
cves['severity'].value_counts() # Types of severity

severity
MEDIUM      31632
HIGH        24219
CRITICAL     5646
LOW          4085
NONE           33
medium          1
MODERATE        1
Name: count, dtype: int64

In [7]:
cves['attack_vector'].value_counts() # Types of attack vector

attack_vector
NETWORK             37346
LOCAL               11747
ADJACENT_NETWORK     2659
PHYSICAL              614
ADJACENT               44
Name: count, dtype: int64

In [8]:
cves['attack_complexity'].value_counts() # Levels of attack complexity

attack_complexity
LOW     43947
HIGH     8484
Name: count, dtype: int64

In [9]:
cves['cwe_id'].describe() # View summary stats of CWE IDs

count      58852
unique       664
top       CWE-79
freq        8871
Name: cwe_id, dtype: object

In [10]:
cves['date_published'] # View date formats

0              2000-02-04T05:00:00
1              1999-09-29T04:00:00
2              1999-09-29T04:00:00
3              2000-02-04T05:00:00
4              1999-09-29T04:00:00
                    ...           
260889    2024-08-20T23:31:03.646Z
260890    2024-08-20T23:31:05.010Z
260891    2024-08-21T20:20:26.856Z
260892    2024-08-21T20:20:27.045Z
260893    2024-08-21T20:20:27.239Z
Name: date_published, Length: 260894, dtype: object

In [11]:
cves['description']

0         ip_input.c in BSD-derived TCP/IP implementatio...
1         Buffer overflow in NFS mountd gives root acces...
2         Execute commands as root via buffer overflow i...
3         MIME buffer overflow in email clients, e.g. So...
4         Arbitrary command execution via IMAP buffer ov...
                                ...                        
260889    A vulnerability was found in Genexis Tilgin Ho...
260890    A vulnerability classified as critical has bee...
260891    Inappropriate implementation in WebApp Install...
260892    Inappropriate implementation in Custom Tabs in...
260893    Inappropriate implementation in Extensions in ...
Name: description, Length: 260894, dtype: object

In [12]:
cves['cve_state'].value_counts() # Types of state

cve_state
PUBLISHED    246505
REJECTED      14389
Name: count, dtype: int64

In [13]:
# Do the rejected IDs contain any useful info?
cves[cves['cve_state'] == 'REJECTED'].head() # Rejected CVE IDs

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity
19,CVE-1999-0020,,REJECTED,2000-02-04T05:00:00,,,,,
109,CVE-1999-0110,,REJECTED,2000-02-04T05:00:00,,,,,
186,CVE-1999-0187,,REJECTED,2000-02-04T05:00:00,,,,,
281,CVE-1999-0282,,REJECTED,2000-02-04T05:00:00,,,,,
334,CVE-1999-0335,,REJECTED,1999-09-29T04:00:00,,,,,


In [13]:
len(cves[cves['description'].isna()]) == len(cves[cves['cve_state'] == 'REJECTED']) # Does the number of CVEs lacking a description exactly match the number of rejected CVEs?

True

From the looks of it, every record has an ID, a published date, and a state, though some $14000$ have been rejected, because of this contain no other useful information, and can be dropped entirely from the table. The publish date will be converted to a proper datetime format. Every published record has a description. The severity scores are already the proper numerical data type, but their categorical counterparts contain multiple levels of `MEDIUM` (`medium` and `MODERATE`) which, based on the [NVD-documented categories](https://nvd.nist.gov/vuln-metrics/cvss), will be need to be addressed.

### Dropping Unnecessary Information

In [14]:
# Drop rejected CVEs
cves = cves.drop(cves[cves['cve_state'] == 'REJECTED'].index)
cves['cve_state'].value_counts()

cve_state
PUBLISHED    246505
Name: count, dtype: int64

### Correcting Data Types

In [15]:
# Convert publication date to datetime format
cves['date_published'] = pd.to_datetime(cves['date_published'], format='ISO8601', utc=True)

# Convert objects to text data (string)
obj_cols = cves.select_dtypes(include=['object']).columns
cves[obj_cols] = cves[obj_cols].astype('string')

cves.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246505 entries, 0 to 260893
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   cve_id             246505 non-null  string             
 1   cwe_id             58852 non-null   string             
 2   cve_state          246505 non-null  string             
 3   date_published     245989 non-null  datetime64[ns, UTC]
 4   description        246505 non-null  string             
 5   severity           65617 non-null   string             
 6   severity_score     65742 non-null   float64            
 7   attack_vector      52410 non-null   string             
 8   attack_complexity  52431 non-null   string             
dtypes: datetime64[ns, UTC](1), float64(1), string(7)
memory usage: 18.8 MB


### Standardizing Formatting

In [16]:
# Standardize severity scores
cves['severity'] = cves['severity'].replace(['medium', 'MODERATE'], 'MEDIUM')
cves['severity'].value_counts()

severity
MEDIUM      31634
HIGH        24219
CRITICAL     5646
LOW          4085
NONE           33
Name: count, dtype: Int64

In [17]:
# Check for leading or trailing whitespace
str_cols = cves.select_dtypes(include=['string']).columns

def check_whitespace():
    for col in str_cols:
        trimmable = cves[col].str.contains(r'^\s|\s$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

# Remove leading or trailing whitespace
cves[str_cols] = cves[str_cols].apply(lambda x: x.str.strip())

check_whitespace()

Column 'cve_id' has 0 trimmable whitespace characters.
Column 'cwe_id' has 0 trimmable whitespace characters.
Column 'cve_state' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.
Column 'severity' has 0 trimmable whitespace characters.
Column 'attack_vector' has 0 trimmable whitespace characters.
Column 'attack_complexity' has 0 trimmable whitespace characters.


### Removing Duplicates

In [18]:
# Check for duplicates
duplicates = cves.duplicated().sum()
print(f'There are {duplicates} duplicates in the dataset.')

There are 0 duplicates in the dataset.


### Bird's Eye View of CWE List
We still have to handle null values in the CVE list, but since we'll be populating it with data from the CWE list during the merge, we can apply the same techniques to clean our second dataset. Once they've been merged, we'll filter out or impute the null values depending on what works best for the analysis.

We'll want to figure out the same basic facts about the CWE list that we did for the CVE list.

In [19]:
cwes.info() # Overview of the size of the dataset, its null values in each column, and their datatypes
cwes.head(n=1)
cwes.columns



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 964 entries, 0 to 963
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cwe_id                    964 non-null    object
 1   name                      964 non-null    object
 2   description               964 non-null    object
 3   tech_class                917 non-null    object
 4   bg_details                964 non-null    object
 5   rel_ids                   964 non-null    object
 6   nature_of_rels            964 non-null    object
 7   modes_of_intro_phases     964 non-null    object
 8   modes_of_intro_descs      964 non-null    object
 9   likelihood                964 non-null    object
 10  scope_of_consequences     964 non-null    object
 11  impact_of_consequences    964 non-null    object
 12  consequence_notes         964 non-null    object
 13  detection_method          964 non-null    object
 14  detection_desc            

Index(['cwe_id', 'name', 'description', 'tech_class', 'bg_details', 'rel_ids',
       'nature_of_rels', 'modes_of_intro_phases', 'modes_of_intro_descs',
       'likelihood', 'scope_of_consequences', 'impact_of_consequences',
       'consequence_notes', 'detection_method', 'detection_desc',
       'detection_effectiveness', 'mitigation_phases', 'mitigation_descs',
       'mitigation_effectiveness', 'mitigate_notes',
       'observed_vulnerabilities', 'vulnerability_descs'],
      dtype='object')

Due to the original data potentially having more than one value for a particular column, I had to pull data dynamically for several attributes into what are technically lists of values. For example, a single CWE ID may have multiple related weaknesses or multiple detection methods. There are groups of these attributes that should correspond to other columns in their group with a 1-to-1 relationship, so that each of a CWE ID's related weaknesses can be mapped to their own respective "nature of relationship" values, or that each of a CWE's detection methods has a detection description and a detection effectiveness, for example. All of this makes cleaning this particular dataset significantly more involved than the CVE list. Dealing with other cleaning steps is made much more complicated without first handling the dataset's architectural issues.

The solution I landed on was to pull in data as a parquet file (which can understand a broader range of datatypes that a simple CSV), create a function that would transform the necessary columns into lists, correct various syntactic anomalies, pad each list so that every list across a given row has the same amount of items, and finally explode these lists across multiple rows. This has preserved the 1-to-1 relationship between items in the lists relative to the items in other lists.

In [56]:
# For converting values into manipulable lists
def safe_lit_eval(value):
    try:
        if isinstance(value, np.ndarray):
            value = value.tolist()
        elif isinstance(value, str):
            value = value.strip()
            if ~value.startswith('[') and ~value.endswith(']') and value:
                return [value]
            if value.startswith('[') and value.endswith(']'):
                return ast.literal_eval(value)
        elif isinstance(value, list):
            return value

    except (ValueError, SyntaxError) as e:
        print(f'Warning: Error parsing value "{value}": {e}')

    # If list is empty, return NaN
    if isinstance(value, list) and not value:
        return [np.nan]
    return value

# For converting values into manipulable lists
def eval_lists(df, cols):
    for col in cols:
        df[col] = df[col].apply(safe_lit_eval)
    return df

# For normalizing list values
def normalize_list_syntax(lst):
    if isinstance(lst, list):
        if not lst: # []
            return [np.nan]
        return [np.nan if item == '' or item is None else str(item).strip() for item in lst]
    return lst

# For normalizing list values
def normalize_lists(df, cols):
    for col in cols:
        df[col] = df[col].apply(normalize_list_syntax)
    return df

# For adding items to each list such that it matches the length of the list with
# the most amount of items in the row
def pad_lists(row, cols):
    # Calculate the number of items in the list within the maximum number of items in a given row
    max_len = max(len(row[col]) if isinstance(row[col], list) else 0 for col in cols)
    # Equalize list length
    for col in cols:
        if isinstance(row[col], list):
            row[col] += [np.nan] * (max_len - len(row[col]))
        else:
            row[col] = [np.nan] * max_len
    return row

# For checking if all list values in a given row have the same length
def check_equal_list_lengths(row, cols):
    lengths = [len(row[col]) for col in cols if isinstance(row[col], list)]
    return len(set(lengths)) == 1

# For spreading list items over new rows
def explode_cols(df, cols):
    return df.explode(cols, ignore_index=True)

# For combining all necessary actions
def process_explosive_cols(df, cols):
    df = eval_lists(df, cols)
    df = normalize_lists(df, cols)
    df = df.apply(lambda row: pad_lists(row, cols), axis=1)

    all_rows_valid = df.apply(lambda row: check_equal_list_lengths(row, cols), axis=1).all()
    if all_rows_valid:
        df = explode_cols(df, cols)
        print('Lists have been exploded successfully.')
    else:
        print('All lists in a row need to have the same length to explode.')
    return df

# Columns to process
explosive_cols = [
    'rel_ids',
    'nature_of_rels',
    'modes_of_intro_phases',
    'modes_of_intro_descs',
    'scope_of_consequences',
    'impact_of_consequences',
    'consequence_notes',
    'detection_method',
    'detection_desc',
    'detection_effectiveness',
    'mitigation_phases',
    'mitigation_descs',
    'mitigation_effectiveness',
    'mitigate_notes',
    'observed_vulnerabilities',
    'vulnerability_descs'
]

# Make a copy for testing purposes
# cwes_test1 = cwes.copy(deep=True)

# Process the explosive columns
cwes = process_explosive_cols(cwes, explosive_cols)

Lists have been exploded successfully.


In [61]:
cwes.head(3)

Unnamed: 0,cwe_id,name,description,tech_class,bg_details,rel_ids,nature_of_rels,modes_of_intro_phases,modes_of_intro_descs,likelihood,...,consequence_notes,detection_method,detection_desc,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,732.0,ChildOf,Implementation,,Medium,...,"If the HttpOnly flag is not set, then sensitiv...",Automated Static Analysis,"Automated static analysis, commonly referred t...",High,Implementation,Leverage the HttpOnly flag when setting a sens...,High,While this mitigation is effective for protect...,CVE-2022-24045,Web application for a room automation system h...
1,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,,,,,Medium,...,If the cookie in question is an authentication...,,,,,,,,CVE-2014-3852,CMS written in Python does not include the HTT...
2,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,,,,,Medium,...,,,,,,,,,CVE-2015-4138,Appliance for managing encrypted communication...


The goal is now to continue preprocessing the data—looking at data types, checking for duplicates and typos, removing whitespace and normalizing capitalization, renaming columns if necessary, and merging the two data tables based on the `observed_vulnerabilities`/`cve_id` attributes.

### Checking For Typos

In [70]:
# Rename columns
col_names = {
    'detection_method': 'detection_methods',
    'detection_desc': 'detection_descs'
}

cwes = cwes.rename(columns=col_names)

# Check for typos
cwes['tech_class'].value_counts()


mitigation_effectiveness
nan                            296
High                            75
Defense in Depth                50
Moderate                        43
Limited                         37
Discouraged Common Practice      9
Incidental                       4
None                             1
Name: count, dtype: int64

In [71]:
cwes['nature_of_rels'].value_counts()

nature_of_rels
ChildOf       1308
CanPrecede     141
PeerOf          92
nan             34
CanAlsoBe       27
Requires        13
StartsWith       3
Name: count, dtype: int64

In [72]:
cwes['modes_of_intro_phases'].value_counts()

modes_of_intro_phases
Implementation              745
Architecture and Design     340
nan                         138
Operation                   104
System Configuration          9
Manufacturing                 7
Requirements                  7
Integration                   6
Installation                  6
Build and Compilation         4
Documentation                 3
Policy                        2
Patching and Maintenance      1
Bundling                      1
Distribution                  1
Testing                       1
Name: count, dtype: int64

In [73]:
cwes['likelihood'].value_counts()

likelihood
          2804
High       810
Medium     430
Low        120
Name: count, dtype: int64

In [74]:
cwes['detection_methods'].value_counts()

detection_methods
nan                                                       639
Automated Static Analysis                                 257
Architecture or Design Review                              62
Dynamic Analysis with Manual Results Interpretation        52
Manual Static Analysis - Source Code                       46
Automated Static Analysis - Source Code                    44
Manual Analysis                                            44
Automated Static Analysis - Binary or Bytecode             34
Fuzzing                                                    32
Manual Static Analysis - Binary or Bytecode                31
Dynamic Analysis with Automated Results Interpretation     31
Automated Dynamic Analysis                                 27
Black Box                                                  21
Manual Static Analysis                                     13
Automated Analysis                                         11
Manual Dynamic Analysis                             

In [75]:
cwes['detection_effectiveness'].value_counts()

detection_effectiveness
nan              639
High             420
SOAR Partial     150
Moderate          54
Limited            9
Opportunistic      6
Name: count, dtype: int64

In [76]:
cwes['mitigation_phases'].value_counts()

mitigation_phases
Implementation              517
Architecture and Design     254
Testing                      55
System Configuration         47
Operation                    40
Build and Compilation        11
Installation                  9
Integration                   6
Requirements                  5
Documentation                 4
Manufacturing                 4
Policy                        2
Distribution                  2
Patching and Maintenance      1
Name: count, dtype: int64

In [77]:
cwes['mitigation_effectiveness'].value_counts()

mitigation_effectiveness
nan                            296
High                            75
Defense in Depth                50
Moderate                        43
Limited                         37
Discouraged Common Practice      9
Incidental                       4
None                             1
Name: count, dtype: int64

### Checking Data Types

In [78]:
# Check data types
def check_type(cols):
    for col in cols:
        dtype = type(col)
        print(f'"{col}" is of type {dtype}.')

check_type(cwes.columns)

"cwe_id" is of type <class 'str'>.
"name" is of type <class 'str'>.
"description" is of type <class 'str'>.
"tech_class" is of type <class 'str'>.
"bg_details" is of type <class 'str'>.
"rel_ids" is of type <class 'str'>.
"nature_of_rels" is of type <class 'str'>.
"modes_of_intro_phases" is of type <class 'str'>.
"modes_of_intro_descs" is of type <class 'str'>.
"likelihood" is of type <class 'str'>.
"scope_of_consequences" is of type <class 'str'>.
"impact_of_consequences" is of type <class 'str'>.
"consequence_notes" is of type <class 'str'>.
"detection_methods" is of type <class 'str'>.
"detection_descs" is of type <class 'str'>.
"detection_effectiveness" is of type <class 'str'>.
"mitigation_phases" is of type <class 'str'>.
"mitigation_descs" is of type <class 'str'>.
"mitigation_effectiveness" is of type <class 'str'>.
"mitigate_notes" is of type <class 'str'>.
"observed_vulnerabilities" is of type <class 'str'>.
"vulnerability_descs" is of type <class 'str'>.


Every column is now simple text, making it much easier to trim whitespace from each entry.

In [83]:
# Check for leading or trailing whitespace
str_cols = cwes.select_dtypes(include=['string']).columns

def check_whitespace():
    for col in str_cols:
        trimmable = cwes[col].str.contains(r'^\s+|\s+$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

# Remove leading or trailing whitespace
cwes[str_cols] = cwes[str_cols].apply(lambda x: x.str.strip())

### Checking For Duplicates

In [88]:
print(f'There are {cwes.duplicated().sum()} duplicates in the dataframe.')

dups = cwes[cwes.duplicated(keep=False)]
cwes = cwes.drop_duplicates(keep='first')

print(f'There are now {cwes.duplicated().sum()} duplicates in the dataframe.')


There are 11 duplicates in the dataframe.
There are now 0 duplicates in the dataframe.


## Merging the Dataframes
It's time to perform the merge. The operation that makes the most sense, since there are many hundreds of times more CVE records than CWEs, will be to "left outer merge" the CWEs into the CVE dataset via the `cwe_id` attribute. This will keep all of the CVE records and simply populate where possible their CWE IDs field, instead of dropping every CVE record that doesn't contain a CWE ID post-merge. A more exclusive set of data will be made by "inner merging" the dataframes such that the only observations that will remain will be those that contain both CVE IDs and CWE IDs.

In [97]:
df = cves.merge(cwes, how='left')

In [99]:
df.head(3)

Unnamed: 0,cve_id,cwe_id,cve_state,date_published,description,severity,severity_score,attack_vector,attack_complexity,name,...,consequence_notes,detection_methods,detection_descs,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,CVE-1999-0001,,PUBLISHED,2000-02-04 05:00:00+00:00,ip_input.c in BSD-derived TCP/IP implementatio...,,,,,,...,,,,,,,,,,
1,CVE-1999-0002,,PUBLISHED,1999-09-29 04:00:00+00:00,Buffer overflow in NFS mountd gives root acces...,,,,,,...,,,,,,,,,,
2,CVE-1999-0003,,PUBLISHED,1999-09-29 04:00:00+00:00,Execute commands as root via buffer overflow i...,,,,,,...,,,,,,,,,,


In [100]:
df['cwe_id'].value_counts()

cwe_id
CWE-79      8871
CWE-89      3554
CWE-20      2609
CWE-352     2118
CWE-200     1802
            ... 
CWE-301        1
CWE-446        1
CWE-1007       1
CWE-623        1
CWE-1262       1
Name: count, Length: 664, dtype: int64