# Preprocessing CWE Data
Now that we have a neatly-packaged CSV file, we'll read it Pandas so we can inspect its cleanliness to make sure we can later construct a comprehensive analysis of the data. The CWE list specifically contains many columns whose values are lists of values. Because Pandas doesn't know how to handle these lists natively, I give the parsing function a converter that saves us a cleaning step. That's what `explosive_cols` is; the columns that contain lists of values that we'll explode into new rows during preprocessing.

Due to the original data potentially having more than one value for a particular column, I had to pull data dynamically for several attributes into what are technically lists of values. For example, a single CWE ID may have multiple related weaknesses or multiple detection methods. There are groups of these attributes that should correspond to other columns in their group with a 1-to-1 relationship, so that each of a CWE ID's related weaknesses can be mapped to their own respective "nature of relationship" values, or that each of a CWE's detection methods has a detection description and a detection effectiveness, for example. All of this makes cleaning this particular dataset significantly more involved than the CVE list. Dealing with other cleaning steps is made much more complicated without first handling the dataset's architectural issues.

The solution I landed on was to pull in data as a parquet file (which can understand a broader range of datatypes that a simple CSV), create a function that would transform the necessary columns into lists, correct various syntactic anomalies, pad each list so that every list across a given row has the same amount of items, and finally explode these lists across multiple rows. This has preserved the 1-to-1 relationship between items in the lists relative to the items in other lists.

## Preprocessing Steps
In order to reliably analyze the data, we need to make sure it's clean. This involves several important steps, namely:
- Understanding the data at a bird's eye level—e.g. number of rows, summary stats about numerical columns, data types (numbers, text, dates, etc.)
- Handling incorrect data types
- Interpolating or discarding missing values
- Removing redundant white space characters
- Standardizing the formatting of column names, categorical data, etc.
- Deleting duplicate observations

Before feeding the data into a machine learning model (if we get there), we'll have several additional steps to add on to this process:
- We'll want to encode our categorical data so that we can use math on it
- We'll split our data into training, validation, and testing sets to ensure our model generalizes well to unseen data
- We'll have to handle outliers that can make overfitting our model a dangerous probability
- We'll scale our data around a common mean of `0` and a standard deviation of `1` so that all features contribute equally to the algorithm's output.

In [1]:
# Import libraries
import pandas as pd # For data cleaning and analysis
import numpy as np # For advanced calculations
import ast # For safely evaluating strings as Python expressions

# Import the data
cwes = pd.read_parquet('../data/CWE_V4.15/CWE_List.parquet')
cwes.head(n=1)

Unnamed: 0,cwe_id,name,description,tech_class,bg_details,rel_ids,nature_of_rels,modes_of_intro_phases,modes_of_intro_descs,likelihood,...,consequence_notes,detection_method,detection_desc,detection_effectiveness,mitigation_phases,mitigation_descs,mitigation_effectiveness,mitigate_notes,observed_vulnerabilities,vulnerability_descs
0,1004,Sensitive Cookie Without 'HttpOnly' Flag,The product uses a cookie to store sensitive i...,Web Based,An HTTP cookie is a small piece of data attrib...,[732],[ChildOf],[Implementation],[],Medium,...,"[If the HttpOnly flag is not set, then sensiti...",[Automated Static Analysis],"[Automated static analysis, commonly referred ...",[High],Implementation,[Leverage the HttpOnly flag when setting a sen...,[High],[While this mitigation is effective for protec...,"[CVE-2022-24045, CVE-2014-3852, CVE-2015-4138]",[Web application for a room automation system ...


In [2]:
cwes.info() # Overview of the size of the dataset, its null values in each column, and their datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 964 entries, 0 to 963
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cwe_id                    964 non-null    object
 1   name                      964 non-null    object
 2   description               964 non-null    object
 3   tech_class                917 non-null    object
 4   bg_details                964 non-null    object
 5   rel_ids                   964 non-null    object
 6   nature_of_rels            964 non-null    object
 7   modes_of_intro_phases     964 non-null    object
 8   modes_of_intro_descs      964 non-null    object
 9   likelihood                964 non-null    object
 10  scope_of_consequences     964 non-null    object
 11  impact_of_consequences    964 non-null    object
 12  consequence_notes         964 non-null    object
 13  detection_method          964 non-null    object
 14  detection_desc            

In [3]:
# For converting values into manipulable lists
def safe_lit_eval(value):
    try:
        if isinstance(value, np.ndarray):
            value = value.tolist()
        elif isinstance(value, str):
            value = value.strip()
            if ~value.startswith('[') and ~value.endswith(']') and value:
                return [value]
            if value.startswith('[') and value.endswith(']'):
                return ast.literal_eval(value)
        elif isinstance(value, list):
            return value

    except (ValueError, SyntaxError) as e:
        print(f'Warning: Error parsing value "{value}": {e}')

    # If list is empty, return NaN
    if isinstance(value, list) and not value:
        return [np.nan]
    return value

# For converting values into manipulable lists
def eval_lists(df, cols):
    for col in cols:
        df[col] = df[col].apply(safe_lit_eval)
    return df

# For normalizing list values
def normalize_list_syntax(lst):
    if isinstance(lst, list):
        if not lst: # []
            return [np.nan]
        return [np.nan if item == '' or item is None else str(item).strip() for item in lst]
    return lst

# For normalizing list values
def normalize_lists(df, cols):
    for col in cols:
        df[col] = df[col].apply(normalize_list_syntax)
    return df

# For adding items to each list such that it matches the length of the list with
# the most amount of items in the row
def pad_lists(row, cols):
    # Calculate the number of items in the list within the maximum number of items in a given row
    max_len = max(len(row[col]) if isinstance(row[col], list) else 0 for col in cols)
    # Equalize list length
    for col in cols:
        if isinstance(row[col], list):
            row[col] += [np.nan] * (max_len - len(row[col]))
        else:
            row[col] = [np.nan] * max_len
    return row

# For checking if all list values in a given row have the same length
def check_equal_list_lengths(row, cols):
    lengths = [len(row[col]) for col in cols if isinstance(row[col], list)]
    return len(set(lengths)) == 1

# For spreading list items over new rows
def explode_cols(df, cols):
    return df.explode(cols, ignore_index=True)

# For combining all necessary actions
def process_explosive_cols(df, cols):
    df = eval_lists(df, cols)
    df = normalize_lists(df, cols)
    df = df.apply(lambda row: pad_lists(row, cols), axis=1)

    all_rows_valid = df.apply(lambda row: check_equal_list_lengths(row, cols), axis=1).all()
    if all_rows_valid:
        df = explode_cols(df, cols)
        print('Lists have been exploded successfully.')
    else:
        print('All lists in a row need to have the same length to explode.')
    return df

# Columns to process
explosive_cols = [
    'rel_ids',
    'nature_of_rels',
    'modes_of_intro_phases',
    'modes_of_intro_descs',
    'scope_of_consequences',
    'impact_of_consequences',
    'consequence_notes',
    'detection_method',
    'detection_desc',
    'detection_effectiveness',
    'mitigation_phases',
    'mitigation_descs',
    'mitigation_effectiveness',
    'mitigate_notes',
    'observed_vulnerabilities',
    'vulnerability_descs'
]

# Make a copy for testing purposes
# cwes_test1 = cwes.copy(deep=True)

# Process the explosive columns
cwes = process_explosive_cols(cwes, explosive_cols)

Lists have been exploded successfully.


## Renaming Attributes To Account For Singular Values

In [4]:
# Rename columns
col_names = {
    'rel_ids': 'rel_id',
    'nature_of_rels': 'nature_of_rel',
    'modes_of_intro_phases': 'mode_of_intro_phase',
    'modes_of_intro_descs': 'mode_of_intro_desc',
    'scope_of_consequences': 'scope_of_consequence',
    'impact_of_consequences': 'impact_of_consequence',
    'detection_methods': 'detection_method',
    'detection_descs': 'detection_desc',
    'mitigation_phases': 'mitigation_phase',
    'mitigation_descs': 'mitigation_desc',
    'observed_vulnerabilities': 'observed_vulnerability',
    'vulnerability_descs': 'vulnerability_desc',
}

cwes = cwes.rename(columns=col_names)

## Checking for Typos in Inconsistent Categories

In [6]:
# Check for typos
print(cwes['tech_class'].value_counts(), '\n')
print(cwes['nature_of_rel'].value_counts(), '\n')
print(cwes['mode_of_intro_phase'].value_counts(), '\n')
print(cwes['likelihood'].value_counts(), '\n')
print(cwes['detection_method'].value_counts(), '\n')
print(cwes['detection_effectiveness'].value_counts(), '\n')
print(cwes['mitigation_phase'].value_counts(), '\n')
print(cwes['mitigation_effectiveness'].value_counts(), '\n')

tech_class
                           2999
Not Technology-Specific     382
ICS/OT                      148
Mobile                      143
Web Based                   106
System on Chip               63
Cloud Computing              55
Name: count, dtype: int64 

nature_of_rel
ChildOf       1308
CanPrecede     141
PeerOf          92
nan             34
CanAlsoBe       27
Requires        13
StartsWith       3
Name: count, dtype: int64 

mode_of_intro_phase
Implementation              745
Architecture and Design     340
nan                         138
Operation                   104
System Configuration          9
Manufacturing                 7
Requirements                  7
Integration                   6
Installation                  6
Build and Compilation         4
Documentation                 3
Policy                        2
Patching and Maintenance      1
Bundling                      1
Distribution                  1
Testing                       1
Name: count, dtype: int64 

li

## Checking Data Types

In [8]:
cols = cwes.columns

# Convert string data into appropriate form
cwes[cols] = cwes[cols].astype('string')

# Remove whitespace
cwes[cols] = cwes[cols].apply(lambda x: x.str.strip())

# Function to check whitespace
def check_whitespace(df, cols):
    for col in cols:
        trimmable = df[col].str.contains(r'^\s+|\s+$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

check_whitespace(cwes, cols)

Column 'cwe_id' has 0 trimmable whitespace characters.
Column 'name' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.
Column 'tech_class' has 0 trimmable whitespace characters.
Column 'bg_details' has 0 trimmable whitespace characters.
Column 'rel_id' has 0 trimmable whitespace characters.
Column 'nature_of_rel' has 0 trimmable whitespace characters.
Column 'mode_of_intro_phase' has 0 trimmable whitespace characters.
Column 'mode_of_intro_desc' has 0 trimmable whitespace characters.
Column 'likelihood' has 0 trimmable whitespace characters.
Column 'scope_of_consequence' has 0 trimmable whitespace characters.
Column 'impact_of_consequence' has 0 trimmable whitespace characters.
Column 'consequence_notes' has 0 trimmable whitespace characters.
Column 'detection_method' has 0 trimmable whitespace characters.
Column 'detection_desc' has 0 trimmable whitespace characters.
Column 'detection_effectiveness' has 0 trimmable whitespace characters

## Normalizing Data

In [11]:
# Standardize null values in text columns
cwes[cols] = cwes[cols].replace([None, 'NaN', 'nan', '', np.nan], pd.NA)

# Function to prepend IDs with "CWE"
def prepend(col, txt):
    col[:] = col.apply(lambda x: f'{txt}{x}' if pd.notna(x) else x)

prepend(cwes['cwe_id'], 'CWE-')
prepend(cwes['rel_id'], 'CWE-')

# Filtering out invalid CVE IDs from CWE data
cwes = cwes[cwes['observed_vulnerability'].str.startswith('CVE-', na=True)]

In [15]:
cwes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4157 entries, 0 to 4163
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cwe_id                    4157 non-null   string
 1   name                      4157 non-null   string
 2   description               4157 non-null   string
 3   tech_class                894 non-null    string
 4   bg_details                292 non-null    string
 5   rel_id                    1580 non-null   string
 6   nature_of_rel             1580 non-null   string
 7   mode_of_intro_phase       1231 non-null   string
 8   mode_of_intro_desc        368 non-null    string
 9   likelihood                1358 non-null   string
 10  scope_of_consequence      1174 non-null   string
 11  impact_of_consequence     1174 non-null   string
 12  consequence_notes         603 non-null    string
 13  detection_method          728 non-null    string
 14  detection_desc            374

In [18]:
cats_dict = {
    'tech_class': {
        'categories': [
            'Web Based',
            'Not Technology-Specific',
            'ICS/OT',
            'Mobile',
            'System on Chip',
            'Cloud Computing',
        ],
        'ordered': False,
    },
    'nature_of_rel': {
        'categories': [
            'ChildOf',
            'PeerOf',
            'CanProcede',
            'CanAlsoBe',
            'Requires',
            'StartsWith',
        ],
        'ordered': False,
    },
    'mode_of_intro_phase': {
        'categories': [
            'Implementation',
            'Architecture and Design',
            'Operation',
            'Manufacturing',
            'Integration',
            'System Configuration',
            'Patching and Maintenance',
            'Distribution',
            'Documentation',
            'Build and Compilation',
            'Policy',
            'Requirements',
            'Installation',
            'Bundling',
            'Testing',
        ], 'ordered': False
    },
    'likelihood': {
        'categories': [
            'Low',
            'Medium',
            'High',
        ], 'ordered': True
    },
    'detection_method': {
        'categories': [
            'Automated Static Analysis',
            'Manual Dynamic Analysis',
            'White Box',
            'Manual Static Analysis',
            'Fuzzing',
            'Automated Dynamic Analysis',
            'Automated Static Analysis - Binary or Bytecode',
            'Manual Static Analysis - Binary or Bytecode',
            'Dynamic Analysis with Automated Results Interpretation',
            'Dynamic Analysis with Manual Results Interpretation',
            'Manual Static Analysis - Source Code',
            'Automated Static Analysis - Source Code',
            'Architecture or Design Review',
            'Manual Analysis',
            'Simulation / Emulation',
            'Formal Verification',
            'Automated Analysis',
            'Black Box',
            'Other',
        ], 'ordered': False
    },
    'detection_effectiveness': {
        'categories': [
            'High',
            'Moderate',
            'Opportunistic',
            'SOAR Partial',
            'Limited',
        ], 'ordered': False
    },
    'mitigation_phase': {
        'categories': [
            'Implementation',
            'Testing',
            'System Configuration',
            'Integration',
            'Manufacturing',
            'Requirements',
            'Installation',
            'Distribution',
            'Documentation',
            'Architecture and Design',
            'Operation',
            'Patching and Maintenance',
            'Build and Compilation',
            'Policy',
        ], 'ordered': False
    },
    'mitigation_effectiveness': {
        'categories': [
            'High',
            'Defense in Depth',
            'Limited',
            'Incidental',
            'Moderate',
            'Discouraged Common Practice',
            'None',
        ], 'ordered': False
    },
}

# Apply categories
for col, cat_info in cats_dict.items():
    cwes[col] = pd.Categorical(cwes[col], categories=cat_info['categories'], ordered=cat_info['ordered'])

# Checking for Duplicates

In [19]:
cwes = cwes.drop_duplicates(keep='first')
print(f'There are now {cwes.duplicated().sum()} duplicates in the dataframe.')

There are now 0 duplicates in the dataframe.


## Saving CWE Data

In [21]:
# Saving CWE list
cwes.to_parquet(path='../data/CWE_V4.15/cve_list_v2.parquet', index=None)
cwes.to_csv('../data/CWE_V4.15/cve_list_v2.csv', index=None)