# Preprocessing MITRE's IoT CVE Data
This notebook is dedicated to cleaning up MITRE's list of known IoT-involed CVEs. The dataset includes three sheets in an Excel file. The `2019-2024` sheet contains all the observations contained in the file's two other sheets, which was verified by making the dataset's rows hashable, placing them inside a series of sets, and checking whether the two other sheets were subsets of the first. The clean dataset is paramount to the master copy because it provides the scope of all the other data necessary.

Cleaning the data involves several important steps, namely:
- Understanding the data at a bird's eye level—e.g. number of rows, data types (numbers, text, dates, etc.)
- Handling incorrect data and data types
- Interpolating or discarding missing values
- Removing redundant white space characters
- Standardizing the formatting of column names, categorical data, etc.
- Deleting duplicate observations

In [2]:
# Import libraries
import pandas as pd
import numpy as np

# Import raw data
df2019_2024 = pd.read_excel(
    '../data/MITRE/MITRE_2024_IoT_CVEs.xlsx',
    sheet_name='2019-2024 CVEs'
)
df2019_2024.head(3)

Unnamed: 0,CVE-2024-38089,Microsoft Defender for IoT Elevation of Privilege Vulnerability
0,CVE-2024-29195,The azure-c-shared-utility is a C library for ...
1,CVE-2024-29055,Microsoft Defender for IoT Elevation of Privil...
2,CVE-2024-29054,Microsoft Defender for IoT Elevation of Privil...


In [2]:
df2019_2024.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 2 columns):
 #   Column                                                           Non-Null Count  Dtype 
---  ------                                                           --------------  ----- 
 0   CVE-2024-38089                                                   1088 non-null   object
 1   Microsoft Defender for IoT Elevation of Privilege Vulnerability  1088 non-null   object
dtypes: object(2)
memory usage: 17.1+ KB


## Fixing Column Names
The column names of these datasets are themselves observations. They'll have to be pushed down into the dataset and replaced with accurate column names.

In [3]:
def add_cols_as_obs(df):
    current_col_names = df.columns.tolist() # Grab current column names
    df.loc[-1] = current_col_names # Set the column names equal to a row
    df.index = df.index + 1 # Shift the index
    df = df.sort_index() # Sort the index
    df = df.reset_index(drop=True)
    return df

# df2019_2024 = add_cols_as_obs(df2019_2024)

# Rename columns names
df2019_2024 = df2019_2024.rename(columns={
    'CVE-2024-38089': 'cve_id',
    'Microsoft Defender for IoT Elevation of Privilege Vulnerability': 'description'
})

In [6]:
df2019_2024.head(3)

Unnamed: 0,cve_id,description
0,CVE-2024-38089,Microsoft Defender for IoT Elevation of Privil...
1,CVE-2024-29195,The azure-c-shared-utility is a C library for ...
2,CVE-2024-29055,Microsoft Defender for IoT Elevation of Privil...


## Duplicates

In [4]:
print(f'"df2019-2024" has {df2019_2024.duplicated().sum()} duplicate observations.')

"df2019-2024" has 0 duplicate observations.


## Null Values

In [5]:
print(f'"df2019-2024" has {df2019_2024.isnull().sum().tolist()} null values in each columns.')

"df2019-2024" has [0, 0] null values in each columns.


## Data Type Conversion

In [6]:
df2019_2024 = df2019_2024.astype('string')

## Whitespace Removal

In [7]:
def check_whitespace(df, cols):
    for col in cols:
        trimmable = df[col].str.contains(r'^\s+|\s+$', regex=True).sum()
        print(f"Column '{col}' has {trimmable} trimmable whitespace characters.")

str_cols = ['cve_id', 'description']
check_whitespace(df2019_2024, str_cols)

# Remove leading or trailing whitespace
df2019_2024[str_cols] = df2019_2024[str_cols].apply(lambda x: x.str.strip())
check_whitespace(df2019_2024, str_cols)

Column 'cve_id' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.
Column 'cve_id' has 0 trimmable whitespace characters.
Column 'description' has 0 trimmable whitespace characters.


## Saving the Data

In [8]:
df2019_2024.to_parquet(path='../data/MITRE/mitre_iot_cves_v1.parquet')