# CIC-Darknet2020 Feature Selection

Here we load data from the cleaned dataset produced by running Dataset_Cleaning.ipynb on the [CIC-Darknet2020](https://www.unb.ca/cic/datasets/darknet2020.html) dataset. We will remove any feature that has ample justification for removal.

First we import all relevant libraries, set a random seed, and print python and library versions for reproducability

In [1]:
import datetime, os, platform, pprint, sys
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

seed: int = 14

# set up pretty printer for easier data evaluation
pretty = pprint.PrettyPrinter(indent=4, width=30).pprint

# set up pandas display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print(
    f'''
    Last Execution: {datetime.datetime.now()}
    python:\t{platform.python_version()}

    \tmatplotlib:\t{mpl.__version__}
    \tnumpy:\t\t{np.__version__}
    \tpandas:\t\t{pd.__version__}
    '''
)


    Last Execution: 2022-02-20 21:11:31.735960
    python:	3.7.10

    	matplotlib:	3.3.4
    	numpy:		1.20.3
    	pandas:		1.2.5
    


Next we prepare some helper functions to help process the data

In [2]:
def get_file_path(directory: str):
    '''
        Closure that will return a function. 
        Function will return the filepath to the directory given to the closure
    '''

    def func(file: str) -> str:
        return os.path.join(directory, file)

    return func



def load_data(filePath):
    '''
        Loads the Dataset from the given filepath and caches it for quick access in the future
        Function will only work when filepath is a .csv file
    '''

    # slice off the ./CSV/ from the filePath
    if filePath[0] == '.' and filePath[1] == '/':
        filePathClean: str = filePath[10::]
        pickleDump: str = f'./cache/{filePathClean}.pickle'
    else:
        pickleDump: str = f'./cache/{filePath}.pickle'
    
    print(f'Loading Dataset: {filePath}')
    print(f'\tTo Dataset Cache: {pickleDump}\n')


    # check if data already exists within cache
    if os.path.exists(pickleDump):
        df = pd.read_pickle(pickleDump)
        df['Label1'] = df['Label1'].str.lower()
        df.Label1.unique()    

    # if not, load data and clean it before caching it
    else:
        df = pd.read_csv(filePath, low_memory=True)
        df['Label1'] = df['Label1'].str.lower()
        df.Label1.unique()    
        df.to_pickle(pickleDump)
    
    return df



def features_with_bad_values(df: pd.DataFrame, datasetName: str) -> pd.DataFrame:
    '''
        Function will scan the dataframe for features with Inf, NaN, or Zero values.
        Returns a new dataframe describing the distribution of these values in the original dataframe
    '''

    # Inf and NaN values can take different forms so we screen for every one of them
    invalid_values: list = [ np.inf, np.nan, 'Infinity', 'inf', 'NaN', 'nan', 0 ]
    infs          : list = [ np.inf, 'Infinity', 'inf' ]
    NaNs          : list = [ np.nan, 'NaN', 'nan' ]

    # We will collect stats on the dataset, specifically how many instances of Infs, NaNs, and 0s are present.
    # using a dictionary that will be converted into a (3, 2+88) dataframe
    stats: dict = {
        'Dataset':[ datasetName, datasetName, datasetName ],
        'Value'  :['Inf', 'NaN', 'Zero']
    }

    i = 0
    for col in df.columns:
        
        i += 1
        feature = np.zeros(3)
        
        for value in invalid_values:
            if value in infs:
                j = 0
            elif value in NaNs:
                j = 1
            else:
                j = 2
            indexNames = df[df[col] == value].index
            if not indexNames.empty:
                feature[j] += len(indexNames)
                
        stats[col] = feature

    return pd.DataFrame(stats)



def clean_data(df: pd.DataFrame, prune: list) -> pd.DataFrame:
    '''
        Function will take a dataframe and remove the columns that match a value in prune 
        Inf values will also be removed from Flow Bytes/s and Flow Packets/s
        once appropriate rows and columns have been removed, we will return
        the dataframe with the appropriate values
    '''

    # remove the features in the prune list    
    for col in prune:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)
            
    
    # drop missing values/NaN etc.
    df.dropna(inplace=True)

    
    # Search through dataframe for any Infinite or NaN values in various forms that were not picked up previously
    invalid_values: list = [
        np.inf, np.nan, 'Infinity', 'inf', 'NaN', 'nan'
    ]
    
    for col in df.columns:
        for value in invalid_values:
            indexNames = df[df[col] == value].index
            if not indexNames.empty:
                print(f'deleting {len(indexNames)} rows with Infinity in column {col}')
                df.drop(indexNames, inplace=True)

    return df


Before we do any processing on the data, we need to list out all their filepaths. If trying to reproduce the process carried out here, place files in the same location relative to the notebook. The file we are looking at in particular, Darknet_cleaned.csv, is generated by the Dataset_Cleaning.ipynb notebook. The original dataset must be in the original directory inside this directory before Dataset_Cleaning.ipynb is run.

In [3]:
# This code is used to scale to processing numerous datasets, even though we currently are only looking at one
data_path_1: str = './cleaned/'   
data_set_1: list = [
    'Darknet_cleaned.csv',
]

data_set: list  = data_set_1
file_path_1      = get_file_path(data_path_1)
file_set: list   = list(map(file_path_1, data_set_1))
current_job: int = 0

Some more helper functions that process the data using the file and dataset information above

In [4]:
def examine_dataset(job_id: int) -> dict:
    '''
        Function will return a dictionary containing dataframe of the job_id passed in as well as that dataframe's
        feature stats, data composition, and file name.

        This dictionary is expected as the input for all of the other helper functions
    '''

    job_id = job_id - 1  # adjusts for indexing while enumerating jobs from 1
    print(f'Dataset {job_id+1}/{len(data_set)}: We now look at {file_set[job_id]}\n\n')

    # Load the dataset
    df: pd.DataFrame = load_data(file_set[job_id])
 

    # print the data composition
    print(f'''
        File:\t\t\t\t{file_set[job_id]}  
        Job Number:\t\t\t{job_id+1}
        Shape:\t\t\t\t{df.shape}
        Samples:\t\t\t{df.shape[0]} 
        Features:\t\t\t{df.shape[1]}
    ''')
    

    # return the dataframe and the feature stats
    data_summary: dict =  {
        'File':             file_set[job_id],
        'Dataset':          df,
        'Feature_stats':    features_with_bad_values(df, file_set[job_id]), 
    }
    
    return data_summary



def package_data_for_inspection(df: pd.DataFrame) -> dict:
    '''
        Function will return a dictionary containing dataframe passed in as well as that dataframe's feature stats.
    '''

    # print the data composition
    print(f'''
    Dataset statistics:
        Shape:\t\t\t\t{df.shape}
        Samples:\t\t\t{df.shape[0]} 
        Features:\t\t\t{df.shape[1]}
    ''')
    

    # return the dataframe and the feature stats
    data_summary: dict =  {
        'File':             '',
        'Dataset':          df,
        'Feature_stats':    features_with_bad_values(df, ''), 
    }
    
    return data_summary



def package_data_for_inspection_with_file_label(df: pd.DataFrame, label: str) -> dict({'File': str, 'Dataset': pd.DataFrame, 'Feature_stats': pd.DataFrame}):
    '''
        Function will return a dictionary containing dataframe passed in as well as that dataframe's feature stats.
    '''

    # print the data composition
    print(f'''
        Shape:\t\t\t\t{df.shape}
        Samples:\t\t\t{df.shape[0]} 
        Features:\t\t\t{df.shape[1]}
    ''')
    

    # return the dataframe and the feature stats
    data_summary: dict =  {
        'File':             f'{label}',
        'Dataset':          df,
        'Feature_stats':    features_with_bad_values(df, f'{label}'),
    }
    
    return data_summary



def check_infs(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of Inf.
    '''

    
    vals: pd.DataFrame = data_summary['Feature_stats']
    inf_df = vals[vals['Value'] == 'Inf'].T

    return inf_df[inf_df[0] != 0]



def check_nans(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of NaN.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    nan_df = vals[vals['Value'] == 'NaN'].T

    return nan_df[nan_df[1] != 0]



def check_zeros(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of 0.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    zero_df = vals[vals['Value'] == 'Zero'].T

    return zero_df[zero_df[2] != 0]



def check_zeros_over_threshold(data_summary: dict, threshold: int) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of 0.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    zero_df = vals[vals['Value'] == 'Zero'].T
    zero_df_bottom = zero_df[2:]

    return zero_df_bottom[zero_df_bottom[2] > threshold]



def check_zeros_over_threshold_percentage(data_summary: dict, threshold: float) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with all features with
        a frequency of 0 values greater than the threshold
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    size: int = data_summary['Dataset'].shape[0]
    zero_df = vals[vals['Value'] == 'Zero'].T
    zero_df_bottom = zero_df[2:]

    return zero_df_bottom[zero_df_bottom[2] > threshold*size]



def remove_infs_and_nans(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return the dataset with all inf and nan values removed.
    '''

    df: pd.DataFrame = data_summary['Dataset'].copy()
    df = clean_data(df, [])

    return df


def rename_columns(data_summary: dict, columns: list, new_names: list) -> pd.DataFrame:
    '''
        Function will return the data_summary dict with the names of the columns in the dataframe changed
    '''

    df: pd.DataFrame = data_summary['Dataset'].copy()
    for x, i in enumerate(columns):
        df.rename(columns={i: new_names[x]}, inplace=True)

    data_summary['Dataset'] = df

    return data_summary

def prune_dataset(data_summary: dict, prune: list) -> pd.DataFrame:
    '''
        Function will return the dataset with all the columns in the prune list removed.
    '''

    df: pd.DataFrame = data_summary['Dataset'].copy()
    df = clean_data(df, prune)

    return df



def create_new_prune_candidates(zeros_df: pd.DataFrame) -> list:
    '''
        Function creates a list of prune candidates from a dataframe of features with a high frequency of 0 values
    '''

    return list(zeros_df.T.columns)



def intersection_of_prune_candidates(pruneCandidates: list, newPruneCandidates: list) -> list:
    '''
        Function will return a list of features that are in both pruneCandidates and newPruneCandidates
    '''

    return list(set(pruneCandidates).intersection(newPruneCandidates))



def test_infs(data_summary: dict) -> bool:
    '''
        Function asserts the dataset has no inf values.
    '''
    vals: pd.DataFrame = data_summary['Feature_stats']
    inf_df = vals[vals['Value'] == 'Inf'].T

    assert inf_df[inf_df[0] != 0].shape[0] == 2, 'Dataset has inf values'
    

    return True



def test_nans(data_summary: dict) -> bool:
    '''
        Function asserts the dataset has no NaN values
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    nan_df = vals[vals['Value'] == 'NaN'].T

    assert nan_df[nan_df[1] != 0].shape[0] == 2, 'Dataset has NaN values'


    return True



def test_pruned(data_summary: dict, prune: list) -> bool:
    '''
        Function asserts the dataset has none of the columns present in the prune list 
    '''

    pruned: bool = True

    for col in prune:
        if col in data_summary['Dataset'].columns:
            pruned = False

    assert pruned, 'Dataset has columns present in prune list'

    return pruned



def test_pruned_size(data_summary_original: dict, data_summary_pruned: dict, prune: list) -> bool:
    '''
        Function asserts the dataset has none of the columns present in the prune list 
    '''

    original_size: int = data_summary_original['Dataset'].shape[1]
    pruned_size: int = data_summary_pruned['Dataset'].shape[1]
    prune_list_size: int = len(prune)

    assert original_size - prune_list_size == pruned_size, 'Dataset has columns present in prune list'

    # print('Original size: ', original_size)
    # print('Pruned size: ', pruned_size)
    # print('Pruned list size: ', prune_list_size)
    # print('Difference: ', original_size - pruned_size)

    return True




This gives us a set of file locations. Lets look at the set of files that we will be eliminating features from

In [5]:
print(f'We will be eliminating features from {len(file_set)} files:')
pretty(file_set)

We will be eliminating features from 1 files:
[   './cleaned/Darknet_cleaned.csv']


## The Cleaned CIC-Darknet2020 Dataset

In [6]:
dataset_1: dict = examine_dataset(1)

print(f"""
    Labels in the first layer:
{dataset_1['Dataset'].groupby('Label').size()}

    Labels in the second layer:
 {dataset_1['Dataset'].groupby('Label1').size()}
""")

Dataset 1/1: We now look at ./cleaned/Darknet_cleaned.csv


Loading Dataset: ./cleaned/Darknet_cleaned.csv
	To Dataset Cache: ./cache/Darknet_cleaned.csv.pickle


        File:				./cleaned/Darknet_cleaned.csv  
        Job Number:			1
        Shape:				(141481, 85)
        Samples:			141481 
        Features:			85
    

    Labels in the first layer:
Label
Non-Tor    93309
NonVPN     23861
Tor         1392
VPN        22919
dtype: int64

    Labels in the second layer:
 Label1
audio-streaming    18050
browsing           32808
chat               11473
email               6143
file-transfer      11173
p2p                48520
video-streaming     9748
voip                3566
dtype: int64



### Data Inspection

Here we verify that all the inf and nan values have been removed from the dataset. This is just incase we accidentally loaded the original dataset.

In [7]:
check_infs(dataset_1)


Unnamed: 0,0
Dataset,./cleaned/Darknet_cleaned.csv
Value,Inf


In [8]:
check_nans(dataset_1)


Unnamed: 0,1
Dataset,./cleaned/Darknet_cleaned.csv
Value,


In [9]:
if(test_infs(dataset_1) and test_nans(dataset_1)):
    print('Dataset is clean')

Dataset is clean


### Feature Breakdown

Now that the dataset is loaded and verified to be clean, we can look at the features and see if any merit removal.

In [10]:
values = dataset_1['Dataset'].values
columns = dataset_1['Dataset'].columns


print("Feature types:")
for i in range(dataset_1['Dataset'].shape[1]):
    print(f"Column: {i}\tType: {type(values[0][i])}\tFeature: {columns[i]}")


print("\nFeature Samples:")
for i in range(dataset_1['Dataset'].shape[1]):
    print(f"Column: {i}\tSample: {values[0][i]}")

Feature types:
Column: 0	Type: <class 'str'>	Feature: Flow ID
Column: 1	Type: <class 'str'>	Feature: Src IP
Column: 2	Type: <class 'int'>	Feature: Src Port
Column: 3	Type: <class 'str'>	Feature: Dst IP
Column: 4	Type: <class 'int'>	Feature: Dst Port
Column: 5	Type: <class 'int'>	Feature: Protocol
Column: 6	Type: <class 'str'>	Feature: Timestamp
Column: 7	Type: <class 'int'>	Feature: Flow Duration
Column: 8	Type: <class 'int'>	Feature: Total Fwd Packet
Column: 9	Type: <class 'int'>	Feature: Total Bwd packets
Column: 10	Type: <class 'int'>	Feature: Total Length of Fwd Packet
Column: 11	Type: <class 'int'>	Feature: Total Length of Bwd Packet
Column: 12	Type: <class 'int'>	Feature: Fwd Packet Length Max
Column: 13	Type: <class 'int'>	Feature: Fwd Packet Length Min
Column: 14	Type: <class 'float'>	Feature: Fwd Packet Length Mean
Column: 15	Type: <class 'float'>	Feature: Fwd Packet Length Std
Column: 16	Type: <class 'int'>	Feature: Bwd Packet Length Max
Column: 17	Type: <class 'int'>	Feature

In [11]:
check_zeros_over_threshold(dataset_1, dataset_1['Dataset'].shape[0]-1)

Unnamed: 0,2
Bwd PSH Flags,141481.0
Fwd URG Flags,141481.0
Bwd URG Flags,141481.0
URG Flag Count,141481.0
CWE Flag Count,141481.0
ECE Flag Count,141481.0
Fwd Bytes/Bulk Avg,141481.0
Fwd Packet/Bulk Avg,141481.0
Fwd Bulk Rate Avg,141481.0
Bwd Bytes/Bulk Avg,141481.0


### Feature Selection

From the feature samples above, we can see the Flow-id feature is contained in the form (Source IP)-(Destination IP)-(Source Port)-(Destination Port)-(Protocol). This is a unique identifier for each flow.

However, this information is contained within other features, meaning we can completely eliminate it.

The timestamp feature is also something we can completely eliminate. The reasoning is simple, since we are dealing with tabular data, and classifiers running on the data will have no awareness of the time of flows that occur before or after the time of a particular flow, meaning no inference can be made based on when the flow arrived in relation to other flows. Since network traffic is not inherently bound by timeframe, i.e. people arent restricted from using tor during the day and you can always stream videos from youtube.

Thus, we will also eliminate the timestamp feature.

The Source and Destination IPs are going to be eliminated because they only reflect the IPs of the computers used by the original researchers to generate the dataset. The distribution of IP addresses within this dataset do no accurately describe the distribution of IP addresses on real internet traffic.

The Source and Destination Ports are going to be eliminated as well, but the reasoning is different. The ports are not necessarily indicative of the ports used by the original researchers to generate the dataset. The ports are simply the ports that the flow was sent on. Malicious or evasive users may alter the ports they are connection to and from in order to appear more innocuous and to obfuscate their activity. The port being used by a service can often be switched after a connection is established as well, meaning that a classifier that learns to classify traffic based on the port could misclassify subsequent flows that occur over a new port.

This gives us a set of six features that we can eliminate:
 * Flow-id
 * Timestamp
 * Source IP
 * Destination IP
 * Source Port
 * Destination Port

In [12]:
prune: list = [
    'Flow ID',
    'Src IP',
    'Dst IP',
    'Src Port',
    'Dst Port',
    'Timestamp',
]

We will also be removing all the features that have only 0 values. The presence of these values in the dataset could cause classifiers to overfit to the data due to an excessive number of parameters that are not being used.

In [13]:
prune += list(check_zeros_over_threshold(dataset_1, dataset_1['Dataset'].shape[0]-1).T.columns)

In [14]:
print(f'We will be pruning the {len(prune)} features:')
pretty(prune)

We will be pruning the 21 features:
[   'Flow ID',
    'Src IP',
    'Dst IP',
    'Src Port',
    'Dst Port',
    'Timestamp',
    'Bwd PSH Flags',
    'Fwd URG Flags',
    'Bwd URG Flags',
    'URG Flag Count',
    'CWE Flag Count',
    'ECE Flag Count',
    'Fwd Bytes/Bulk Avg',
    'Fwd Packet/Bulk Avg',
    'Fwd Bulk Rate Avg',
    'Bwd Bytes/Bulk Avg',
    'Subflow Bwd Packets',
    'Active Mean',
    'Active Std',
    'Active Max',
    'Active Min']


In [15]:
dataset_2 = package_data_for_inspection(prune_dataset(dataset_1, prune))


    Dataset statistics:
        Shape:				(141481, 64)
        Samples:			141481 
        Features:			64
    


In [16]:
if(test_pruned(dataset_2, prune) and test_pruned_size(dataset_1, dataset_2, prune)):
    print('Dataset has been pruned') 

Dataset has been pruned


### Saving the Dataset after Feature Selection

Now we save the dataset to a new file in the phase1 directory.

In [17]:
dataset_2['Dataset'].to_csv('./phase1/Darknet_phase_1.csv', index=False)

### Renaming Columns

Since Label and Label1 are confusing names, we will rename them to Traffic Type and Application Type to be more meaningful.

In [18]:
dataset_3: dict = rename_columns(dataset_2, ['Label', 'Label1'], ['Traffic Type', 'Application Type'])

In [19]:
dataset_3['Dataset'].columns

Index(['Protocol', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Fwd Header Length', 'Bwd Header Length', 'Fwd Packets/s',
       'Bwd Packets/s', 'Packet Length Min', 'Packet Length Max',
       'Packet Length Mean', 'Packet Length Std', 'Packet Length Variance',
       'FIN Flag Count', 'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count',
       'ACK Flag Count', 'Down/Up Ratio

In [20]:
dataset_3['Dataset'].to_csv('./phase1/Darknet_reduced_features.csv', index=False)

In [21]:
print(f'Last Execution: {datetime.datetime.now()}')
assert False, 'Nothing is complete after this point'

Last Execution: 2022-02-20 21:11:44.832977


AssertionError: Nothing is complete after this point