# Chi-Squared Feature Selection
This jupyter notebook is centered around the use of chi-squared feature selection to determine the value the features have to the models used later on.

In [1]:
# Import for normalization
from sklearn.preprocessing import MinMaxScaler

# Import chi2 function 
from sklearn.feature_selection import chi2

# Import for data management
import pandas as pd

# Import our preprocessing function
from utils.data_preprocessing import get_data

print('Imports complete')

Imports complete


Since the dataset has two separate layers to it, we will need to do these one-by-one. We have built functions to assist us in importing the data

## Layer 1

In [2]:
path = '/media/notclaytonjohnson/Seagate Portable Drive/Data/doh_dataset/Total-CSVs'
df = get_data(path=path, layer=1)
df.head()

Unnamed: 0,SourceIP,DestinationIP,SourcePort,DestinationPort,TimeStamp,Duration,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,...,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:49:11,95.08155,62311,655.342703,65358,687.388878,...,0.574626,0.001053,0.032457,0.027624,0.026854,0.026822,0.071187,0.024715,1.174948,DoH
1,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:50:52,122.309318,93828,767.136973,101232,827.672018,...,0.509047,0.00117,0.0342,0.024387,0.021043,0.026981,0.293297,-0.075845,1.402382,DoH
2,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:52:55,120.958413,38784,320.639127,38236,316.108645,...,0.732636,0.000785,0.028021,0.029238,0.026921,0.026855,0.248064,0.085061,0.958348,DoH
3,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:54:56,110.50108,61993,561.017141,69757,631.278898,...,0.646859,0.000411,0.020274,0.019925,0.019268,0.026918,0.097199,-0.344926,1.017535,DoH
4,176.103.130.131,192.168.20.191,443,50749,2020-01-14 15:56:46,54.229891,83641,1542.341289,76804,1416.266907,...,0.507334,0.079079,0.281209,0.02593,4.7e-05,2.1e-05,0.276133,0.092135,10.844829,DoH


We need to remove some columns because they contain data types that would overfit the models, such as `SourceIP`.

In [3]:
bad_columns = ['SourceIP', 'DestinationIP', 'TimeStamp']
df.drop(labels=bad_columns, axis='columns', inplace=True)

In [4]:
# The target classifications are in the 'Label' columns, 
#  thus this is the independent variable!
dep_var = 'Label'
df[dep_var].value_counts()

NonDoH    889809
DoH       269299
Name: Label, dtype: int64

In [5]:
# Split up the data into the data (X) and classifications (y)
X = df.loc[:, df.columns != dep_var]
y = df[dep_var]

In [6]:
# We need to normalize X so we don't have negative values. 
#  Chi-squared doesn't like negative values!
scaler = MinMaxScaler()
X = pd.DataFrame( 
    scaler.fit_transform(X), 
    columns=X.columns 
)

In [7]:
# Additions of another work found here: https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
chi_scores = chi2(X, y)
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = True , inplace = True)
#p_values.plot.bar()
print('Values in order of ascending p-values (lower=more significant)')
print(p_values)

Values in order of ascending p-values (lower=more significant)
SourcePort                                 0.000000e+00
ResponseTimeTimeSkewFromMedian             0.000000e+00
ResponseTimeTimeMode                       0.000000e+00
ResponseTimeTimeMedian                     0.000000e+00
ResponseTimeTimeMean                       0.000000e+00
PacketTimeSkewFromMedian                   0.000000e+00
PacketTimeMode                             0.000000e+00
PacketTimeMedian                           0.000000e+00
PacketTimeMean                             0.000000e+00
PacketTimeStandardDeviation                0.000000e+00
ResponseTimeTimeSkewFromMode               0.000000e+00
PacketLengthCoefficientofVariation         0.000000e+00
PacketTimeVariance                         0.000000e+00
ResponseTimeTimeCoefficientofVariation     0.000000e+00
PacketLengthMode                           0.000000e+00
PacketLengthMedian                         0.000000e+00
PacketLengthMean                         

## Layer 2

In [8]:
path = '/media/notclaytonjohnson/Seagate Portable Drive/Data/doh_dataset/Total-CSVs'
df = get_data(path=path, layer=2)
df.head()

Unnamed: 0,SourceIP,DestinationIP,SourcePort,DestinationPort,TimeStamp,Duration,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,...,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:49:11,95.08155,62311,655.342703,65358,687.388878,...,0.574626,0.001053,0.032457,0.027624,0.026854,0.026822,0.071187,0.024715,1.174948,Benign
1,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:50:52,122.309318,93828,767.136973,101232,827.672018,...,0.509047,0.00117,0.0342,0.024387,0.021043,0.026981,0.293297,-0.075845,1.402382,Benign
2,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:52:55,120.958413,38784,320.639127,38236,316.108645,...,0.732636,0.000785,0.028021,0.029238,0.026921,0.026855,0.248064,0.085061,0.958348,Benign
3,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:54:56,110.50108,61993,561.017141,69757,631.278898,...,0.646859,0.000411,0.020274,0.019925,0.019268,0.026918,0.097199,-0.344926,1.017535,Benign
4,176.103.130.131,192.168.20.191,443,50749,2020-01-14 15:56:46,54.229891,83641,1542.341289,76804,1416.266907,...,0.507334,0.079079,0.281209,0.02593,4.7e-05,2.1e-05,0.276133,0.092135,10.844829,Benign


We need to remove some columns because they contain data types that would overfit the models, such as `SourceIP`.

In [9]:
bad_columns = ['SourceIP', 'DestinationIP', 'TimeStamp']
df.drop(labels=bad_columns, axis='columns', inplace=True)

In [10]:
# The target classifications are in the 'Label' columns, 
#  thus this is the independent variable!
dep_var = 'Label'
df[dep_var].value_counts()

Malicious    249553
Benign        19746
Name: Label, dtype: int64

In [11]:
# Split up the data into the data (X) and classifications (y)
X = df.loc[:, df.columns != dep_var]
y = df[dep_var]

In [12]:
# We need to normalize X so we don't have negative values. 
#  Chi-squared doesn't like negative values!
scaler = MinMaxScaler()
X = pd.DataFrame( 
    scaler.fit_transform(X), 
    columns=X.columns 
)

In [13]:
# Additions of another work found here: https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
chi_scores = chi2(X, y)
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = True , inplace = True)
#p_values.plot.bar()
print('Values in order of ascending p-values (lower=more significant)')
print(p_values)

Values in order of ascending p-values (lower=more significant)
PacketLengthCoefficientofVariation         0.000000e+00
PacketLengthStandardDeviation              0.000000e+00
FlowReceivedRate                          1.450939e-244
PacketLengthMean                          7.975795e-217
Duration                                  2.510029e-216
PacketTimeSkewFromMedian                  4.561902e-188
FlowSentRate                              1.859601e-176
PacketLengthVariance                      5.351375e-147
PacketTimeMean                            1.857341e-131
PacketTimeStandardDeviation               3.619944e-129
ResponseTimeTimeMedian                    3.364445e-115
PacketTimeMedian                           5.147378e-95
ResponseTimeTimeSkewFromMode               9.872719e-91
DestinationPort                            2.812960e-68
ResponseTimeTimeMean                       1.589233e-62
ResponseTimeTimeMode                       4.342812e-61
PacketTimeCoefficientofVariation         