# Proseminar Research Data 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
from scipy import stats

---
## Initial Data Preparation

The primary tasks for making this a viable data set are:
* remove the invalid variables
* remove instances with ICMP
* remove instances with ARP.

Refer to the dataset review paper for the specifics as to why all of these observations need to be dropped. 

In [4]:
# load dataset 
data = pd.read_csv("Data/Bot-IoT/All-features/All-features/combined.csv")

  data = pd.read_csv("Data/Bot-IoT/All-features/All-features/combined.csv")


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3668525 entries, 0 to 3668524
Data columns (total 46 columns):
 #   Column                            Dtype 
---  ------                            ----- 
 0   pkSeqID                           object
 1   stime                             object
 2   flgs                              object
 3   flgs_number                       object
 4   proto                             object
 5   proto_number                      object
 6   saddr                             object
 7   sport                             object
 8   daddr                             object
 9   dport                             object
 10  pkts                              object
 11  bytes                             object
 12  state                             object
 13  state_number                      object
 14  ltime                             object
 15  seq                               object
 16  dur                               object
 17  mean    

In [18]:
# create copy of dataset without invalid features
valid_data = data.drop(columns=['pkSeqID', 'seq', 'stime', 'ltime', 'saddr', 'daddr'])

In [19]:
valid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3668525 entries, 0 to 3668524
Data columns (total 40 columns):
 #   Column                            Dtype 
---  ------                            ----- 
 0   flgs                              object
 1   flgs_number                       object
 2   proto                             object
 3   proto_number                      object
 4   sport                             object
 5   dport                             object
 6   pkts                              object
 7   bytes                             object
 8   state                             object
 9   state_number                      object
 10  dur                               object
 11  mean                              object
 12  stddev                            object
 13  sum                               object
 14  min                               object
 15  max                               object
 16  spkts                             object
 17  dpkts   

In [22]:
# check protocol values 
valid_data.proto.unique()

array(['tcp', 'arp', 'udp', 'icmp', 'proto', 'ipv6-icmp'], dtype=object)

In [34]:
# check out instances of ICMP observations
icmp_records = valid_data.loc[valid_data.proto == 'icmp']

<br>Let's go ahead and drop these values since these instances due to the fact these values contain hex values for the sport and dport features. These area incongruent with the rest of the data. Hex values for source/destination ports are invalid, therefore these values are difficult to comprehend. There is a chance they can possibly be fixed; however, it is not worth the time to attempt this, and instead I will simply drop these cases. 

In [41]:
icmp_records.sport.unique()

array(['0x0303', '0x0008', '0x000d', '0x0011'], dtype=object)

In [42]:
icmp_records.dport.unique()

array(['0x5000', '0xfcec', '0x0000', ..., '0x3ead', '0xeeaa', '0x9a89'],
      dtype=object)

In [43]:
valid_data.sport.unique()

array([49960, -1, 49962, ..., '0x0008', '0x000d', '0x0011'], dtype=object)

In [51]:
# drop cases of icmp records 
valid_data = valid_data[valid_data.proto != 'icmp']

In [52]:
valid_data.proto.unique()

array(['tcp', 'arp', 'udp', 'proto', 'ipv6-icmp'], dtype=object)

The next thing to do is to completely drop all ARP packets. The paper reviewing the dataset showed the there exist mislabeld observations. Several instances using ARP are not labeled as 'attack' traffic when they should be labeled as normal. One can go through and manually inspect each packet; however, it is better to simply drop these observations.

In [57]:
# drop the cases of arp
valid_data = valid_data[valid_data.proto != 'arp']

In [58]:
valid_data.proto.unique()

array(['tcp', 'udp', 'proto', 'ipv6-icmp'], dtype=object)

In [66]:
valid_data.category.unique()

array(['DoS', 'category', 'DDoS', 'Normal', 'Reconnaissance', 'Theft'],
      dtype=object)

In [67]:
valid_data.subcategory.unique()

array(['HTTP', 'TCP', 'UDP', 'subcategory', 'Normal', 'OS_Fingerprint',
       'Service_Scan', 'Data_Exfiltration', 'Keylogging'], dtype=object)