# Prepare Active DNS Tunneling Attack Dataset

## Overview:

This notebook will focus on creating an active DNS Tunneling attack dataset based on a small sample of data collected by performing real DNS Tunneling attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 50,000 flows of the DNS Tunneling attacks based on the samples we collected when running a DNS Tunneling attacks in various configurations using the well known DNScat2 tool when the victim host has an active DNS tunnel, meaning its open and commands are running in it.<br> 

## Imports & Global Variables:

In [46]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 50000
ATTACK_NAME = 'DNS'

In [47]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the sample dataset:

In [48]:
# import the attack sample dataset
dns_samples = pd.read_csv('dns_samples_active.csv')
print(f'Dataset Shape: {dns_samples.shape}')
dns_samples

Dataset Shape: (20, 27)


Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,150,84,116,174,39.758621,34,70,181.363636,42,241,32.309848,14.333333,39.833333,268.742857,101,386,88,276,44448,34912,350,17.115152,1.102749,0.049041,0.153262
1,0,0,94,68,60,110,39.55,34,70,149.0,42,241,28.125,14.333333,39.833333,233.810811,101,357,56,148,23990,18592,222,14.696482,0.770567,0.0665,0.158759
2,0,0,94,96,136,162,38.12766,34,52,185.609756,42,241,32.740854,14.333333,39.833333,272.122699,101,357,82,262,41792,33228,326,14.316078,0.990149,0.044049,0.145406
3,0,0,100,128,88,158,38.175439,34,70,170.493671,42,241,30.845992,14.333333,39.833333,256.006329,101,357,79,234,38002,29624,316,16.216136,0.9994,0.05148,0.15114
4,0,0,136,122,88,172,38.40625,34,44,207.091954,42,241,35.493295,14.333333,39.833333,292.16185,101,357,87,306,47564,38992,346,11.440871,0.990594,0.033162,0.099968
5,0,0,56,90,68,106,37.388889,34,44,155.87037,42,241,28.910494,14.333333,39.833333,240.252336,101,357,54,149,23756,18670,214,13.661421,0.801823,0.064138,0.165087
6,0,0,74,60,80,106,38.909091,34,52,108.611111,42,241,22.862654,14.333333,39.833333,194.897196,101,357,54,111,19154,13566,214,19.716813,0.987562,0.092567,0.25585
7,0,0,66,52,72,94,39.655172,34,60,132.375,42,241,25.881944,14.333333,39.833333,219.2,101,357,48,116,19328,14340,190,13.953869,0.996097,0.07383,0.189524
8,0,0,72,52,80,102,38.645161,34,42,84.529412,42,241,19.767974,14.333333,39.833333,170.578431,101,357,51,86,15874,10356,204,21.172616,0.993989,0.104299,0.275736
9,0,0,60,68,96,112,38.875,34,60,119.464286,42,241,24.232143,14.333333,39.833333,205.5,101,357,56,124,21340,15284,224,18.313704,0.979127,0.082124,0.187037


### Find the columns that we need to synthesis data for:

In [49]:
columns_to_gather = dns_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['CName Record Count',
 'TXT Record Count',
 'MX Record Count',
 'DF Flag Count',
 'Average Response Data Length',
 'Min Response Data Length',
 'Max Response Data Length',
 'Average Domain Name Length',
 'Min Domain Name Length',
 'Max Domain Name Length',
 'Average Sub Domain Name Length',
 'Min Sub Domain Name Length',
 'Max Sub Domain Name Length',
 'Average Packet Length',
 'Min Packet Length',
 'Max Packet Length',
 'Number of Domian Names',
 'Number of Sub Domian Names',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Total Number of Packets',
 'Flow Duration',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [50]:
# find the minimum and maximum values for each column, scale the range (reduce min by 10% and increase max by 15%), and store the results in a dictionary.
min_max_dict = {col: (float(dns_samples[col].min() * 0.9), float(dns_samples[col].max() * 1.15)) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

CName Record Count             | Min: 37.80 | Max: 172.50
TXT Record Count               | Min: 46.80 | Max: 170.20
MX Record Count                | Min: 48.60 | Max: 156.40
DF Flag Count                  | Min: 84.60 | Max: 200.10
Average Response Data Length   | Min: 33.58 | Max: 45.72
Min Response Data Length       | Min: 30.60 | Max: 39.10
Max Response Data Length       | Min: 37.80 | Max: 80.50
Average Domain Name Length     | Min: 76.08 | Max: 238.16
Min Domain Name Length         | Min: 37.80 | Max: 48.30
Max Domain Name Length         | Min: 216.90 | Max: 277.15
Average Sub Domain Name Length | Min: 17.79 | Max: 40.82
Min Sub Domain Name Length     | Min: 12.90 | Max: 16.48
Max Sub Domain Name Length     | Min: 35.85 | Max: 45.81
Average Packet Length          | Min: 153.52 | Max: 335.99
Min Packet Length              | Min: 90.90 | Max: 116.15
Max Packet Length              | Min: 321.30 | Max: 443.90
Number of Domian Names         | Min: 43.20 | Max: 101.20
Number of Sub Domi

### Create the base attack dataset (full of zeros):

In [51]:
# creating an empty dataframe before adding values to it
dns_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dns_samples.columns))), columns = dns_samples.columns)
dns_dataset.head(3)

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [52]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dns_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dns_dataset[col] = int(0)
zero_columns

['A Record Count', 'AAAA Record Count']

---

## Filling in values based on collected samples:

### Firstly we insert data into columns that are not related to each other:

In [53]:
constant_columns = ['TXT Record Count', 'MX Record Count', 'CName Record Count', 'DF Flag Count', 
                   'Average Response Data Length', 'Min Response Data Length', 'Max Response Data Length', 
                   'Min Domain Name Length', 'Max Domain Name Length', 
                   'Min Sub Domain Name Length', 'Max Sub Domain Name Length',
                   'Min Packet Length', 'Max Packet Length', 'Average Sub Domain Name Length', 'Flow Duration']

# adding the attack feature values to the dataset at random based on the smaple data using the minimum maximum dict
for col in constant_columns:
    dns_dataset[col] = np.random.randint(min_max_dict[col][0], min_max_dict[col][1], NUM_OF_ROWS)

In [54]:
dns_dataset

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,165,154,79,133,42,30,62,0.0,39,249,23,14,42,0.0,98,424,0.0,0.0,0.0,0.0,0.0,19,0.0,0.0,0.0
1,0,0,114,55,90,110,42,31,44,0.0,40,217,29,13,39,0.0,105,392,0.0,0.0,0.0,0.0,0.0,23,0.0,0.0,0.0
2,0,0,111,57,129,88,37,30,44,0.0,43,251,30,15,40,0.0,113,381,0.0,0.0,0.0,0.0,0.0,23,0.0,0.0,0.0
3,0,0,129,167,107,186,39,35,57,0.0,46,268,20,15,37,0.0,98,428,0.0,0.0,0.0,0.0,0.0,22,0.0,0.0,0.0
4,0,0,116,135,145,96,34,38,69,0.0,45,239,19,14,41,0.0,107,414,0.0,0.0,0.0,0.0,0.0,16,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,129,62,128,175,38,36,59,0.0,45,216,32,13,42,0.0,113,430,0.0,0.0,0.0,0.0,0.0,13,0.0,0.0,0.0
49996,0,0,109,97,54,165,39,33,75,0.0,39,259,33,13,41,0.0,101,326,0.0,0.0,0.0,0.0,0.0,21,0.0,0.0,0.0
49997,0,0,157,144,89,194,34,37,40,0.0,46,234,32,14,36,0.0,112,372,0.0,0.0,0.0,0.0,0.0,13,0.0,0.0,0.0
49998,0,0,50,145,113,125,42,35,65,0.0,47,242,19,13,40,0.0,104,421,0.0,0.0,0.0,0.0,0.0,17,0.0,0.0,0.0


## Then we fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **average sub domain name length** increases, the **number of sub domian names** is likely to increases as well. This happens because of the nature of domain names, the longer the sub domain length, the more sub domains can fit inside it.<br>

### Correlation between 'Average Sub Domain Name Length' and all of the following: 'Average Domain Name Length', 'Number of Sub Domian Names', 'Average Packet Length', 'Number of Domian Names', 'Total Length of Fwd Packet', 'Total Length of Bwd Packet', 'Total Number of Packets':

In [55]:
base_column = 'Average Sub Domain Name Length'
correlation_columns = ['Average Domain Name Length', 'Number of Sub Domian Names', 'Average Packet Length', 
                        'Number of Domian Names', 'Total Length of Fwd Packet', 'Total Length of Bwd Packet', 'Total Number of Packets']

In [None]:
# finding the correlation between the 'Average Sub Domain Name Length' column to the rest of the columns in correlation_columns
independent_col = dns_samples[base_column].values.reshape(-1, 1) #column 'Average Sub Domain Name Length'
dependent_cols = dns_samples[correlation_columns].values

# using least squares regression to find scaling factors that best approximate the relationship between 'Average Sub Domain Name Length' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]
scaling_factors = [(name, float(factor)) for name, factor in zip(correlation_columns, scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# precompute values using random deltas for all rows and all dependent columns
for col, factor in scaling_factors:
    deltas = np.random.uniform(factor * 0.075, factor * 0.2, size = NUM_OF_ROWS)
    signs = np.random.choice([-1, 1], size = NUM_OF_ROWS)
    updated_factors = factor + signs * deltas

    # insert the data into the attack dataset for each column
    dns_dataset[col] = dns_dataset[base_column] * updated_factors

('Average Domain Name Length', 5.397201131256436)
('Number of Sub Domian Names', 6.80256261306239)
('Average Packet Length', 8.346997407096115)
('Number of Domian Names', 2.3720454375026234)
('Total Length of Fwd Packet', 1103.566273611197)
('Total Length of Bwd Packet', 856.9022907907998)
('Total Number of Packets', 9.450172899622817)


In [57]:
dns_dataset

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,165,154,79,133,42,30,62,112.362604,39,249,23,14,42,171.994568,98,424,65.249269,133.121477,23136.318793,23148.014492,176.905843,19,0.0,0.0,0.0
1,0,0,114,55,90,110,42,31,44,187.291959,40,217,29,13,39,268.994608,105,392,81.078447,176.918096,35349.802181,21300.040365,227.533396,23,0.0,0.0,0.0
2,0,0,111,57,129,88,37,30,44,141.798955,43,251,30,15,40,296.411015,113,381,65.675491,235.157813,28196.949380,23091.308180,328.923195,23,0.0,0.0,0.0
3,0,0,129,167,107,186,39,35,57,116.127797,46,268,20,15,37,135.816358,98,428,41.064315,124.619090,19153.651895,18966.445200,170.928755,22,0.0,0.0,0.0
4,0,0,116,135,145,96,34,38,69,92.715809,45,239,19,14,41,137.698856,107,414,52.539601,144.950412,19017.775476,14441.318398,199.143804,16,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,129,62,128,175,38,36,59,194.033641,45,216,32,13,42,298.457847,113,430,88.788325,177.363341,38458.529373,22074.921106,269.461812,13,0.0,0.0,0.0
49996,0,0,109,97,54,165,39,33,75,206.217187,39,259,33,13,41,323.408867,101,326,93.389750,268.907867,40930.371356,30985.625937,354.515859,21,0.0,0.0,0.0
49997,0,0,157,144,89,194,34,37,40,155.490777,46,234,32,14,36,314.367614,112,372,84.955409,178.347982,40605.950685,31935.693940,268.477196,13,0.0,0.0,0.0
49998,0,0,50,145,113,125,42,35,65,114.122591,47,242,19,13,40,140.084997,104,421,53.097504,147.266886,24997.107488,13951.058526,156.044649,17,0.0,0.0,0.0


### Correlation between 'IAT Mean' and all of the following: 'IAT Max', 'IAT Std', 'Flow Duration':

In [58]:
# finding the correlation between the 'IAT Mean' column to the rest of the columns in order to create new data
correlation = ['IAT Mean', 'IAT Max', 'IAT Std', 'Flow Duration']
independent_col = dns_samples[correlation[0]].values.reshape(-1, 1) #column 'IAT Mean'
dependent_cols = dns_samples[correlation[1:]].values

# using least squares regression to find scaling factors that best approximate the relationship between 'IAT Mean' and the rest of the columns in correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, float(factor)) for name, factor in zip(correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('IAT Max', 13.127911876318981)
('IAT Std', 2.67841979702811)
('Flow Duration', 232.71429250504946)


In [59]:
dns_dataset['IAT Mean'] = np.random.uniform(min_max_dict['IAT Mean'][0], min_max_dict['IAT Mean'][1], NUM_OF_ROWS)

# iterating over all rows we need to add values
for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'IAT Mean'
    delta = random.uniform(factor * 0.1, factor * 0.2) 
    updated_factor = factor + delta
    dns_dataset[col] = dns_dataset['IAT Mean'] * updated_factor

---

## Adding the Label column:

In [60]:
# adding a label to the dataset
dns_dataset['Label'] = ATTACK_NAME

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [61]:
dns_samples

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,150,84,116,174,39.758621,34,70,181.363636,42,241,32.309848,14.333333,39.833333,268.742857,101,386,88,276,44448,34912,350,17.115152,1.102749,0.049041,0.153262
1,0,0,94,68,60,110,39.55,34,70,149.0,42,241,28.125,14.333333,39.833333,233.810811,101,357,56,148,23990,18592,222,14.696482,0.770567,0.0665,0.158759
2,0,0,94,96,136,162,38.12766,34,52,185.609756,42,241,32.740854,14.333333,39.833333,272.122699,101,357,82,262,41792,33228,326,14.316078,0.990149,0.044049,0.145406
3,0,0,100,128,88,158,38.175439,34,70,170.493671,42,241,30.845992,14.333333,39.833333,256.006329,101,357,79,234,38002,29624,316,16.216136,0.9994,0.05148,0.15114
4,0,0,136,122,88,172,38.40625,34,44,207.091954,42,241,35.493295,14.333333,39.833333,292.16185,101,357,87,306,47564,38992,346,11.440871,0.990594,0.033162,0.099968
5,0,0,56,90,68,106,37.388889,34,44,155.87037,42,241,28.910494,14.333333,39.833333,240.252336,101,357,54,149,23756,18670,214,13.661421,0.801823,0.064138,0.165087
6,0,0,74,60,80,106,38.909091,34,52,108.611111,42,241,22.862654,14.333333,39.833333,194.897196,101,357,54,111,19154,13566,214,19.716813,0.987562,0.092567,0.25585
7,0,0,66,52,72,94,39.655172,34,60,132.375,42,241,25.881944,14.333333,39.833333,219.2,101,357,48,116,19328,14340,190,13.953869,0.996097,0.07383,0.189524
8,0,0,72,52,80,102,38.645161,34,42,84.529412,42,241,19.767974,14.333333,39.833333,170.578431,101,357,51,86,15874,10356,204,21.172616,0.993989,0.104299,0.275736
9,0,0,60,68,96,112,38.875,34,60,119.464286,42,241,24.232143,14.333333,39.833333,205.5,101,357,56,124,21340,15284,224,18.313704,0.979127,0.082124,0.187037


In [62]:
dns_dataset

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std,Label
0,0,0,165,154,79,133,42,30,62,112.362604,39,249,23,14,42,171.994568,98,424,65.249269,133.121477,23136.318793,23148.014492,176.905843,15.114447,0.863241,0.058235,0.171849,DNS
1,0,0,114,55,90,110,42,31,44,187.291959,40,217,29,13,39,268.994608,105,392,81.078447,176.918096,35349.802181,21300.040365,227.533396,14.279366,0.815547,0.055018,0.162355,DNS
2,0,0,111,57,129,88,37,30,44,141.798955,43,251,30,15,40,296.411015,113,381,65.675491,235.157813,28196.949380,23091.308180,328.923195,27.693345,1.581668,0.106701,0.314870,DNS
3,0,0,129,167,107,186,39,35,57,116.127797,46,268,20,15,37,135.816358,98,428,41.064315,124.619090,19153.651895,18966.445200,170.928755,27.595174,1.576061,0.106323,0.313754,DNS
4,0,0,116,135,145,96,34,38,69,92.715809,45,239,19,14,41,137.698856,107,414,52.539601,144.950412,19017.775476,14441.318398,199.143804,24.956175,1.425338,0.096155,0.283749,DNS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,129,62,128,175,38,36,59,194.033641,45,216,32,13,42,298.457847,113,430,88.788325,177.363341,38458.529373,22074.921106,269.461812,29.724673,1.697685,0.114528,0.337966,DNS
49996,0,0,109,97,54,165,39,33,75,206.217187,39,259,33,13,41,323.408867,101,326,93.389750,268.907867,40930.371356,30985.625937,354.515859,13.487722,0.770333,0.051968,0.153354,DNS
49997,0,0,157,144,89,194,34,37,40,155.490777,46,234,32,14,36,314.367614,112,372,84.955409,178.347982,40605.950685,31935.693940,268.477196,26.115462,1.491549,0.100622,0.296930,DNS
49998,0,0,50,145,113,125,42,35,65,114.122591,47,242,19,13,40,140.084997,104,421,53.097504,147.266886,24997.107488,13951.058526,156.044649,29.444451,1.681680,0.113448,0.334780,DNS


In [63]:
dns_samples.describe()

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,0.0,0.0,88.6,87.3,90.2,132.5,38.511953,34.0,54.2,151.014408,42.0,241.0,28.31744,14.33333,39.83333,236.795765,101.0,358.45,66.8,186.2,30354.6,23409.8,266.1,16.006378,0.931769,0.064669,0.175722
std,0.0,0.0,31.147189,28.181928,21.607991,31.768074,0.778566,0.0,9.058872,37.134279,0.0,0.0,4.773938,1.822504e-15,7.290015e-15,37.170469,0.0,6.484597,15.813385,76.963833,11632.193025,9970.843324,63.386783,2.757227,0.109439,0.022172,0.053685
min,0.0,0.0,42.0,52.0,54.0,94.0,37.311475,34.0,42.0,84.529412,42.0,241.0,19.767974,14.33333,39.83333,170.578431,101.0,357.0,48.0,84.0,15286.0,10120.0,190.0,11.440871,0.764955,0.033162,0.099968
25%,0.0,0.0,63.0,68.0,78.0,105.5,37.981744,34.0,50.0,119.12453,42.0,241.0,24.188727,14.33333,39.83333,205.268805,101.0,357.0,53.75,122.0,20837.0,15048.0,213.0,13.880757,0.794057,0.046086,0.140232
50%,0.0,0.0,89.0,82.0,88.0,112.0,38.525706,34.0,52.0,153.586129,42.0,241.0,28.667511,14.33333,39.83333,239.05474,101.0,357.0,56.5,148.5,23888.0,18631.0,225.0,15.517295,0.989264,0.063071,0.161923
75%,0.0,0.0,103.0,102.0,108.0,165.0,39.074074,34.0,60.0,181.570346,42.0,241.0,32.304397,14.33333,39.83333,267.810714,101.0,357.0,82.5,262.5,42089.0,33287.0,330.0,18.407313,0.994516,0.082358,0.200219
max,0.0,0.0,150.0,148.0,136.0,174.0,39.758621,34.0,70.0,207.091954,42.0,241.0,35.493295,14.33333,39.83333,292.16185,101.0,386.0,88.0,306.0,47564.0,38992.0,350.0,21.172616,1.102749,0.104299,0.275736


In [64]:
dns_dataset.describe()

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,0.0,0.0,103.95628,107.33198,101.64682,141.41224,38.50638,34.00258,58.05528,150.976021,41.99418,246.03052,27.97312,13.51182,39.50754,233.49832,102.4873,381.6967,66.355699,190.298251,30870.228379,23970.741267,264.484227,19.446298,1.110649,0.074926,0.221102
std,0.0,0.0,38.960012,35.730134,31.099937,33.461879,3.443747,2.592788,12.389093,42.027113,3.168013,17.597955,6.634235,1.119609,2.872015,64.947535,7.496362,35.266913,18.49223,52.875507,8581.960328,6669.393329,73.692031,6.73058,0.384408,0.025933,0.076526
min,0.0,0.0,37.0,46.0,48.0,84.0,33.0,30.0,37.0,73.404199,37.0,216.0,17.0,12.0,35.0,113.579083,90.0,321.0,32.260075,92.531592,15011.475836,11653.96323,128.631066,7.746683,0.442441,0.029848,0.088079
25%,0.0,0.0,70.0,76.0,75.0,113.0,36.0,32.0,47.0,117.668513,39.0,231.0,22.0,13.0,37.0,181.810853,96.0,351.0,51.632972,148.117546,24079.553151,18657.980973,205.597793,13.642174,0.779154,0.052563,0.15511
50%,0.0,0.0,104.0,107.0,102.0,141.0,39.0,34.0,58.0,147.804001,42.0,246.0,28.0,14.0,40.0,228.65713,102.0,382.0,64.997753,186.658702,30161.891648,23492.827366,258.976286,19.454969,1.111144,0.074959,0.221201
75%,0.0,0.0,138.0,138.0,129.0,170.0,41.0,36.0,69.0,178.891581,45.0,261.0,34.0,15.0,42.0,276.843162,109.0,412.0,78.668272,225.70771,36617.104835,28445.861625,314.379017,25.275229,1.443561,0.097385,0.287376
max,0.0,0.0,171.0,169.0,155.0,199.0,44.0,38.0,79.0,252.579062,47.0,276.0,39.0,15.0,44.0,390.590695,115.0,442.0,111.001017,318.346955,51635.857473,40098.431307,442.200019,31.130124,1.777955,0.119943,0.353946


---

## At the end we save the dataset as a CSV file

In [65]:
print(f'Attack Dataset Shape: {dns_dataset.shape}')

Attack Dataset Shape: (50000, 28)


In [66]:
# save the dataset
dns_dataset.to_csv('dns_active_dataset.csv', index=False)