# Prepare Background (inactive) DNS Tunneling Attack Dataset

## Overview:

This notebook will focus on creating a background (inactive) DNS Tunneling attack dataset based on a small sample of data collected by performing real DNS Tunneling attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 40,000 flows of the DNS Tunneling attack based on the samples we collected when running a DNS Tunneling attacks in various configurations using the well known DNScat2 tool when the victim host has a DNS tunnel open <u>but no</u> commands or data are being transmitted through it, the tunnel is open but not actively used by the attacker.<br>

## Imports & Global Variables:

In [56]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 40000
ATTACK_NAME = 'DNS'

In [57]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the sample dataset:

In [58]:
# import the attack sample dataset
dns_samples = pd.read_csv('dns_samples_background.csv')
print(f'Dataset Shape: {dns_samples.shape}')
dns_samples

Dataset Shape: (10, 27)


Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,16,36,36,44,36.461538,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.272727,101,158,22,24,4908,2596,88,22.814415,1.096587,0.262235,0.461227
1,0,0,48,48,36,66,38.0,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.318182,101,158,33,35,7368,3894,132,34.926907,1.105063,0.266618,0.465087
2,0,0,48,36,52,68,38.571429,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.823529,101,158,34,36,7660,4012,136,35.418652,1.105114,0.26236,0.456099
3,0,0,48,52,34,68,37.84,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.626866,101,158,33,35,7580,3894,134,35.846157,1.102855,0.26952,0.464188
4,0,0,44,56,36,68,37.52,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.117647,101,158,34,36,7564,4012,136,35.797906,1.107464,0.26517,0.4614
5,0,0,28,40,68,68,37.294118,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.823529,101,158,34,36,7660,4012,136,35.301731,1.10236,0.261494,0.454823
6,0,0,52,40,44,68,38.521739,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.647059,101,158,34,36,7636,4012,136,35.763095,1.096167,0.264912,0.461058
7,0,0,40,48,48,68,37.636364,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.441176,101,158,34,36,7608,4012,136,35.993821,1.098705,0.266621,0.464297
8,0,0,44,48,40,66,37.826087,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.348485,101,158,33,35,7372,3894,132,34.958758,1.098239,0.266861,0.465344
9,0,0,52,32,48,66,38.952381,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.893939,101,158,33,35,7444,3894,132,34.897108,1.098657,0.26639,0.463545


### Find the columns that we need to synthesis data for:

In [59]:
columns_to_gather = dns_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['CName Record Count',
 'TXT Record Count',
 'MX Record Count',
 'DF Flag Count',
 'Average Response Data Length',
 'Min Response Data Length',
 'Max Response Data Length',
 'Average Domain Name Length',
 'Min Domain Name Length',
 'Max Domain Name Length',
 'Average Sub Domain Name Length',
 'Min Sub Domain Name Length',
 'Max Sub Domain Name Length',
 'Average Packet Length',
 'Min Packet Length',
 'Max Packet Length',
 'Number of Domian Names',
 'Number of Sub Domian Names',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Total Number of Packets',
 'Flow Duration',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [60]:
# find the minimum and maximum values for each column, scale the range (reduce min by 10% and increase max by 35%), and store the results in a dictionary.
min_max_dict = {col: (float(dns_samples[col].min() * 0.9), float(dns_samples[col].max() * 1.35)) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

CName Record Count             | Min: 14.40 | Max: 70.20
TXT Record Count               | Min: 28.80 | Max: 75.60
MX Record Count                | Min: 30.60 | Max: 91.80
DF Flag Count                  | Min: 39.60 | Max: 91.80
Average Response Data Length   | Min: 32.82 | Max: 52.59
Min Response Data Length       | Min: 30.60 | Max: 45.90
Max Response Data Length       | Min: 37.80 | Max: 56.70
Average Domain Name Length     | Min: 37.80 | Max: 56.70
Min Domain Name Length         | Min: 37.80 | Max: 56.70
Max Domain Name Length         | Min: 37.80 | Max: 56.70
Average Sub Domain Name Length | Min: 12.90 | Max: 19.35
Min Sub Domain Name Length     | Min: 12.90 | Max: 19.35
Max Sub Domain Name Length     | Min: 12.90 | Max: 19.35
Average Packet Length          | Min: 114.41 | Max: 172.66
Min Packet Length              | Min: 90.90 | Max: 136.35
Max Packet Length              | Min: 142.20 | Max: 213.30
Number of Domian Names         | Min: 19.80 | Max: 45.90
Number of Sub Domian Names

### Create the base attack dataset (full of zeros):

In [None]:
# creating an empty dataframe before adding values to it
dns_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dns_samples.columns))), columns = dns_samples.columns)
dns_dataset.head(3)

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [62]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dns_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dns_dataset[col] = int(0)
zero_columns

['A Record Count', 'AAAA Record Count']

---

## Filling in values based on collected samples:

### Firstly we insert data into columns that are not related to each other:

In [63]:
constant_columns = ['TXT Record Count', 'MX Record Count', 'CName Record Count', 'DF Flag Count', 'Number of Sub Domian Names', 'Total Number of Packets']

# adding the attack feature values to the dataset at random based on the smaple data using the minimum maximum dict
for col in constant_columns:
    dns_dataset[col] = np.random.randint(min_max_dict[col][0], min_max_dict[col][1], NUM_OF_ROWS)

## Then we fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **total number of packets** increases, the **total length of fwd packet** and **total length of bwd packet** are both likely to increases as well. This happens because packets can be either forward or backward, and the more packets there are, the more fwd and bwd traffic (packets) there will be.<br>

### Correlation between multiple columns based on the values in the list_of_correlation:

In [64]:
lists_of_correlation = [
    ['Number of Sub Domian Names', 'Number of Domian Names'],
    ['Average Response Data Length', 'Min Response Data Length'],
    ['Total Number of Packets', 'Total Length of Fwd Packet', 'Total Length of Bwd Packet'],
    ['Average Packet Length', 'Min Packet Length', 'Max Packet Length']
]

In [65]:
# for each list of correlation, calculate the correlation between the first column in the list and the rest of the columns
for l in lists_of_correlation:
    dns_dataset[l[0]] = np.random.randint(min_max_dict[l[0]][0], min_max_dict[l[0]][1], NUM_OF_ROWS)
    if l[0] == 'Average Response Data Length':
        dns_dataset[l[0]] = np.random.uniform(min_max_dict[l[0]][0], min_max_dict[l[0]][1], NUM_OF_ROWS)

    # finding the correlation between the first column to the rest of the columns in order to create new data
    independent_col = dns_samples[l[0]].values.reshape(-1, 1) 
    dependent_cols = dns_samples[l[1:]].values

    # using least squares regression to find scaling factors that best approximate the relationship between the columns
    scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]
    scaling_factors = [(name, float(factor)) for name, factor in zip(l[1:], scaling_factors.flatten())]
    print(l[0]+':')
    for val in scaling_factors:
        print(val)
    print('-'*50)

    # Precompute values using random deltas for all rows and all dependent columns
    for col, factor in scaling_factors:
        deltas = np.random.uniform(factor * 0.1, factor * 0.2, size = NUM_OF_ROWS)
        signs = np.random.choice([-1, 1], size = NUM_OF_ROWS)
        updated_factors = factor + signs * deltas

        # insert the data into the attack dataset for each column
        dns_dataset[col] = dns_dataset[l[0]] * updated_factors

Number of Sub Domian Names:
('Number of Domian Names', 0.9424556707929075)
--------------------------------------------------
Average Response Data Length:
('Min Response Data Length', 0.897701272076271)
--------------------------------------------------
Total Number of Packets:
('Total Length of Fwd Packet', 56.09391500246403)
('Total Length of Bwd Packet', 29.45361744068712)
--------------------------------------------------
Average Packet Length:
('Min Packet Length', 0.7919591927438456)
('Max Packet Length', 1.238906459935917)
--------------------------------------------------


In [66]:
dns_dataset

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,50,35,63,44,35.808317,37.147779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131,122.456328,141.301846,39.712734,38,6185.698164,3342.472820,99,0.0,0.0,0.0,0.0
1,0,0,49,61,68,55,38.098088,39.836043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,133,88.005215,143.599962,17.370482,23,6272.382685,2970.632007,126,0.0,0.0,0.0,0.0
2,0,0,52,40,78,48,38.123016,29.257643,0.0,0.0,0.0,0.0,0.0,0.0,0.0,153,133.304668,225.185590,50.538084,46,5730.716922,3031.393180,88,0.0,0.0,0.0,0.0
3,0,0,37,69,65,68,33.502485,35.731191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,155,102.137462,168.038240,17.228296,21,9344.417639,5158.090510,147,0.0,0.0,0.0,0.0
4,0,0,43,68,55,50,40.236105,42.967968,0.0,0.0,0.0,0.0,0.0,0.0,0.0,134,121.821800,140.853104,50.020809,47,8419.666162,3290.598924,133,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,0,50,34,89,76,44.396031,44.830313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,155,139.157922,220.124335,48.631814,45,9132.171333,3403.665721,138,0.0,0.0,0.0,0.0
39996,0,0,28,45,73,39,39.310571,38.911268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,157,143.354979,227.294136,30.485205,28,7538.197875,3987.824395,157,0.0,0.0,0.0,0.0
39997,0,0,20,65,38,87,42.954191,32.260217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,171,112.803007,236.673343,35.556901,46,6988.613007,3719.948853,142,0.0,0.0,0.0,0.0
39998,0,0,55,31,54,46,35.896917,37.735829,0.0,0.0,0.0,0.0,0.0,0.0,0.0,133,117.023007,196.561483,28.819181,37,6253.219597,4501.798248,134,0.0,0.0,0.0,0.0


### Then we insert data into columns that have the same values:

In [None]:
# this is a list of list, each inner list holds columns that should have the same values
same_values = [['Max Response Data Length', 'Average Domain Name Length', 'Min Domain Name Length', 'Max Domain Name Length'], 
              ['Average Sub Domain Name Length', 'Min Sub Domain Name Length', 'Max Sub Domain Name Length']]

# for each list of columns, generate a vector with values based on the minimum and maximum dict and insert that vector to each column in the current list
for l in same_values:
    val = np.random.uniform(min_max_dict[l[0]][0], min_max_dict[l[0]][1], NUM_OF_ROWS)
    for col in l:
        dns_dataset[col] = val

dns_dataset.head(10)

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,50,35,63,44,35.808317,37.147779,52.099052,52.099052,52.099052,52.099052,14.816525,14.816525,14.816525,131,122.456328,141.301846,39.712734,38,6185.698164,3342.47282,99,0.0,0.0,0.0,0.0
1,0,0,49,61,68,55,38.098088,39.836043,52.661364,52.661364,52.661364,52.661364,15.365976,15.365976,15.365976,133,88.005215,143.599962,17.370482,23,6272.382685,2970.632007,126,0.0,0.0,0.0,0.0
2,0,0,52,40,78,48,38.123016,29.257643,46.789052,46.789052,46.789052,46.789052,13.709802,13.709802,13.709802,153,133.304668,225.18559,50.538084,46,5730.716922,3031.39318,88,0.0,0.0,0.0,0.0
3,0,0,37,69,65,68,33.502485,35.731191,56.66241,56.66241,56.66241,56.66241,13.693586,13.693586,13.693586,155,102.137462,168.03824,17.228296,21,9344.417639,5158.09051,147,0.0,0.0,0.0,0.0
4,0,0,43,68,55,50,40.236105,42.967968,39.852676,39.852676,39.852676,39.852676,13.752507,13.752507,13.752507,134,121.8218,140.853104,50.020809,47,8419.666162,3290.598924,133,0.0,0.0,0.0,0.0
5,0,0,23,44,30,67,36.257984,27.429194,47.84338,47.84338,47.84338,47.84338,17.592384,17.592384,17.592384,168,148.914542,174.907289,25.967314,23,5554.447492,2723.324307,114,0.0,0.0,0.0,0.0
6,0,0,16,61,69,78,36.040042,36.086707,41.427435,41.427435,41.427435,41.427435,13.622613,13.622613,13.622613,154,107.81139,163.19126,22.645809,29,5722.499273,2792.823531,116,0.0,0.0,0.0,0.0
7,0,0,30,45,87,71,37.311174,38.66619,42.154009,42.154009,42.154009,42.154009,18.159797,18.159797,18.159797,116,79.071617,167.93011,28.662012,34,7890.520585,3998.099424,162,0.0,0.0,0.0,0.0
8,0,0,62,74,48,61,49.272941,48.740616,44.617987,44.617987,44.617987,44.617987,17.217011,17.217011,17.217011,136,119.890174,194.229691,23.353679,30,5608.296103,2098.980623,84,0.0,0.0,0.0,0.0
9,0,0,39,72,81,62,34.938585,36.269358,49.321736,49.321736,49.321736,49.321736,14.618184,14.618184,14.618184,153,135.022136,216.356212,33.656142,44,7363.250129,4000.490936,157,0.0,0.0,0.0,0.0


### Correlation between 'IAT Mean' and all of the following: 'IAT Max', 'IAT Std', 'Flow Duration':

In [68]:
# finding the correlation between the 'IAT Mean' column to the rest of the columns in order to create new data
correlation = ['IAT Mean', 'IAT Max', 'IAT Std', 'Flow Duration']
independent_col = dns_samples[correlation[0]].values.reshape(-1, 1) #column 'IAT Mean'
dependent_cols = dns_samples[correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'IAT Mean' and the rest of the columns in correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, float(factor)) for name, factor in zip(correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('IAT Max', 4.151419427246799)
('IAT Std', 1.7408151217079082)
('Flow Duration', 128.88859283302108)


In [69]:
dns_dataset['IAT Mean'] = np.random.uniform(min_max_dict['IAT Mean'][0], min_max_dict['IAT Mean'][1], NUM_OF_ROWS)

# iterating over all rows we need to add values
for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'IAT Mean'
    delta = random.uniform(factor * 0.1, factor * 0.2) 
    updated_factor = factor + delta
    dns_dataset[col] = dns_dataset['IAT Mean'] * updated_factor


---

## Adding the Label column:

In [70]:
# adding a label to the dataset
dns_dataset['Label'] = ATTACK_NAME

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

Make sure that the data that needs to be of type Integer will be Integer for consistency.  

In [71]:
int_columns = ['Min Response Data Length', 'Max Response Data Length', 'Average Domain Name Length', 'Min Domain Name Length',
                'Max Domain Name Length', 'Min Packet Length', 'Max Packet Length', 'Number of Domian Names', 'Number of Sub Domian Names',
                'Total Length of Fwd Packet', 'Total Length of Bwd Packet', 'Total Number of Packets']

for col in int_columns:
    dns_dataset[col] = dns_dataset[col].astype(int)

In [72]:
dns_samples

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
0,0,0,16,36,36,44,36.461538,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.272727,101,158,22,24,4908,2596,88,22.814415,1.096587,0.262235,0.461227
1,0,0,48,48,36,66,38.0,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.318182,101,158,33,35,7368,3894,132,34.926907,1.105063,0.266618,0.465087
2,0,0,48,36,52,68,38.571429,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.823529,101,158,34,36,7660,4012,136,35.418652,1.105114,0.26236,0.456099
3,0,0,48,52,34,68,37.84,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.626866,101,158,33,35,7580,3894,134,35.846157,1.102855,0.26952,0.464188
4,0,0,44,56,36,68,37.52,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.117647,101,158,34,36,7564,4012,136,35.797906,1.107464,0.26517,0.4614
5,0,0,28,40,68,68,37.294118,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.823529,101,158,34,36,7660,4012,136,35.301731,1.10236,0.261494,0.454823
6,0,0,52,40,44,68,38.521739,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.647059,101,158,34,36,7636,4012,136,35.763095,1.096167,0.264912,0.461058
7,0,0,40,48,48,68,37.636364,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.441176,101,158,34,36,7608,4012,136,35.993821,1.098705,0.266621,0.464297
8,0,0,44,48,40,66,37.826087,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.348485,101,158,33,35,7372,3894,132,34.958758,1.098239,0.266861,0.465344
9,0,0,52,32,48,66,38.952381,34,42,42.0,42,42,14.333333,14.333333,14.333333,127.893939,101,158,33,35,7444,3894,132,34.897108,1.098657,0.26639,0.463545


In [73]:
dns_dataset

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std,Label
0,0,0,50,35,63,44,35.808317,37,52,52,52,52,14.816525,14.816525,14.816525,131,122,141,39,38,6185,3342,99,51.068020,1.619813,0.332942,0.677521,DNS
1,0,0,49,61,68,55,38.098088,39,52,52,52,52,15.365976,15.365976,15.365976,133,88,143,17,23,6272,2970,126,42.149110,1.336917,0.274795,0.559194,DNS
2,0,0,52,40,78,48,38.123016,29,46,46,46,46,13.709802,13.709802,13.709802,153,133,225,50,46,5730,3031,88,41.055082,1.302216,0.267662,0.544679,DNS
3,0,0,37,69,65,68,33.502485,35,56,56,56,56,13.693586,13.693586,13.693586,155,102,168,17,21,9344,5158,147,52.991496,1.680824,0.345483,0.703040,DNS
4,0,0,43,68,55,50,40.236105,42,39,39,39,39,13.752507,13.752507,13.752507,134,121,140,50,47,8419,3290,133,42.818027,1.358134,0.279156,0.568068,DNS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,0,50,34,89,76,44.396031,44,54,54,54,54,13.341173,13.341173,13.341173,155,139,220,48,45,9132,3403,138,48.552329,1.540019,0.316541,0.644146,DNS
39996,0,0,28,45,73,39,39.310571,38,50,50,50,50,16.486593,16.486593,16.486593,157,143,227,30,28,7538,3987,157,50.151489,1.590742,0.326967,0.665362,DNS
39997,0,0,20,65,38,87,42.954191,32,46,46,46,46,17.855364,17.855364,17.855364,171,112,236,35,46,6988,3719,142,52.838715,1.675978,0.344486,0.701013,DNS
39998,0,0,55,31,54,46,35.896917,37,44,44,44,44,12.918386,12.918386,12.918386,133,117,196,28,37,6253,4501,134,46.461118,1.473688,0.302907,0.616401,DNS


In [74]:
dns_samples.describe()

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.0,0.0,42.0,43.6,44.2,65.0,37.862366,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.531314,101.0,158.0,32.4,34.4,7280.0,3823.2,129.8,34.171855,1.101121,0.265218,0.461707
std,0.0,0.0,11.508451,7.87683,10.432854,7.438638,0.714916,0.0,0.0,0.0,0.0,0.0,3.601719e-15,0.0,0.0,0.268694,0.0,0.0,3.687818,3.687818,840.850363,435.162498,14.800901,4.011606,0.003966,0.002528,0.003659
min,0.0,0.0,16.0,32.0,34.0,44.0,36.461538,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.117647,101.0,158.0,22.0,24.0,4908.0,2596.0,88.0,22.814415,1.096167,0.261494,0.454823
25%,0.0,0.0,41.0,37.0,36.0,66.0,37.549091,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.325758,101.0,158.0,33.0,35.0,7390.0,3894.0,132.0,34.93487,1.098343,0.262998,0.4611
50%,0.0,0.0,46.0,44.0,42.0,68.0,37.833043,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.534021,101.0,158.0,33.5,35.5,7572.0,3953.0,135.0,35.360191,1.100532,0.26578,0.462472
75%,0.0,0.0,48.0,48.0,48.0,68.0,38.391304,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.779412,101.0,158.0,34.0,36.0,7629.0,4012.0,136.0,35.789203,1.104511,0.26662,0.46427
max,0.0,0.0,52.0,56.0,68.0,68.0,38.952381,34.0,42.0,42.0,42.0,42.0,14.33333,14.333333,14.333333,127.893939,101.0,158.0,34.0,36.0,7660.0,4012.0,136.0,35.993821,1.107464,0.26952,0.465344


In [75]:
dns_dataset.describe()

Unnamed: 0,A Record Count,AAAA Record Count,CName Record Count,TXT Record Count,MX Record Count,DF Flag Count,Average Response Data Length,Min Response Data Length,Max Response Data Length,Average Domain Name Length,Min Domain Name Length,Max Domain Name Length,Average Sub Domain Name Length,Min Sub Domain Name Length,Max Sub Domain Name Length,Average Packet Length,Min Packet Length,Max Packet Length,Number of Domian Names,Number of Sub Domian Names,Total Length of Fwd Packet,Total Length of Bwd Packet,Total Number of Packets,Flow Duration,IAT Max,IAT Mean,IAT Std
count,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0
mean,0.0,0.0,41.6763,50.879875,59.916,64.414825,42.677969,37.83095,46.728275,46.728275,46.728275,46.728275,16.123766,16.123766,16.123766,142.405825,112.2247,176.01525,31.5816,34.0075,7308.55535,3840.288225,130.4186,45.932123,1.456909,0.299458,0.609383
std,0.0,0.0,16.19221,13.577089,17.688548,15.012829,5.719167,7.8057,5.469176,5.469176,5.469176,5.469176,1.86829,1.86829,1.86829,16.743518,21.873446,34.215229,8.898223,7.796997,2029.443176,1063.8989,29.935915,5.688222,0.180423,0.037085,0.075466
min,0.0,0.0,14.0,28.0,30.0,39.0,32.816414,23.0,37.0,37.0,37.0,37.0,12.90003,12.90003,12.90003,114.0,72.0,113.0,15.0,21.0,3546.0,1862.0,79.0,36.098584,1.145002,0.235348,0.478921
25%,0.0,0.0,28.0,39.0,45.0,51.0,37.699438,32.0,42.0,42.0,42.0,42.0,14.508305,14.508305,14.508305,128.0,95.0,149.0,25.0,27.0,5714.0,3005.0,104.0,40.989852,1.300147,0.267237,0.543814
50%,0.0,0.0,42.0,51.0,60.0,64.0,42.690363,37.0,47.0,47.0,47.0,47.0,16.120317,16.120317,16.120317,142.0,110.0,172.0,31.0,34.0,7132.0,3757.0,130.0,45.938351,1.457107,0.299499,0.609466
75%,0.0,0.0,56.0,63.0,75.0,77.0,47.640156,44.0,51.0,51.0,51.0,51.0,17.735509,17.735509,17.735509,157.0,129.0,202.0,37.0,41.0,8598.0,4517.0,156.0,50.898794,1.614446,0.331839,0.675276
max,0.0,0.0,69.0,74.0,90.0,90.0,52.583687,56.0,56.0,56.0,56.0,56.0,19.34975,19.34975,19.34975,171.0,162.0,254.0,53.0,47.0,12248.0,6432.0,182.0,55.807986,1.770159,0.363845,0.740407


---

## At the end we save the dataset as a CSV file

In [76]:
print(f'Attack Dataset Shape: {dns_dataset.shape}')

Attack Dataset Shape: (40000, 28)


In [None]:
# save the dataset
dns_dataset.to_csv('dns_background_dataset.csv', index=False)