# Prepare DoS Hulk & Goldeneye Closed Port Attack Dataset
* 'Closed Port' means that the DoS attack is performed on an HTTP server that is <u>not</u> currently running.

## Overview:

This notebook will focus on creating a DoS Hulk & Goldeneye closed port attack dataset based on a small sample of data collected by performing real DoS HTTP GET Flood attacks on a webserver that is offline in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 5,000 flows of the DoS HTTP GET Flood closed port attack based on the samples we collected when running a DoS HTTP GET Flood attacks in various configurations using the well known DoS Hulk and DoS Goldeneye tools when the victim web server was offline.<br> 
The victim web server was a regular Flask web server.

## Imports & Global Variables:

In [2]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 5000
ATTACK_NAME = 'DoS'

In [3]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the sample dataset:

In [4]:
# import the attack sample dataset
dos_samples = pd.read_csv('dos_hulk_goldeneye_samples_closed_port.csv')
print(f'Dataset Shape: {dos_samples.shape}')
dos_samples

Dataset Shape: (11, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,74.0,74,74,0.0,0.0,12120,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,303,0,0,1.942041,156.021429,0.770559,0.006431,0.044304
1,1,74.0,74,74,0.0,0.0,24240,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,24240.0,606,0,0,7.609775,79.634417,4.271673,0.012578,0.175409
2,1,74.0,74,74,0.0,0.0,24280,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,24280.0,607,0,0,9.05479,67.036342,6.13993,0.014942,0.249637
3,1,74.0,74,74,0.0,0.0,12120,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,303,0,0,1.398528,216.656355,0.253589,0.004631,0.017373
4,1,74.0,74,74,0.0,0.0,19160,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,19160.0,479,0,0,4.504522,106.337591,2.296905,0.009424,0.106528
5,1,74.0,74,74,0.0,0.0,20360,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,509,0,0,2.271243,224.10635,0.96003,0.004471,0.056644
6,1,74.0,74,74,0.0,0.0,20160,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,20160.0,504,0,0,2.253179,223.683952,1.020697,0.004479,0.057479
7,1,74.0,74,74,0.0,0.0,40440,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,20220.0,1011,0,0,10.423238,96.994811,6.01986,0.01032,0.19808
8,1,74.0,74,74,0.0,0.0,20400,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,510,0,0,2.180446,233.897111,0.767459,0.004284,0.04866
9,1,74.0,74,74,0.0,0.0,40840,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,40840.0,1021,0,0,11.646797,87.663587,7.156308,0.011418,0.229241


### Find the columns that we need to synthesis data for:

In [5]:
columns_to_gather = dos_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist()  #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [6]:
# find the minimum and maximum values for each column, scale the range (reduce min by 10% and increase max by 10%), and store the results in a dictionary.
min_max_dict = {col: (float(dos_samples[col].min() * 0.9), float(dos_samples[col].max() * 1.1)) for col in columns_to_gather}
min_max_dict['Number of Ports'] = (1, 1)

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 1.00 | Max: 1.00
Average Packet Length          | Min: 66.60 | Max: 81.40
Packet Length Min              | Min: 66.60 | Max: 81.40
Packet Length Max              | Min: 66.60 | Max: 81.40
Total Length of Fwd Packet     | Min: 10908.00 | Max: 44924.00
Fwd Packet Length Max          | Min: 36.00 | Max: 44.00
Fwd Packet Length Mean         | Min: 36.00 | Max: 44.00
Fwd Packet Length Min          | Min: 36.00 | Max: 44.00
Subflow Fwd Bytes              | Min: 0.00 | Max: 44924.00
SYN Flag Count                 | Min: 272.70 | Max: 1123.10
Flow Duration                  | Min: 1.26 | Max: 12.81
Packets Per Second             | Min: 60.33 | Max: 257.29
IAT Max                        | Min: 0.23 | Max: 7.87
IAT Mean                       | Min: 0.00 | Max: 0.02
IAT Std                        | Min: 0.02 | Max: 0.27


### Create the base attack dataset (full of zeros):

In [7]:
# creating an empty dataframe before adding values to it
dos_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dos_samples.columns))), columns = dos_samples.columns)
dos_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [8]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dos_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dos_dataset[col] = int(0)
zero_columns

['Packet Length Std',
 'Packet Length Variance',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Fwd Segment Size Avg',
 'Bwd Segment Size Avg',
 'ACK Flag Count',
 'RST Flag Count']

---

## Filling in values based on collected samples:

### Firstly we insert data into columns that have the exact same values:

In [9]:
same_value = ['Average Packet Length', 'Packet Length Min', 'Packet Length Max']
val = np.random.randint(min_max_dict[same_value[0]][0], min_max_dict[same_value[0]][1]*1.1, NUM_OF_ROWS)

for col in same_value:
    dos_dataset[col] = val

In [10]:
same_value2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']
val2 = np.random.randint(min_max_dict[same_value2[0]][0], min_max_dict[same_value2[0]][1]*1.25, NUM_OF_ROWS)

for col in same_value2:
    dos_dataset[col] = val2

### Then we insert data into columns that are independant of each other, based on the min max values:

In [None]:
dos_dataset['Number of Ports'] = np.full(shape = NUM_OF_ROWS, fill_value = 1, dtype = int)
dos_dataset['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*1.35, min_max_dict['SYN Flag Count'][1]*1.1, NUM_OF_ROWS)

## Then we fill values into columns that have a certain correlation between them:

### Correlation between 'SYN Flag Count' and all of the following: 'Total Length of Fwd Packet', 'Flow Duration':

In [12]:
# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
first_correlation = ['SYN Flag Count', 'Total Length of Fwd Packet', 'Flow Duration']
independent_col = dos_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = dos_samples[first_correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and 'Total Length of Fwd Packet'
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)
    
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'SYN Flag Count'
        delta = random.uniform(factor * 0.05, factor * 0.25) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        dos_dataset.loc[index, col] = int(row['SYN Flag Count'] * updated_factor) if col == 'Total Length of Fwd Packet' else row['SYN Flag Count'] * updated_factor

('Total Length of Fwd Packet', np.float64(40.00000000000001))
('Flow Duration', np.float64(0.008855462081054798))


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [13]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = dos_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = dos_samples[second_correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(dos_samples['Flow Duration'].values, dos_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(15.176195493353456))
('IAT Max', np.float64(0.5699914672955171))
('IAT Mean', np.float64(0.0013028246020595424))
('IAT Std', np.float64(0.021232950310855647))


np.float64(624.2727272727273)

In [14]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr*0.05, duration_to_packets_corr * 0.125) 
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            dos_dataset.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std':
                delta = random.uniform(factor * 0.1, factor * 0.25)
                updated_factor = factor + random.choices([-1, 1], weights=[1, 2], k=1)[0] * delta  
            elif col == 'IAT Max':
                delta = random.uniform(factor * 0.1, factor * 0.25)
                updated_factor = factor + random.choice([-1, 1]) * delta  
            else:
                delta = random.uniform(factor * 0.25, factor * 0.5)
                updated_factor = factor + random.choices([-1, 1], weights=[1, 3], k=1)[0] * delta
            dos_dataset.loc[index, col] = row['Flow Duration'] * updated_factor

In our sample dataset, the column 'Subflow Fwd Bytes' usually has values in a specific range, but sometimes it has zero values.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Subflow Fwd Bytes' column, <br>
50% of the values will be within the usual range, but the other 50% will have zero values.  

In [15]:
# generate a vector with random values based on min max dict, and also create a zero vector
col = 'Subflow Fwd Bytes'
subflow_values = dos_samples[dos_samples[col] != 0][col] 
min_max_dict[col] = (np.min(subflow_values), np.max(subflow_values))

rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
zero_values = np.zeros(NUM_OF_ROWS)

# choose values randomly (50% from rand_values, 50% from zero_values)
dos_dataset[col] = np.where(np.random.rand(NUM_OF_ROWS) > 0.5, rand_values, zero_values)

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [16]:
dos_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,74.0,74,74,0.0,0.0,12120,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,303,0,0,1.942041,156.021429,0.770559,0.006431,0.044304
1,1,74.0,74,74,0.0,0.0,24240,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,24240.0,606,0,0,7.609775,79.634417,4.271673,0.012578,0.175409
2,1,74.0,74,74,0.0,0.0,24280,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,24280.0,607,0,0,9.05479,67.036342,6.13993,0.014942,0.249637
3,1,74.0,74,74,0.0,0.0,12120,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,303,0,0,1.398528,216.656355,0.253589,0.004631,0.017373
4,1,74.0,74,74,0.0,0.0,19160,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,19160.0,479,0,0,4.504522,106.337591,2.296905,0.009424,0.106528
5,1,74.0,74,74,0.0,0.0,20360,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,509,0,0,2.271243,224.10635,0.96003,0.004471,0.056644
6,1,74.0,74,74,0.0,0.0,20160,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,20160.0,504,0,0,2.253179,223.683952,1.020697,0.004479,0.057479
7,1,74.0,74,74,0.0,0.0,40440,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,20220.0,1011,0,0,10.423238,96.994811,6.01986,0.01032,0.19808
8,1,74.0,74,74,0.0,0.0,20400,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,0.0,510,0,0,2.180446,233.897111,0.767459,0.004284,0.04866
9,1,74.0,74,74,0.0,0.0,40840,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,40840.0,1021,0,0,11.646797,87.663587,7.156308,0.011418,0.229241


In [17]:
dos_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,1.0,74.0,74.0,74.0,0.0,0.0,24970.909091,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15380.0,624.272727,0.0,0.0,5.318936,153.285979,2.789452,0.008012,0.113205
std,0.0,0.0,0.0,0.0,0.0,0.0,10791.24224,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13552.384292,269.781056,0.0,0.0,3.758764,66.859893,2.595128,0.003857,0.083803
min,1.0,74.0,74.0,74.0,0.0,0.0,12120.0,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,303.0,0.0,0.0,1.398528,67.036342,0.253589,0.004284,0.017373
25%,1.0,74.0,74.0,74.0,0.0,0.0,19660.0,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,491.5,0.0,0.0,2.216812,92.329199,0.865294,0.004555,0.052652
50%,1.0,74.0,74.0,74.0,0.0,0.0,20400.0,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20160.0,510.0,0.0,0.0,4.504522,156.021429,1.026959,0.006431,0.0619
75%,1.0,74.0,74.0,74.0,0.0,0.0,32360.0,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22260.0,809.0,0.0,0.0,8.332283,220.170153,5.145766,0.010869,0.186744
max,1.0,74.0,74.0,74.0,0.0,0.0,40840.0,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40840.0,1021.0,0.0,0.0,11.646797,233.897111,7.156308,0.014942,0.249637


In [18]:
dos_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.0,76.932,76.932,76.932,0.0,0.0,32045.0558,45.0392,45.0392,45.0392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15115.325413,804.3456,0.0,0.0,7.131503,100.559464,4.088149,0.011062,0.1607
std,0.0,6.620373,6.620373,6.620373,0.0,0.0,11263.899104,5.541137,5.541137,5.541137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16420.567755,248.705152,0.0,0.0,2.520671,41.539421,1.651946,0.005082,0.062962
min,1.0,66.0,66.0,66.0,0.0,0.0,11294.0,36.0,36.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,368.0,0.0,0.0,2.517072,41.54077,1.133186,0.001812,0.042499
25%,1.0,71.0,71.0,71.0,0.0,0.0,23059.75,40.0,40.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,595.0,0.0,0.0,5.093124,69.684177,2.818326,0.007019,0.111384
50%,1.0,77.0,77.0,77.0,0.0,0.0,31219.5,45.0,45.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,799.0,0.0,0.0,6.947593,89.543693,3.868461,0.010348,0.152974
75%,1.0,83.0,83.0,83.0,0.0,0.0,39727.75,50.0,50.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30818.947753,1020.25,0.0,0.0,8.881915,122.753813,5.168005,0.014649,0.201461
max,1.0,88.0,88.0,88.0,0.0,0.0,61044.0,54.0,54.0,54.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,44919.517544,1234.0,0.0,0.0,13.555006,270.610675,9.508376,0.026128,0.347554


---

## Adding the Label column:

In [19]:
dos_dataset['Label'] = ATTACK_NAME

---

## At the end we save the dataset as a CSV file

In [20]:
print(f'Attack Dataset Shape: {dos_dataset.shape}')

Attack Dataset Shape: (5000, 27)


In [21]:
# save the dataset
dos_dataset.to_csv('dos_hulk_goldeneye_closed_port_dataset.csv', index=False)