# Prepare DoS Goldeneye Attack Dataset

## Overview:

This notebook will focus on creating a DoS Goldeneye attack dataset based on a small sample of data collected by performing real DoS HTTP GET Flood attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 25,000 flows of the DoS HTTP GET Flood attack based on the samples we collected when running a DoS HTTP GET Flood attacks in various configurations using the well known DoS Goldeneye tool when the victim web server was online.<br> 
The victim web server was a regular Flask web server.

## Imports & Global Variables:

In [29]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 25000
ATTACK_NAME = 'DoS'

In [30]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the sample dataset:

In [31]:
# import the attack sample dataset
dos_samples = pd.read_csv('dos_goldeneye_samples.csv') 
print(f'Dataset Shape: {dos_samples.shape}')
dos_samples

Dataset Shape: (20, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,123.434336,54,800,145.0221,21031.409346,757956,766,89.434336,20,145.0221,0,0.0,0,0.0,55.9259,0.0,757956.0,1604,6867,1255,6.103949,1388.445453,3.753694,0.00072,0.041021
1,1,117.336729,54,798,137.485156,18902.168182,809783,764,83.336729,20,137.485156,0,0.0,0,0.0,50.03427,0.0,809783.0,1660,8005,2411,3.808417,2551.453738,1.820085,0.000392,0.018483
2,1,133.224948,66,847,153.839422,23666.567721,958513,813,99.224948,32,153.839422,0,0.0,0,0.0,65.378986,0.0,958513.0,2229,7431,106,5.103857,1892.686242,2.224035,0.000528,0.022869
3,1,127.334055,54,827,149.385549,22316.042388,904967,793,93.334055,20,149.385549,0,0.0,0,0.0,59.850557,0.0,904967.0,1906,7718,868,5.16678,1876.603999,2.473612,0.000533,0.025403
4,1,124.758296,54,783,146.48448,21457.70277,877905,749,90.758296,20,146.48448,0,0.0,0,0.0,57.186499,0.0,877905.0,1959,7675,1222,5.400769,1791.04124,3.195697,0.000558,0.032502
5,1,117.920988,54,786,138.710865,19240.704013,791291,752,83.920988,20,138.710865,0,0.0,0,0.0,50.694559,0.0,791291.0,1651,7641,2243,5.991286,1573.785651,3.882892,0.000635,0.040014
6,1,121.535804,54,816,143.291036,20532.320952,838593,782,87.535804,20,143.291036,0,0.0,0,0.0,54.26691,0.0,838593.0,1755,7668,1646,5.244331,1826.734388,2.550409,0.000547,0.026378
7,1,121.197395,54,832,142.506324,20308.052313,816778,798,87.197395,20,142.506324,0,0.0,0,0.0,53.704495,0.0,816778.0,1787,7554,1581,6.696016,1398.891554,4.246982,0.000715,0.044103
8,1,118.740191,54,859,139.588957,19485.077035,764526,825,84.740191,20,139.588957,0,0.0,0,0.0,51.500111,0.0,764526.0,1622,7251,2070,5.257887,1715.89838,2.85699,0.000583,0.030171
9,1,135.19741,54,847,158.072184,24986.815414,890942,813,101.19741,20,158.072184,0,0.0,0,0.0,67.505452,0.0,890942.0,1988,6732,106,5.875914,1498.320066,3.307963,0.000667,0.035406


### Find the columns that we need to synthesis data for:

In [32]:
columns_to_gather = dos_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Fwd Packet Length Std',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

The 'RST Flag Count' column sometimes has values and other times has the value 0, there for for this specific column<br>
we decided to calculate the minimum/maximum values based on actual numbers without the 0, and later to add rows where the<br>
value will be exactly 0 in order to match the collect sample dataset.

In [33]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 15%), and store the results in a dictionary.
RST_FlagCount = dos_samples[dos_samples['RST Flag Count'] != 0]['RST Flag Count'] 

min_max_dict = {col: (dos_samples[col].min() * 0.85, dos_samples[col].max() * 1.15) for col in columns_to_gather}
min_max_dict['Number of Ports'] = (1, 1)
min_max_dict['RST Flag Count'] = (np.min(RST_FlagCount), np.max(RST_FlagCount))

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 1.00 | Max: 1.00
Average Packet Length          | Min: 95.06 | Max: 156.41
Packet Length Min              | Min: 45.90 | Max: 75.90
Packet Length Max              | Min: 665.55 | Max: 987.85
Packet Length Std              | Min: 113.01 | Max: 181.78
Packet Length Variance         | Min: 15024.89 | Max: 28734.84
Total Length of Fwd Packet     | Min: 432112.80 | Max: 1150113.85
Fwd Packet Length Max          | Min: 636.65 | Max: 948.75
Fwd Packet Length Mean         | Min: 66.16 | Max: 117.31
Fwd Packet Length Min          | Min: 17.00 | Max: 36.80
Fwd Packet Length Std          | Min: 113.01 | Max: 181.78
Fwd Segment Size Avg           | Min: 37.96 | Max: 78.55
Subflow Fwd Bytes              | Min: 0.00 | Max: 1102289.95
SYN Flag Count                 | Min: 1296.25 | Max: 2856.60
ACK Flag Count                 | Min: 3689.00 | Max: 9205.75
RST Flag Count                 | Min: 69.00 | Max: 2411.00
Flow Duration                  | Min: 2.12 | Max: 4

### Create the base attack dataset (full of zeros):

In [34]:
# creating an empty dataframe before adding values to it
dos_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dos_samples.columns))), columns = dos_samples.columns)
dos_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [35]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dos_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dos_dataset[col] = int(0)
zero_columns

['Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Bwd Segment Size Avg']

In [36]:
dos_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## Filling in values based on collected samples:

## Calculate and fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **flow duration** increases, the **packets per second** is likely to decrease. This occurs because each flow has an upper limit on duration, after which data collection stops and a new flow begins.<br>  
Similarly, the **Inter-Arrival Time (IAT)** of packets within a flow is influenced by the flow duration. Given these dependencies, <br>
the attack dataset should generate data for these columns collectively, ensuring that their inherent correlations are maintained.

### Correlation between 'SYN Flag Count' and 'ACK Flag Count':

In [37]:
# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
first_correlation = ['SYN Flag Count', 'ACK Flag Count']
independent_col = dos_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = dos_samples[first_correlation[1]].values #the rest of the columns that are not zeros

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and 'ACK Flag Count'
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name,factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('ACK Flag Count', np.float64(3.6602748137166277))


After finding the scaling factors we can apply some randomness when generating values for the attack dataset in order to generate better data (without many duplications).<br>
We add randomness by creating a modified scaling factor, which introduces controlled variations in the generated values.<br>
This is done by selecting a small random delta (between 10% and 20% of the factor) and adding or subtracting it from the original scaling factor.<br>
As a result, the generated data maintains realistic correlations while avoiding exact duplicates.

In [None]:
dos_dataset['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0], min_max_dict['SYN Flag Count'][1], NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        delta = random.uniform(factor * 0.1, factor * 0.2) #select a delta
        updated_factor = factor + random.choice([-1, 1]) * delta
        dos_dataset.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

In [39]:
dos_dataset.head(10)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2824,9100.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1976,6216.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2271,7282.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2631,8163.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2391,7157.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1482,4449.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2015,6557.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1892,8213.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1364,4127.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1447,5977.0,0.0,0.0,0.0,0.0,0.0,0.0


As we mentioned above, the 'RST Flag Count' column in this attack sample contains two different values, sometimes its exactly zero, and other times its a number in a specific range.<br> 
There for we insert data into this column such that half of the vector will be zeros and the other half will be values in the range of usual values (between minimal value that is not zero and the maximum value).

In [None]:
# adding the RST flag such that half of the rows will have 0 and the other half will have a random value from the known range
midpoint = len(dos_dataset) // 2 #calculate the midpoint of the DataFrame

# generate random integers for the second half
random_values = np.random.randint(min_max_dict['RST Flag Count'][0], min_max_dict['RST Flag Count'][1] + 1, size=len(dos_dataset) - midpoint)

# add the new column with 0s for the first half and random integers for the second half
new_column_values = [0] * midpoint + list(random_values)
dos_dataset['RST Flag Count'] = new_column_values

In [41]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2824,9100.0,0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1976,6216.0,0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2271,7282.0,0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2631,8163.0,0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2391,7157.0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1987,8512.0,761,0.0,0.0,0.0,0.0,0.0
24996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,1421,4634.0,1466,0.0,0.0,0.0,0.0,0.0
24997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2202,6501.0,497,0.0,0.0,0.0,0.0,0.0
24998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0.0,0,0.0,2507,10718.0,1867,0.0,0.0,0.0,0.0,0.0


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [None]:
# generate random values for the 'Flow Duration' column
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0]*0.9, min_max_dict['Flow Duration'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
dos_dataset['Flow Duration'] = rand_values

In [43]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = dos_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = dos_samples[second_correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name,factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packets Per Second', np.float64(52.73181510400994))
('IAT Max', np.float64(0.8336811683983014))
('IAT Mean', np.float64(0.00010657804108310443))
('IAT Std', np.float64(0.008605181263205227))


In [44]:
# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(dos_samples['Flow Duration'].values, dos_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

np.float64(9298.15)

And again here after finding the scaling factors we add some randomness and generate the data

In [45]:
# calculate a random small delta of the factor for some randomness
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.1, duration_to_packets_corr * 0.2) 
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            dos_dataset.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std' or col == 'IAT Max':
                delta = random.uniform(factor * 0.7, factor * 0.99)
                updated_factor = factor + random.choices([-1, 1], weights=[2, 1], k=1)[0] * delta  
            else:
                delta = random.uniform(factor * 0.1, factor * 0.2) 
                updated_factor = factor + random.choice([-1, 1]) * delta
            dos_dataset.loc[index, col] = row['Flow Duration'] * updated_factor

### Correlation between 'Average Packet Length' and all of the following:<br>'Packet Length Std', 'Packet Length Variance', 'Total Length of Fwd Packet', 'Fwd Packet Length Mean', 'Fwd Packet Length Std', 'Fwd Segment Size Avg', 'Subflow Fwd Bytes':

In [46]:
# finding the correlation between the 'Average Packet Length' column to the rest of the columns in order to create new data
first_correlation = ['Average Packet Length', 'Packet Length Std', 'Packet Length Variance', 'Total Length of Fwd Packet', 
                    'Fwd Packet Length Mean', 'Fwd Packet Length Std', 'Fwd Segment Size Avg']
independent_col = dos_samples[first_correlation[0]].values.reshape(-1, 1) #column 'Average Packet Length'
dependent_cols = dos_samples[first_correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between Average Packet Length' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name,factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packet Length Std', np.float64(1.1713685544035226))
('Packet Length Variance', np.float64(173.05703938204599))
('Total Length of Fwd Packet', np.float64(6807.072826019522))
('Fwd Packet Length Mean', np.float64(0.7297631883951156))
('Fwd Packet Length Std', np.float64(1.1713685544035226))
('Fwd Segment Size Avg', np.float64(0.4625773838748729))


And again here after finding the scaling factors we add some randomness and generate the data

In [47]:
dos_dataset['Average Packet Length'] = np.random.uniform(min_max_dict['Average Packet Length'][0]*0.85, min_max_dict['Average Packet Length'][1]*1.15, NUM_OF_ROWS)

for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Average Packet Length'
        delta = random.uniform(factor * 0.1, factor * 0.2) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        dos_dataset.loc[index, col] = row['Average Packet Length'] * updated_factor

### Then we insert data into columns that are independant of each other:

In [48]:
independant_columns = ['Packet Length Min', 'Packet Length Max', 'Fwd Packet Length Max', 'Fwd Packet Length Min']

for col in independant_columns:
    dos_dataset[col] = (np.random.uniform(min_max_dict[col][0]*0.85, min_max_dict[col][1]*1.15, NUM_OF_ROWS)).astype(int)

dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,105.587820,48,811,107.975975,21606.250764,8.379445e+05,784,66.001649,17,100.599824,0,0,0,0,56.401964,0,0.0,2824,9100.0,0,44.073579,240.713811,6.278590,0.005171,0.099604
1,0.0,147.823620,80,1075,140.517726,30009.173407,1.160707e+06,548,87.131647,42,204.588507,0,0,0,0,56.056813,0,0.0,1976,6216.0,0,2.123064,5066.773119,3.415648,0.000270,0.004027
2,0.0,130.944127,49,656,173.492877,26738.347462,7.905741e+05,664,113.250776,14,123.690720,0,0,0,0,49.897789,0,0.0,2271,7282.0,0,13.394617,555.752803,20.155189,0.001698,0.009695
3,0.0,134.263527,77,1047,181.106997,19713.256816,7.914585e+05,640,85.905870,37,177.537367,0,0,0,0,54.010349,0,0.0,2631,8163.0,0,46.794556,161.257651,73.103100,0.005723,0.042695
4,0.0,162.985253,56,1113,215.697005,23358.248906,9.454286e+05,978,133.057202,35,212.639765,0,0,0,0,87.540970,0,0.0,2391,7157.0,0,35.143267,301.707569,6.869862,0.004151,0.572579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,0.0,125.001640,62,729,128.084013,19432.881900,7.280644e+05,632,101.846778,31,130.631412,0,0,0,0,66.117524,0,0.0,1987,8512.0,761,5.054775,1571.320357,0.857303,0.000460,0.075367
24996,0.0,83.159171,73,1049,79.455610,11979.887229,5.078021e+05,589,67.222318,26,84.526155,0,0,0,0,34.024051,0,0.0,1421,4634.0,1466,19.113563,566.140852,1.984823,0.002245,0.303757
24997,0.0,131.912612,77,738,175.884625,18976.228713,7.576047e+05,617,108.940677,29,133.643911,0,0,0,0,69.268531,0,0.0,2202,6501.0,497,46.989745,219.764158,2.585987,0.004182,0.034811
24998,0.0,84.145230,41,671,79.493996,16270.217131,4.890558e+05,1077,68.685017,29,82.804102,0,0,0,0,43.771385,0,0.0,2507,10718.0,1867,8.311653,1286.288353,0.099015,0.001058,0.011450


In our sample dataset, the column 'Subflow Fwd Bytes' usually has values in a specific range, but sometimes it has zero values.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Subflow Fwd Bytes' column, <br>
50% of the values will be within the usual range, but the other 50% will have zero values.  

In [49]:
# generate a vector with random values based on min max dict, and also create a zero vector
col = 'Subflow Fwd Bytes'
subflow_values = dos_samples[dos_samples[col] != 0][col] 
min_max_dict[col] = (np.min(subflow_values), np.max(subflow_values))

rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
zero_values = np.zeros(NUM_OF_ROWS)

# choose values randomly (50% from rand_values, 50% from zero_values)
dos_dataset[col] = np.where(np.random.rand(NUM_OF_ROWS) > 0.5, rand_values, zero_values)

---

## Adding the Label and Number of Ports column:

In [50]:
# adding number of ports and a label to the dataset
dos_dataset['Number of Ports'] = np.full(shape = NUM_OF_ROWS, fill_value = 1, dtype = int)
dos_dataset['Label'] = ATTACK_NAME

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [51]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,1,105.587820,48,811,107.975975,21606.250764,8.379445e+05,784,66.001649,17,100.599824,0,0,0,0,56.401964,0,0.000000,2824,9100.0,0,44.073579,240.713811,6.278590,0.005171,0.099604,DoS
1,1,147.823620,80,1075,140.517726,30009.173407,1.160707e+06,548,87.131647,42,204.588507,0,0,0,0,56.056813,0,0.000000,1976,6216.0,0,2.123064,5066.773119,3.415648,0.000270,0.004027,DoS
2,1,130.944127,49,656,173.492877,26738.347462,7.905741e+05,664,113.250776,14,123.690720,0,0,0,0,49.897789,0,0.000000,2271,7282.0,0,13.394617,555.752803,20.155189,0.001698,0.009695,DoS
3,1,134.263527,77,1047,181.106997,19713.256816,7.914585e+05,640,85.905870,37,177.537367,0,0,0,0,54.010349,0,0.000000,2631,8163.0,0,46.794556,161.257651,73.103100,0.005723,0.042695,DoS
4,1,162.985253,56,1113,215.697005,23358.248906,9.454286e+05,978,133.057202,35,212.639765,0,0,0,0,87.540970,0,0.000000,2391,7157.0,0,35.143267,301.707569,6.869862,0.004151,0.572579,DoS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,1,125.001640,62,729,128.084013,19432.881900,7.280644e+05,632,101.846778,31,130.631412,0,0,0,0,66.117524,0,721327.942429,1987,8512.0,761,5.054775,1571.320357,0.857303,0.000460,0.075367,DoS
24996,1,83.159171,73,1049,79.455610,11979.887229,5.078021e+05,589,67.222318,26,84.526155,0,0,0,0,34.024051,0,0.000000,1421,4634.0,1466,19.113563,566.140852,1.984823,0.002245,0.303757,DoS
24997,1,131.912612,77,738,175.884625,18976.228713,7.576047e+05,617,108.940677,29,133.643911,0,0,0,0,69.268531,0,0.000000,2202,6501.0,497,46.989745,219.764158,2.585987,0.004182,0.034811,DoS
24998,1,84.145230,41,671,79.493996,16270.217131,4.890558e+05,1077,68.685017,29,82.804102,0,0,0,0,43.771385,0,745111.280215,2507,10718.0,1867,8.311653,1286.288353,0.099015,0.001058,0.011450,DoS


In [52]:
dos_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,1.0,125.445591,58.2,823.15,146.978334,21656.714941,851842.1,789.15,91.445591,24.2,146.978334,0.0,0.0,0.0,0.0,57.8405,0.0,670594.0,1954.85,7279.05,929.25,9.298159,1695.083908,6.375138,0.001,0.066081
std,0.0,6.98969,5.872326,22.31892,7.545255,2210.140943,115763.5,22.31892,6.98969,5.872326,7.545255,0.0,0.0,0.0,0.0,6.783959,0.0,357323.929619,295.35942,768.275819,942.848198,9.726431,984.25318,9.52269,0.001032,0.098056
min,1.0,111.840541,54.0,783.0,132.952415,17676.344618,508368.0,749.0,77.840541,20.0,132.952415,0.0,0.0,0.0,0.0,44.653153,0.0,0.0,1525.0,4340.0,0.0,2.492676,232.498325,0.038326,0.000259,0.000933
25%,1.0,120.193743,54.0,807.5,140.78931,19822.110164,805160.0,773.5,86.193743,20.0,140.78931,0.0,0.0,0.0,0.0,52.323621,0.0,695559.0,1657.75,7143.5,0.0,4.851513,978.750865,1.935547,0.000501,0.021773
50%,1.0,126.046176,54.0,824.0,147.935015,21886.872579,870507.0,790.0,92.046176,20.0,147.935015,0.0,0.0,0.0,0.0,58.518528,0.0,819837.0,1969.5,7492.5,762.5,5.556939,1644.842016,3.026343,0.000609,0.031337
75%,1.0,130.877932,66.0,841.75,152.393116,23224.359169,935989.8,807.75,96.877932,32.0,152.393116,0.0,0.0,0.0,0.0,63.115272,0.0,894448.25,2184.25,7684.25,1729.5,7.992698,2018.125175,4.92631,0.00103,0.05056
max,1.0,136.009282,66.0,859.0,158.072184,24986.815414,1000099.0,825.0,102.009282,32.0,158.072184,0.0,0.0,0.0,0.0,68.305488,0.0,958513.0,2484.0,8005.0,2411.0,39.54007,3867.986356,36.608864,0.004302,0.381823


In [53]:
dos_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,1.0,130.398256,62.66484,852.00644,152.663756,22559.802841,888177.5,814.57688,94.906606,27.82472,152.770317,0.0,0.0,0.0,0.0,60.351966,0.0,376034.9,2075.04708,7595.02,622.97004,24.906375,651.054819,14.958577,0.002658,0.152843
std,0.0,28.555713,13.904733,164.416183,41.176453,6056.318253,239845.5,159.164593,25.621168,8.066362,41.012698,0.0,0.0,0.0,0.0,16.295715,0.0,396766.8,448.518443,2027.992385,786.324974,13.214827,738.757669,20.4359,0.001483,0.210267
min,1.0,80.807713,39.0,565.0,75.853676,11258.426371,441588.4,541.0,47.242603,14.0,76.067212,0.0,0.0,0.0,0.0,30.008169,0.0,0.0,1296.0,3808.0,0.0,1.907317,156.195443,0.020891,0.000164,0.000191
25%,1.0,105.648327,51.0,711.0,120.867673,17860.268965,701532.8,677.0,75.044706,21.0,121.155991,0.0,0.0,0.0,0.0,47.809957,0.0,0.0,1691.0,6026.75,0.0,13.533222,255.700246,1.714729,0.001404,0.017521
50%,1.0,130.466042,63.0,851.0,149.397966,22080.197709,867326.2,814.0,92.8379,28.0,149.0907,0.0,0.0,0.0,0.0,58.893287,0.0,0.0,2070.0,7405.0,34.5,24.96066,372.765744,4.675008,0.002608,0.047703
75%,1.0,155.108374,75.0,995.0,178.968536,26399.353861,1043508.0,951.0,111.091176,35.0,178.842782,0.0,0.0,0.0,0.0,70.743835,0.0,755598.8,2461.0,8884.25,1251.0,36.285422,685.443756,21.060305,0.003789,0.208817
max,1.0,179.857787,87.0,1136.0,252.460457,37286.377047,1466228.0,1091.0,157.073172,42.0,252.096894,0.0,0.0,0.0,0.0,99.826826,0.0,1054355.0,2855.0,12522.0,2411.0,47.744572,5741.975441,78.453382,0.006092,0.814948


---

## At the end we save the dataset as a CSV file

In [54]:
print(f'Attack Dataset Shape: {dos_dataset.shape}')

Attack Dataset Shape: (25000, 27)


In [55]:
# save the dataset
dos_dataset.to_csv('dos_goldeneye_dataset.csv', index=False)

---