# Prepare DoS Hping Attack Dataset

## Overview:

This notebook will focus on creating a DoS Hping attack dataset based on a small sample of data collected by performing real DoS TCP SYN Flood attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
There are multiple sample datasets because we performed the attack in a few different ways, and in each way, the data is slightly different.<br>
That is why we split the original sample dataset into multiple samples, ensuring that the attack dataset we generate matches the real-world data as closely as possible.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 25,000 flows of the DoS Hping attack based on the samples we collected when running a DoS TCP SYN Flood attacks in various configurations using the well known DoS Hping3 tool.<br> 

## Imports & Global Variables:

In [2]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 17000
ATTACK_NAME = 'DoS'

In [3]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the first sample dataset:

In [4]:
# import the attack sample dataset
dos_samples = pd.read_csv('dos_hping_samples_1.csv')
print(f'Dataset Shape: {dos_samples.shape}')
dos_samples

Dataset Shape: (19, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,60.0,60,60,0.0,0.0,258180,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,258180.0,9930,0,0,19.600603,506.617064,19.450618,0.001974,0.19519
1,1,60.0,60,60,0.0,0.0,260000,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86666.666667,10000,0,0,4.503996,2220.250574,1.634673,0.00045,0.025437
2,1,60.0,60,60,0.0,0.0,259766,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86588.666667,9991,0,0,7.420343,1346.43373,4.149745,0.000743,0.047032
3,1,60.0,60,60,0.0,0.0,259974,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,259974.0,9999,0,0,2.54017,3936.350788,1.439891,0.000254,0.017419
4,1,60.0,60,60,0.0,0.0,259506,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86502.0,9981,0,0,9.664079,1032.793715,6.416394,0.000968,0.067794
5,1,60.0,60,60,0.0,0.0,260000,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,260000.0,10000,0,0,1.838951,5437.88317,1.658607,0.000184,0.016591
6,1,60.0,60,60,0.0,0.0,260000,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86666.666667,10000,0,0,4.842348,2065.113824,1.691813,0.000484,0.027245
7,1,60.0,60,60,0.0,0.0,259506,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86502.0,9981,0,0,10.240651,974.645079,6.93523,0.001026,0.07283
8,1,60.0,60,60,0.0,0.0,260000,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,86666.666667,10000,0,0,4.972892,2010.902289,1.693308,0.000497,0.028051
9,1,60.0,60,60,0.0,0.0,257660,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,257660.0,9910,0,0,19.010815,521.282224,18.835732,0.001919,0.189211


### Find the columns that we need to synthesis data for:

In [5]:
columns_to_gather = dos_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (we know for a fact that the data is consistant and there are not missing values in the rows we to the collection process)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [6]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 10%), and store the results in a dictionary.
min_max_dict = {col: (dos_samples[col].min() * 0.85, dos_samples[col].max() * 1.1) for col in columns_to_gather}
min_max_dict['Number of Ports'] = (1, 1) #ensure that the 'Number of Ports' column always has the value '1'

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 1.00 | Max: 1.00
Average Packet Length          | Min: 51.00 | Max: 66.00
Packet Length Min              | Min: 51.00 | Max: 66.00
Packet Length Max              | Min: 51.00 | Max: 66.00
Total Length of Fwd Packet     | Min: 217950.20 | Max: 286000.00
Fwd Packet Length Max          | Min: 22.10 | Max: 28.60
Fwd Packet Length Mean         | Min: 22.10 | Max: 28.60
Fwd Packet Length Min          | Min: 22.10 | Max: 28.60
Fwd Segment Size Avg           | Min: 5.10 | Max: 6.60
Subflow Fwd Bytes              | Min: 73526.70 | Max: 286000.00
SYN Flag Count                 | Min: 8382.70 | Max: 11000.00
Flow Duration                  | Min: 1.05 | Max: 21.56
Packets Per Second             | Min: 430.62 | Max: 8933.12
IAT Max                        | Min: 0.91 | Max: 21.40
IAT Mean                       | Min: 0.00 | Max: 0.00
IAT Std                        | Min: 0.01 | Max: 0.21


### Create the base attack dataset (full of zeros):

In [7]:
# creating an empty dataframe before adding values to it
dos_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dos_samples.columns))), columns = dos_samples.columns)
dos_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [8]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dos_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dos_dataset[col] = int(0)
zero_columns

['Packet Length Std',
 'Packet Length Variance',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Bwd Segment Size Avg',
 'ACK Flag Count',
 'RST Flag Count']

In [9]:
dos_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0


---

## Filling in values based on collected samples:

### Firstly we insert data into columns that have the exact same values:

For each one of the 'same_value' columns we insert the same vector, meaning that in each row of the attack dataset, these columns will have the same value.

In [10]:
same_value = ['Average Packet Length', 'Packet Length Min', 'Packet Length Max'] #based on collected samples
val = np.random.randint(min_max_dict[same_value[0]][0], min_max_dict[same_value[0]][1]*1.1, NUM_OF_ROWS)

for col in same_value:
    dos_dataset[col] = val

In [11]:
same_value2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min'] #based on collected samples
val2 = np.random.randint(min_max_dict[same_value2[0]][0], min_max_dict[same_value2[0]][1]*1.25, NUM_OF_ROWS)

for col in same_value2:
    dos_dataset[col] = val2

In [12]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,58,58,58,0,0,0.0,33,33,33,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
1,0.0,62,62,62,0,0,0.0,25,25,25,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
2,0.0,68,68,68,0,0,0.0,33,33,33,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
3,0.0,59,59,59,0,0,0.0,33,33,33,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
4,0.0,61,61,61,0,0,0.0,26,26,26,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,0.0,67,67,67,0,0,0.0,32,32,32,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
16996,0.0,60,60,60,0,0,0.0,32,32,32,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
16997,0.0,68,68,68,0,0,0.0,29,29,29,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
16998,0.0,56,56,56,0,0,0.0,30,30,30,0,0,0,0,0,0.0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0


In [13]:
dos_dataset['Fwd Segment Size Avg'] = np.random.randint(min_max_dict['Fwd Segment Size Avg'][0]*0.9, min_max_dict['Fwd Segment Size Avg'][1]*1.5, NUM_OF_ROWS)
dos_dataset['Subflow Fwd Bytes'] = np.random.uniform(min_max_dict['Subflow Fwd Bytes'][0], min_max_dict['Subflow Fwd Bytes'][1], NUM_OF_ROWS)
dos_dataset['Number of Ports'] = np.full(shape = NUM_OF_ROWS, fill_value = 1, dtype = int)

Some columns, like 'SYN Flag Count', based on the collected samples, usually have values in a specific range, but sometimes they have values outside of the range.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'SYN Flag Count' column, 90% of the values will be within the usual range,<br>
but the other 10% will have values that are anywhere between the minimal and maximal value for this column, meaning they will have values outside of the usual range as well.  

In [14]:
rand_values = dos_dataset['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0], min_max_dict['SYN Flag Count'][1]*1.1, NUM_OF_ROWS)
usual_values = np.random.randint(8176, 10658, NUM_OF_ROWS)

# choose values randomly (10% from rand_values, 90% from usual_values)
chosen_values = np.where(np.random.rand(NUM_OF_ROWS) > 0.1, usual_values, rand_values) 

dos_dataset['SYN Flag Count'] = chosen_values 

In [15]:
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0], min_max_dict['Flow Duration'][1], NUM_OF_ROWS)
usual_values = np.random.uniform(1.654, 45.175, NUM_OF_ROWS)

# choose values randomly (25% from rand_values, 75% from usual_values)
chosen_values = np.where(np.random.rand(NUM_OF_ROWS) > 0.25, usual_values, rand_values) 

dos_dataset['Flow Duration'] = chosen_values

In [16]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,58,58,58,0,0,0.0,33,33,33,0,0,0,0,0,8,0,234294.022213,10344,0,0,10.424518,0.0,0.0,0.0,0.0
1,1,62,62,62,0,0,0.0,25,25,25,0,0,0,0,0,6,0,193897.951510,8593,0,0,27.832099,0.0,0.0,0.0,0.0
2,1,68,68,68,0,0,0.0,33,33,33,0,0,0,0,0,5,0,112888.499713,9483,0,0,15.879211,0.0,0.0,0.0,0.0
3,1,59,59,59,0,0,0.0,33,33,33,0,0,0,0,0,7,0,81215.198003,9078,0,0,16.359363,0.0,0.0,0.0,0.0
4,1,61,61,61,0,0,0.0,26,26,26,0,0,0,0,0,8,0,166037.850835,8571,0,0,40.118965,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,1,67,67,67,0,0,0.0,32,32,32,0,0,0,0,0,8,0,183007.960123,8745,0,0,35.394276,0.0,0.0,0.0,0.0
16996,1,60,60,60,0,0,0.0,32,32,32,0,0,0,0,0,7,0,213194.853607,10186,0,0,4.487526,0.0,0.0,0.0,0.0
16997,1,68,68,68,0,0,0.0,29,29,29,0,0,0,0,0,5,0,92273.138047,9222,0,0,23.013778,0.0,0.0,0.0,0.0
16998,1,56,56,56,0,0,0.0,30,30,30,0,0,0,0,0,8,0,223794.507723,9848,0,0,6.116590,0.0,0.0,0.0,0.0


## Then we fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **flow duration increases**, the **packets per second** is likely to decrease. This occurs because each flow has an upper limit on duration, after which data collection stops and a new flow begins.<br>  
Similarly, the **Inter-Arrival Time (IAT)** of packets within a flow is influenced by the flow duration. Given these dependencies, <br>
the attack dataset should generate data for these columns collectively, ensuring that their inherent correlations are maintained.

### Correlation between 'SYN Flag Count' and 'Total Length of Fwd Packet':

In [17]:
# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
first_correlation = ['SYN Flag Count', 'Total Length of Fwd Packet']
independent_col = dos_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = dos_samples[first_correlation[1]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and 'Total Length of Fwd Packet'
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond=None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Total Length of Fwd Packet', np.float64(26.000000000000014))


After finding the scaling factors we can apply some randomness when generating values for the attack dataset in order to generate better data (without many duplications).<br>
We add randomness by creating a modified scaling factor, which introduces controlled variations in the generated values.<br>
This is done by selecting a small random delta (between 5% and 25% of the factor) and adding or subtracting it from the original scaling factor.<br>
As a result, the generated data maintains realistic correlations while avoiding exact duplicates.

In [18]:
# generate new data by scaling the original correlated column value using the updated factor.
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        delta = random.uniform(factor * 0.05, factor * 0.25) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        dos_dataset.loc[index, col] = int(row['SYN Flag Count'] * updated_factor) 

In [19]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,58,58,58,0,0,327151.0,33,33,33,0,0,0,0,0,8,0,234294.022213,10344,0,0,10.424518,0.0,0.0,0.0,0.0
1,1,62,62,62,0,0,186193.0,25,25,25,0,0,0,0,0,6,0,193897.951510,8593,0,0,27.832099,0.0,0.0,0.0,0.0
2,1,68,68,68,0,0,186152.0,33,33,33,0,0,0,0,0,5,0,112888.499713,9483,0,0,15.879211,0.0,0.0,0.0,0.0
3,1,59,59,59,0,0,179660.0,33,33,33,0,0,0,0,0,7,0,81215.198003,9078,0,0,16.359363,0.0,0.0,0.0,0.0
4,1,61,61,61,0,0,240290.0,26,26,26,0,0,0,0,0,8,0,166037.850835,8571,0,0,40.118965,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,1,67,67,67,0,0,183245.0,32,32,32,0,0,0,0,0,8,0,183007.960123,8745,0,0,35.394276,0.0,0.0,0.0,0.0
16996,1,60,60,60,0,0,292307.0,32,32,32,0,0,0,0,0,7,0,213194.853607,10186,0,0,4.487526,0.0,0.0,0.0,0.0
16997,1,68,68,68,0,0,205101.0,29,29,29,0,0,0,0,0,5,0,92273.138047,9222,0,0,23.013778,0.0,0.0,0.0,0.0
16998,1,56,56,56,0,0,295588.0,30,30,30,0,0,0,0,0,8,0,223794.507723,9848,0,0,6.116590,0.0,0.0,0.0,0.0


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [20]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = dos_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = dos_samples[second_correlation[1:]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(dos_samples['Flow Duration'].values, dos_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(87.84098187225318))
('IAT Max', np.float64(0.9069689451257357))
('IAT Mean', np.float64(0.00010065380228822136))
('IAT Std', np.float64(0.00925615690679181))


np.float64(9971.421052631578)

And again here after finding the scaling factors we add some randomness and generate the data

In [21]:
# calculate a random small delta of the factor for some randomness
for index, row in dos_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.1, duration_to_packets_corr * 0.15)
            updatedFactor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            dos_dataset.loc[index, col] = updatedFactor / row['Flow Duration']
        else:
            if col == 'IAT Std':
                delta = random.uniform(factor * 0.1, factor * 0.35)
                updatedFactor = factor + random.choice([-1, 1]) * delta  
            elif col == 'IAT Max':
                delta = random.uniform(factor * 0.1, factor * 0.225)
                updatedFactor = factor + random.choice([-1, 1]) * delta  
            else:
                delta = random.uniform(factor * 0.05, factor * 0.2)
                updatedFactor = factor + random.choice([-1, 1]) * delta
            dos_dataset.loc[index, col] = row['Flow Duration'] * updatedFactor

In [22]:
dos_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,58,58,58,0,0,327151.0,33,33,33,0,0,0,0,0,8,0,234294.022213,10344,0,0,10.424518,1054.957292,7.677853,0.001212,0.074474
1,1,62,62,62,0,0,186193.0,25,25,25,0,0,0,0,0,6,0,193897.951510,8593,0,0,27.832099,411.421058,28.609782,0.002354,0.173096
2,1,68,68,68,0,0,186152.0,33,33,33,0,0,0,0,0,5,0,112888.499713,9483,0,0,15.879211,555.459122,16.042706,0.001718,0.103454
3,1,59,59,59,0,0,179660.0,33,33,33,0,0,0,0,0,7,0,81215.198003,9078,0,0,16.359363,539.535454,13.295555,0.001421,0.181407
4,1,61,61,61,0,0,240290.0,26,26,26,0,0,0,0,0,8,0,166037.850835,8571,0,0,40.118965,285.117211,40.063877,0.003821,0.479968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,1,67,67,67,0,0,183245.0,32,32,32,0,0,0,0,0,8,0,183007.960123,8745,0,0,35.394276,246.663212,25.909351,0.003229,0.420561
16996,1,60,60,60,0,0,292307.0,32,32,32,0,0,0,0,0,7,0,213194.853607,10186,0,0,4.487526,1991.671654,3.348309,0.000425,0.028184
16997,1,68,68,68,0,0,205101.0,29,29,29,0,0,0,0,0,5,0,92273.138047,9222,0,0,23.013778,384.273366,17.788977,0.001916,0.247649
16998,1,56,56,56,0,0,295588.0,30,30,30,0,0,0,0,0,8,0,223794.507723,9848,0,0,6.116590,1873.796805,6.323783,0.000514,0.039218


## Adding the Label column:

In [23]:
# adding a label to the dataset
dos_dataset['Label'] = ATTACK_NAME

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [24]:
dos_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0
mean,1.0,60.0,60.0,60.0,0.0,0.0,259256.947368,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,181785.385965,9971.421053,0.0,0.0,8.917401,2037.65914,7.455617,0.000896,0.077946
std,0.0,0.0,0.0,0.0,0.0,0.0,981.375762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84390.321689,37.745222,0.0,0.0,5.990442,1963.164277,6.564728,0.000605,0.063584
min,1.0,60.0,60.0,60.0,0.0,0.0,256412.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,86502.0,9862.0,0.0,0.0,1.230634,506.617064,1.065089,0.000123,0.010654
25%,1.0,60.0,60.0,60.0,0.0,0.0,258999.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,86666.666667,9961.5,0.0,0.0,4.613889,809.208563,1.692561,0.000461,0.026082
50%,1.0,60.0,60.0,60.0,0.0,0.0,259506.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,256412.0,9981.0,0.0,0.0,7.615424,1309.973037,6.416394,0.000763,0.067794
75%,1.0,60.0,60.0,60.0,0.0,0.0,260000.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,259207.0,10000.0,0.0,0.0,12.838119,2168.599449,12.251514,0.00129,0.123172
max,1.0,60.0,60.0,60.0,0.0,0.0,260000.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,260000.0,10000.0,0.0,0.0,19.600603,8121.017468,19.450618,0.001974,0.19519


In [25]:
dos_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,1.0,60.926882,60.926882,60.926882,0.0,0.0,246712.022765,28.028765,28.028765,28.028765,0.0,0.0,0.0,0.0,0.0,6.012706,0.0,180196.885658,9482.777294,0.0,0.0,20.338936,945.009478,18.461437,0.002045,0.188792
std,0.0,6.048105,6.048105,6.048105,0.0,0.0,44952.794277,3.745939,3.745939,3.745939,0.0,0.0,0.0,0.0,0.0,1.413532,0.0,60766.348445,793.604382,0.0,0.0,12.473129,1131.598475,11.888971,0.001294,0.12683
min,1.0,51.0,51.0,51.0,0.0,0.0,160139.0,22.0,22.0,22.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,73531.08256,8176.0,0.0,0.0,1.054009,188.794811,0.759046,8.9e-05,0.006554
25%,1.0,56.0,56.0,56.0,0.0,0.0,207771.75,25.0,25.0,25.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,127887.081167,8827.0,0.0,0.0,9.751034,325.557319,8.669683,0.000967,0.085793
50%,1.0,61.0,61.0,61.0,0.0,0.0,244502.5,28.0,28.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,179968.119499,9449.0,0.0,0.0,18.400341,543.178439,16.349765,0.001815,0.164394
75%,1.0,66.0,66.0,66.0,0.0,0.0,282886.75,31.0,31.0,31.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,232857.466319,10092.0,0.0,0.0,30.757176,1027.687443,26.983188,0.00304,0.267996
max,1.0,71.0,71.0,71.0,0.0,0.0,391073.0,34.0,34.0,34.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,285990.772622,12098.0,0.0,0.0,45.172364,10731.937012,50.048639,0.005454,0.561301


---

## Load the second sample dataset:

The following code will create another attack dataset, this time based on a different sample dataset, the code in this section<br> 
will be mostly the same as it was up until this point in the notebook, there for we will not repeat the same explanations here.<br>

In [26]:
NUM_OF_ROWS = 8000 #adjust the number of rows for the second dataset because samples like these almost never happen, thus should be less prominent in the final attack dataset

## Load the second sample dataset:

In [27]:
# import the attack sample dataset
dos_samples = pd.read_csv('dos_hping_samples_2.csv')
print(f'Dataset Shape: {dos_samples.shape}')
dos_samples

Dataset Shape: (8, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,60.0,60,60,0.0,0.0,251472,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9672,0,0,0.113266,85391.915937,0.002451,1.2e-05,7.7e-05
1,1,60.0,60,60,0.0,0.0,259818,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9993,0,0,0.1256,79562.038842,0.013888,1.3e-05,0.000176
2,1,60.0,60,60,0.0,0.0,259558,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9983,0,0,0.122131,81740.025636,0.001796,1.2e-05,7e-05
3,1,60.0,60,60,0.0,0.0,250952,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9652,0,0,0.155204,62189.191252,0.002296,1.6e-05,8.3e-05
4,1,60.0,60,60,0.0,0.0,259844,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9994,0,0,0.150346,66473.316835,0.005427,1.5e-05,0.0001
5,1,60.0,60,60,0.0,0.0,258440,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9940,0,0,0.146744,67737.00547,0.001699,1.5e-05,8e-05
6,1,60.0,60,60,0.0,0.0,253734,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9759,0,0,0.133626,73032.202973,0.003685,1.4e-05,8.5e-05
7,1,60.0,60,60,0.0,0.0,259740,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9990,0,0,0.150142,66537.031827,0.013583,1.5e-05,0.000234


### Find an approximate minimum and maximum values of each column:

In [28]:
# find an approximate minimum and maximum values of each column and save that data into a dictionary
min_max_dict = {col: (dos_samples[col].min() * 0.85, dos_samples[col].max() * 1.1) for col in columns_to_gather}
min_max_dict['Number of Ports'] = (1, 1)

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 1.00 | Max: 1.00
Average Packet Length          | Min: 51.00 | Max: 66.00
Packet Length Min              | Min: 51.00 | Max: 66.00
Packet Length Max              | Min: 51.00 | Max: 66.00
Total Length of Fwd Packet     | Min: 213309.20 | Max: 285828.40
Fwd Packet Length Max          | Min: 22.10 | Max: 28.60
Fwd Packet Length Mean         | Min: 22.10 | Max: 28.60
Fwd Packet Length Min          | Min: 22.10 | Max: 28.60
Fwd Segment Size Avg           | Min: 5.10 | Max: 6.60
Subflow Fwd Bytes              | Min: 0.00 | Max: 0.00
SYN Flag Count                 | Min: 8204.20 | Max: 10993.40
Flow Duration                  | Min: 0.10 | Max: 0.17
Packets Per Second             | Min: 52860.81 | Max: 93931.11
IAT Max                        | Min: 0.00 | Max: 0.02
IAT Mean                       | Min: 0.00 | Max: 0.00
IAT Std                        | Min: 0.00 | Max: 0.00


### Create the base attack dataset (full of zeros):

In [29]:
# creating an empty dataframe before adding values to it
dos_dataset2 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(dos_samples.columns))), columns = dos_samples.columns)
dos_dataset2.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [30]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in dos_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    dos_dataset2[col] = int(0)
zero_columns

['Packet Length Std',
 'Packet Length Variance',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Bwd Segment Size Avg',
 'ACK Flag Count',
 'RST Flag Count']

---

## Filling in values based on collected samples:

### Firstly we insert data into columns that have the exact same values:

In [31]:
same_value = ['Average Packet Length', 'Packet Length Min', 'Packet Length Max']
val = np.random.randint(min_max_dict[same_value[0]][0], min_max_dict[same_value[0]][1]*1.1, NUM_OF_ROWS)

for col in same_value:
    dos_dataset2[col] = val

In [32]:
same_value2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']
val2 = np.random.randint(min_max_dict[same_value2[0]][0], min_max_dict[same_value2[0]][1]*1.25, NUM_OF_ROWS)

for col in same_value2:
    dos_dataset2[col] = val2

### Then we insert data into columns that are independant of each other, based on the min max values:

In [33]:
dos_dataset2['Fwd Segment Size Avg'] = np.random.randint(min_max_dict['Fwd Segment Size Avg'][0]*0.9, min_max_dict['Fwd Segment Size Avg'][1]*1.5, NUM_OF_ROWS)
dos_dataset2['Flow Duration'] = np.random.uniform(min_max_dict['Flow Duration'][0]*0.95, min_max_dict['Flow Duration'][1]*1.05, NUM_OF_ROWS)
dos_dataset2['Number of Ports'] = np.full(shape = NUM_OF_ROWS, fill_value = 1, dtype = int)
dos_dataset2['Subflow Fwd Bytes'] = np.full(shape = NUM_OF_ROWS, fill_value = 0, dtype = int)
dos_dataset2['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*0.9, min_max_dict['SYN Flag Count'][1]*1.1, NUM_OF_ROWS)

## Then we fill values into columns that have a certain correlation between them:

### Correlation between 'SYN Flag Count' and 'Total Length of Fwd Packet':

In [34]:
# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
first_correlation = ['SYN Flag Count', 'Total Length of Fwd Packet']
independent_col = dos_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = dos_samples[first_correlation[1]].values 

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and 'Total Length of Fwd Packet'
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name,factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)
    
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in dos_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'SYN Flag Count'
        delta = random.uniform(factor * 0.05, factor * 0.25) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        dos_dataset2.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

('Total Length of Fwd Packet', np.float64(26.000000000000007))


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [35]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = dos_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = dos_samples[second_correlation[1:]].values

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name,factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(dos_samples['Flow Duration'].values, dos_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(519129.6051044216))
('IAT Max', np.float64(0.040715922209246726))
('IAT Mean', np.float64(0.00010129001644309197))
('IAT Std', np.float64(0.000825544773212792))


np.float64(9872.875)

In [36]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in dos_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr*0.025, duration_to_packets_corr * 0.075)
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            dos_dataset2.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std':
                delta = random.uniform(factor * 0.1, factor * 0.35)
                updated_factor = factor + random.choice([-1, 1]) * delta  
            elif col == 'IAT Max':
                delta = random.uniform(factor * 0.15, factor * 0.7)
                updated_factor = factor + random.choice([-1, 1]) * delta  
            else:
                delta = random.uniform(factor * 0.05, factor * 0.2) 
                updated_factor = factor + random.choice([-1, 1]) * delta
            dos_dataset2.loc[index, col] = row['Flow Duration'] * updated_factor

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [37]:
dos_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1,60.0,60,60,0.0,0.0,251472,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9672,0,0,0.113266,85391.915937,0.002451,1.2e-05,7.7e-05
1,1,60.0,60,60,0.0,0.0,259818,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9993,0,0,0.1256,79562.038842,0.013888,1.3e-05,0.000176
2,1,60.0,60,60,0.0,0.0,259558,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9983,0,0,0.122131,81740.025636,0.001796,1.2e-05,7e-05
3,1,60.0,60,60,0.0,0.0,250952,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9652,0,0,0.155204,62189.191252,0.002296,1.6e-05,8.3e-05
4,1,60.0,60,60,0.0,0.0,259844,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9994,0,0,0.150346,66473.316835,0.005427,1.5e-05,0.0001
5,1,60.0,60,60,0.0,0.0,258440,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9940,0,0,0.146744,67737.00547,0.001699,1.5e-05,8e-05
6,1,60.0,60,60,0.0,0.0,253734,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9759,0,0,0.133626,73032.202973,0.003685,1.4e-05,8.5e-05
7,1,60.0,60,60,0.0,0.0,259740,26,26.0,26,0.0,0,0.0,0,0.0,6.0,0.0,0.0,9990,0,0,0.150142,66537.031827,0.013583,1.5e-05,0.000234


In [38]:
dos_dataset2.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,1.0,60.95575,60.95575,60.95575,0.0,0.0,252407.773,27.953875,27.953875,27.953875,0.0,0.0,0.0,0.0,0.0,6.02175,0.0,0.0,9728.046125,0.0,0.0,0.135653,75489.698247,0.005531,1.4e-05,0.000112
std,0.0,6.06967,6.06967,6.06967,0.0,0.0,54474.36855,3.737512,3.737512,3.737512,0.0,0.0,0.0,0.0,0.0,1.419517,0.0,0.0,1362.872186,0.0,0.0,0.025616,15450.254031,0.00275,3e-06,3.4e-05
min,1.0,51.0,51.0,51.0,0.0,0.0,144590.0,22.0,22.0,22.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,7383.0,0.0,0.0,0.091475,51086.121586,0.001156,7e-06,4.9e-05
25%,1.0,56.0,56.0,56.0,0.0,0.0,211655.0,25.0,25.0,25.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,8524.75,0.0,0.0,0.113101,62373.420717,0.003046,1.1e-05,8.5e-05
50%,1.0,61.0,61.0,61.0,0.0,0.0,246069.5,28.0,28.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,9746.0,0.0,0.0,0.135807,72463.122991,0.005226,1.4e-05,0.000106
75%,1.0,66.0,66.0,66.0,0.0,0.0,290264.75,31.0,31.0,31.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,10927.0,0.0,0.0,0.158056,87132.789548,0.00779,1.6e-05,0.000137
max,1.0,71.0,71.0,71.0,0.0,0.0,391192.0,34.0,34.0,34.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,12090.0,0.0,0.0,0.179241,115417.280236,0.012365,2.2e-05,0.000199


## Adding the Label column:

In [39]:
dos_dataset2['Label'] = ATTACK_NAME

---

## At the end we merge both attack datasets into one final attack dataset for DoS Hping and save the dataset as a CSV file

In [40]:
# merging and shuffling the rows in the final dataset of the DoS Hping attack
merged_dos_dataset = pd.concat([dos_dataset, dos_dataset2], axis=0)
merged_dos_dataset = merged_dos_dataset.sample(frac=1, random_state=42).reset_index(drop=True)
print(f'Attack Dataset Shape: {merged_dos_dataset.shape}')

Attack Dataset Shape: (25000, 27)


In [41]:
# save the dataset
merged_dos_dataset.to_csv('dos_hping_dataset.csv', index=False)

---