# Prepare Port Scanning Open Port Attack Dataset

## Overview:

This notebook will focus on creating a Port Scanning open port attack dataset based on a small sample of data collected by performing real Port Scanning open port attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
There are multiple sample datasets because we performed the attack in a few different ways, and in each way, the data is slightly different.<br>
That is why we split the original sample dataset into multiple samples, ensuring that the attack dataset we generate matches the real-world data as closely as possible.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 7,500 flows of the Port Scanning open port attack based on the samples we collected when running a Port Scanning attack in various configurations using the well known NMap tool when the majority of ports on the victim host machine where open.<br> 

## Imports & Global Variables:

In [1]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 6000
ATTACK_NAME = 'PortScan'

In [2]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the first sample dataset:

In [3]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_open_port_samples_1.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (18, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4990,57.006307,54,60,2.999593,8.997558,130078,26,26.0,26,0.0,24,20.002407,20,0.098088,2.002399,0,61.357547,5003,4986,4986,1.460154,6841.058966,1.012127,0.000146,0.010134
1,5003,57.007603,54,60,2.999457,8.996741,130208,26,26.0,26,0.0,24,20.003208,20,0.113228,2.003195,0,33.890682,5008,4988,4988,1.520292,6575.052498,1.009149,0.000152,0.010104
2,4985,56.998799,54,60,3.0,8.999999,129792,26,26.0,26,0.0,20,20.0,20,0.0,2.0,0,178.285714,4992,4996,4996,1.420907,7029.312865,0.961251,0.000142,0.009626
3,4937,57.012343,54,60,2.999573,8.997439,130026,26,26.0,26,0.0,24,20.002417,20,0.098305,2.0024,0,31.853503,5001,4964,4964,14.153931,704.044686,10.852338,0.001421,0.110534
4,4995,57.0049,54,60,2.999863,8.999176,130182,26,26.0,26,0.0,24,20.000801,20,0.056608,2.000799,0,34.882637,5007,4992,4992,1.527861,6544.443,1.007949,0.000153,0.01009
5,5020,57.015505,54,60,2.999693,8.998159,130598,26,26.0,26,0.0,24,20.001608,20,0.080193,2.001593,0,219.124161,5023,4974,4974,1.943438,5143.976667,1.0682,0.000194,0.011385
6,4962,57.004834,54,60,2.999593,8.99756,129246,26,26.0,26,0.0,24,20.00242,20,0.098354,2.002414,0,50.964511,4971,4959,4959,1.45254,6836.2998,1.008287,0.000146,0.010125
7,5003,57.012301,54,60,2.999975,8.999849,130520,26,26.0,26,0.0,20,20.0,20,0.0,2.0,0,0.0,5020,4979,4979,0.986577,10135.04233,0.504682,9.9e-05,0.00508
8,4968,57.007404,54,60,2.999324,8.995943,130156,26,26.0,26,0.0,24,20.00401,20,0.12658,2.003995,0,29.487087,5006,4988,4988,6.943519,1439.327787,2.646078,0.000695,0.034648
9,5001,57.003902,54,60,2.999997,8.999985,130104,26,26.0,26,0.0,20,20.0,20,0.0,2.0,0,33.155963,5004,4991,4991,1.93033,5177.871039,1.005448,0.000193,0.0113


### Find the columns that we need to synthesis data for:

In [4]:
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [None]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 7.5%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.075) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 2476.90 | Max: 5396.50
Average Packet Length          | Min: 48.45 | Max: 68.94
Packet Length Min              | Min: 45.90 | Max: 58.05
Packet Length Max              | Min: 51.00 | Max: 79.55
Packet Length Std              | Min: 2.55 | Max: 10.75
Packet Length Variance         | Min: 7.65 | Max: 107.42
Total Length of Fwd Packet     | Min: 102765.00 | Max: 216367.40
Fwd Packet Length Max          | Min: 22.10 | Max: 43.00
Fwd Packet Length Mean         | Min: 22.10 | Max: 42.99
Fwd Packet Length Min          | Min: 22.10 | Max: 34.40
Fwd Packet Length Std          | Min: 0.00 | Max: 5.19
Bwd Packet Length Max          | Min: 17.00 | Max: 43.00
Bwd Packet Length Mean         | Min: 17.00 | Max: 21.51
Bwd Packet Length Min          | Min: 17.00 | Max: 21.50
Bwd Packet Length Std          | Min: 0.00 | Max: 0.54
Fwd Segment Size Avg           | Min: 0.00 | Max: 2.15
Subflow Fwd Bytes              | Min: 0.00 | Max: 1547.23
SYN Flag Count           

### Create the base attack dataset (full of zeros):

In [6]:
# creating an empty dataframe before adding values to it
port_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [7]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset[col] = int(0)
zero_columns

['Bwd Segment Size Avg']

In [8]:
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## Filling in values based on collected samples:

### Firstly fill values into 'Fwd Packet' columns that are related to each other:

When generating data for the following columns we take the time to ensure that the values generated are correct in the sence that the minimum value should be lower than the mean and the mean should be lower than the max value <u>in each row</u> of the attack dataset.<br>  
Also sometimes in the sample dataset the values in the following columns are exactly the same, and other times they are different, there for we randomly select 25% of the rows to have the same value and the rest to have some variance within the acceptable range.

In [9]:
independant = ['Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean']

packet_length_max = np.random.randint(min_max_dict['Fwd Packet Length Max'][0] * 0.9, min_max_dict['Fwd Packet Length Max'][1] * 1.1, NUM_OF_ROWS)

# define probability distribution: 25% True, 75% False
probability = [0.25, 0.75]

# decide for each row whether to copy or vary 'Fwd Packet Length Max'
copy_values = np.random.choice([True, False], size = NUM_OF_ROWS, p=probability)

# create 'Fwd Packet Length Min': either copy or apply a small variation
packet_length_min = np.where(copy_values, packet_length_max, packet_length_max + np.random.uniform(-4, 4, NUM_OF_ROWS))
packet_length_min = np.minimum(packet_length_min, packet_length_max)

# calculate 'Fwd Packet Length Mean': average of min and max, or copy if equal
average_packet_length = np.where(packet_length_max != packet_length_min, (packet_length_max + packet_length_min) / 2, packet_length_min)

# assign the values to the dataset
port_dataset['Fwd Packet Length Max'] = packet_length_max.astype(int)
port_dataset['Fwd Packet Length Mean'] = average_packet_length
port_dataset['Fwd Packet Length Min'] = packet_length_min.astype(int)

In [10]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29,27.109093,25,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27,26.198608,25,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23,21.673972,20,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,20.000000,20,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,41,40.423601,39,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26,26.000000,26,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39,39.000000,39,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43,43.000000,43,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38,37.442955,36,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Then fill values into columns that are not related to each other:

Here most of the columns are unrelated to each other excent the Bwd Packet columns, for these ones we just ensure that again the minimum is lower than the mean and the mean is lower than the maximum value in each row.

In [None]:
independent = ['Number of Ports', 'Average Packet Length', 'Packet Length Max', 'Bwd Packet Length Max', 'Subflow Fwd Bytes', 'Bwd Packet Length Mean']

# generate 'Bwd Packet Length Min' values
bwd_min_low, bwd_min_high = min_max_dict['Bwd Packet Length Min']
bwd_min_values = np.random.randint(bwd_min_low * 0.9, bwd_min_high * 1.05, size = NUM_OF_ROWS)

for col in independent:
    if col == 'Bwd Packet Length Mean':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.995, min_max_dict[col][1] * 1.005, NUM_OF_ROWS)
    else:
        rand_values = np.random.randint(min_max_dict[col][0] * 0.9, min_max_dict[col][1] * 1.1, NUM_OF_ROWS)

    port_dataset[col] = rand_values

# ensure that 'Bwd Packet Length Max' is always >= 'Bwd Packet Length Min'
port_dataset['Bwd Packet Length Min'] = bwd_min_values
port_dataset['Bwd Packet Length Max'] = np.maximum(bwd_min_values, port_dataset['Bwd Packet Length Max']) #fix inconsistencies

# ensure that 'Bwd Packet Length Max' is always > 'Bwd Packet Length Mean' > 'Bwd Packet Length Min'
invalid_rows = port_dataset['Bwd Packet Length Mean'] > port_dataset['Bwd Packet Length Max']

# compute the correct mean for those rows
corrected_means = (port_dataset.loc[invalid_rows, 'Bwd Packet Length Min'] + 
                   port_dataset.loc[invalid_rows, 'Bwd Packet Length Max']) / 2

# update only the invalid rows
port_dataset.loc[invalid_rows, 'Bwd Packet Length Mean'] = corrected_means

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,2372,60,0.0,57,0.0,0.0,0.0,29,27.109093,25,0.0,21,18.936479,17,0.0,0.0,0,378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3554,59,0.0,46,0.0,0.0,0.0,27,26.198608,25,0.0,46,21.186321,20,0.0,0.0,0,1615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2368,47,0.0,86,0.0,0.0,0.0,23,21.673972,20,0.0,16,15.500000,15,0.0,0.0,0,248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3076,54,0.0,62,0.0,0.0,0.0,20,20.000000,20,0.0,41,21.473990,16,0.0,0.0,0,1549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4195,69,0.0,70,0.0,0.0,0.0,41,40.423601,39,0.0,19,17.000000,15,0.0,0.0,0,102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,5000,48,0.0,64,0.0,0.0,0.0,26,26.000000,26,0.0,33,19.711557,18,0.0,0.0,0,1194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5996,2495,66,0.0,55,0.0,0.0,0.0,39,39.000000,39,0.0,40,18.150773,19,0.0,0.0,0,717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5997,5179,59,0.0,55,0.0,0.0,0.0,43,43.000000,43,0.0,44,18.015041,21,0.0,0.0,0,783,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5998,2739,62,0.0,73,0.0,0.0,0.0,38,37.442955,36,0.0,40,19.603177,20,0.0,0.0,0,1261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Some columns, like 'Packet Length Std', based on the collected samples, usually have values in a specific range, but sometimes they have values outside of the range.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Packet Length Std' column, 80% of the values will be within the usual range,<br>
but the other 20% will have values that are anywhere between the minimal and maximal value for this column, meaning they will have values outside of the usual range as well.  

In [None]:
half_and_half = ['Packet Length Std', 'Packet Length Variance', 'Fwd Packet Length Std', 
                 'Flow Duration', 'Total Length of Fwd Packet', 'Bwd Packet Length Std', 'Fwd Segment Size Avg']

for col in half_and_half:
    # generate random values from the uniform distribution (90% - 110% of min-max range)
    rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
    
    # generate alternative random values based on column-specific conditions
    if col == 'Packet Length Std':
        usual_values = np.random.uniform(2.9, 3.1, NUM_OF_ROWS)
    elif col == 'Packet Length Variance':
        usual_values = np.random.uniform(8.85, 9.15, NUM_OF_ROWS)
    elif col == 'Fwd Packet Length Std':
        rand_values = np.random.uniform(min_max_dict[col][0], min_max_dict[col][1]*1.1, NUM_OF_ROWS)
        usual_values = np.zeros(NUM_OF_ROWS)
    elif col == 'Flow Duration':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.85, min_max_dict[col][1], NUM_OF_ROWS)
        usual_values = np.random.uniform(0.85, 8.597, NUM_OF_ROWS)
    elif col == 'Total Length of Fwd Packet':
        usual_values = np.random.randint(min_max_dict[col][0]*0.9, 150000, NUM_OF_ROWS)
    elif col == 'Bwd Packet Length Std':
        rand_values = np.random.uniform(min_max_dict[col][0], min_max_dict[col][1]*1.1, NUM_OF_ROWS)
        usual_values = np.random.uniform(0.035, 0.15, NUM_OF_ROWS)
    elif col == 'Fwd Segment Size Avg':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1]*1.05, NUM_OF_ROWS)
        usual_values = np.random.uniform(1.99, 2.01, NUM_OF_ROWS)

    # choose values randomly (20% from rand_values, 80% from usual_values)
    chosen_values = np.where(np.random.rand(NUM_OF_ROWS) > 0.2, usual_values, rand_values)

    port_dataset[col] = chosen_values

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,2372,60,0.0,57,2.920504,9.099900,115800.000000,29,27.109093,25,0.0,21,18.936479,17,0.115317,2.009010,0,378,0.0,0.0,0.0,2.260604,0.0,0.0,0.0,0.0
1,3554,59,0.0,46,2.919790,8.912675,109430.000000,27,26.198608,25,0.0,46,21.186321,20,0.036078,1.995346,0,1615,0.0,0.0,0.0,2.048747,0.0,0.0,0.0,0.0
2,2368,47,0.0,86,3.022850,8.951914,182527.215127,23,21.673972,20,0.0,16,15.500000,15,0.073291,2.000501,0,248,0.0,0.0,0.0,25.124060,0.0,0.0,0.0,0.0
3,3076,54,0.0,62,10.384071,90.049986,113765.000000,20,20.000000,20,0.0,41,21.473990,16,0.146599,2.005556,0,1549,0.0,0.0,0.0,17.510340,0.0,0.0,0.0,0.0
4,4195,69,0.0,70,2.935106,13.274075,124915.484977,41,40.423601,39,0.0,19,17.000000,15,0.099098,2.007666,0,102,0.0,0.0,0.0,4.254797,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,5000,48,0.0,64,3.022384,9.107200,133370.000000,26,26.000000,26,0.0,33,19.711557,18,0.097800,0.657461,0,1194,0.0,0.0,0.0,5.610132,0.0,0.0,0.0,0.0
5996,2495,66,0.0,55,6.979195,9.057955,105102.000000,39,39.000000,39,0.0,40,18.150773,19,0.451297,2.009211,0,717,0.0,0.0,0.0,7.663756,0.0,0.0,0.0,0.0
5997,5179,59,0.0,55,10.883723,8.981118,103049.000000,43,43.000000,43,0.0,44,18.015041,21,0.110635,1.990716,0,783,0.0,0.0,0.0,5.633648,0.0,0.0,0.0,0.0
5998,2739,62,0.0,73,10.791156,8.987080,96887.000000,38,37.442955,36,0.0,40,19.603177,20,0.571546,2.000042,0,1261,0.0,0.0,0.0,2.894112,0.0,0.0,0.0,0.0


In [13]:
# generate random values for the 'Packet Length Min' column
rand_values = np.random.randint(min_max_dict['Packet Length Min'][0]*0.9, min_max_dict['Packet Length Min'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset['Packet Length Min'] = rand_values

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,2372,60,47,57,2.920504,9.099900,115800.000000,29,27.109093,25,0.0,21,18.936479,17,0.115317,2.009010,0,378,0.0,0.0,0.0,2.260604,0.0,0.0,0.0,0.0
1,3554,59,46,46,2.919790,8.912675,109430.000000,27,26.198608,25,0.0,46,21.186321,20,0.036078,1.995346,0,1615,0.0,0.0,0.0,2.048747,0.0,0.0,0.0,0.0
2,2368,47,42,86,3.022850,8.951914,182527.215127,23,21.673972,20,0.0,16,15.500000,15,0.073291,2.000501,0,248,0.0,0.0,0.0,25.124060,0.0,0.0,0.0,0.0
3,3076,54,51,62,10.384071,90.049986,113765.000000,20,20.000000,20,0.0,41,21.473990,16,0.146599,2.005556,0,1549,0.0,0.0,0.0,17.510340,0.0,0.0,0.0,0.0
4,4195,69,45,70,2.935106,13.274075,124915.484977,41,40.423601,39,0.0,19,17.000000,15,0.099098,2.007666,0,102,0.0,0.0,0.0,4.254797,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,5000,48,48,64,3.022384,9.107200,133370.000000,26,26.000000,26,0.0,33,19.711557,18,0.097800,0.657461,0,1194,0.0,0.0,0.0,5.610132,0.0,0.0,0.0,0.0
5996,2495,66,56,55,6.979195,9.057955,105102.000000,39,39.000000,39,0.0,40,18.150773,19,0.451297,2.009211,0,717,0.0,0.0,0.0,7.663756,0.0,0.0,0.0,0.0
5997,5179,59,50,55,10.883723,8.981118,103049.000000,43,43.000000,43,0.0,44,18.015041,21,0.110635,1.990716,0,783,0.0,0.0,0.0,5.633648,0.0,0.0,0.0,0.0
5998,2739,62,47,73,10.791156,8.987080,96887.000000,38,37.442955,36,0.0,40,19.603177,20,0.571546,2.000042,0,1261,0.0,0.0,0.0,2.894112,0.0,0.0,0.0,0.0


## Calculate and fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **flow duration increases**, the **packets per second** is likely to decrease. This occurs because each flow has an upper limit on duration, after which data collection stops and a new flow begins.<br>  
Similarly, the **Inter-Arrival Time (IAT)** of packets within a flow is influenced by the flow duration. Given these dependencies, <br>
the attack dataset should generate data for these columns collectively, ensuring that their inherent correlations are maintained.

### Correlation between 'SYN Flag Count' and all the following: 'ACK Flag Count', 'RST Flag Count':

In [None]:
first_correlation = ['SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']

# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('ACK Flag Count', np.float64(0.9934935261651908))
('RST Flag Count', np.float64(0.9933592536718514))


After finding the scaling factors we can apply some randomness when generating values for the attack dataset in order to generate better data (without many duplications).<br>
We add randomness by creating a modified scaling factor, which introduces controlled variations in the generated values.<br>
This is done by selecting a small random delta (between 1% and 2% of the factor) and adding or subtracting it from the original scaling factor.<br>
As a result, the generated data maintains realistic correlations while avoiding exact duplicates.

In [15]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*0.85, min_max_dict['SYN Flag Count'][1]*1.1, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all generated scaling factors
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
        updated_factor = factor[1] + (-1) * delta
        port_dataset.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [None]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = port_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = port_samples[second_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packets Per Second', np.float64(78.30849469959969))
('IAT Max', np.float64(0.6321809769264504))
('IAT Mean', np.float64(0.00010107180700798969))
('IAT Std', np.float64(0.006820131130425529))


In [17]:
# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(port_samples['Flow Duration'].values, port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

np.float64(9940.88888921109)

And again here after finding the scaling factors we add some randomness and generate the data

In [18]:
# calculate a random small delta of the factor for some randomness
for index, row in port_dataset.iterrows():
    for col, factor in zip(second_correlation[1:], scaling_factors):#iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.075, duration_to_packets_corr * 0.1)
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            port_dataset.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
            updated_factor = factor[1] + random.choice([-1, 1]) * delta
            port_dataset.loc[index, col] = row['Flow Duration'] * updated_factor

---

## Adding the Label column:

In [19]:
# adding a label to the dataset
port_dataset['Label'] = ATTACK_NAME

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [20]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0
mean,4848.222222,57.459618,54.0,61.555556,3.513727,15.084284,134114.666667,27.555556,26.884977,26.333333,0.283675,24.666667,20.002894,20.0,0.1139,1.874946,0.0,152.51972,4986.777778,4954.444444,4953.777778,5.742802,4944.269115,3.658145,0.000579,0.039199
std,491.206505,1.681917,0.0,4.527332,1.702665,21.622031,17063.62018,4.527332,3.302369,1.414214,1.136493,5.861138,0.003808,0.0,0.146877,0.472518,0.0,330.387193,85.082351,83.26091,83.743695,9.974609,2783.707692,6.380214,0.001009,0.068487
min,2914.0,56.998799,54.0,60.0,2.999324,8.995943,120900.0,26.0,26.0,26.0,0.0,20.0,20.0,20.0,0.0,0.0,0.0,0.0,4650.0,4641.0,4641.0,0.986577,235.28316,0.504682,9.9e-05,0.00508
25%,4963.5,57.004851,54.0,60.0,2.999574,8.997443,130084.5,26.0,26.0,26.0,0.0,21.0,20.0002,20.0,0.014152,2.0,0.0,33.339643,5001.5,4966.5,4966.5,1.463707,2663.73993,1.006845,0.000147,0.010095
50%,4998.0,57.007502,54.0,60.0,2.999775,8.998651,130195.0,26.0,26.0,26.0,0.0,24.0,20.002409,20.0,0.098137,2.001196,0.0,35.352346,5006.5,4982.5,4982.5,1.691956,5844.73954,1.010638,0.000172,0.010876
75%,5003.0,57.012333,54.0,60.0,2.999995,8.999973,130390.0,26.0,26.0,26.0,0.0,24.0,20.003212,20.0,0.113304,2.00241,0.0,149.053673,5012.0,4988.0,4988.0,4.097363,6808.495571,1.460001,0.00041,0.021448
max,5020.0,64.131025,54.0,74.0,9.996243,99.924867,201272.0,40.0,39.990463,32.0,4.830089,40.0,20.013104,20.0,0.498803,2.003995,0.0,1439.285714,5030.0,4996.0,4996.0,41.979205,10135.04233,25.286814,0.004251,0.276435


In [21]:
port_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0
mean,4050.938833,58.4525,49.857667,65.743,3.797924,19.829723,130125.562163,32.456333,32.083783,31.524833,0.574673,30.785833,19.011373,17.993833,0.135819,1.833786,0.0,845.578667,4663.023833,4562.376333,4561.977833,8.413711,2531.09816,5.319021,0.00085,0.057375
std,1074.75598,9.152962,5.476289,12.115173,2.024482,25.794553,29669.079752,8.075888,8.096464,8.187748,1.357496,8.923559,1.438029,1.97622,0.116582,0.441967,0.0,488.781675,754.385721,738.381589,738.13409,9.508562,2213.694141,6.009526,0.000962,0.064803
min,2229.0,43.0,41.0,45.0,2.29623,6.910745,92517.0,19.0,17.033877,15.0,0.0,15.0,15.0,15.0,1.5e-05,0.000569,0.0,0.0,3359.0,3271.0,3273.0,0.726413,200.376017,0.450895,7.2e-05,0.005038
25%,3112.75,51.0,45.0,55.0,2.960131,8.941943,108810.5,25.0,25.0,24.0,0.0,23.0,17.873088,16.0,0.068243,1.991851,0.0,419.0,3998.0,3913.75,3911.0,3.245476,1272.954128,2.058138,0.000329,0.022197
50%,4043.0,58.0,50.0,66.0,3.022095,9.037271,125088.395149,33.0,32.0,32.0,0.0,31.0,19.0,18.0,0.103523,1.998018,0.0,835.0,4672.5,4571.5,4568.0,5.498811,1809.536986,3.481597,0.000554,0.037463
75%,4983.0,66.0,55.0,76.0,3.082749,9.131231,141591.170156,39.0,39.0,38.0,0.0,39.0,20.196126,20.0,0.137663,2.004207,0.0,1270.25,5320.25,5208.0,5205.5,7.813595,3071.648235,4.944757,0.00079,0.053362
max,5935.0,74.0,59.0,86.0,11.811701,118.112376,237966.272248,46.0,46.0,46.0,5.707833,46.0,21.621492,21.0,0.589369,2.260816,0.0,1700.0,5946.0,5844.0,5842.0,45.105317,14929.9913,28.969828,0.004641,0.312743


## Turning certain columns into type Integer for consistency  

In [22]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset[col] = port_dataset[col].astype(int)

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,2372,60,47,57,2.920504,9.099900,115800,29,27.109093,25,0.0,21,18.936479,17,0.115317,2.009010,0,378,5818,5674,5684,2.260604,4052.661455,1.449311,0.000226,0.015680,PortScan
1,3554,59,46,46,2.919790,8.912675,109430,27,26.198608,25,0.0,46,21.186321,20,0.036078,1.995346,0,1615,3582,3501,3520,2.048747,4433.964433,1.272969,0.000203,0.013831,PortScan
2,2368,47,42,86,3.022850,8.951914,182527,23,21.673972,20,0.0,16,15.500000,15,0.073291,2.000501,0,248,4329,4215,4256,25.124060,364.948333,16.186273,0.002575,0.173525,PortScan
3,3076,54,51,62,10.384071,90.049986,113765,20,20.000000,20,0.0,41,21.473990,16,0.146599,2.005556,0,1549,5456,5337,5312,17.510340,623.734035,10.868378,0.001788,0.121027,PortScan
4,4195,69,45,70,2.935106,13.274075,124915,41,40.423601,39,0.0,19,17.000000,15,0.099098,2.007666,0,102,4910,4817,4814,4.254797,2536.088115,2.640620,0.000423,0.029539,PortScan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,5000,48,48,64,3.022384,9.107200,133370,26,26.000000,26,0.0,33,19.711557,18,0.097800,0.657461,0,1194,5713,5569,5562,5.610132,1935.474486,3.583260,0.000557,0.038991,PortScan
5996,2495,66,56,55,6.979195,9.057955,105102,39,39.000000,39,0.0,40,18.150773,19,0.451297,2.009211,0,717,3479,3392,3405,7.663756,1395.068493,4.920667,0.000788,0.053295,PortScan
5997,5179,59,50,55,10.883723,8.981118,103049,43,43.000000,43,0.0,44,18.015041,21,0.110635,1.990716,0,783,3530,3468,3459,5.633648,1908.426573,3.631398,0.000580,0.039080,PortScan
5998,2739,62,47,73,10.791156,8.987080,96887,38,37.442955,36,0.0,40,19.603177,20,0.571546,2.000042,0,1261,3796,3705,3731,2.894112,3764.723016,1.803846,0.000290,0.019508,PortScan


---

## Load the second sample dataset:

The following code will create another attack dataset, this time based on a different sample dataset, the code in this section<br> 
will be mostly the same as it was up until this point in the notebook, there for we will not repeat the same explanations here.<br>

In [23]:
NUM_OF_ROWS = 6000

## Load the second sample dataset:

In [24]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_open_port_samples_2.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (18, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3198,59.99655,58,60,0.082993,0.006888,165516,26,26.0,26,0.0,24,24.0,24,0.0,2.006912,0.0,26.077832,6366,11,11,9.874222,645.82303,1.088071,0.001549,0.01638
1,4163,59.998305,58,60,0.058205,0.003388,214526,26,26.0,26,0.0,24,24.0,24,0.0,2.003394,0.0,28.954785,8251,7,7,9.240387,893.685517,0.05886,0.001119,0.005757
2,2728,59.998156,58,60,0.060706,0.003685,140842,26,26.0,26,0.0,24,24.0,24,0.0,2.003692,0.0,31.487145,5417,5,5,5.565601,974.198454,0.049144,0.001027,0.004635
3,3607,59.996941,58,60,0.078163,0.006109,186680,26,26.0,26,0.0,24,24.0,24,0.0,2.006128,0.0,26.047161,7180,11,11,10.963778,655.887048,1.096967,0.001525,0.015087
4,4047,59.998252,58,60,0.059109,0.003494,208000,26,26.0,26,0.0,24,24.0,24,0.0,2.0035,0.0,29.005717,8000,7,7,8.881072,901.580345,0.067206,0.001109,0.004755
5,3473,59.996815,58,60,0.079745,0.006359,179322,26,26.0,26,0.0,24,24.0,24,0.0,2.00638,0.0,26.060456,6897,11,11,10.75229,642.46779,1.097064,0.001557,0.015597
6,4015,59.998231,58,60,0.059455,0.003535,205582,26,26.0,26,0.0,24,24.0,24,0.0,2.003541,0.0,29.193695,7907,7,7,8.70982,908.629566,0.058229,0.001101,0.004645
7,3484,59.997114,58,60,0.075924,0.005765,179894,26,26.0,26,0.0,24,24.0,24,0.0,2.005781,0.0,26.048943,6919,10,10,10.701885,647.456033,1.092027,0.001545,0.016018
8,4038,59.998251,58,60,0.05912,0.003495,207922,26,26.0,26,0.0,24,24.0,24,0.0,2.003501,0.0,29.023171,7997,7,7,8.978749,891.438213,0.070668,0.001122,0.006004
9,3827,59.996868,58,60,0.079083,0.006254,198926,26,26.0,26,0.0,24,24.0,24,0.0,2.006274,0.0,26.044252,7651,12,12,10.75696,712.375993,1.095676,0.001404,0.014864


In this attack sample, we noticed that there are two attack flows that have a low number of ports (indexes 11 and 12), and that the rest of the data in these two rows differs from the rest in a small but noticeable way.<br> That is why we decided to put them aside for now and, at the end of this notebook, create a small sample of data based solely on these two rows.<br> This will ensure the correctness of the data we generate.  

In [None]:
small_port_samples = port_samples.iloc[[11, 12]]

port_samples.drop(index=11, inplace=True)
port_samples.drop(index=12, inplace=True)
port_samples.reset_index(drop=True, inplace=True)
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (16, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3198,59.99655,58,60,0.082993,0.006888,165516,26,26.0,26,0.0,24,24.0,24,0.0,2.006912,0.0,26.077832,6366,11,11,9.874222,645.82303,1.088071,0.001549,0.01638
1,4163,59.998305,58,60,0.058205,0.003388,214526,26,26.0,26,0.0,24,24.0,24,0.0,2.003394,0.0,28.954785,8251,7,7,9.240387,893.685517,0.05886,0.001119,0.005757
2,2728,59.998156,58,60,0.060706,0.003685,140842,26,26.0,26,0.0,24,24.0,24,0.0,2.003692,0.0,31.487145,5417,5,5,5.565601,974.198454,0.049144,0.001027,0.004635
3,3607,59.996941,58,60,0.078163,0.006109,186680,26,26.0,26,0.0,24,24.0,24,0.0,2.006128,0.0,26.047161,7180,11,11,10.963778,655.887048,1.096967,0.001525,0.015087
4,4047,59.998252,58,60,0.059109,0.003494,208000,26,26.0,26,0.0,24,24.0,24,0.0,2.0035,0.0,29.005717,8000,7,7,8.881072,901.580345,0.067206,0.001109,0.004755
5,3473,59.996815,58,60,0.079745,0.006359,179322,26,26.0,26,0.0,24,24.0,24,0.0,2.00638,0.0,26.060456,6897,11,11,10.75229,642.46779,1.097064,0.001557,0.015597
6,4015,59.998231,58,60,0.059455,0.003535,205582,26,26.0,26,0.0,24,24.0,24,0.0,2.003541,0.0,29.193695,7907,7,7,8.70982,908.629566,0.058229,0.001101,0.004645
7,3484,59.997114,58,60,0.075924,0.005765,179894,26,26.0,26,0.0,24,24.0,24,0.0,2.005781,0.0,26.048943,6919,10,10,10.701885,647.456033,1.092027,0.001545,0.016018
8,4038,59.998251,58,60,0.05912,0.003495,207922,26,26.0,26,0.0,24,24.0,24,0.0,2.003501,0.0,29.023171,7997,7,7,8.978749,891.438213,0.070668,0.001122,0.006004
9,3827,59.996868,58,60,0.079083,0.006254,198926,26,26.0,26,0.0,24,24.0,24,0.0,2.006274,0.0,26.044252,7651,12,12,10.75696,712.375993,1.095676,0.001404,0.014864


### Find the columns that we need to synthesis data for:

In [None]:
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [27]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 7.5%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.075) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 2318.80 | Max: 4798.80
Average Packet Length          | Min: 51.00 | Max: 64.50
Packet Length Min              | Min: 49.30 | Max: 62.35
Packet Length Max              | Min: 51.00 | Max: 64.50
Packet Length Std              | Min: 0.05 | Max: 0.09
Packet Length Variance         | Min: 0.00 | Max: 0.01
Total Length of Fwd Packet     | Min: 119715.70 | Max: 247581.10
Fwd Packet Length Max          | Min: 22.10 | Max: 27.95
Fwd Packet Length Mean         | Min: 22.10 | Max: 27.95
Fwd Packet Length Min          | Min: 22.10 | Max: 27.95
Bwd Packet Length Max          | Min: 20.40 | Max: 25.80
Bwd Packet Length Mean         | Min: 20.40 | Max: 25.80
Bwd Packet Length Min          | Min: 20.40 | Max: 25.80
Fwd Segment Size Avg           | Min: 1.70 | Max: 2.16
Subflow Fwd Bytes              | Min: 22.14 | Max: 33.85
SYN Flag Count                 | Min: 4604.45 | Max: 9522.35
ACK Flag Count                 | Min: 4.25 | Max: 12.90
RST Flag Count        

### Create the base attack dataset (full of zeros):

In [None]:
# creating an empty dataframe before adding values to it
port_dataset2 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)
port_dataset2.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [29]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset2[col] = int(0)
zero_columns

['Fwd Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg']

---

## Filling in values based on collected samples:

### Firstly fill values into columns that are not related to each other:

In [30]:
random_values = ['Average Packet Length', 'Packet Length Std', 'Packet Length Variance', 'Fwd Segment Size Avg', 'Subflow Fwd Bytes']

for col in random_values:
    if col == 'Subflow Fwd Bytes':
        val = np.random.uniform(min_max_dict[col][0]*0.995, min_max_dict[col][1]*1.005, size = NUM_OF_ROWS)
    else:
        val = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, size = NUM_OF_ROWS)
    port_dataset2[col] = val

In [31]:
same_value1 = ['Packet Length Min', 'Packet Length Max']
val1 = np.random.randint(min_max_dict[same_value1[0]][0]*0.9, min_max_dict[same_value1[0]][1]*1.05, size = NUM_OF_ROWS)

same_value2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']
val2 = np.random.randint(min_max_dict[same_value2[0]][0]*0.9, min_max_dict[same_value2[0]][1]*1.05, size = NUM_OF_ROWS)

same_value3 = ['Bwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']
val3 = np.random.randint(min_max_dict[same_value3[0]][0]*0.9, min_max_dict[same_value3[0]][1]*1.05, size = NUM_OF_ROWS)

for col in same_value1:
    if col == 'Packet Length Min':
        port_dataset2[col] = val1
    else:
        port_dataset2[col] = [val + np.random.randint(2, 8) for val in val1]

for col in same_value2:
    port_dataset2[col] = val2

for col in same_value3:
    port_dataset2[col] = val3

## Calculate and fill values into columns that have a certain correlation between them:

### Correlation between 'Number of Ports' and all the following: 'Total Length of Fwd Packet', 'SYN Flag Count':

In [None]:
first_correlation = ['Number of Ports', 'Total Length of Fwd Packet', 'SYN Flag Count', 'ACK Flag Count']

# finding the correlation between the 'Number of Ports' column to the rest of the columns in order to create new data
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'Number of Ports'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Number of Ports' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Total Length of Fwd Packet', np.float64(51.603268694587804))
('SYN Flag Count', np.float64(1.9847411036379927))
('ACK Flag Count', np.float64(0.0022734815333307983))


In [33]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset2['Number of Ports'] = np.random.randint(min_max_dict['Number of Ports'][0]*0.85, min_max_dict['Number of Ports'][1]*1.1, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        delta = random.uniform(factor * 0.1, factor * 0.2) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        port_dataset2.loc[index, col] = int(row['Number of Ports'] * updated_factor)
        if col == 'ACK Flag Count':
            port_dataset2.loc[index, 'RST Flag Count'] = int(row['Number of Ports'] * updated_factor) #copy the value to RST column


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [None]:
# generate random values for the 'Flow Duration' column
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0]*0.9, min_max_dict['Flow Duration'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset2['Flow Duration'] = rand_values

In [None]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
secondCorrelation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = port_samples[secondCorrelation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = port_samples[secondCorrelation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(secondCorrelation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(port_samples['Flow Duration'].values, port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(81.07547767708505))
('IAT Max', np.float64(0.05919619316767066))
('IAT Mean', np.float64(0.0001340921963392785))
('IAT Std', np.float64(0.0010486295444490786))


np.float64(7516.125)

In [36]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in port_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.075, duration_to_packets_corr * 0.1)
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            port_dataset2.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std' or col == 'IAT Max':
                delta = random.uniform(factor * 0.5, factor * 0.8)
                updated_factor = factor + random.choices([-1, 1], weights=[1, 2], k=1)[0] * delta  
            else:
                delta = random.uniform(factor * 0.1, factor * 0.2)
                updated_factor = factor + random.choice([-1, 1]) * delta
            port_dataset2.loc[index, col] = row['Flow Duration'] * updated_factor

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [37]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0
mean,3782.0625,59.997635,58.0,60.0,0.068096,0.004724,195191.75,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.004736,0.0,27.88073,7507.375,8.75,8.75,9.529528,804.528544,0.518116,0.001279,0.009613
std,431.957478,0.000669,0.0,0.0,0.009604,0.001335,22068.30596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001342,0.0,1.769354,848.780998,2.081666,2.081666,1.421124,136.161725,0.526051,0.000226,0.005487
min,2728.0,59.99655,58.0,60.0,0.058205,0.003388,140842.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.003394,0.0,26.044252,5417.0,5.0,5.0,5.565601,642.46779,0.049144,0.001027,0.003061
25%,3570.25,59.996937,58.0,60.0,0.059371,0.003525,184359.5,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.003531,0.0,26.055982,7090.75,7.0,7.0,8.838259,650.235297,0.061723,0.001093,0.004727
50%,3921.0,59.998011,58.0,60.0,0.063039,0.003974,202254.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.003982,0.0,28.891462,7779.0,8.0,8.0,9.340837,892.561865,0.105977,0.001121,0.00665
75%,4052.75,59.998236,58.0,60.0,0.078206,0.006116,208747.5,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.006135,0.0,29.042916,8028.75,11.0,11.0,10.753457,915.198956,1.095999,0.001538,0.015236
max,4464.0,59.998305,58.0,60.0,0.082993,0.006888,230308.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.006912,0.0,31.487145,8858.0,12.0,12.0,11.127804,974.198454,1.098471,0.001557,0.01638


In [38]:
port_dataset2.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0
mean,3618.503667,58.352955,54.0405,58.539,0.071316,0.005359,186417.0265,23.532667,23.532667,23.532667,0.0,22.044167,22.044167,22.044167,0.0,1.9491,0.0,28.003484,7183.710667,7.711,7.711,8.422083,977.12143,0.61365,0.001125,0.010698
std,966.364314,7.239216,6.073824,6.272403,0.015416,0.001618,57891.587437,2.858431,2.858431,2.858431,0.0,2.601284,2.601284,2.601284,0.0,0.241537,0.0,3.459602,2239.615399,2.573567,2.573567,2.391429,323.904739,0.362536,0.000367,0.00649
min,1970.0,45.91765,44.0,46.0,0.044532,0.002592,82045.0,19.0,19.0,19.0,0.0,18.0,18.0,18.0,0.0,1.533557,0.0,22.027308,3163.0,3.0,3.0,4.260206,541.251143,0.050955,0.000466,0.000914
25%,2764.75,52.094679,49.0,53.0,0.057983,0.003936,139373.0,21.0,21.0,21.0,0.0,20.0,20.0,20.0,0.0,1.73916,0.0,25.003989,5362.75,6.0,6.0,6.362517,716.495929,0.227055,0.000828,0.003792
50%,3614.5,58.368818,54.0,59.0,0.071053,0.005346,181606.0,24.0,24.0,24.0,0.0,22.0,22.0,22.0,0.0,1.947273,0.0,27.988627,6967.5,7.0,7.0,8.426448,887.689643,0.628803,0.001097,0.010966
75%,4471.0,64.597781,59.0,64.0,0.084934,0.006777,225655.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.158815,0.0,31.027614,8689.0,9.0,9.0,10.493203,1179.197592,0.928982,0.001372,0.016307
max,5277.0,70.944618,64.0,71.0,0.098138,0.008144,324986.0,28.0,28.0,28.0,0.0,26.0,26.0,26.0,0.0,2.373114,0.0,34.017798,12526.0,14.0,14.0,12.55468,1934.06667,1.337132,0.002013,0.023535


---

## Adding the Label column:

In [39]:
# adding a label to the dataset
port_dataset2['Label'] = ATTACK_NAME

---

Make sure that the data that needs to be of type Integer will be Integer for consistency.  

In [40]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset2[col] = port_dataset2[col].astype(int)

port_dataset2['Fwd Packet Length Mean'] = port_dataset2['Fwd Packet Length Mean'].astype(float)
port_dataset2['Bwd Packet Length Mean'] = port_dataset2['Bwd Packet Length Mean'].astype(float)

port_dataset2

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,3153,64.283501,51,58,0.089751,0.005294,188090,22,22.0,22,0,22,22.0,22,0,2.199814,0,31.149895,7101,6,6,6.133804,1133.419275,0.616359,0.000738,0.011134,PortScan
1,2068,64.455389,64,66,0.047709,0.004417,92980,22,22.0,22,0,25,25.0,25,0,2.059450,0,29.552035,3589,5,5,8.190811,997.203524,0.153291,0.001284,0.002426,PortScan
2,3961,70.437187,49,51,0.062916,0.006522,181818,20,20.0,20,0,23,23.0,23,0,1.972845,0,32.849444,6469,7,7,5.658816,1459.084046,0.154635,0.000839,0.009272,PortScan
3,4485,63.652641,62,67,0.090927,0.004581,187120,24,24.0,24,0,25,25.0,25,0,1.846070,0,27.021780,7158,11,11,8.302587,975.219348,0.852157,0.001330,0.002429,PortScan
4,3808,62.274696,56,61,0.071425,0.006545,162405,28,28.0,28,0,25,25.0,25,0,2.264907,0,22.274706,8423,10,10,6.517127,1262.695110,0.100823,0.000725,0.001611,PortScan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,2062,60.028694,52,54,0.096937,0.007706,121344,19,19.0,19,0,26,26.0,26,0,1.611376,0,32.735299,4642,4,4,12.155878,676.078614,1.088350,0.001902,0.022623,PortScan
5996,2794,60.009691,45,48,0.083576,0.004008,119842,27,27.0,27,0,23,23.0,23,0,1.747310,0,26.151547,6290,5,5,10.721345,764.560749,1.033977,0.001171,0.019183,PortScan
5997,4299,50.521237,53,59,0.098086,0.005621,191801,23,23.0,23,0,18,18.0,18,0,2.237203,0,32.636211,9907,11,11,11.331543,606.788684,0.257756,0.001325,0.018771,PortScan
5998,3019,46.920914,50,53,0.064849,0.008134,129203,23,23.0,23,0,24,24.0,24,0,1.775838,0,23.313498,7149,5,5,4.775738,1701.049411,0.456945,0.000715,0.001964,PortScan


---

## Creating more rows base on small subset of samples that is slightly different:

In [None]:
small_port_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
11,1000,73.960367,66,74,0.561691,0.315496,160920,40,39.960268,32,0.562384,40,40.0,40,0.0,0.0,0.0,40.139686,4017,30,10,12.911425,312.668816,3.273951,0.003199,0.057305
12,1000,73.960278,66,74,0.562315,0.316198,80280,40,39.960179,32,0.56301,40,40.0,40,0.0,0.0,0.0,40.240602,2004,15,5,4.624056,435.548349,1.099811,0.002297,0.026689


In [42]:
NUM_OF_ROWS = 3000

### Create the base attack dataset (full of zeros):

In [None]:
# creating an empty dataframe before adding values to it
port_dataset3 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(small_port_samples.columns))), columns = small_port_samples.columns)

# find the columns that we need to synthesis data for to produce an attack dataset
columns_to_gather = small_port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values

# find an approximate minimum and maximum values of each column and save that data into a dictionary
min_max_dict = {col: (small_port_samples[col].min() * 0.85, small_port_samples[col].max() * 1.1) for col in columns_to_gather}

# adding zeros to all columns that should not have any values
zero_columns = [col for col in small_port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset3[col] = int(0)
zero_columns

['Bwd Packet Length Std', 'Fwd Segment Size Avg', 'Bwd Segment Size Avg']

---

## Filling in values based on collected samples:

### Firstly fill values into columns that are not related to each other:

In [44]:
random_values = ['Average Packet Length', 'Packet Length Std', 'Packet Length Variance', 'Fwd Packet Length Std', 'Subflow Fwd Bytes', 'Number of Ports']

for col in random_values:
    val = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1]*1.05, size = NUM_OF_ROWS)
    port_dataset3[col] = val

### Then filling same value columns:

In [45]:
same_value1 = ['Packet Length Min', 'Packet Length Max']
val1 = np.random.randint(min_max_dict[same_value1[0]][0]*0.9, min_max_dict[same_value1[0]][1]*1.05, size = NUM_OF_ROWS)

same_value2 = ['Bwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']
val2 = np.random.randint(min_max_dict[same_value2[0]][0]*0.9, min_max_dict[same_value2[0]][1]*1.05, size = NUM_OF_ROWS)


for col in same_value1:
    if col == 'Packet Length Min':
        port_dataset3[col] = val1
    else:
        port_dataset3[col] = [val + np.random.randint(2, 14) for val in val1]

for col in same_value2:
    port_dataset3[col] = val2

In [46]:
independant = ['Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean']

packet_length_max = np.random.randint(min_max_dict['Fwd Packet Length Max'][0] * 0.9, min_max_dict['Fwd Packet Length Max'][1] * 1.1, NUM_OF_ROWS)

# create 'Fwd Packet Length Min' by applying a small variation
packet_length_min = packet_length_max - np.random.randint(2, 16, NUM_OF_ROWS)

# calculate 'Fwd Packet Length Mean': average of min and max, or copy if equal
average_packet_length = np.where(packet_length_max != packet_length_min, (packet_length_max + packet_length_min) / 2, packet_length_min)

# assign the values to the dataset
port_dataset3['Fwd Packet Length Max'] = packet_length_max.astype(int)
port_dataset3['Fwd Packet Length Mean'] = average_packet_length
port_dataset3['Fwd Packet Length Min'] = packet_length_min.astype(int)

In [47]:
port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,850.110655,60.193093,70,73,0.454961,0.267520,0.0,30,25.5,21,0.613209,37,37,37,0,0,0,46.260579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1073.376988,71.035755,68,75,0.526132,0.256637,0.0,34,32.0,30,0.560503,41,41,41,0,0,0,38.540909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,851.117589,62.644438,74,81,0.510642,0.345980,0.0,31,29.5,28,0.520716,45,45,45,0,0,0,43.515491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1055.235680,80.947002,69,77,0.470617,0.288430,0.0,39,38.0,37,0.646861,37,37,37,0,0,0,41.609240,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1021.119786,74.992107,52,59,0.584143,0.331596,0.0,41,35.5,30,0.569918,45,45,45,0,0,0,41.548575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1084.540283,73.731776,73,81,0.586182,0.299747,0.0,31,25.0,19,0.468385,40,40,40,0,0,0,45.212818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2996,818.175096,66.319265,64,71,0.467107,0.276685,0.0,39,33.5,28,0.615429,44,44,44,0,0,0,38.942502,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2997,855.397812,74.202778,74,81,0.464416,0.325954,0.0,42,37.0,32,0.463799,43,43,43,0,0,0,36.891445,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2998,914.872463,83.833566,67,75,0.565814,0.339365,0.0,40,34.0,28,0.540441,30,30,30,0,0,0,42.043146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Calculate and fill values into columns that have a certain correlation between them:

### Correlation between 'SYN Flag Count' and all the following: 'ACK Flag Count', 'RST Flag Count', 'Total Length of Fwd Packet':

In [None]:
first_correlation = ['SYN Flag Count', 'ACK Flag Count', 'RST Flag Count', 'Total Length of Fwd Packet']

# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
independent_col = small_port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = small_port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)
    
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset3['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*0.9, min_max_dict['SYN Flag Count'][1]*1.05, NUM_OF_ROWS)

for index, row in port_dataset3.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all rows we need to add values to except 'SYN Flag Count'
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
        updated_factor = factor[1] + (-1) * delta
        port_dataset3.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

('ACK Flag Count', np.float64(0.007471601883754736))
('RST Flag Count', np.float64(0.002490533961251579))
('Total Length of Fwd Packet', np.float64(40.05977281507004))


In [49]:
port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,850.110655,60.193093,70,73,0.454961,0.267520,145792.0,30,25.5,21,0.613209,37,37,37,0,0,0,46.260579,3698,27.0,9.0,0.0,0.0,0.0,0.0,0.0
1,1073.376988,71.035755,68,75,0.526132,0.256637,158469.0,34,32.0,30,0.560503,41,41,41,0,0,0,38.540909,4001,29.0,9.0,0.0,0.0,0.0,0.0,0.0
2,851.117589,62.644438,74,81,0.510642,0.345980,136927.0,31,29.5,28,0.520716,45,45,45,0,0,0,43.515491,3464,25.0,8.0,0.0,0.0,0.0,0.0,0.0
3,1055.235680,80.947002,69,77,0.470617,0.288430,174944.0,39,38.0,37,0.646861,37,37,37,0,0,0,41.609240,4418,32.0,10.0,0.0,0.0,0.0,0.0,0.0
4,1021.119786,74.992107,52,59,0.584143,0.331596,87273.0,41,35.5,30,0.569918,45,45,45,0,0,0,41.548575,2203,16.0,5.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1084.540283,73.731776,73,81,0.586182,0.299747,96959.0,31,25.0,19,0.468385,40,40,40,0,0,0,45.212818,2454,17.0,5.0,0.0,0.0,0.0,0.0,0.0
2996,818.175096,66.319265,64,71,0.467107,0.276685,116402.0,39,33.5,28,0.615429,44,44,44,0,0,0,38.942502,2961,21.0,7.0,0.0,0.0,0.0,0.0,0.0
2997,855.397812,74.202778,74,81,0.464416,0.325954,102526.0,42,37.0,32,0.463799,43,43,43,0,0,0,36.891445,2586,19.0,6.0,0.0,0.0,0.0,0.0,0.0
2998,914.872463,83.833566,67,75,0.565814,0.339365,77575.0,40,34.0,28,0.540441,30,30,30,0,0,0,42.043146,1968,14.0,4.0,0.0,0.0,0.0,0.0,0.0


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [None]:
# generate random values for the 'Flow Duration' column
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0]*0.9, min_max_dict['Flow Duration'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset3['Flow Duration'] = rand_values

# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
secondCorrelation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = small_port_samples[secondCorrelation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = small_port_samples[secondCorrelation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(secondCorrelation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(small_port_samples['Flow Duration'].values, small_port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(32.17131779515319))
('IAT Max', np.float64(0.2517824943859383))
('IAT Mean', np.float64(0.00027607674297135994))
('IAT Std', np.float64(0.004589897159462554))


np.float64(3025.5)

In [51]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in port_dataset3.iterrows():
    for col, factor in zip(secondCorrelation[1:], scaling_factors): #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.1, duration_to_packets_corr * 0.2) 
            updated_factor = duration_to_packets_corr + random.choices([-1, 1], weights=[2, 1], k=1)[0] * delta
            port_dataset3.loc[index, col] = updated_factor / row['Flow Duration']
        elif col == 'IAT Mean':
            delta = random.uniform(factor[1] * 0.5, factor[1] * 0.8)
            updated_factor = factor[1] + delta
            port_dataset3.loc[index, col] = row['Flow Duration'] * updated_factor
        else:
            delta = random.uniform(factor[1] * 0.15, factor[1] * 0.35)
            updated_factor = factor[1] + random.choice([-1, 1]) * delta
            port_dataset3.loc[index, col] = row['Flow Duration'] * updated_factor

Make sure that the data that needs to be of type Integer will be Integer for consistency.  

In [52]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset3[col] = port_dataset3[col].astype(int)

port_dataset3['Bwd Packet Length Mean'] = port_dataset3['Bwd Packet Length Mean'].astype(float)

port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,850,60.193093,70,73,0.454961,0.267520,145792,30,25.5,21,0.613209,37,37.0,37,0,0,0,46.260579,3698,27,9,11.120724,230.319880,2.249622,0.004715,0.037372
1,1073,71.035755,68,75,0.526132,0.256637,158469,34,32.0,30,0.560503,41,41.0,41,0,0,0,38.540909,4001,29,9,4.993318,684.862076,0.860284,0.002324,0.027817
2,851,62.644438,74,81,0.510642,0.345980,136927,31,29.5,28,0.520716,45,45.0,45,0,0,0,43.515491,3464,25,8,9.001009,381.979277,3.032999,0.004094,0.033799
3,1055,80.947002,69,77,0.470617,0.288430,174944,39,38.0,37,0.646861,37,37.0,37,0,0,0,41.609240,4418,32,10,6.823532,527.783608,2.066955,0.002929,0.038624
4,1021,74.992107,52,59,0.584143,0.331596,87273,41,35.5,30,0.569918,45,45.0,45,0,0,0,41.548575,2203,16,5,5.003316,521.193506,1.646441,0.002346,0.027592
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1084,73.731776,73,81,0.586182,0.299747,96959,31,25.0,19,0.468385,40,40.0,40,0,0,0,45.212818,2454,17,5,8.510000,288.369927,2.563109,0.004170,0.026624
2996,818,66.319265,64,71,0.467107,0.276685,116402,39,33.5,28,0.615429,44,44.0,44,0,0,0,38.942502,2961,21,7,8.801904,292.817026,1.654419,0.003983,0.031233
2997,855,74.202778,74,81,0.464416,0.325954,102526,42,37.0,32,0.463799,43,43.0,43,0,0,0,36.891445,2586,19,6,9.728176,350.330189,2.861212,0.004231,0.030408
2998,914,83.833566,67,75,0.565814,0.339365,77575,40,34.0,28,0.540441,30,30.0,30,0,0,0,42.043146,1968,14,4,4.565524,531.050677,1.413088,0.002097,0.028281


---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [None]:
small_port_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
11,1000,73.960367,66,74,0.561691,0.315496,160920,40,39.960268,32,0.562384,40,40.0,40,0.0,0.0,0.0,40.139686,4017,30,10,12.911425,312.668816,3.273951,0.003199,0.057305
12,1000,73.960278,66,74,0.562315,0.316198,80280,40,39.960179,32,0.56301,40,40.0,40,0.0,0.0,0.0,40.240602,2004,15,5,4.624056,435.548349,1.099811,0.002297,0.026689


In [54]:
port_dataset3.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,980.250667,72.704631,62.458333,69.970333,0.552079,0.309325,120550.182333,38.690667,34.434667,30.178667,0.551408,37.512667,37.512667,37.512667,0.0,0.0,0.0,39.60079,3055.264333,21.979,6.989667,9.179229,364.210954,2.309159,0.004175,0.04196
std,100.517065,7.475367,7.436085,8.223057,0.057064,0.031551,35236.096402,5.184072,5.562033,6.555741,0.05735,4.615583,4.615583,4.615583,0.0,0.0,0.0,3.972738,893.064929,6.580805,2.214466,3.288817,168.152921,1.043737,0.001511,0.018719
min,807.0,59.723789,50.0,52.0,0.453668,0.254809,60309.0,30.0,22.5,15.0,0.454189,30.0,30.0,30.0,0.0,0.0,0.0,32.414837,1534.0,11.0,3.0,3.54375,163.194581,0.599812,0.001511,0.010688
25%,892.75,66.257574,56.0,63.0,0.502872,0.281539,90478.75,34.0,30.0,25.0,0.501719,34.0,34.0,34.0,0.0,0.0,0.0,36.313965,2292.0,16.0,5.0,6.337607,237.400928,1.520742,0.002862,0.027181
50%,981.0,72.635148,63.0,70.0,0.551301,0.308986,120389.5,39.0,34.5,30.0,0.550706,38.0,38.0,38.0,0.0,0.0,0.0,39.639272,3048.0,22.0,7.0,9.128559,313.132767,2.141206,0.004141,0.03919
75%,1068.0,79.32674,69.0,76.0,0.603278,0.336261,149803.0,43.0,39.0,35.0,0.600888,42.0,42.0,42.0,0.0,0.0,0.0,42.940429,3798.5,27.0,9.0,12.017814,451.871024,2.920218,0.005441,0.052936
max,1154.0,85.423112,75.0,88.0,0.649439,0.365154,183678.0,47.0,46.0,45.0,0.650116,45.0,45.0,45.0,0.0,0.0,0.0,46.473107,4637.0,34.0,11.0,14.895448,1002.717575,5.023238,0.007284,0.091543


---

## Adding the Label column:

In [55]:
# adding a label to the dataset
port_dataset3['Label'] = ATTACK_NAME

---

## At the end we merge the three sample datasets tougether and then save it as a CSV file:

In [56]:
# sample dos attack dataset
mergedport_dataset = pd.concat([port_dataset, port_dataset2, port_dataset3], axis=0)
mergedport_dataset = mergedport_dataset.sample(frac=1, random_state=42).reset_index(drop=True)
print(f'Attack Dataset Shape: {mergedport_dataset.shape}')

Attack Dataset Shape: (15000, 27)


In [None]:
# save the dataset
mergedport_dataset.to_csv('port_scan_open_ports_dataset.csv', index=False)