# Prepare Port Scanning Closed Port Attack Dataset

## Overview:

This notebook will focus on creating a Port Scanning closed port attack dataset based on a small sample of data collected by performing real Port Scanning closed port attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
There are multiple sample datasets because we performed the attack in a few different ways, and in each way, the data is slightly different.<br>
That is why we split the original sample dataset into multiple samples, ensuring that the attack dataset we generate matches the real-world data as closely as possible.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 7,500 flows of the Port Scanning closed port attack based on the samples we collected when running a Port Scanning attack in various configurations using the well known NMap tool when the majority of ports on the victim host machine where closed.<br> 

## Imports & Global Variables:

In [1]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 7500
ATTACK_NAME = 'PortScan'

In [2]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the first sample dataset:

In [3]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_closed_port_samples_1.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (19, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1970,60.0,60,60,0.0,0.0,102154,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.679028,3929,0,0,39.64862,99.095504,0.101249,0.010094,0.017669
1,1980,60.0,60,60,0.0,0.0,102778,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.674799,3953,0,0,39.988767,98.85276,0.090225,0.010119,0.016015
2,1800,60.0,60,60,0.0,0.0,93366,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.072605,3591,0,0,37.254382,96.391345,1.103064,0.010377,0.029695
3,4942,59.998174,58,60,0.060403,0.003649,256074,26,26.0,26,0.0,24,24.0,24,0.0,2.003655,0.0,26.026425,9849,9,9,28.134188,350.3922,1.101169,0.002854,0.017863
4,3416,59.998822,58,60,0.048532,0.002355,176410,26,26.0,26,0.0,24,24.0,24,0.0,2.002358,0.0,26.38893,6785,4,4,39.957571,169.905223,0.137244,0.005887,0.018825
5,1410,60.0,60,60,0.0,0.0,73060,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.761905,2810,0,0,39.887447,70.448229,0.14227,0.0142,0.029792
6,3314,59.998782,58,60,0.049349,0.002435,170612,26,26.0,26,0.0,24,24.0,24,0.0,2.002438,0.0,26.039683,6562,4,4,38.899376,168.794481,1.100685,0.005925,0.021482
7,5019,59.99819,58,60,0.060138,0.003617,258336,26,26.0,26,0.0,24,24.0,24,0.0,2.003623,0.0,28.282899,9936,9,9,11.427656,870.257194,0.038543,0.001149,0.002328
8,1930,74.0,74,74,0.0,0.0,154400,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,40.103896,3860,0,0,39.700051,97.229094,1.017906,0.010288,0.034187
9,1999,74.0,74,74,0.0,0.0,159200,40,40.0,40,0.0,0,0.0,0,0.0,0.0,0.0,41.020356,3980,0,0,39.994987,99.512472,0.215444,0.010052,0.03092


### Find the columns that we need to synthesis data for:

In [4]:
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [None]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 15%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.15) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 850.00 | Max: 5771.85
Average Packet Length          | Min: 51.00 | Max: 85.10
Packet Length Min              | Min: 49.30 | Max: 85.10
Packet Length Max              | Min: 51.00 | Max: 85.10
Packet Length Std              | Min: 0.00 | Max: 0.07
Packet Length Variance         | Min: 0.00 | Max: 0.00
Total Length of Fwd Packet     | Min: 62101.00 | Max: 297086.40
Fwd Packet Length Max          | Min: 22.10 | Max: 46.00
Fwd Packet Length Mean         | Min: 22.10 | Max: 46.00
Fwd Packet Length Min          | Min: 22.10 | Max: 46.00
Bwd Packet Length Max          | Min: 0.00 | Max: 27.60
Bwd Packet Length Mean         | Min: 0.00 | Max: 27.60
Bwd Packet Length Min          | Min: 0.00 | Max: 27.60
Fwd Segment Size Avg           | Min: 0.00 | Max: 2.30
Subflow Fwd Bytes              | Min: 22.12 | Max: 47.19
SYN Flag Count                 | Min: 2388.50 | Max: 11426.40
ACK Flag Count                 | Min: 0.00 | Max: 10.35
RST Flag Count            

### Create the base attack dataset (full of zeros):

In [6]:
# creating an empty dataframe before adding values to it
port_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [7]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset[col] = int(0)
zero_columns

['Fwd Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg']

In [8]:
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## Filling in values based on collected samples:

## Calculate and fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **flow duration increases**, the **packets per second** is likely to decrease. This occurs because each flow has an upper limit on duration, after which data collection stops and a new flow begins.<br>  
Similarly, the **Inter-Arrival Time (IAT)** of packets within a flow is influenced by the flow duration. Given these dependencies, <br>
the attack dataset should generate data for these columns collectively, ensuring that their inherent correlations are maintained.

### Correlation between 'Number of Ports' and all the following: 'Total Length of Fwd Packet', 'SYN Flag Count':

In [None]:
# finding the correlation between the 'Number of Ports' column to the rest of the columns in order to create new data
first_correlation = ['Number of Ports', 'Total Length of Fwd Packet', 'SYN Flag Count']
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'Number of Ports'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Number of Ports' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Total Length of Fwd Packet', np.float64(58.833888228624055))
('SYN Flag Count', np.float64(2.0010731757589486))


After finding the scaling factors we can apply some randomness when generating values for the attack dataset in order to generate better data (without many duplications).<br>
We add randomness by creating a modified scaling factor, which introduces controlled variations in the generated values.<br>
This is done by selecting a small random delta (between 1% and 2% of the factor) and adding or subtracting it from the original scaling factor.<br>
As a result, the generated data maintains realistic correlations while avoiding exact duplicates.

In [None]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset['Number of Ports'] = np.random.randint(min_max_dict['Number of Ports'][0]*0.9, min_max_dict['Number of Ports'][1]*1.10, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all generated scaling factors
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
        updated_factor = factor[1] + random.choice([-1, 1]) * delta
        port_dataset.loc[index, col] = row['Number of Ports'] * updated_factor

In [11]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3477,0.0,0.0,0.0,0.0,0.0,207118.696308,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,7039.099088,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4017,0.0,0.0,0.0,0.0,0.0,231726.964329,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,7905.590160,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1632,0.0,0.0,0.0,0.0,0.0,94959.857436,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,3233.077331,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1745,0.0,0.0,0.0,0.0,0.0,101075.580707,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,3429.253877,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,953,0.0,0.0,0.0,0.0,0.0,56927.715384,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,1938.123424,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,1512,0.0,0.0,0.0,0.0,0.0,87332.997876,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,2984.714298,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7496,5162,0.0,0.0,0.0,0.0,0.0,298746.038973,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,10146.837686,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7497,2562,0.0,0.0,0.0,0.0,0.0,153386.517166,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,5216.959902,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7498,928,0.0,0.0,0.0,0.0,0.0,53642.305955,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,1875.742428,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [None]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = port_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = port_samples[second_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packets Per Second', np.float64(3.4232599999586326))
('IAT Max', np.float64(0.016503150061282924))
('IAT Mean', np.float64(0.00024361763844503648))
('IAT Std', np.float64(0.0007367524780914209))


In [None]:
# generate random values for the 'Flow Duration' column
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0]*0.9, min_max_dict['Flow Duration'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset['Flow Duration'] = rand_values

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3477,0.0,0.0,0.0,0.0,0.0,207118.696308,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,7039.099088,0.0,0.0,9.702468,0.0,0.0,0.0,0.0
1,4017,0.0,0.0,0.0,0.0,0.0,231726.964329,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,7905.590160,0.0,0.0,11.416355,0.0,0.0,0.0,0.0
2,1632,0.0,0.0,0.0,0.0,0.0,94959.857436,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,3233.077331,0.0,0.0,21.412667,0.0,0.0,0.0,0.0
3,1745,0.0,0.0,0.0,0.0,0.0,101075.580707,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,3429.253877,0.0,0.0,21.270453,0.0,0.0,0.0,0.0
4,953,0.0,0.0,0.0,0.0,0.0,56927.715384,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,1938.123424,0.0,0.0,44.882708,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,1512,0.0,0.0,0.0,0.0,0.0,87332.997876,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,2984.714298,0.0,0.0,42.847998,0.0,0.0,0.0,0.0
7496,5162,0.0,0.0,0.0,0.0,0.0,298746.038973,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,10146.837686,0.0,0.0,48.355042,0.0,0.0,0.0,0.0
7497,2562,0.0,0.0,0.0,0.0,0.0,153386.517166,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,5216.959902,0.0,0.0,41.464769,0.0,0.0,0.0,0.0
7498,928,0.0,0.0,0.0,0.0,0.0,53642.305955,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,1875.742428,0.0,0.0,37.283102,0.0,0.0,0.0,0.0


In [14]:
# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(port_samples['Flow Duration'].values, port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

np.float64(4832.368421052632)

And again here after finding the scaling factors we add some randomness and generate the data

In [None]:
# calculate a random small delta of the factor for some randomness
for index, row in port_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.25, duration_to_packets_corr * 0.65) 
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            port_dataset.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std':
                delta = random.uniform(factor * 0.35, factor * 0.65)
                updated_factor = factor + random.choice([-1, 1]) * delta  
            else:
                delta = random.uniform(factor * 0.1, factor * 0.2) 
                updated_factor = factor + random.choice([-1, 1]) * delta

            if col == 'IAT Max':
                delta = random.uniform(factor * 0.6, factor * 0.99)
                updated_factor = factor + random.choices([-1, 1], weights=[1, 3], k=1)[0] * delta  
                port_dataset.loc[index, col] = (row['Flow Duration'] * updated_factor) * 2.3
            else:
                port_dataset.loc[index, col] = row['Flow Duration'] * updated_factor

### Correlation between 'Packet Length Std' and 'Packet Length Variance:

In [16]:
# insert values based on minimum and maximum values
port_dataset['Packet Length Std'] = np.random.uniform(min_max_dict['Packet Length Std'][0]*0.9, min_max_dict['Packet Length Std'][1]*1.1, size = NUM_OF_ROWS)

In [None]:
# finding the correlation between the 'Packet Length Std' column to the rest of the columns in order to create new data
third_correlation = ['Packet Length Std', 'Packet Length Variance']
independent_col = port_samples[third_correlation[0]].values.reshape(-1, 1) #column 'Packet Length Std'
dependent_cols = port_samples[third_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Packet Length Std' and 'Packet Length Variance'
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(third_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packet Length Variance', np.float64(0.0547351141554972))


In [None]:
# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        delta = random.uniform(factor * 0.05, factor * 0.1) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        port_dataset.loc[index, col] = int(row['Packet Length Std'] * updated_factor) 

The values in these columns, based on our collect sample dataset, can sometimes be zero, and other times can be a number, that is why we select half of the cells in each vector (the same index cells), and insert zeros into them, and the other half get values between the minimum and maximum values.

In [19]:
mask = np.random.rand(NUM_OF_ROWS) > 0.5 # randomly choose 50% of the cells in the vector
port_dataset.loc[mask, 'Packet Length Std'] = 0
port_dataset.loc[mask, 'Packet Length Variance'] = 0

### Then we insert data into the 'Fwd Segment Size Avg' column:

This column also either has a zero or a number, and we can see in the sample data that the zeros coinside with the zeros in the 'Packet Length Std' and 'Packet Length Variance' columns, there for we use the same mask to insert the zeros into the same cell indexes as the other columns.

In [20]:
port_dataset['Fwd Segment Size Avg'] = np.random.uniform(min_max_dict['Fwd Segment Size Avg'][0]*0.9, min_max_dict['Fwd Segment Size Avg'][1]*1.1, size = NUM_OF_ROWS)
port_dataset.loc[mask, 'Fwd Segment Size Avg'] = 0

In [21]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3477,0.0,0.0,0.0,0.000000,0.0,207118.696308,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,7039.099088,0.0,0.0,9.702468,676.679389,0.694347,0.001921,0.010772
1,4017,0.0,0.0,0.0,0.000000,0.0,231726.964329,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,7905.590160,0.0,0.0,11.416355,162.890749,0.742951,0.002254,0.005349
2,1632,0.0,0.0,0.0,0.000000,0.0,94959.857436,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,3233.077331,0.0,0.0,21.412667,168.975752,1.422753,0.006244,0.025178
3,1745,0.0,0.0,0.0,0.037832,0.0,101075.580707,0.0,0.0,0.0,0,0.0,0.0,0.0,0,1.520240,0,0.0,3429.253877,0.0,0.0,21.270453,147.918086,1.584327,0.004197,0.008933
4,953,0.0,0.0,0.0,0.000000,0.0,56927.715384,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,1938.123424,0.0,0.0,44.882708,157.078777,3.238709,0.012341,0.045918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,1512,0.0,0.0,0.0,0.000000,0.0,87332.997876,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,2984.714298,0.0,0.0,42.847998,174.718756,2.689744,0.011584,0.019245
7496,5162,0.0,0.0,0.0,0.030296,0.0,298746.038973,0.0,0.0,0.0,0,0.0,0.0,0.0,0,1.692078,0,0.0,10146.837686,0.0,0.0,48.355042,150.379049,3.611664,0.010030,0.017721
7497,2562,0.0,0.0,0.0,0.000000,0.0,153386.517166,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.000000,0,0.0,5216.959902,0.0,0.0,41.464769,185.818164,2.720942,0.008724,0.012590
7498,928,0.0,0.0,0.0,0.016411,0.0,53642.305955,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.817855,0,0.0,1875.742428,0.0,0.0,37.283102,53.647834,2.773919,0.007339,0.015650


### Then we insert data into columns that have the exact same values:

The following columns have values that are exactly the same, there for we generate a single vector and insert into into all of these columns.

In [22]:
same_values = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']

# Generate random values for the 'Flow Duration' column
rand_values = np.random.randint(min_max_dict['Fwd Packet Length Max'][0]*0.9, min_max_dict['Fwd Packet Length Max'][1]*1.1, size = NUM_OF_ROWS)

# Assign the random values
for col in same_values:
    port_dataset[col] = rand_values

The column 'Subflow Fwd Bytes' has approximatly the same as values as in 'Fwd Packet Length Max', 'Fwd Packet Length Mean' and 'Fwd Packet Length Min'. Thats why we create a insert into 'Subflow Fwd Bytes' column slightly adjusted values from rand_values

In [None]:
adjustment_factor = np.random.uniform(0.9995, 1.0005, size = NUM_OF_ROWS)
subflow_fwd_bytes = rand_values * adjustment_factor
port_dataset['Subflow Fwd Bytes'] = subflow_fwd_bytes

### The we insert values into columns that has approximate values between one another:

When generating data for the following columns we take the time to ensure that the values generated are correct in the sence that the minimum value should be lower than the mean and the mean should be lower than the max value <u>in each row</u> of the attack dataset.<br>  
Also sometimes in the sample dataset the values in the following columns are exactly the same, and other times they are different, there for we randomly select 50% of the rows to have the same value and the rest to have some variance within the acceptable range.

In [None]:
approx_same = ['Average Packet Length', 'Packet Length Min', 'Packet Length Max']

# Generate random values for 'Packet Length Max'
packet_length_max = np.random.randint(min_max_dict['Packet Length Max'][0] * 0.9, min_max_dict['Packet Length Max'][1] * 1.1, NUM_OF_ROWS)

# Decide whether to copy or adjust based on a condition or randomly
copy_values = np.random.choice([True, False], size = NUM_OF_ROWS)  # Randomly decide whether to copy values or not

# Create 'Average Packet Length' and 'Packet Length Min' based on 'Packet Length Max'
packet_length_min = np.where(copy_values, packet_length_max, packet_length_max + np.random.uniform(-2, 2, NUM_OF_ROWS))
packet_length_min = np.minimum(packet_length_min, packet_length_max)

# If True, copy the 'Packet Length Max' values; if False, apply small variation
average_packet_length = np.where(packet_length_max != packet_length_min, (packet_length_max + packet_length_min) / 2, packet_length_min)

# Assign the values to the dataset
port_dataset['Packet Length Max'] = packet_length_max
port_dataset['Average Packet Length'] = average_packet_length
port_dataset['Packet Length Min'] = packet_length_min.astype(int)

In [25]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,3477,75.000000,75,75,0.000000,0.0,207118.696308,23,23,23,0,0.0,0.0,0.0,0,0.000000,0,22.992134,7039.099088,0.0,0.0,9.702468,676.679389,0.694347,0.001921,0.010772
1,4017,76.000000,76,76,0.000000,0.0,231726.964329,25,25,25,0,0.0,0.0,0.0,0,0.000000,0,24.995169,7905.590160,0.0,0.0,11.416355,162.890749,0.742951,0.002254,0.005349
2,1632,71.000000,71,71,0.000000,0.0,94959.857436,23,23,23,0,0.0,0.0,0.0,0,0.000000,0,23.006447,3233.077331,0.0,0.0,21.412667,168.975752,1.422753,0.006244,0.025178
3,1745,92.000000,92,92,0.037832,0.0,101075.580707,38,38,38,0,0.0,0.0,0.0,0,1.520240,0,38.011261,3429.253877,0.0,0.0,21.270453,147.918086,1.584327,0.004197,0.008933
4,953,65.000000,65,65,0.000000,0.0,56927.715384,27,27,27,0,0.0,0.0,0.0,0,0.000000,0,26.987657,1938.123424,0.0,0.0,44.882708,157.078777,3.238709,0.012341,0.045918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,1512,58.429868,57,59,0.000000,0.0,87332.997876,23,23,23,0,0.0,0.0,0.0,0,0.000000,0,23.009665,2984.714298,0.0,0.0,42.847998,174.718756,2.689744,0.011584,0.019245
7496,5162,49.000000,49,49,0.030296,0.0,298746.038973,28,28,28,0,0.0,0.0,0.0,0,1.692078,0,27.999102,10146.837686,0.0,0.0,48.355042,150.379049,3.611664,0.010030,0.017721
7497,2562,82.000000,82,82,0.000000,0.0,153386.517166,28,28,28,0,0.0,0.0,0.0,0,0.000000,0,27.989323,5216.959902,0.0,0.0,41.464769,185.818164,2.720942,0.008724,0.012590
7498,928,55.000000,55,55,0.016411,0.0,53642.305955,42,42,42,0,0.0,0.0,0.0,0,0.817855,0,42.009474,1875.742428,0.0,0.0,37.283102,53.647834,2.773919,0.007339,0.015650


### Correlation between 'Bwd Packet Length Max' and the rest of 'Bwd Packet Length Mean', 'Bwd Packet Length Min', 'ACK Flag Count', 'RST Flag Count':

As we see in our sample dataset, these columns have a correlation, but also these columns are usualy zero. That is why we randomly select 25% of the cells to have values and the rest be zero.<br> The cells that have values will get them by calculating the correlation factors between these columns.

In [None]:
backward_flags = ['Bwd Packet Length Max', 'Bwd Packet Length Mean', 'Bwd Packet Length Min', 'ACK Flag Count', 'RST Flag Count']

# define probability distribution: 25% True, 75% False
probability = [0.25, 0.75]

# decide whether to use backward flags (True or False) based on the probability for each row
has_backward_flags = np.random.choice([True, False], size = NUM_OF_ROWS, p = probability)

# check if the value should be True or False for each row
for i in range(NUM_OF_ROWS):
    if has_backward_flags[i]: # if True, generate random values for Bwd Packet Length and Flag Count
        bwd_vector = np.random.randint(16, min_max_dict['Bwd Packet Length Max'][1] * 1.15)
        flag_vector = np.random.randint(2, min_max_dict['ACK Flag Count'][1] * 1.15)
        
        # apply the values to the backward packets and then the flags, each with their respective vector
        for col in backward_flags[:3]:
            port_dataset.at[i, col] = bwd_vector
        
        for col in backward_flags[3:]:
            port_dataset.at[i, col] = flag_vector
    
    else: # if False, set only the current row to zero for all backward flags
        for col in backward_flags:
            port_dataset.at[i, col] = 0

---

## Adding the Label column and adjusting some columns to have integer values:

In [27]:
# making the SYN Flag Count column have int values instead of floats
port_dataset['SYN Flag Count'] = port_dataset['SYN Flag Count'].astype(int)

# adding a label to the dataset
port_dataset['Label'] = ATTACK_NAME

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [28]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,3477,75.000000,75,75,0.000000,0.0,207118.696308,23,23,23,0,0.0,0.0,0.0,0,0.000000,0,22.992134,7039,0.0,0.0,9.702468,676.679389,0.694347,0.001921,0.010772,PortScan
1,4017,76.000000,76,76,0.000000,0.0,231726.964329,25,25,25,0,0.0,0.0,0.0,0,0.000000,0,24.995169,7905,0.0,0.0,11.416355,162.890749,0.742951,0.002254,0.005349,PortScan
2,1632,71.000000,71,71,0.000000,0.0,94959.857436,23,23,23,0,23.0,23.0,23.0,0,0.000000,0,23.006447,3233,7.0,7.0,21.412667,168.975752,1.422753,0.006244,0.025178,PortScan
3,1745,92.000000,92,92,0.037832,0.0,101075.580707,38,38,38,0,16.0,16.0,16.0,0,1.520240,0,38.011261,3429,4.0,4.0,21.270453,147.918086,1.584327,0.004197,0.008933,PortScan
4,953,65.000000,65,65,0.000000,0.0,56927.715384,27,27,27,0,0.0,0.0,0.0,0,0.000000,0,26.987657,1938,0.0,0.0,44.882708,157.078777,3.238709,0.012341,0.045918,PortScan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,1512,58.429868,57,59,0.000000,0.0,87332.997876,23,23,23,0,0.0,0.0,0.0,0,0.000000,0,23.009665,2984,0.0,0.0,42.847998,174.718756,2.689744,0.011584,0.019245,PortScan
7496,5162,49.000000,49,49,0.030296,0.0,298746.038973,28,28,28,0,0.0,0.0,0.0,0,1.692078,0,27.999102,10146,0.0,0.0,48.355042,150.379049,3.611664,0.010030,0.017721,PortScan
7497,2562,82.000000,82,82,0.000000,0.0,153386.517166,28,28,28,0,30.0,30.0,30.0,0,0.000000,0,27.989323,5216,7.0,7.0,41.464769,185.818164,2.720942,0.008724,0.012590,PortScan
7498,928,55.000000,55,55,0.016411,0.0,53642.305955,42,42,42,0,0.0,0.0,0.0,0,0.817855,0,42.009474,1875,0.0,0.0,37.283102,53.647834,2.773919,0.007339,0.015650,PortScan


In [29]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0
mean,2387.526316,66.631196,66.105263,66.631579,0.01411,0.000764,150824.842105,32.631579,32.631579,32.631579,0.0,6.315789,6.315789,6.315789,0.0,1.053397,0.0,33.236968,4830.789474,1.578947,1.578947,36.960596,161.40442,0.627105,0.008909,0.02711
std,1100.883199,7.182221,7.730853,7.181848,0.024427,0.001351,48603.855729,7.181848,7.181848,7.181848,0.0,10.857934,10.857934,10.857934,0.0,1.026725,0.0,7.314582,2104.368646,3.005842,3.005842,6.933406,182.503211,0.719782,0.003132,0.011919
min,1000.0,59.998174,58.0,60.0,0.0,0.0,73060.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.026425,2810.0,0.0,0.0,11.427656,70.448229,0.038543,0.001149,0.002328
25%,1880.0,59.999411,59.0,60.0,0.0,0.0,112749.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.230768,3760.0,0.0,0.0,38.380666,96.947314,0.130432,0.008012,0.019438
50%,1994.0,60.0,60.0,60.0,0.0,0.0,158960.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,28.282899,3974.0,0.0,0.0,39.700051,99.53883,0.152234,0.010049,0.029792
75%,2618.5,74.0,74.0,74.0,0.024266,0.001178,163941.0,40.0,40.0,40.0,0.0,12.0,12.0,12.0,0.0,2.001179,0.0,41.00323,5232.0,2.0,2.0,39.960304,133.336021,1.100399,0.010318,0.030047
max,5019.0,74.0,74.0,74.0,0.060403,0.003649,258336.0,40.0,40.0,40.0,0.0,24.0,24.0,24.0,0.0,2.003655,0.0,41.033325,9936.0,9.0,9.0,40.052014,870.257194,2.878662,0.0142,0.063127


In [30]:
port_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0
mean,3539.894,68.299105,68.0492,68.421867,0.019488,0.0,208305.912605,34.086133,34.086133,34.086133,0.0,5.854133,5.854133,5.854133,0.0,0.631692,0.0,34.086017,7082.651867,1.496933,1.496933,28.422239,211.51538,1.514942,0.006915,0.020533
std,1611.957026,13.779616,13.798354,13.776816,0.024827,0.0,94938.112061,8.956849,8.956849,8.956849,0.0,10.263253,10.263253,10.263253,0.0,0.809579,0.0,8.956747,3226.493142,2.878407,2.878407,11.561074,157.144298,1.017796,0.003017,0.01415
min,765.0,44.01853,43.0,45.0,0.0,0.0,44510.427802,19.0,19.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.990692,1502.0,0.0,0.0,8.744011,36.176157,0.003702,0.00172,0.002287
25%,2135.75,56.118742,56.0,57.0,0.0,0.0,125878.001924,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.008816,4287.5,0.0,0.0,18.382721,93.043717,0.614821,0.004361,0.009863
50%,3550.5,68.0,68.0,68.0,0.000913,0.0,208275.680621,34.0,34.0,34.0,0.0,0.0,0.0,0.0,0.0,0.028149,0.0,33.997773,7118.0,0.0,0.0,28.33949,170.712999,1.487296,0.006746,0.015392
75%,4942.0,80.0,80.0,80.0,0.039235,0.0,289958.290559,42.0,42.0,42.0,0.0,16.0,16.0,16.0,0.0,1.233378,0.0,41.998238,9876.25,2.0,2.0,38.524347,265.873154,2.392359,0.009136,0.03025
max,6348.0,92.0,92.0,92.0,0.076382,0.0,379949.722741,49.0,49.0,49.0,0.0,30.0,30.0,30.0,0.0,2.53398,0.0,49.024332,12913.0,10.0,10.0,48.362035,906.025486,3.63278,0.014008,0.05851


---

## Load the second sample dataset:

The following code will create another attack dataset, this time based on a different sample dataset, the code in this section<br> 
will be mostly the same as it was up until this point in the notebook, there for we will not repeat the same explanations here.<br>  
For the second sample we intentionally generate more rows then we need because at the end we will be selecting 7,500 rows that fit our needs out of this second dataset. The rows we will take will be rows that have 'Number of Ports' >= 120.

In [31]:
NUM_OF_ROWS = 12500 

## Load the second sample dataset:

In [32]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_closed_port_samples_2.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (10, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,100,60.0,60,60,0.0,0.0,5200,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,27.368421,200,0,0,2.920097,68.490873,1.103361,0.014674,0.082441
1,120,60.0,60,60,0.0,0.0,6240,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,27.130435,240,0,0,3.34079,71.839295,1.102872,0.013978,0.075529
2,120,60.0,60,60,0.0,0.0,6240,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,27.130435,240,0,0,3.325836,72.162309,1.10776,0.013916,0.076588
3,140,60.0,60,60,0.0,0.0,7280,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.962963,280,0,0,3.753263,74.601753,1.109797,0.013453,0.071323
4,240,60.0,60,60,0.0,0.0,12480,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.553191,480,0,0,5.749811,83.481006,1.105315,0.012004,0.05814
5,180,60.0,60,60,0.0,0.0,9360,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.742857,360,0,0,4.554911,79.03557,1.103638,0.012688,0.063769
6,280,60.0,60,60,0.0,0.0,14560,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.472727,560,0,0,6.582215,85.077743,1.1115,0.011775,0.055129
7,150,60.0,60,60,0.0,0.0,7800,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.896552,300,0,0,3.944551,76.054284,1.105556,0.013192,0.069209
8,190,60.0,60,60,0.0,0.0,9880,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.702703,380,0,0,4.743357,80.112036,1.105079,0.012515,0.061978
9,220,60.0,60,60,0.0,0.0,11440,26,26.0,26,0.0,0,0.0,0,0.0,2.0,0.0,26.604651,440,0,0,5.354074,82.180411,1.10544,0.012196,0.058847


### Find the columns that we need to synthesis data for:

In [33]:
# find the columns that we need to synthesis data for to produce an attack dataset
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [34]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 15%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.15) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 85.00 | Max: 322.00
Average Packet Length          | Min: 51.00 | Max: 69.00
Packet Length Min              | Min: 51.00 | Max: 69.00
Packet Length Max              | Min: 51.00 | Max: 69.00
Total Length of Fwd Packet     | Min: 4420.00 | Max: 16744.00
Fwd Packet Length Max          | Min: 22.10 | Max: 29.90
Fwd Packet Length Mean         | Min: 22.10 | Max: 29.90
Fwd Packet Length Min          | Min: 22.10 | Max: 29.90
Fwd Segment Size Avg           | Min: 1.70 | Max: 2.30
Subflow Fwd Bytes              | Min: 22.50 | Max: 31.47
SYN Flag Count                 | Min: 170.00 | Max: 644.00
Flow Duration                  | Min: 2.48 | Max: 7.57
Packets Per Second             | Min: 58.22 | Max: 97.84
IAT Max                        | Min: 0.94 | Max: 1.28
IAT Mean                       | Min: 0.01 | Max: 0.02
IAT Std                        | Min: 0.05 | Max: 0.09


### Create the base attack dataset (full of zeros):

In [36]:
# creating an empty dataframe before adding values to it
port_dataset2 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)

### Find the columns with constant zero values based on samples:

In [37]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset2[col] = int(0)
zero_columns

['Packet Length Std',
 'Packet Length Variance',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Bwd Segment Size Avg',
 'ACK Flag Count',
 'RST Flag Count']

---

## Filling in values based on collected samples:

## Calculate and fill values into columns that have a certain correlation between them:

### Correlation between 'Number of Ports' and all the following: 'Total Length of Fwd Packet', 'SYN Flag Count':

In [None]:
first_correlation = ['Number of Ports', 'Total Length of Fwd Packet', 'SYN Flag Count']

# finding the correlation between the 'Number of Ports' column to the rest of the columns in order to create new data
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'Number of Ports'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Number of Ports' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)
    
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset2['Number of Ports'] = np.random.randint(min_max_dict['Number of Ports'][0]*0.9, min_max_dict['Number of Ports'][1]*1.10, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset2.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all generated scaling factors
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02) 
        updated_factor = factor[1] + random.choice([-1, 1]) * delta
        port_dataset2.loc[index, col] = int(row['Number of Ports'] * updated_factor)

('Total Length of Fwd Packet', np.float64(51.99999999999999))
('SYN Flag Count', np.float64(2.0))


### Correlation between 'Number of Ports' and all of the following: 'Flow Duration', 'IAT Mean', 'IAT Std':

In [None]:
second_correlation = ['Number of Ports', 'Flow Duration', 'IAT Mean', 'IAT Std'] 

# finding the correlation between the Number of Ports column to the rest of the columns in order to create new data
independent_col = port_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Number of Ports'
dependent_cols = port_samples[second_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Number of Ports' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Flow Duration', np.float64(0.024958470011985968))
('IAT Mean', np.float64(6.650583610346264e-05))
('IAT Std', np.float64(0.000336893054787924))


In [None]:
#iterating over all rows we need to add values
for index, row in port_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        if col == 'Flow Duration':
            delta = random.uniform(factor * 0.05, factor * 0.1) 
        elif col == 'IAT Std':
            delta = random.uniform(factor * 0.05, factor * 0.2) * random.choice([-1, 1]) 
        else:
            delta = random.uniform(factor * 0.1, factor * 0.25) 
        updated_factor = factor + delta
        port_dataset2.loc[index, col] = row['Number of Ports'] * updated_factor

### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max':

In [None]:
packets_per_second = 63.5 + (port_dataset2['Flow Duration'] - 2.0) * (35 / 7.5) #linear transformation
port_dataset2['Packets Per Second'] = np.clip(packets_per_second, 63.5, 98.75) #ensure within range

iat_max = 1.100 + (port_dataset2['Flow Duration'] - 2.0) * (0.013 / 7.5) + np.random.uniform(-0.002, 0.002, size = NUM_OF_ROWS)
port_dataset2['IAT Max'] = np.clip(iat_max, 1.100, 1.113) #ensure within range

In [42]:
port_dataset2

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,175,0.0,0.0,0.0,0,0,9236.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,353.0,0,0,4.620928,75.730999,1.104598,0.014289,0.061956
1,283,0.0,0.0,0.0,0,0,14935.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,557.0,0,0,7.740055,90.286925,1.111589,0.021027,0.085428
2,293,0.0,0.0,0.0,0,0,15411.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,597.0,0,0,7.826857,90.692001,1.109409,0.022244,0.113244
3,266,0.0,0.0,0.0,0,0,14047.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,541.0,0,0,7.004469,86.854189,1.109694,0.022012,0.085030
4,134,0.0,0.0,0.0,0,0,6834.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,264.0,0,0,3.599746,70.965479,1.104583,0.009881,0.040319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,290,0.0,0.0,0.0,0,0,15307.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,589.0,0,0,7.834477,90.727561,1.110409,0.021315,0.114594
12496,256,0.0,0.0,0.0,0,0,13468.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,504.0,0,0,6.761352,85.719642,1.110073,0.020689,0.098980
12497,130,0.0,0.0,0.0,0,0,6886.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,256.0,0,0,3.545366,70.711707,1.102331,0.010230,0.038209
12498,199,0.0,0.0,0.0,0,0,10185.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0.0,393.0,0,0,5.424459,79.480809,1.105030,0.014984,0.062178


### Then fill values into columns that are not related to each other:

In [43]:
port_dataset2['Fwd Segment Size Avg'] = np.full(NUM_OF_ROWS, 2.0)
port_dataset2['Subflow Fwd Bytes'] = np.random.uniform(min_max_dict['Subflow Fwd Bytes'][0]*0.95, min_max_dict['Subflow Fwd Bytes'][1]*1.05, NUM_OF_ROWS)
port_dataset2

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,175,0.0,0.0,0.0,0,0,9236.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,22.094612,353.0,0,0,4.620928,75.730999,1.104598,0.014289,0.061956
1,283,0.0,0.0,0.0,0,0,14935.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,27.741205,557.0,0,0,7.740055,90.286925,1.111589,0.021027,0.085428
2,293,0.0,0.0,0.0,0,0,15411.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,32.050201,597.0,0,0,7.826857,90.692001,1.109409,0.022244,0.113244
3,266,0.0,0.0,0.0,0,0,14047.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,28.629255,541.0,0,0,7.004469,86.854189,1.109694,0.022012,0.085030
4,134,0.0,0.0,0.0,0,0,6834.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,28.209263,264.0,0,0,3.599746,70.965479,1.104583,0.009881,0.040319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,290,0.0,0.0,0.0,0,0,15307.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,24.171462,589.0,0,0,7.834477,90.727561,1.110409,0.021315,0.114594
12496,256,0.0,0.0,0.0,0,0,13468.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,30.663721,504.0,0,0,6.761352,85.719642,1.110073,0.020689,0.098980
12497,130,0.0,0.0,0.0,0,0,6886.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,25.671725,256.0,0,0,3.545366,70.711707,1.102331,0.010230,0.038209
12498,199,0.0,0.0,0.0,0,0,10185.0,0.0,0.0,0.0,0,0,0,0,0,2.0,0,22.925432,393.0,0,0,5.424459,79.480809,1.105030,0.014984,0.062178


### Then we insert data into columns that have the exact same values:

In [None]:
same_values1 = ['Average Packet Length', 'Packet Length Min', 'Packet Length Max']

# generate random values for the 'Average Packet Length' column
rand_values = np.random.randint(min_max_dict['Average Packet Length'][0]*0.95, min_max_dict['Average Packet Length'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
for col in same_values1:
    port_dataset2[col] = rand_values

In [None]:
same_values2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']

# generate random values for the 'Fwd Packet Length Max' column
rand_values = np.random.randint(min_max_dict['Fwd Packet Length Max'][0]*0.95, min_max_dict['Fwd Packet Length Max'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
for col in same_values2:
    port_dataset2[col] = rand_values

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [46]:
port_dataset2

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,175,62,62,62,0,0,9236.0,20,20,20,0,0,0,0,0,2.0,0,22.094612,353.0,0,0,4.620928,75.730999,1.104598,0.014289,0.061956
1,283,69,69,69,0,0,14935.0,23,23,23,0,0,0,0,0,2.0,0,27.741205,557.0,0,0,7.740055,90.286925,1.111589,0.021027,0.085428
2,293,70,70,70,0,0,15411.0,24,24,24,0,0,0,0,0,2.0,0,32.050201,597.0,0,0,7.826857,90.692001,1.109409,0.022244,0.113244
3,266,56,56,56,0,0,14047.0,20,20,20,0,0,0,0,0,2.0,0,28.629255,541.0,0,0,7.004469,86.854189,1.109694,0.022012,0.085030
4,134,51,51,51,0,0,6834.0,27,27,27,0,0,0,0,0,2.0,0,28.209263,264.0,0,0,3.599746,70.965479,1.104583,0.009881,0.040319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,290,66,66,66,0,0,15307.0,26,26,26,0,0,0,0,0,2.0,0,24.171462,589.0,0,0,7.834477,90.727561,1.110409,0.021315,0.114594
12496,256,53,53,53,0,0,13468.0,25,25,25,0,0,0,0,0,2.0,0,30.663721,504.0,0,0,6.761352,85.719642,1.110073,0.020689,0.098980
12497,130,53,53,53,0,0,6886.0,22,22,22,0,0,0,0,0,2.0,0,25.671725,256.0,0,0,3.545366,70.711707,1.102331,0.010230,0.038209
12498,199,61,61,61,0,0,10185.0,21,21,21,0,0,0,0,0,2.0,0,22.925432,393.0,0,0,5.424459,79.480809,1.105030,0.014984,0.062178


In [47]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,174.0,60.0,60.0,60.0,0.0,0.0,9048.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,26.856494,348.0,0.0,0.0,4.426891,77.303528,1.106032,0.013039,0.067295
std,58.727241,0.0,0.0,0.0,0.0,0.0,3053.816556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.291425,117.454483,0.0,0.0,1.188862,5.538596,0.002827,0.00096,0.009113
min,100.0,60.0,60.0,60.0,0.0,0.0,5200.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,26.472727,200.0,0.0,0.0,2.920097,68.490873,1.102872,0.011775,0.055129
25%,125.0,60.0,60.0,60.0,0.0,0.0,6500.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,26.629164,250.0,0.0,0.0,3.443908,72.77217,1.103998,0.012276,0.059629
50%,165.0,60.0,60.0,60.0,0.0,0.0,8580.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,26.819704,330.0,0.0,0.0,4.249731,77.544927,1.105377,0.01294,0.066489
75%,212.5,60.0,60.0,60.0,0.0,0.0,11050.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,27.088567,425.0,0.0,0.0,5.201395,81.663317,1.107209,0.0138,0.074478
max,280.0,60.0,60.0,60.0,0.0,0.0,14560.0,26.0,26.0,26.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,27.368421,560.0,0.0,0.0,6.582215,85.077743,1.1115,0.014674,0.082441


In [48]:
port_dataset2.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0
mean,214.3624,59.55912,59.55912,59.55912,0.0,0.0,11146.64424,24.98904,24.98904,24.98904,0.0,0.0,0.0,0.0,0.0,2.0,0.0,27.185844,428.19528,0.0,0.0,5.751479,81.006144,1.106507,0.016753,0.072179
std,80.473129,6.957826,6.957826,6.957826,0.0,0.0,4188.55463,3.155813,3.155813,3.155813,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.349133,161.075465,0.0,0.0,2.160944,10.08304,0.003822,0.006329,0.028962
min,76.0,48.0,48.0,48.0,0.0,0.0,3873.0,20.0,20.0,20.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,21.376742,149.0,0.0,0.0,1.99285,63.5,1.1,0.005598,0.020545
25%,144.0,53.0,53.0,53.0,0.0,0.0,7467.75,22.0,22.0,22.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,24.268212,287.0,0.0,0.0,3.855087,72.157073,1.103272,0.011221,0.047877
50%,215.0,60.0,60.0,60.0,0.0,0.0,11171.5,25.0,25.0,25.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,27.137907,429.0,0.0,0.0,5.755154,81.024053,1.10649,0.016733,0.070983
75%,284.0,66.0,66.0,66.0,0.0,0.0,14762.25,28.0,28.0,28.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,30.075416,567.0,0.0,0.0,7.621279,89.732637,1.109787,0.022205,0.093713
max,353.0,71.0,71.0,71.0,0.0,0.0,18697.0,30.0,30.0,30.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,33.04659,719.0,0.0,0.0,9.687389,98.75,1.113,0.029329,0.142247


---

## Adding the Label column:

In [49]:
# adding a label to the dataset
port_dataset2['Label'] = ATTACK_NAME

---

## Select the rows are want from the generated second dataset such that the 'Number of Ports' value will be >= 120:

In [50]:
port_dataset2 = port_dataset2[port_dataset2['Number of Ports'] >= 120]
print(f'Second Attack Dataset Shape Before: {port_dataset2.shape}')

Second Attack Dataset Shape Before: (10513, 27)


In [51]:
port_dataset2 = port_dataset2.sample(n=7500, random_state = 42) 
port_dataset2.shape
print(f'Second Attack Dataset Shape After: {port_dataset2.shape}')

Second Attack Dataset Shape After: (7500, 27)


---

## At the end we merge the two sample datasets tougether and then save it as a CSV file:

In [52]:
mergedport_dataset = pd.concat([port_dataset, port_dataset2], axis=0)
mergedport_dataset = mergedport_dataset.sample(frac=1, random_state=42).reset_index(drop=True)
print(f'Attack Dataset Shape: {mergedport_dataset.shape}')

Attack Dataset Shape: (15000, 27)


Make sure that the data in the 'Total Length of Fwd Packet' column is of type int for consistency.  

In [None]:
mergedport_dataset['Total Length of Fwd Packet'] = mergedport_dataset['Total Length of Fwd Packet'].astype(int)

In [None]:
# save the dataset
mergedport_dataset.to_csv('port_scan_closed_port_dataset.csv', index=False)