# Prepare Port Scanning Open Port Attack Dataset

## Overview:

This notebook will focus on creating a Port Scanning open port attack dataset based on a small sample of data collected by performing real Port Scanning open port attacks in a controlled environment.<br>
The dataset that this notebook creates closely represents real-world data and was used to train our SVM model.<br>  
There are multiple sample datasets because we performed the attack in a few different ways, and in each way, the data is slightly different.<br>
That is why we split the original sample dataset into multiple samples, ensuring that the attack dataset we generate matches the real-world data as closely as possible.<br>  
It is worth noteing that the sample dataset we collected does not contain any missing values or any outliers due to the fact we tested each part of the collection process and verified that it is correct.<br>
In this notebook we have generated an attack dataset with 40,000 flows of the Port Scanning open port attack based on the samples we collected when running a Port Scanning attack in various configurations using the well known NMap tool when the majority of ports on the victim host machine where open.<br> 

## Imports & Global Variables:

In [2]:
import pandas as pd
import numpy as np
import random

NUM_OF_ROWS = 16000
ATTACK_NAME = 'PortScan'

In [3]:
# the following command will make it so that when we print the dataframe we will see all the columns
pd.set_option('display.max_columns', None)

---

## Load the first sample dataset:

In [4]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_open_port_samples_1.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (19, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4654,57.005583,54,60,2.999422,8.996533,121238,26,26.0,26,0.0,24,20.00344,20,0.117255,2.003431,0.0,121238.0,4663,4651,4651,1.429131,6517.247053,1.008289,0.000153,0.010453
1,4994,57.0058,54,60,2.999461,8.996766,130182,26,26.0,26,0.0,24,20.003204,20,0.113171,2.003196,0.0,130182.0,5007,4993,4993,1.417464,7054.852803,1.004343,0.000142,0.010049
2,4584,57.006039,54,60,2.999591,8.997548,129376,26,26.0,26,0.0,24,20.002419,20,0.098344,2.002412,0.0,129376.0,4976,4960,4960,12.688453,783.074188,12.201695,0.001277,0.12241
3,4996,56.9996,54,60,2.999733,8.9984,129948,26,26.0,26,0.0,24,20.001599,20,0.079968,2.001601,0.0,0.0,4998,5002,5002,0.369603,27056.045949,0.007196,3.7e-05,0.000278
4,4987,57.007918,54,60,2.999455,8.99673,129974,26,26.0,26,0.0,24,20.003214,20,0.113341,2.003201,0.0,129974.0,4999,4978,4978,1.428317,6985.144878,1.002485,0.000143,0.010042
5,4490,57.003043,54,60,2.999593,8.997557,128258,26,26.0,26,0.0,24,20.002436,20,0.098673,2.002433,0.0,128258.0,4933,4927,4927,11.969671,823.748622,11.435005,0.001214,0.11516
6,4995,57.0068,54,60,2.999726,8.998354,130260,26,26.0,26,0.0,24,20.001603,20,0.080064,2.001597,0.0,130260.0,5010,4990,4990,1.456153,6867.41062,1.008573,0.000146,0.010091
7,4997,57.007705,54,60,2.999323,8.995938,130156,26,26.0,26,0.0,24,20.00401,20,0.126592,2.003995,0.0,0.0,5006,4987,4987,1.462775,6831.535988,0.898302,0.000146,0.009002
8,4982,57.008102,54,60,2.999589,8.997534,130260,26,26.0,26,0.0,24,20.002406,20,0.098078,2.002395,0.0,65130.0,5010,4987,4987,10.794247,926.141498,9.383592,0.00108,0.094389
9,4993,57.002101,54,60,2.999599,8.997595,130000,26,26.0,26,0.0,24,20.002401,20,0.09798,2.0024,0.0,130000.0,5000,4997,4997,42.628286,234.515646,41.218939,0.004265,0.412373


### Find the columns that we need to synthesis data for:

In [5]:
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Bwd Packet Length Std',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [6]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 7.5%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.075) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 3816.50 | Max: 5379.30
Average Packet Length          | Min: 48.45 | Max: 61.29
Packet Length Min              | Min: 45.90 | Max: 58.05
Packet Length Max              | Min: 51.00 | Max: 64.50
Packet Length Std              | Min: 2.55 | Max: 3.22
Packet Length Variance         | Min: 7.65 | Max: 9.67
Total Length of Fwd Packet     | Min: 103052.30 | Max: 140141.30
Fwd Packet Length Max          | Min: 22.10 | Max: 27.95
Fwd Packet Length Mean         | Min: 22.10 | Max: 27.95
Fwd Packet Length Min          | Min: 22.10 | Max: 27.95
Bwd Packet Length Max          | Min: 17.00 | Max: 25.80
Bwd Packet Length Mean         | Min: 17.00 | Max: 21.50
Bwd Packet Length Min          | Min: 17.00 | Max: 21.50
Bwd Packet Length Std          | Min: 0.00 | Max: 0.14
Fwd Segment Size Avg           | Min: 1.70 | Max: 2.15
Subflow Fwd Bytes              | Min: 0.00 | Max: 140057.45
SYN Flag Count                 | Min: 3963.55 | Max: 5391.12
ACK Flag Count      

### Create the base attack dataset (full of zeros):

In [7]:
# creating an empty dataframe before adding values to it
port_dataset = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [8]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset[col] = int(0)
zero_columns

['Fwd Packet Length Std', 'Bwd Segment Size Avg']

In [9]:
port_dataset.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## Filling in values based on collected samples:

### Firstly fill values into 'Fwd Packet' columns that are related to each other:

When generating data for the following columns we take the time to ensure that the values generated are correct in the sence that the minimum value should be lower than the mean and the mean should be lower than the max value <u>in each row</u> of the attack dataset.<br>  
Also sometimes in the sample dataset the values in the following columns are exactly the same, and other times they are different, there for we randomly select 25% of the rows to have the same value and the rest to have some variance within the acceptable range.

In [10]:
independant = ['Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean']

packet_length_max = np.random.randint(min_max_dict['Fwd Packet Length Max'][0] * 0.9, min_max_dict['Fwd Packet Length Max'][1] * 1.1, NUM_OF_ROWS)

# define probability distribution: 25% True, 75% False
probability = [0.25, 0.75]

# decide for each row whether to copy or vary 'Fwd Packet Length Max'
copy_values = np.random.choice([True, False], size = NUM_OF_ROWS, p=probability)

# create 'Fwd Packet Length Min': either copy or apply a small variation
packet_length_min = np.where(copy_values, packet_length_max, packet_length_max + np.random.uniform(-4, 4, NUM_OF_ROWS))
packet_length_min = np.minimum(packet_length_min, packet_length_max)

# calculate 'Fwd Packet Length Mean': average of min and max, or copy if equal
average_packet_length = np.where(packet_length_max != packet_length_min, (packet_length_max + packet_length_min) / 2, packet_length_min)

# assign the values to the dataset
port_dataset['Fwd Packet Length Max'] = packet_length_max.astype(int)
port_dataset['Fwd Packet Length Mean'] = average_packet_length
port_dataset['Fwd Packet Length Min'] = packet_length_min.astype(int)

In [11]:
port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26,26.000000,26,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22,22.000000,22,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26,26.000000,26,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28,28.000000,28,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26,24.800242,23,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28,28.000000,28,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29,29.000000,29,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24,22.984329,21,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23,23.000000,23,0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Then fill values into columns that are not related to each other:

Here most of the columns are unrelated to each other excent the Bwd Packet columns, for these ones we just ensure that again the minimum is lower than the mean and the mean is lower than the maximum value in each row.

In [12]:
independent = ['Number of Ports', 'Average Packet Length', 'Packet Length Max', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']

# generate 'Bwd Packet Length Min' values
bwd_min_low, bwd_min_high = min_max_dict['Bwd Packet Length Min']
bwd_min_values = np.random.randint(bwd_min_low * 0.9, bwd_min_high * 1.05, size = NUM_OF_ROWS)

for col in independent:
    if col == 'Bwd Packet Length Mean':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.995, min_max_dict[col][1] * 1.005, NUM_OF_ROWS)
    else:
        rand_values = np.random.randint(min_max_dict[col][0] * 0.9, min_max_dict[col][1] * 1.1, NUM_OF_ROWS)

    port_dataset[col] = rand_values

# ensure that 'Bwd Packet Length Max' is always >= 'Bwd Packet Length Min'
port_dataset['Bwd Packet Length Min'] = bwd_min_values
port_dataset['Bwd Packet Length Max'] = np.maximum(bwd_min_values, port_dataset['Bwd Packet Length Max']) #fix inconsistencies

# ensure that 'Bwd Packet Length Max' is always > 'Bwd Packet Length Mean' > 'Bwd Packet Length Min'
invalid_rows = port_dataset['Bwd Packet Length Mean'] > port_dataset['Bwd Packet Length Max']

# compute the correct mean for those rows
corrected_means = (port_dataset.loc[invalid_rows, 'Bwd Packet Length Min'] + 
                   port_dataset.loc[invalid_rows, 'Bwd Packet Length Max']) / 2

# update only the invalid rows
port_dataset.loc[invalid_rows, 'Bwd Packet Length Mean'] = corrected_means

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4561,57,0.0,45,0.0,0.0,0.0,26,26.000000,26,0,22,20.882645,21,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4809,47,0.0,65,0.0,0.0,0.0,22,22.000000,22,0,20,19.010884,15,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3465,54,0.0,57,0.0,0.0,0.0,26,26.000000,26,0,24,17.196154,20,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4436,61,0.0,45,0.0,0.0,0.0,28,28.000000,28,0,20,19.434706,20,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3436,55,0.0,62,0.0,0.0,0.0,26,24.800242,23,0,23,18.359937,16,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,4449,43,0.0,57,0.0,0.0,0.0,28,28.000000,28,0,19,17.000000,15,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15996,4391,52,0.0,50,0.0,0.0,0.0,29,29.000000,29,0,23,17.267953,15,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15997,5011,46,0.0,64,0.0,0.0,0.0,24,22.984329,21,0,22,18.355819,20,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15998,5732,57,0.0,47,0.0,0.0,0.0,23,23.000000,23,0,15,15.000000,15,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In our sample dataset, the column 'Subflow Fwd Bytes' usually has values in a specific range, but sometimes it has zero values.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Subflow Fwd Bytes' column, <br>
50% of the values will be within the usual range, but the other 50% will have zero values.  

In [13]:
# generate a vector with random values based on min max dict, and also create a zero vector
col = 'Subflow Fwd Bytes'
subflow_values = port_samples[port_samples[col] != 0][col] 
min_max_dict[col] = (np.min(subflow_values), np.max(subflow_values))

rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
zero_values = np.zeros(NUM_OF_ROWS)

# choose values randomly (50% from rand_values, 50% from zero_values)
port_dataset[col] = np.where(np.random.rand(NUM_OF_ROWS) > 0.5, rand_values, zero_values)

Some columns, like 'Packet Length Std', based on the collected samples, usually have values in a specific range, but sometimes they have values outside of the range.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Packet Length Std' column, 80% of the values will be within the usual range,<br>
but the other 20% will have values that are anywhere between the minimal and maximal value for this column, meaning they will have values outside of the usual range as well.  

In [14]:
port_samples = pd.read_csv('portscan_open_port_samples_1.csv')
port_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4654,57.005583,54,60,2.999422,8.996533,121238,26,26.0,26,0.0,24,20.00344,20,0.117255,2.003431,0.0,121238.0,4663,4651,4651,1.429131,6517.247053,1.008289,0.000153,0.010453
1,4994,57.0058,54,60,2.999461,8.996766,130182,26,26.0,26,0.0,24,20.003204,20,0.113171,2.003196,0.0,130182.0,5007,4993,4993,1.417464,7054.852803,1.004343,0.000142,0.010049
2,4584,57.006039,54,60,2.999591,8.997548,129376,26,26.0,26,0.0,24,20.002419,20,0.098344,2.002412,0.0,129376.0,4976,4960,4960,12.688453,783.074188,12.201695,0.001277,0.12241
3,4996,56.9996,54,60,2.999733,8.9984,129948,26,26.0,26,0.0,24,20.001599,20,0.079968,2.001601,0.0,0.0,4998,5002,5002,0.369603,27056.045949,0.007196,3.7e-05,0.000278
4,4987,57.007918,54,60,2.999455,8.99673,129974,26,26.0,26,0.0,24,20.003214,20,0.113341,2.003201,0.0,129974.0,4999,4978,4978,1.428317,6985.144878,1.002485,0.000143,0.010042
5,4490,57.003043,54,60,2.999593,8.997557,128258,26,26.0,26,0.0,24,20.002436,20,0.098673,2.002433,0.0,128258.0,4933,4927,4927,11.969671,823.748622,11.435005,0.001214,0.11516
6,4995,57.0068,54,60,2.999726,8.998354,130260,26,26.0,26,0.0,24,20.001603,20,0.080064,2.001597,0.0,130260.0,5010,4990,4990,1.456153,6867.41062,1.008573,0.000146,0.010091
7,4997,57.007705,54,60,2.999323,8.995938,130156,26,26.0,26,0.0,24,20.00401,20,0.126592,2.003995,0.0,0.0,5006,4987,4987,1.462775,6831.535988,0.898302,0.000146,0.009002
8,4982,57.008102,54,60,2.999589,8.997534,130260,26,26.0,26,0.0,24,20.002406,20,0.098078,2.002395,0.0,65130.0,5010,4987,4987,10.794247,926.141498,9.383592,0.00108,0.094389
9,4993,57.002101,54,60,2.999599,8.997595,130000,26,26.0,26,0.0,24,20.002401,20,0.09798,2.0024,0.0,130000.0,5000,4997,4997,42.628286,234.515646,41.218939,0.004265,0.412373


In [15]:
half_and_half = ['Packet Length Std', 'Packet Length Variance', 'Flow Duration', 'Total Length of Fwd Packet', 'Bwd Packet Length Std', 'Fwd Segment Size Avg']

for col in half_and_half:
    # generate random values from the uniform distribution (90% - 110% of min-max range)
    rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
    
    # generate alternative random values based on column-specific conditions
    if col == 'Packet Length Std':
        usual_values = np.random.uniform(2.9, 3.1, NUM_OF_ROWS)
    elif col == 'Packet Length Variance':
        usual_values = np.random.uniform(8.85, 9.15, NUM_OF_ROWS)
    elif col == 'Flow Duration':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1], NUM_OF_ROWS)
        usual_values = np.random.uniform(0.45, 1.7, NUM_OF_ROWS)
    elif col == 'Total Length of Fwd Packet':
        usual_values = np.random.randint(min_max_dict[col][0]*0.9, 130000, NUM_OF_ROWS)
    elif col == 'Bwd Packet Length Std':
        rand_values = np.random.uniform(min_max_dict[col][0], min_max_dict[col][1]*1.1, NUM_OF_ROWS)
        usual_values = np.random.uniform(0.09, 0.11, NUM_OF_ROWS)
    elif col == 'Fwd Segment Size Avg':
        rand_values = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1]*1.05, NUM_OF_ROWS)
        usual_values = np.random.uniform(2.001, 2.002, NUM_OF_ROWS)

    # choose values randomly (20% from rand_values, 80% from usual_values)
    chosen_values = np.where(np.random.rand(NUM_OF_ROWS) > 0.2, usual_values, rand_values)

    port_dataset[col] = chosen_values

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4561,57,0.0,45,2.947471,8.857574,145123.636005,26,26.000000,26,0,22,20.882645,21,0.093753,2.001062,0,0.000000,0.0,0.0,0.0,1.039104,0.0,0.0,0.0,0.0
1,4809,47,0.0,65,3.506635,8.882307,110651.000000,22,22.000000,22,0,20,19.010884,15,0.105235,2.001751,0,102437.118250,0.0,0.0,0.0,1.268739,0.0,0.0,0.0,0.0
2,3465,54,0.0,57,2.857611,9.618570,105984.000000,26,26.000000,26,0,24,17.196154,20,0.104441,2.001264,0,0.000000,0.0,0.0,0.0,1.476553,0.0,0.0,0.0,0.0
3,4436,61,0.0,45,2.986253,7.670678,94346.000000,28,28.000000,28,0,20,19.434706,20,0.099311,1.618916,0,0.000000,0.0,0.0,0.0,0.451240,0.0,0.0,0.0,0.0
4,3436,55,0.0,62,2.962029,8.983642,117937.000000,26,24.800242,23,0,23,18.359937,16,0.036875,2.001080,0,0.000000,0.0,0.0,0.0,11.911448,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,4449,43,0.0,57,2.435099,8.977256,121910.000000,28,28.000000,28,0,19,17.000000,15,0.049477,2.001239,0,122710.651752,0.0,0.0,0.0,1.210337,0.0,0.0,0.0,0.0
15996,4391,52,0.0,50,2.919020,9.104157,149339.613793,29,29.000000,29,0,23,17.267953,15,0.100909,2.001576,0,0.000000,0.0,0.0,0.0,1.661436,0.0,0.0,0.0,0.0
15997,5011,46,0.0,64,2.933730,8.933704,128610.617086,24,22.984329,21,0,22,18.355819,20,0.092906,1.746315,0,80275.962006,0.0,0.0,0.0,1.109014,0.0,0.0,0.0,0.0
15998,5732,57,0.0,47,2.502362,9.127717,106555.000000,23,23.000000,23,0,15,15.000000,15,0.002730,2.001485,0,0.000000,0.0,0.0,0.0,1.055370,0.0,0.0,0.0,0.0


In [16]:
# generate random values for the 'Packet Length Min' column
rand_values = np.random.randint(min_max_dict['Packet Length Min'][0]*0.9, min_max_dict['Packet Length Min'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset['Packet Length Min'] = rand_values

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4561,57,58,45,2.947471,8.857574,145123.636005,26,26.000000,26,0,22,20.882645,21,0.093753,2.001062,0,0.000000,0.0,0.0,0.0,1.039104,0.0,0.0,0.0,0.0
1,4809,47,45,65,3.506635,8.882307,110651.000000,22,22.000000,22,0,20,19.010884,15,0.105235,2.001751,0,102437.118250,0.0,0.0,0.0,1.268739,0.0,0.0,0.0,0.0
2,3465,54,41,57,2.857611,9.618570,105984.000000,26,26.000000,26,0,24,17.196154,20,0.104441,2.001264,0,0.000000,0.0,0.0,0.0,1.476553,0.0,0.0,0.0,0.0
3,4436,61,47,45,2.986253,7.670678,94346.000000,28,28.000000,28,0,20,19.434706,20,0.099311,1.618916,0,0.000000,0.0,0.0,0.0,0.451240,0.0,0.0,0.0,0.0
4,3436,55,49,62,2.962029,8.983642,117937.000000,26,24.800242,23,0,23,18.359937,16,0.036875,2.001080,0,0.000000,0.0,0.0,0.0,11.911448,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,4449,43,50,57,2.435099,8.977256,121910.000000,28,28.000000,28,0,19,17.000000,15,0.049477,2.001239,0,122710.651752,0.0,0.0,0.0,1.210337,0.0,0.0,0.0,0.0
15996,4391,52,51,50,2.919020,9.104157,149339.613793,29,29.000000,29,0,23,17.267953,15,0.100909,2.001576,0,0.000000,0.0,0.0,0.0,1.661436,0.0,0.0,0.0,0.0
15997,5011,46,55,64,2.933730,8.933704,128610.617086,24,22.984329,21,0,22,18.355819,20,0.092906,1.746315,0,80275.962006,0.0,0.0,0.0,1.109014,0.0,0.0,0.0,0.0
15998,5732,57,57,47,2.502362,9.127717,106555.000000,23,23.000000,23,0,15,15.000000,15,0.002730,2.001485,0,0.000000,0.0,0.0,0.0,1.055370,0.0,0.0,0.0,0.0


## Calculate and fill values into columns that have a certain correlation between them:

A correlation between two or more columns is common in our dataset since most features are inherently related. All of them are derived from network packet traffic.<br>
For example, as the **flow duration increases**, the **packets per second** is likely to decrease. This occurs because each flow has an upper limit on duration, after which data collection stops and a new flow begins.<br>  
Similarly, the **Inter-Arrival Time (IAT)** of packets within a flow is influenced by the flow duration. Given these dependencies, <br>
the attack dataset should generate data for these columns collectively, ensuring that their inherent correlations are maintained.

### Correlation between 'SYN Flag Count' and all the following: 'ACK Flag Count', 'RST Flag Count':

In [17]:
first_correlation = ['SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']

# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('ACK Flag Count', np.float64(0.9973550153245457))
('RST Flag Count', np.float64(0.9973443799504141))


After finding the scaling factors we can apply some randomness when generating values for the attack dataset in order to generate better data (without many duplications).<br>
We add randomness by creating a modified scaling factor, which introduces controlled variations in the generated values.<br>
This is done by selecting a small random delta (between 1% and 2% of the factor) and adding or subtracting it from the original scaling factor.<br>
As a result, the generated data maintains realistic correlations while avoiding exact duplicates.

In [18]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*0.85, min_max_dict['SYN Flag Count'][1]*1.1, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all generated scaling factors
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
        updated_factor = factor[1] + (-1) * delta
        port_dataset.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [19]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
second_correlation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = port_samples[second_correlation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = port_samples[second_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(second_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Packets Per Second', np.float64(39.569547160555715))
('IAT Max', np.float64(0.9616734549103965))
('IAT Mean', np.float64(0.00010010917563052352))
('IAT Std', np.float64(0.009626118429844118))


In [20]:
# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(port_samples['Flow Duration'].values, port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

np.float64(9949.105263157895)

And again here after finding the scaling factors we add some randomness and generate the data

In [None]:
# calculate a random small delta of the factor for some randomness
for index, row in port_dataset.iterrows():
    for col, factor in zip(second_correlation[1:], scaling_factors): #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.075, duration_to_packets_corr * 0.1)
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            port_dataset.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
            updated_factor = factor[1] + random.choice([-1, 1]) * delta
            port_dataset.loc[index, col] = row['Flow Duration'] * updated_factor

---

## Adding the Label column:

In [22]:
# adding a label to the dataset
port_dataset['Label'] = ATTACK_NAME

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [23]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0
mean,4913.0,57.005327,54.0,60.0,2.999534,8.997204,129508.736842,26.0,26.0,26.0,0.0,23.578947,20.002758,20.0,0.0983,2.002707,0.0,91855.263158,4981.157895,4968.0,4967.947368,8.197893,7216.74883,7.51774,0.000822,0.075384
std,160.321552,0.003941,0.0,0.0,0.000214,0.001282,2058.053148,0.0,0.0,0.0,0.0,1.261207,0.001268,0.0,0.037878,0.001261,0.0,54096.803566,79.179306,78.688556,78.6783,13.944993,7693.306562,13.639942,0.001395,0.136448
min,4490.0,56.996999,54.0,60.0,2.999318,8.995907,121238.0,26.0,26.0,26.0,0.0,20.0,20.0,20.0,0.0,2.0,0.0,0.0,4663.0,4651.0,4651.0,0.369603,204.920557,0.007196,3.7e-05,0.000278
25%,4948.5,57.003725,54.0,60.0,2.999328,8.995967,129883.0,26.0,26.0,26.0,0.0,24.0,20.002403,20.0,0.098019,2.002394,0.0,65039.0,4995.5,4983.5,4983.0,1.428724,898.537219,1.003414,0.000144,0.010045
50%,4992.0,57.006201,54.0,60.0,2.999461,8.996766,130156.0,26.0,26.0,26.0,0.0,24.0,20.003204,20.0,0.113171,2.002433,0.0,129376.0,5006.0,4988.0,4988.0,1.517677,6517.247053,1.008459,0.000153,0.010093
75%,4996.5,57.007859,54.0,60.0,2.999596,8.997576,130234.0,26.0,26.0,26.0,0.0,24.0,20.004008,20.0,0.126554,2.003993,0.0,130156.0,5009.0,4992.5,4992.5,11.128908,6926.277749,9.715109,0.001114,0.097721
max,5004.0,57.010903,54.0,60.0,2.999998,8.999991,130364.0,26.0,26.0,26.0,0.0,24.0,20.004013,20.0,0.12663,2.004003,0.0,130286.0,5015.0,5005.0,5005.0,48.784759,27056.045949,47.34065,0.00488,0.473583


In [24]:
port_dataset.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0
mean,4682.637437,54.432125,50.024375,56.922375,2.983189,8.952741,113890.229616,24.034063,23.656335,23.091625,0.0,21.636312,18.663947,17.991937,0.09514595,1.988296,0.0,50055.532634,4645.758375,4563.482563,4563.453062,6.111009,8672.069803,5.873913,0.000612,0.058801
std,717.756256,6.910988,5.506989,7.209367,0.174166,0.500436,13412.895263,3.152681,3.206326,3.440653,0.0,3.138875,1.483633,1.995901,0.02259697,0.087517,0.0,53436.35276,739.697534,726.807027,726.822319,12.161901,5524.315253,11.691156,0.001218,0.117006
min,3434.0,43.0,41.0,45.0,2.295034,6.88324,92760.697388,19.0,17.001126,15.0,0.0,15.0,15.0,15.0,8.30873e-07,1.61515,0.0,0.0,3369.0,3294.0,3295.0,0.300257,172.004011,0.285018,3e-05,0.002845
25%,4061.0,48.0,45.0,51.0,2.937146,8.904227,102914.0,21.0,21.0,20.0,0.0,19.0,17.510395,16.0,0.09319019,2.001158,0.0,0.0,4003.0,3931.0,3929.75,0.836025,5983.74742,0.802892,8.4e-05,0.00805
50%,4687.0,54.0,50.0,57.0,2.998134,8.994106,113035.5,24.0,23.991182,23.0,0.0,21.0,18.555321,18.0,0.0992538,2.001476,0.0,0.0,4652.5,4570.0,4569.5,1.229579,8117.774895,1.1822,0.000123,0.011832
75%,5305.25,60.0,55.0,63.0,3.056278,9.086533,123195.25,27.0,26.189209,26.0,0.0,24.0,19.85432,20.0,0.1053491,2.001788,0.0,100552.436974,5283.25,5190.0,5191.0,1.608214,11880.391323,1.546915,0.000161,0.01549
max,5916.0,66.0,59.0,69.0,3.547441,10.641753,154149.93193,29.0,29.0,29.0,0.0,27.0,21.611507,21.0,0.1496838,2.261792,0.0,143306.862318,5929.0,5849.0,5849.0,52.438742,35254.276784,51.179391,0.005345,0.513112


## Turning certain columns into type Integer for consistency  

In [25]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset[col] = port_dataset[col].astype(int)

port_dataset

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,4561,57,58,45,2.947471,8.857574,145123,26,26.000000,26,0,22,20.882645,21,0.093753,2.001062,0,0.000000,3581,3500,3501,1.039104,8687.425437,1.016492,0.000106,0.009822,PortScan
1,4809,47,45,65,3.506635,8.882307,110651,22,22.000000,22,0,20,19.010884,15,0.105235,2.001751,0,102437.118250,5043,4962,4959,1.268739,7102.088226,1.237470,0.000129,0.011979,PortScan
2,3465,54,41,57,2.857611,9.618570,105984,26,26.000000,26,0,24,17.196154,20,0.104441,2.001264,0,0.000000,3516,3468,3466,1.476553,6220.557088,1.443817,0.000150,0.013955,PortScan
3,4436,61,47,45,2.986253,7.670678,94346,28,28.000000,28,0,20,19.434706,20,0.099311,1.618916,0,0.000000,5837,5730,5739,0.451240,20143.259610,0.442166,0.000044,0.004415,PortScan
4,3436,55,49,62,2.962029,8.983642,117937,26,24.800242,23,0,23,18.359937,16,0.036875,2.001080,0,0.000000,3648,3572,3577,11.911448,758.429284,11.275823,0.001208,0.113511,PortScan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,4449,43,50,57,2.435099,8.977256,121910,28,28.000000,28,0,19,17.000000,15,0.049477,2.001239,0,122710.651752,4821,4713,4744,1.210337,7421.595089,1.177889,0.000119,0.011441,PortScan
15996,4391,52,51,50,2.919020,9.104157,149339,29,29.000000,29,0,23,17.267953,15,0.100909,2.001576,0,0.000000,5567,5456,5467,1.661436,6548.915950,1.619673,0.000169,0.015829,PortScan
15997,5011,46,55,64,2.933730,8.933704,128610,24,22.984329,21,0,22,18.355819,20,0.092906,1.746315,0,80275.962006,4895,4799,4798,1.109014,8294.921929,1.079457,0.000110,0.010791,PortScan
15998,5732,57,57,47,2.502362,9.127717,106555,23,23.000000,23,0,15,15.000000,15,0.002730,2.001485,0,0.000000,5717,5634,5593,1.055370,10238.521667,1.029001,0.000107,0.010051,PortScan


---

## Load the second sample dataset:

The following code will create another attack dataset, this time based on a different sample dataset, the code in this section<br> 
will be mostly the same as it was up until this point in the notebook, there for we will not repeat the same explanations here.<br>

In [26]:
NUM_OF_ROWS = 16000

## Load the second sample dataset:

In [27]:
# import the attack sample dataset
port_samples = pd.read_csv('portscan_open_port_samples_2.csv')
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (19, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4424,59.997053,58,60,0.076709,0.005884,229086,26,26.0,26,0.0,24,24.0,24,0.0,2.005902,0.0,229086.0,8811,13,13,12.892101,684.450111,1.095053,0.001461,0.014709
1,4986,59.99719,58,60,0.074912,0.005612,258726,26,26.0,26,0.0,24,24.0,24,0.0,2.005628,0.0,258726.0,9951,14,14,14.184491,702.527845,1.09692,0.001424,0.013249
2,4983,59.998384,58,60,0.056836,0.00323,257140,26,26.0,26,0.0,24,24.0,24,0.0,2.003236,0.0,0.0,9890,8,8,10.319836,959.123778,0.066469,0.001043,0.004925
3,4021,59.997506,58,60,0.070587,0.004983,208208,26,26.0,26,0.0,24,24.0,24,0.0,2.004995,0.0,0.0,8008,10,10,10.557126,759.487001,0.109485,0.001317,0.009402
4,3000,59.99701,58,60,0.077267,0.00597,156312,26,26.0,26,0.0,24,24.0,24,0.0,2.005988,0.0,156312.0,6012,9,9,9.757549,617.060695,1.092883,0.001621,0.016516
5,4006,59.997246,58,60,0.074167,0.005501,207402,26,26.0,26,0.0,24,24.0,24,0.0,2.005516,0.0,207402.0,7977,11,11,12.066848,661.979,1.099047,0.001511,0.015127
6,4082,59.99802,58,60,0.062904,0.003957,209846,26,26.0,26,0.0,24,24.0,24,0.0,2.003965,0.0,0.0,8071,8,8,8.730299,925.397859,0.069691,0.001081,0.005426
7,4377,59.997925,58,60,0.064386,0.004146,225316,26,26.0,26,0.0,24,24.0,24,0.0,2.004154,0.0,0.0,8666,9,9,10.738069,807.873367,0.087439,0.001238,0.006715
8,4327,59.997221,58,60,0.074505,0.005551,224198,26,26.0,26,0.0,24,24.0,24,0.0,2.005567,0.0,224198.0,8623,12,12,12.768577,676.269559,1.096336,0.001479,0.013964
9,1000,59.994036,58,60,0.109054,0.011893,52156,26,26.0,26,0.0,24,24.0,24,0.0,2.011964,0.0,52156.0,2006,6,6,4.456998,451.424917,1.092991,0.002216,0.026842


In this attack sample, we noticed that there are two attack flows that have a low number of ports (indexes 9 and 10), and that the rest of the data in these two rows differs from the rest in a small but noticeable way.<br> That is why we decided to put them aside for now and, at the end of this notebook, create a small sample of data based solely on these two rows.<br> This will ensure the correctness of the data we generate.  

In [28]:
small_port_samples = port_samples.iloc[[9, 10]]

port_samples.drop(index=9, inplace=True)
port_samples.drop(index=10, inplace=True)
port_samples.reset_index(drop=True, inplace=True)
print(f'Dataset Shape: {port_samples.shape}')
port_samples

Dataset Shape: (17, 26)


Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,4424,59.997053,58,60,0.076709,0.005884,229086,26,26.0,26,0.0,24,24.0,24,0.0,2.005902,0.0,229086.0,8811,13,13,12.892101,684.450111,1.095053,0.001461,0.014709
1,4986,59.99719,58,60,0.074912,0.005612,258726,26,26.0,26,0.0,24,24.0,24,0.0,2.005628,0.0,258726.0,9951,14,14,14.184491,702.527845,1.09692,0.001424,0.013249
2,4983,59.998384,58,60,0.056836,0.00323,257140,26,26.0,26,0.0,24,24.0,24,0.0,2.003236,0.0,0.0,9890,8,8,10.319836,959.123778,0.066469,0.001043,0.004925
3,4021,59.997506,58,60,0.070587,0.004983,208208,26,26.0,26,0.0,24,24.0,24,0.0,2.004995,0.0,0.0,8008,10,10,10.557126,759.487001,0.109485,0.001317,0.009402
4,3000,59.99701,58,60,0.077267,0.00597,156312,26,26.0,26,0.0,24,24.0,24,0.0,2.005988,0.0,156312.0,6012,9,9,9.757549,617.060695,1.092883,0.001621,0.016516
5,4006,59.997246,58,60,0.074167,0.005501,207402,26,26.0,26,0.0,24,24.0,24,0.0,2.005516,0.0,207402.0,7977,11,11,12.066848,661.979,1.099047,0.001511,0.015127
6,4082,59.99802,58,60,0.062904,0.003957,209846,26,26.0,26,0.0,24,24.0,24,0.0,2.003965,0.0,0.0,8071,8,8,8.730299,925.397859,0.069691,0.001081,0.005426
7,4377,59.997925,58,60,0.064386,0.004146,225316,26,26.0,26,0.0,24,24.0,24,0.0,2.004154,0.0,0.0,8666,9,9,10.738069,807.873367,0.087439,0.001238,0.006715
8,4327,59.997221,58,60,0.074505,0.005551,224198,26,26.0,26,0.0,24,24.0,24,0.0,2.005567,0.0,224198.0,8623,12,12,12.768577,676.269559,1.096336,0.001479,0.013964
9,4353,59.997236,58,60,0.074304,0.005521,225420,26,26.0,26,0.0,24,24.0,24,0.0,2.005536,0.0,225420.0,8670,12,12,12.549492,691.820826,1.090418,0.001446,0.014636


### Find the columns that we need to synthesis data for:

In [29]:
columns_to_gather = port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values
columns_to_gather #left with all columns that the values are not 0 (be know for a fact that the data is consistant and there are not missing values in the rows)

['Number of Ports',
 'Average Packet Length',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Std',
 'Packet Length Variance',
 'Total Length of Fwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Fwd Segment Size Avg',
 'Subflow Fwd Bytes',
 'SYN Flag Count',
 'ACK Flag Count',
 'RST Flag Count',
 'Flow Duration',
 'Packets Per Second',
 'IAT Max',
 'IAT Mean',
 'IAT Std']

### Find an approximate minimum and maximum values of each column:

In [30]:
# find the minimum and maximum values for each column, scale the range (reduce min by 15% and increase max by 7.5%), and store the results in a dictionary.
min_max_dict = {col: (port_samples[col].min() * 0.85, port_samples[col].max() * 1.075) for col in columns_to_gather}

# print the min max dictionary
for col, (min_val, max_val) in min_max_dict.items():
    print(f'{col:<30} | Min: {min_val:.2f} | Max: {max_val:.2f}')

Number of Ports                | Min: 2550.00 | Max: 5371.77
Average Packet Length          | Min: 51.00 | Max: 64.50
Packet Length Min              | Min: 49.30 | Max: 62.35
Packet Length Max              | Min: 51.00 | Max: 64.50
Packet Length Std              | Min: 0.05 | Max: 0.09
Packet Length Variance         | Min: 0.00 | Max: 0.01
Total Length of Fwd Packet     | Min: 132865.20 | Max: 278242.25
Fwd Packet Length Max          | Min: 22.10 | Max: 27.95
Fwd Packet Length Mean         | Min: 22.10 | Max: 27.95
Fwd Packet Length Min          | Min: 22.10 | Max: 27.95
Bwd Packet Length Max          | Min: 20.40 | Max: 25.80
Bwd Packet Length Mean         | Min: 20.40 | Max: 25.80
Bwd Packet Length Min          | Min: 20.40 | Max: 25.80
Fwd Segment Size Avg           | Min: 1.70 | Max: 2.16
Subflow Fwd Bytes              | Min: 0.00 | Max: 278242.25
SYN Flag Count                 | Min: 5110.20 | Max: 10701.62
ACK Flag Count                 | Min: 6.80 | Max: 15.05
RST Flag Count    

### Create the base attack dataset (full of zeros):

In [31]:
# creating an empty dataframe before adding values to it
port_dataset2 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(port_samples.columns))), columns = port_samples.columns)
port_dataset2.head(3)

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Find the columns with constant zero values based on samples:

In [32]:
# adding zeros to all columns that should not have any values
zero_columns = [col for col in port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset2[col] = int(0)
zero_columns

['Fwd Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg']

---

## Filling in values based on collected samples:

### Firstly fill values into columns that are not related to each other:

In [33]:
random_values = ['Average Packet Length', 'Packet Length Std', 'Packet Length Variance', 'Fwd Segment Size Avg']

for col in random_values:
    port_dataset2[col] = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, size = NUM_OF_ROWS)

In our sample dataset, the column 'Subflow Fwd Bytes' usually has values in a specific range, but sometimes it has zero values.<br>
In order to generate accurate data, we generate a vector that will have a certain distribution of values. For example, in the 'Subflow Fwd Bytes' column, <br>
50% of the values will be within the usual range, but the other 50% will have zero values.  

In [34]:
# generate a vector with random values based on min max dict, and also create a zero vector
col = 'Subflow Fwd Bytes'
subflow_values = port_samples[port_samples[col] != 0][col] 
min_max_dict[col] = (np.min(subflow_values), np.max(subflow_values))

rand_values = np.random.uniform(min_max_dict[col][0]*0.9, min_max_dict[col][1]*1.1, NUM_OF_ROWS)
zero_values = np.zeros(NUM_OF_ROWS)

# choose values randomly (50% from rand_values, 50% from zero_values)
port_dataset2[col] = np.where(np.random.rand(NUM_OF_ROWS) > 0.5, rand_values, zero_values)

In [35]:
same_value1 = ['Packet Length Min', 'Packet Length Max']
val1 = np.random.randint(min_max_dict[same_value1[0]][0]*0.9, min_max_dict[same_value1[0]][1]*1.05, size = NUM_OF_ROWS)

same_value2 = ['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Min']
val2 = np.random.randint(min_max_dict[same_value2[0]][0]*0.9, min_max_dict[same_value2[0]][1]*1.05, size = NUM_OF_ROWS)

same_value3 = ['Bwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']
val3 = np.random.randint(min_max_dict[same_value3[0]][0]*0.9, min_max_dict[same_value3[0]][1]*1.05, size = NUM_OF_ROWS)

for col in same_value1:
    if col == 'Packet Length Min':
        port_dataset2[col] = val1
    else:
        port_dataset2[col] = [val + np.random.randint(2, 8) for val in val1]

for col in same_value2:
    port_dataset2[col] = val2

for col in same_value3:
    port_dataset2[col] = val3

## Calculate and fill values into columns that have a certain correlation between them:

### Correlation between 'Number of Ports' and all the following: 'Total Length of Fwd Packet', 'SYN Flag Count':

In [36]:
first_correlation = ['Number of Ports', 'Total Length of Fwd Packet', 'SYN Flag Count', 'ACK Flag Count']

# finding the correlation between the 'Number of Ports' column to the rest of the columns in order to create new data
independent_col = port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'Number of Ports'
dependent_cols = port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Number of Ports' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

('Total Length of Fwd Packet', np.float64(51.72121549891846))
('SYN Flag Count', np.float64(1.9892775191891712))
('ACK Flag Count', np.float64(0.002642371853968411))


In [37]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset2['Number of Ports'] = np.random.randint(min_max_dict['Number of Ports'][0]*0.85, min_max_dict['Number of Ports'][1]*1.1, NUM_OF_ROWS)

# generate new data by scaling the original correlated column value using the updated factor.
for index, row in port_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all generated scaling factors
        delta = random.uniform(factor * 0.1, factor * 0.2) 
        updated_factor = factor + random.choice([-1, 1]) * delta
        port_dataset2.loc[index, col] = int(row['Number of Ports'] * updated_factor)
        if col == 'ACK Flag Count':
            port_dataset2.loc[index, 'RST Flag Count'] = int(row['Number of Ports'] * updated_factor) #copy the value to RST column


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [38]:
# generate random values for the 'Flow Duration' column
col = 'Flow Duration'
rand_values = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1], NUM_OF_ROWS)
usual_values = np.random.uniform(9.7461, 14.4132, NUM_OF_ROWS)

# choose values randomly (30% from rand_values, 70% from usual_values)
port_dataset2[col] = np.where(np.random.rand(NUM_OF_ROWS) > 0.3, usual_values, rand_values)

In [39]:
# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
secondCorrelation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = port_samples[secondCorrelation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = port_samples[secondCorrelation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(secondCorrelation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(port_samples['Flow Duration'].values, port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(42.41252992335282))
('IAT Max', np.float64(0.2758936721100647))
('IAT Mean', np.float64(0.00012795064831380915))
('IAT Std', np.float64(0.0033957593413350097))


np.float64(8063.0)

In [40]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in port_dataset2.iterrows():
    for col, factor in scaling_factors: #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.1, duration_to_packets_corr * 0.2)
            updated_factor = duration_to_packets_corr + random.choice([-1, 1]) * delta
            port_dataset2.loc[index, col] = updated_factor / row['Flow Duration']
        else:
            if col == 'IAT Std' or col == 'IAT Max':
                delta = random.uniform(factor * 0.4, factor * 0.65)
                updated_factor = factor + (-1) * delta  
            else:
                delta = random.uniform(factor * 0.1, factor * 0.2)
                updated_factor = factor + random.choice([-1, 1]) * delta
            port_dataset2.loc[index, col] = row['Flow Duration'] * updated_factor

---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [41]:
port_samples.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,4047.529412,59.997282,58.0,60.0,0.073412,0.005428,209356.588235,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.005444,0.0,150848.941176,8052.176471,10.823529,10.823529,12.714612,684.508845,2.090411,0.00161,0.027269
std,626.554678,0.000449,0.0,0.0,0.006446,0.000896,32330.941502,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000901,0.0,94916.392333,1243.49775,1.878673,1.878673,5.497775,155.368238,5.161284,0.000783,0.059926
min,3000.0,59.996691,58.0,60.0,0.056836,0.00323,156312.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.003236,0.0,0.0,6012.0,8.0,8.0,8.730299,218.453264,0.066469,0.001043,0.004925
25%,3658.0,59.99701,58.0,60.0,0.073986,0.005474,186914.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.005489,0.0,94120.0,7189.0,9.0,9.0,10.533744,639.958824,1.090418,0.001422,0.012832
50%,4021.0,59.997191,58.0,60.0,0.074897,0.00561,208416.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.005625,0.0,184340.0,8016.0,11.0,11.0,11.250724,676.269559,1.095385,0.001479,0.014636
75%,4377.0,59.997259,58.0,60.0,0.077267,0.00597,225420.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.005988,0.0,224198.0,8670.0,12.0,12.0,12.768577,703.349833,1.098808,0.001563,0.015609
max,4997.0,59.998384,58.0,60.0,0.081289,0.006608,258830.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.00663,0.0,258830.0,9955.0,14.0,14.0,33.197032,959.123778,22.046615,0.004578,0.259357


In [42]:
port_dataset2.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0,16000.0
mean,4028.517438,58.398797,53.994437,58.482062,0.069724,0.005144,208119.662437,23.496688,23.496688,23.496688,0.0,22.021,22.021,22.021,0.0,1.953589,0.0,91865.170682,8013.6135,10.116875,10.116875,14.916575,609.206626,1.958159,0.001908,0.02405
std,1074.171179,7.265077,6.066915,6.293709,0.015205,0.001543,64618.115401,2.861752,2.861752,2.861752,0.0,2.579603,2.579603,2.579603,0.0,0.242594,0.0,101335.644941,2479.089097,3.278893,3.278893,6.385539,198.580478,0.902714,0.000871,0.011075
min,2167.0,45.898433,44.0,46.0,0.04348,0.002472,90346.0,19.0,19.0,19.0,0.0,18.0,18.0,18.0,0.0,1.532524,0.0,0.0,3472.0,4.0,4.0,7.059156,181.812682,0.708491,0.000741,0.008651
25%,3098.75,52.068121,49.0,53.0,0.056585,0.00382,157179.0,21.0,21.0,21.0,0.0,20.0,20.0,20.0,0.0,1.743393,0.0,0.0,6020.0,8.0,8.0,11.091439,493.889982,1.396888,0.001361,0.017188
50%,4023.5,58.392144,54.0,59.0,0.069751,0.005137,202743.5,24.0,24.0,24.0,0.0,22.0,22.0,22.0,0.0,1.954272,0.0,0.0,7822.0,10.0,10.0,12.689131,626.201712,1.670956,0.001616,0.020484
75%,4954.25,64.721314,59.0,64.0,0.082835,0.006478,250099.0,26.0,26.0,26.0,0.0,24.0,24.0,24.0,0.0,2.165041,0.0,185637.599213,9625.25,12.0,12.0,14.268076,736.68421,2.089992,0.002042,0.025728
max,5907.0,70.946926,64.0,71.0,0.096121,0.007813,364683.0,28.0,28.0,28.0,0.0,26.0,26.0,26.0,0.0,2.372745,0.0,284694.452455,14052.0,18.0,18.0,35.681823,1356.549498,5.812875,0.005423,0.072452


---

## Adding the Label column:

In [43]:
# adding a label to the dataset
port_dataset2['Label'] = ATTACK_NAME

---

Make sure that the data that needs to be of type Integer will be Integer for consistency.  

In [44]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset2[col] = port_dataset2[col].astype(int)

port_dataset2['Fwd Packet Length Mean'] = port_dataset2['Fwd Packet Length Mean'].astype(float)
port_dataset2['Bwd Packet Length Mean'] = port_dataset2['Bwd Packet Length Mean'].astype(float)

port_dataset2

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std,Label
0,2808,50.525483,53,60,0.055620,0.006470,127781,19,19.0,19,0,19,19.0,19,0,1.972776,0,231691.427068,4668,8,8,13.697967,528.928489,2.244494,0.001509,0.022194,PortScan
1,4811,58.546596,51,53,0.050881,0.002812,207120,26,26.0,26,0,20,20.0,20,0,1.742772,0,0.000000,11392,10,10,14.089722,685.783196,1.642976,0.001524,0.023207,PortScan
2,3978,64.778952,46,50,0.091477,0.006312,229134,28,28.0,28,0,19,19.0,19,0,2.311743,0,252548.402232,6784,11,11,12.365476,732.537942,1.369674,0.001825,0.017260,PortScan
3,5045,60.378734,44,46,0.045581,0.003411,224185,19,19.0,19,0,21,21.0,21,0,2.097670,0,0.000000,9012,15,15,11.822570,547.616910,1.853016,0.001717,0.018698,PortScan
4,5081,61.880228,53,57,0.093915,0.002572,290041,20,20.0,20,0,24,24.0,24,0,1.872889,0,278893.815154,11597,10,10,23.583003,285.019763,3.417053,0.003455,0.040249,PortScan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,2222,56.346854,54,58,0.060562,0.002943,102162,25,25.0,25,0,24,24.0,24,0,1.598152,0,0.000000,4985,4,4,9.080155,715.063368,0.894975,0.001339,0.011825,PortScan
15996,4625,55.662219,58,61,0.067069,0.007354,213244,25,25.0,25,0,23,23.0,23,0,1.992201,0,230417.602007,8233,10,10,12.160445,761.196068,1.448975,0.001279,0.020674,PortScan
15997,2895,57.319606,57,62,0.091155,0.005967,176083,25,25.0,25,0,23,23.0,23,0,1.945818,0,0.000000,6424,6,6,10.360512,669.913302,1.273046,0.001076,0.020275,PortScan
15998,5213,51.210715,56,60,0.087772,0.003489,309658,24,24.0,24,0,25,25.0,25,0,1.837856,0,204272.021050,12363,15,15,10.699182,628.660947,1.426533,0.001640,0.017537,PortScan


---

## Creating more rows base on small subset of samples that is slightly different:

In [45]:
small_port_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
9,1000,59.994036,58,60,0.109054,0.011893,52156,26,26.0,26,0.0,24,24.0,24,0.0,2.011964,0.0,52156.0,2006,6,6,4.456998,451.424917,1.092991,0.002216,0.026842
10,1000,59.994036,58,60,0.109054,0.011893,52156,26,26.0,26,0.0,24,24.0,24,0.0,2.011964,0.0,52156.0,2006,6,6,4.471801,449.930572,1.099059,0.002224,0.026916


In [46]:
NUM_OF_ROWS = 8000

### Create the base attack dataset (full of zeros):

In [47]:
# creating an empty dataframe before adding values to it
port_dataset3 = pd.DataFrame(np.zeros((NUM_OF_ROWS, len(small_port_samples.columns))), columns = small_port_samples.columns)

# find the columns that we need to synthesis data for to produce an attack dataset
columns_to_gather = small_port_samples.replace(0, np.nan) #replace all 0 values with null
columns_to_gather = columns_to_gather.dropna(how = 'all', axis = 1).columns.tolist() #remove all columns where there are null values

# find an approximate minimum and maximum values of each column and save that data into a dictionary
min_max_dict = {col: (small_port_samples[col].min() * 0.85, small_port_samples[col].max() * 1.1) for col in columns_to_gather}

# adding zeros to all columns that should not have any values
zero_columns = [col for col in small_port_samples.columns if col not in columns_to_gather]
for col in zero_columns:
    port_dataset3[col] = int(0)
zero_columns

['Fwd Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg']

---

## Filling in values based on collected samples:

### Firstly fill values into columns that are not related to each other:

In [48]:
random_values = ['Average Packet Length', 'Packet Length Std', 'Packet Length Variance', 'Subflow Fwd Bytes', 'Number of Ports']

for col in random_values:
    val = np.random.uniform(min_max_dict[col][0]*0.95, min_max_dict[col][1]*1.05, size = NUM_OF_ROWS)
    port_dataset3[col] = val

### Then filling same value columns:

In [49]:
same_value1 = ['Packet Length Min', 'Packet Length Max']
val1 = np.random.randint(min_max_dict[same_value1[0]][0]*0.9, min_max_dict[same_value1[0]][1]*1.05, size = NUM_OF_ROWS)

same_value2 = ['Bwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']
val2 = np.random.randint(min_max_dict[same_value2[0]][0]*0.9, min_max_dict[same_value2[0]][1]*1.05, size = NUM_OF_ROWS)


for col in same_value1:
    if col == 'Packet Length Min':
        port_dataset3[col] = val1
    else:
        port_dataset3[col] = [val + np.random.randint(2, 14) for val in val1]

for col in same_value2:
    port_dataset3[col] = val2

In [50]:
independant = ['Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean']

packet_length_max = np.random.randint(min_max_dict['Fwd Packet Length Max'][0] * 0.9, min_max_dict['Fwd Packet Length Max'][1] * 1.1, NUM_OF_ROWS)

# create 'Fwd Packet Length Min' by applying a small variation
packet_length_min = packet_length_max - np.random.randint(2, 16, NUM_OF_ROWS)

# calculate 'Fwd Packet Length Mean': average of min and max, or copy if equal
average_packet_length = np.where(packet_length_max != packet_length_min, (packet_length_max + packet_length_min) / 2, packet_length_min)

# assign the values to the dataset
port_dataset3['Fwd Packet Length Max'] = packet_length_max.astype(int)
port_dataset3['Fwd Packet Length Mean'] = average_packet_length
port_dataset3['Fwd Packet Length Min'] = packet_length_min.astype(int)

In [51]:
port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1066.275284,50.096409,46,49,0.102810,0.013700,0.0,23,21.5,20,0,23,23,23,0,0.0,0,50971.994095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,808.230746,50.755322,48,56,0.099652,0.011862,0.0,19,12.5,6,0,23,23,23,0,0.0,0,47221.700696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1054.312971,54.495864,50,63,0.115166,0.010852,0.0,22,18.0,14,0,25,25,25,0,0.0,0,59031.103831,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,931.241397,60.893750,50,57,0.107528,0.013508,0.0,22,15.0,8,0,18,18,18,0,0.0,0,44257.613763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,838.056351,59.248038,48,60,0.119262,0.012404,0.0,22,15.5,9,0,18,18,18,0,0.0,0,49184.363038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,830.290304,56.342642,65,72,0.106147,0.013086,0.0,20,19.0,18,0,26,26,26,0,0.0,0,46858.064529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7996,1052.396177,58.932604,50,59,0.120696,0.011661,0.0,28,26.0,24,0,21,21,21,0,0.0,0,59448.970214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7997,979.995008,49.221035,49,61,0.101105,0.011227,0.0,29,25.0,21,0,20,20,20,0,0.0,0,56871.720398,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7998,1089.899177,53.467697,58,67,0.113531,0.012442,0.0,26,21.0,16,0,25,25,25,0,0.0,0,57190.006086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Calculate and fill values into columns that have a certain correlation between them:

### Correlation between 'SYN Flag Count' and all the following: 'ACK Flag Count', 'RST Flag Count', 'Total Length of Fwd Packet':

In [52]:
first_correlation = ['SYN Flag Count', 'ACK Flag Count', 'RST Flag Count', 'Total Length of Fwd Packet']

# finding the correlation between the 'SYN Flag Count' column to the rest of the columns in order to create new data
independent_col = small_port_samples[first_correlation[0]].values.reshape(-1, 1) #column 'SYN Flag Count'
dependent_cols = small_port_samples[first_correlation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'SYN Flag Count' and the rest
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(first_correlation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)
    
# adding the rest of the attack feature values to the dataset at random based on the smaple data
port_dataset3['SYN Flag Count'] = np.random.randint(min_max_dict['SYN Flag Count'][0]*0.9, min_max_dict['SYN Flag Count'][1]*1.05, NUM_OF_ROWS)

for index, row in port_dataset3.iterrows():
    for col, factor in zip(first_correlation[1:], scaling_factors): #iterating over all rows we need to add values to except 'SYN Flag Count'
        delta = random.uniform(factor[1] * 0.01, factor[1] * 0.02)
        updated_factor = factor[1] + (-1) * delta
        port_dataset3.loc[index, col] = int(row['SYN Flag Count'] * updated_factor)

('ACK Flag Count', np.float64(0.0029910269192422725))
('RST Flag Count', np.float64(0.0029910269192422725))
('Total Length of Fwd Packet', np.float64(26.0))


In [53]:
port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1066.275284,50.096409,46,49,0.102810,0.013700,55935.0,23,21.5,20,0,23,23,23,0,0.0,0,50971.994095,2178,6.0,6.0,0.0,0.0,0.0,0.0,0.0
1,808.230746,50.755322,48,56,0.099652,0.011862,53663.0,19,12.5,6,0,23,23,23,0,0.0,0,47221.700696,2100,6.0,6.0,0.0,0.0,0.0,0.0,0.0
2,1054.312971,54.495864,50,63,0.115166,0.010852,56045.0,22,18.0,14,0,25,25,25,0,0.0,0,59031.103831,2185,6.0,6.0,0.0,0.0,0.0,0.0,0.0
3,931.241397,60.893750,50,57,0.107528,0.013508,39816.0,22,15.0,8,0,18,18,18,0,0.0,0,44257.613763,1560,4.0,4.0,0.0,0.0,0.0,0.0,0.0
4,838.056351,59.248038,48,60,0.119262,0.012404,52263.0,22,15.5,9,0,18,18,18,0,0.0,0,49184.363038,2051,6.0,6.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,830.290304,56.342642,65,72,0.106147,0.013086,40483.0,20,19.0,18,0,26,26,26,0,0.0,0,46858.064529,1573,4.0,4.0,0.0,0.0,0.0,0.0,0.0
7996,1052.396177,58.932604,50,59,0.120696,0.011661,42347.0,28,26.0,24,0,21,21,21,0,0.0,0,59448.970214,1659,4.0,4.0,0.0,0.0,0.0,0.0,0.0
7997,979.995008,49.221035,49,61,0.101105,0.011227,46492.0,29,25.0,21,0,20,20,20,0,0.0,0,56871.720398,1812,5.0,5.0,0.0,0.0,0.0,0.0,0.0
7998,1089.899177,53.467697,58,67,0.113531,0.012442,43215.0,26,21.0,16,0,25,25,25,0,0.0,0,57190.006086,1690,4.0,4.0,0.0,0.0,0.0,0.0,0.0


### Correlation between 'Flow Duration' and all of the following: 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std':

In [54]:
# generate random values for the 'Flow Duration' column
rand_values = np.random.uniform(min_max_dict['Flow Duration'][0]*0.9, min_max_dict['Flow Duration'][1]*1.05, size = NUM_OF_ROWS)

# assign the random values
port_dataset3['Flow Duration'] = rand_values

# finding the correlation between the 'Flow Duration' column to the rest of the columns in order to create new data
secondCorrelation = ['Flow Duration', 'Packets Per Second', 'IAT Max', 'IAT Mean', 'IAT Std']
independent_col = small_port_samples[secondCorrelation[0]].values.reshape(-1, 1) #column 'Flow Duration'
dependent_cols = small_port_samples[secondCorrelation[1:]].values  

# using least squares regression to find scaling factors that best approximate the relationship between 'Flow Duration' and the rest of the columns in second_correlation
scaling_factors = np.linalg.lstsq(independent_col, dependent_cols, rcond = None)[0]

scaling_factors = [(name, factor) for name, factor in zip(secondCorrelation[1:], scaling_factors.flatten())]
for val in scaling_factors:
    print(val)

# calculate the average correlation between flow duration and packets per second by multiplying their corresponding values from both columns and then calculate the average.
duration_to_packets_corr = [x * y for x, y in zip(small_port_samples['Flow Duration'].values, small_port_samples['Packets Per Second'].values)]
duration_to_packets_corr = np.mean(duration_to_packets_corr)
duration_to_packets_corr

('Packets Per Second', np.float64(100.94868504824292))
('IAT Max', np.float64(0.24550378806247353))
('IAT Mean', np.float64(0.0004972650422675103))
('IAT Std', np.float64(0.006020762164982359))


np.float64(2012.0)

In [55]:
# adding the rest of the attack feature values to the dataset at random based on the smaple data
for index, row in port_dataset3.iterrows():
    for col, factor in zip(secondCorrelation[1:], scaling_factors): #iterating over all rows we need to add values to except 'Flow Duration'
        if col == 'Packets Per Second':
            delta = random.uniform(duration_to_packets_corr * 0.1, duration_to_packets_corr * 0.2) 
            updated_factor = duration_to_packets_corr + random.choices([-1, 1], weights=[2, 1], k=1)[0] * delta
            port_dataset3.loc[index, col] = updated_factor / row['Flow Duration']
        elif col == 'IAT Mean':
            delta = random.uniform(factor[1] * 0.5, factor[1] * 0.8)
            updated_factor = factor[1] + delta
            port_dataset3.loc[index, col] = row['Flow Duration'] * updated_factor
        else:
            delta = random.uniform(factor[1] * 0.15, factor[1] * 0.35)
            updated_factor = factor[1] + random.choice([-1, 1]) * delta
            port_dataset3.loc[index, col] = row['Flow Duration'] * updated_factor

Make sure that the data that needs to be of type Integer will be Integer for consistency.  

In [56]:
int_columns = ['Number of Ports', 'Packet Length Min', 'Packet Length Max', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'SYN Flag Count', 'ACK Flag Count', 'RST Flag Count']
for col in int_columns:
    port_dataset3[col] = port_dataset3[col].astype(int)

port_dataset3['Bwd Packet Length Mean'] = port_dataset3['Bwd Packet Length Mean'].astype(float)

port_dataset3

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
0,1066,50.096409,46,49,0.102810,0.013700,55935,23,21.5,20,0,23,23.0,23,0,0.0,0,50971.994095,2178,6,6,4.400320,518.388932,0.819930,0.003836,0.018389
1,808,50.755322,48,56,0.099652,0.011862,53663,19,12.5,6,0,23,23.0,23,0,0.0,0,47221.700696,2100,6,6,4.150905,555.905661,0.789094,0.003545,0.030745
2,1054,54.495864,50,63,0.115166,0.010852,56045,22,18.0,14,0,25,25.0,25,0,0.0,0,59031.103831,2185,6,6,4.592802,377.183838,0.814027,0.003762,0.020485
3,931,60.893750,50,57,0.107528,0.013508,39816,22,15.0,8,0,18,18.0,18,0,0.0,0,44257.613763,1560,4,4,5.045390,338.335926,1.000431,0.004443,0.023723
4,838,59.248038,48,60,0.119262,0.012404,52263,22,15.5,9,0,18,18.0,18,0,0.0,0,49184.363038,2051,6,6,4.875268,352.029607,0.983861,0.003854,0.037747
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,830,56.342642,65,72,0.106147,0.013086,40483,20,19.0,18,0,26,26.0,26,0,0.0,0,46858.064529,1573,4,4,4.540834,395.791087,1.301523,0.003714,0.022232
7996,1052,58.932604,50,59,0.120696,0.011661,42347,28,26.0,24,0,21,21.0,21,0,0.0,0,59448.970214,1659,4,4,3.824988,625.413604,0.751224,0.003193,0.016926
7997,979,49.221035,49,61,0.101105,0.011227,46492,29,25.0,21,0,20,20.0,20,0,0.0,0,56871.720398,1812,5,5,4.211737,428.226732,1.249910,0.003168,0.030476
7998,1089,53.467697,58,67,0.113531,0.012442,43215,26,21.0,16,0,25,25.0,25,0,0.0,0,57190.006086,1690,4,4,4.007331,424.501007,1.186079,0.003525,0.031513


---

## Validate that the generated data looks valid by comparing the samples with the generated dataset:

In [57]:
small_port_samples

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
9,1000,59.994036,58,60,0.109054,0.011893,52156,26,26.0,26,0.0,24,24.0,24,0.0,2.011964,0.0,52156.0,2006,6,6,4.456998,451.424917,1.092991,0.002216,0.026842
10,1000,59.994036,58,60,0.109054,0.011893,52156,26,26.0,26,0.0,24,24.0,24,0.0,2.011964,0.0,52156.0,2006,6,6,4.471801,449.930572,1.099059,0.002224,0.026916


In [58]:
port_dataset3.describe()

Unnamed: 0,Number of Ports,Average Packet Length,Packet Length Min,Packet Length Max,Packet Length Std,Packet Length Variance,Total Length of Fwd Packet,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Min,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Min,Bwd Packet Length Std,Fwd Segment Size Avg,Bwd Segment Size Avg,Subflow Fwd Bytes,SYN Flag Count,ACK Flag Count,RST Flag Count,Flow Duration,Packets Per Second,IAT Max,IAT Mean,IAT Std
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,981.159375,58.868581,54.56125,62.07825,0.107161,0.011664,49235.854375,24.54925,20.338625,16.128,0.0,21.987875,21.987875,21.987875,0.0,0.0,0.0,51166.15983,1922.52475,5.13775,5.135375,4.283315,453.034798,1.05505,0.003519,0.025703
std,99.119022,6.006858,6.313932,7.244563,0.010843,0.001195,5784.045553,3.439884,3.984016,5.300036,0.0,2.596309,2.596309,2.596309,0.0,0.0,0.0,5215.205315,225.758922,0.738643,0.738997,0.507656,88.249858,0.297395,0.000459,0.007355
min,807.0,48.44834,44.0,46.0,0.088063,0.009604,39104.0,19.0,11.5,4.0,0.0,18.0,18.0,18.0,0.0,0.0,0.0,42118.984503,1534.0,4.0,4.0,3.409846,313.548606,0.547767,0.002555,0.013508
25%,897.0,53.682408,49.0,57.0,0.097703,0.010643,44257.75,22.0,17.5,12.0,0.0,20.0,20.0,20.0,0.0,0.0,0.0,46624.548457,1727.0,5.0,5.0,3.835641,379.266676,0.787968,0.003148,0.019149
50%,983.0,58.920396,55.0,62.0,0.107312,0.011654,49241.0,25.0,20.5,16.0,0.0,22.0,22.0,22.0,0.0,0.0,0.0,51223.812883,1923.0,5.0,5.0,4.283808,443.647515,1.025363,0.003503,0.024714
75%,1066.0,64.062904,60.0,68.0,0.116461,0.012709,54227.0,28.0,23.5,20.0,0.0,24.0,24.0,24.0,0.0,0.0,0.0,55679.283437,2117.0,6.0,6.0,4.728043,501.100028,1.313607,0.003876,0.032062
max,1154.0,69.291855,65.0,78.0,0.125957,0.013736,59584.0,30.0,29.0,28.0,0.0,26.0,26.0,26.0,0.0,0.0,0.0,60239.246797,2315.0,6.0,6.0,5.164772,700.964378,1.699771,0.004618,0.04185


---

## Adding the Label column:

In [59]:
# adding a label to the dataset
port_dataset3['Label'] = ATTACK_NAME

---

## At the end we merge the three sample datasets together and then save it as a CSV file:

In [60]:
# sample dos attack dataset
mergedport_dataset = pd.concat([port_dataset, port_dataset2, port_dataset3], axis=0)
mergedport_dataset = mergedport_dataset.sample(frac=1, random_state=42).reset_index(drop=True)
print(f'Attack Dataset Shape: {mergedport_dataset.shape}')

Attack Dataset Shape: (40000, 27)


In [61]:
# save the dataset
mergedport_dataset.to_csv('port_scan_open_port_dataset.csv', index=False)