# Network Security Data Processing: 4Network, UNSW-NB15, and CESNET-TimeSeries24

This notebook implements the data processing pipeline for the AI4Cyber assignment. The goal is to load, clean, preprocess, and merge three distinct network security datasets into a unified format suitable for training various machine learning models (classification, clustering, and time-series forecasting).

The three datasets are:
1.  **4Network**: The basic dataset provided in the assignment.
2.  **UNSW-NB15**: A comprehensive, modern network intrusion dataset.
3.  **CESNET-TimeSeries24**: A time-series dataset of network traffic per IP address.

## 1. Import Libraries and Define Paths

In [45]:
import pandas as pd
import numpy as np
import os
import glob
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# Define paths to the datasets
# It's good practice to use relative paths if the notebook is in the project root,
# but here we use absolute paths for clarity.
BASE_PATH = "e:/Swinburne_bsc_data_science/year_two/sem_two/COS30049/Assignment2"

PATH_4NETWORK = os.path.join(BASE_PATH, "Assignment Datasets/Assignment Datasets/4Network")
PATH_UNSW_NB15 = os.path.join(BASE_PATH, "UNSW-NB15/CSV Files")
PATH_CESNET = os.path.join(BASE_PATH, "ip_addresses_sample/ip_addresses_sample")


## 2. Load and Preprocess UNSW-NB15 Dataset

In [46]:
# Load UNSW-NB15 training and testing sets
unsw_train_path = os.path.join(BASE_PATH, "UNSW-NB15/CSV Files/Training and Testing Sets/UNSW_NB15_training-set.csv")
unsw_test_path = os.path.join(BASE_PATH, "UNSW-NB15/CSV Files/Training and Testing Sets/UNSW_NB15_testing-set.csv")

df_unsw_train = pd.read_csv(unsw_train_path)
df_unsw_test = pd.read_csv(unsw_test_path)

# Concatenate them
df_unsw = pd.concat([df_unsw_train, df_unsw_test], ignore_index=True)

# Drop irrelevant columns (like id)
df_unsw = df_unsw.drop(columns=['id'])

# Rename columns for clarity and consistency
df_unsw.columns = df_unsw.columns.str.strip().str.lower().str.replace(' ', '_')

# Map attack categories to a simplified set
# First, let's see the original categories
print("Original UNSW-NB15 attack categories:")
print(df_unsw['attack_cat'].unique())

# Define a mapping to the 4 main categories + Normal
attack_map = {
    'Normal': 'Normal',
    'Generic': 'DoS',
    'Exploits': 'Probe',
    'Fuzzers': 'Probe',
    'DoS': 'DoS',
    'Reconnaissance': 'Probe',
    'Analysis': 'Probe',
    'Backdoor': 'R2L',
    'Shellcode': 'U2R',
    'Worms': 'U2R',
    # Handle potential NaN or other unexpected values
    np.nan: 'Normal' 
}
df_unsw['attack_cat'] = df_unsw['attack_cat'].fillna('Normal')
df_unsw['category'] = df_unsw['attack_cat'].apply(lambda x: attack_map.get(x, 'Other'))


# Create a binary label (0 for Normal, 1 for Attack)
df_unsw['binary_label'] = df_unsw['label'].apply(lambda x: 0 if x == 0 else 1)

print("\nProcessed UNSW-NB15 DataFrame head:")
df_unsw.head()


Original UNSW-NB15 attack categories:
['Normal' 'Backdoor' 'Analysis' 'Fuzzers' 'Shellcode' 'Reconnaissance'
 'Exploits' 'DoS' 'Worms' 'Generic']

Processed UNSW-NB15 DataFrame head:


Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label,category,binary_label
0,0.121478,tcp,-,FIN,6,4,258,172,74.08749,252,...,0,0,0,1,1,0,Normal,0,Normal,0
1,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,62,...,0,0,0,1,6,0,Normal,0,Normal,0
2,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,62,...,0,0,0,2,6,0,Normal,0,Normal,0
3,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,62,...,1,1,0,2,1,0,Normal,0,Normal,0
4,0.449454,tcp,-,FIN,10,6,534,268,33.373826,254,...,0,0,0,2,39,0,Normal,0,Normal,0


## 3. Load and Preprocess 4Network Dataset

In [47]:
# Load the 4Network data and the label map
df_4network = pd.read_csv(os.path.join(PATH_4NETWORK, "basic_data_4.csv"))
print("Original 4Network columns:", df_4network.columns)

label_map_4network = pd.read_csv(os.path.join(PATH_4NETWORK, "label_category_map.csv"))

# Merge to get descriptive category labels
df_4network = df_4network.merge(label_map_4network, on='label', how='left')

# Rename columns to align with a common schema
# Let's aim for a schema similar to UNSW-NB15 where possible
df_4network = df_4network.rename(columns={
    'src_bytes': 'sbytes',
    'dst_bytes': 'dbytes',
    'label': 'attack_cat'
})

# Create a binary label
df_4network['binary_label'] = df_4network['category'].apply(lambda x: 0 if x == 'Normal' else 1)

print("\nProcessed 4Network DataFrame head:")
df_4network.head()

Original 4Network columns: Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'count', 'srv_count', 'serror_rate', 'label'],
      dtype='object')

Processed 4Network DataFrame head:

Processed 4Network DataFrame head:


Unnamed: 0,duration,protocol_type,service,flag,sbytes,dbytes,count,srv_count,serror_rate,attack_cat,category,binary_label
0,0.0,tcp,ftp_data,SF,491.0,0.0,2.0,2.0,0.0,normal,Normal,0
1,0.0,udp,other,SF,146.0,0.0,13.0,1.0,0.0,normal,Normal,0
2,0.0,tcp,private,S0,0.0,0.0,123.0,6.0,1.0,neptune,DoS,1
3,0.0,tcp,http,SF,232.0,8153.0,5.0,5.0,0.2,normal,Normal,0
4,0.0,tcp,http,SF,199.0,420.0,30.0,32.0,0.0,normal,Normal,0


## 4. Load and Preprocess CESNET-TimeSeries24 Dataset

This dataset is different as it's a time-series of aggregated data per IP, not flow-based like the others. To merge it, we need to transform it into a flow-like summary. We'll aggregate features for each IP over the entire period available in the `agg_1_day` files. This simulates having a summary of each IP's behavior.

In [48]:
# Load and concatenate all daily aggregation files
agg_files_path = os.path.join(PATH_CESNET, "agg_1_day/*.csv")
all_files = glob.glob(agg_files_path)

df_list = []
for filename in all_files:
    df_file = pd.read_csv(filename)
    # Extract the ID from the filename, which is the unique identifier for a source.
    file_id = os.path.basename(filename).split('.')[0]
    df_file['id'] = int(file_id)
    df_list.append(df_file)

df_cesnet_raw = pd.concat(df_list, ignore_index=True)

# The 'identifiers.csv' file does not contain IP addresses, so we will not use it.
# We will group by the 'id' which uniquely identifies each entity.

# This dataset doesn't have attack labels. We'll treat it as 'Normal' for now.
df_cesnet_raw['category'] = 'Normal'
df_cesnet_raw['binary_label'] = 0
df_cesnet_raw['attack_cat'] = 'normal'

# To make it compatible, we need to aggregate it to a "flow-like" summary per ID.
# We will calculate mean, std, max for the numerical features for each ID.
agg_dict = {
    'n_bytes': ['mean', 'std', 'max'],
    'n_packets': ['mean', 'std', 'max'],
    'n_flows': ['mean', 'std', 'max'],
    'avg_duration': ['mean']
}

# Check which of the desired aggregation keys are present in the dataframe
agg_keys_to_use = {k: v for k, v in agg_dict.items() if k in df_cesnet_raw.columns}

# CORRECTED: Group by 'id' as it is the unique identifier.
df_cesnet_agg = df_cesnet_raw.groupby('id').agg(agg_keys_to_use).reset_index()

# Flatten the multi-level column names
df_cesnet_agg.columns = ['_'.join(col).strip() for col in df_cesnet_agg.columns.values]
df_cesnet_agg = df_cesnet_agg.rename(columns={'id_': 'id'})

# Add the labels back
df_cesnet_agg['category'] = 'Normal'
df_cesnet_agg['binary_label'] = 0
df_cesnet_agg['attack_cat'] = 'normal'

print("\nProcessed and Aggregated CESNET DataFrame head:")
df_cesnet_agg.head()


Processed and Aggregated CESNET DataFrame head:


Unnamed: 0,id,n_bytes_mean,n_bytes_std,n_bytes_max,n_packets_mean,n_packets_std,n_packets_max,n_flows_mean,n_flows_std,n_flows_max,avg_duration_mean,category,binary_label,attack_cat
0,11,785858000.0,451390400.0,5080981875,10321960.0,5944051.0,66857421,3978649.0,928629.599686,7243129,5.120036,Normal,0,normal
1,20,15482720.0,53776490.0,893080618,30783.4,43089.37,698501,1079.593,726.22925,7999,49.034357,Normal,0,normal
2,101,21683000.0,30791530.0,195103149,27207.52,32045.26,212731,303.7066,95.100157,564,58.303892,Normal,0,normal
3,103,69270590000.0,42139530000.0,192280591577,71344870.0,44179190.0,197611735,512445.1,355795.646864,1407589,24.954857,Normal,0,normal
4,118,12696150000.0,6955015000.0,31446730432,12885920.0,7197750.0,32375299,60119.18,37757.047161,163934,27.084036,Normal,0,normal


## 5. Harmonize and Merge Datasets

Now, we'll identify common features and create a unified schema to merge the three datasets. This is a critical step and involves making decisions about which features to keep, which to discard, and how to align them.

**Common Schema:**
We will define a common set of columns. If a dataset doesn't have a particular column, it will be filled with NaN or a sensible default (example like 0).

- `duration`: Connection duration.
- `proto`: Protocol.
- `service`: Service type.
- `state` or `flag`: Connection state.
- `sbytes`, `dbytes`: Source and destination bytes.
- `spkts`, `dpkts`: Source and destination packets.
- `srate`, `drate`: Source and destination packet rates.
- `category`: The multi-class label (Normal, DoS, Probe, etc.).
- `binary_label`: The binary label (0 or 1).

In [49]:
# Harmonize UNSW-NB15
df_unsw_h = df_unsw.rename(columns={
    'dur': 'duration',
    'spkts': 'spkts',
    'dpkts': 'dpkts',
    'sbytes': 'sbytes',
    'dbytes': 'dbytes',
    'proto': 'protocol',
    'state': 'flag'
})

# Harmonize 4Network
# CORRECTED: 'protocol_type' is the correct original column name.
df_4network_h = df_4network.rename(columns={
    'duration': 'duration',
    'protocol_type': 'protocol', # <-- FIX
    'service': 'service',
    'flag': 'flag',
    'sbytes': 'sbytes',
    'dbytes': 'dbytes'
})
# Add missing packet columns to 4Network
df_4network_h['spkts'] = 0
df_4network_h['dpkts'] = 0


# Harmonize CESNET - mapping aggregated stats to the common schema.
df_cesnet_h = pd.DataFrame()
# We use the aggregated 'mean' values as representatives for the flow.
df_cesnet_h['sbytes'] = df_cesnet_agg['n_bytes_mean'] 
df_cesnet_h['dbytes'] = df_cesnet_agg['n_bytes_mean'] # Assuming symmetric traffic for this dataset
df_cesnet_h['spkts'] = df_cesnet_agg['n_packets_mean']
df_cesnet_h['dpkts'] = df_cesnet_agg['n_packets_mean'] # Assuming symmetric traffic
df_cesnet_h['duration'] = df_cesnet_agg['avg_duration_mean']
df_cesnet_h['protocol'] = 'udp' # Placeholder
df_cesnet_h['service'] = 'dns'  # Placeholder
df_cesnet_h['flag'] = 'CON'    # Placeholder
df_cesnet_h['category'] = 'Normal'
df_cesnet_h['binary_label'] = 0


# Define common columns for the final merge
common_cols = ['duration', 'protocol', 'service', 'flag', 'sbytes', 'dbytes', 'spkts', 'dpkts', 'category', 'binary_label']

# Select and reorder columns for all dataframes
df_unsw_final = df_unsw_h[common_cols]
df_4network_final = df_4network_h[common_cols]
df_cesnet_final = df_cesnet_h[common_cols]

# Concatenate all three dataframes
df_merged = pd.concat([df_unsw_final, df_4network_final, df_cesnet_final], ignore_index=True)

print("Shape of the merged dataframe:", df_merged.shape)
df_merged.head()

Shape of the merged dataframe: (283865, 10)


Unnamed: 0,duration,protocol,service,flag,sbytes,dbytes,spkts,dpkts,category,binary_label
0,0.121478,tcp,-,FIN,258.0,172.0,6.0,4.0,Normal,0
1,0.649902,tcp,-,FIN,734.0,42014.0,14.0,38.0,Normal,0
2,1.623129,tcp,-,FIN,364.0,13186.0,8.0,16.0,Normal,0
3,1.681642,tcp,ftp,FIN,628.0,770.0,12.0,12.0,Normal,0
4,0.449454,tcp,-,FIN,534.0,268.0,10.0,6.0,Normal,0


## 6. Feature Engineering

Now that we have a merged dataset, we can create new features that might help the models.

In [50]:
# Avoid division by zero for rates
epsilon = 1e-6

df_merged['srate'] = df_merged['spkts'] / (df_merged['duration'] + epsilon)
df_merged['drate'] = df_merged['dpkts'] / (df_merged['duration'] + epsilon)
df_merged['total_bytes'] = df_merged['sbytes'] + df_merged['dbytes']
df_merged['total_pkts'] = df_merged['spkts'] + df_merged['dpkts']
df_merged['bytes_per_pkt'] = df_merged['total_bytes'] / (df_merged['total_pkts'] + epsilon)

print("Merged DataFrame with new features:")
df_merged.head()


Merged DataFrame with new features:


Unnamed: 0,duration,protocol,service,flag,sbytes,dbytes,spkts,dpkts,category,binary_label,srate,drate,total_bytes,total_pkts,bytes_per_pkt
0,0.121478,tcp,-,FIN,258.0,172.0,6.0,4.0,Normal,0,49.391253,32.927502,430.0,10.0,42.999996
1,0.649902,tcp,-,FIN,734.0,42014.0,14.0,38.0,Normal,0,21.541676,58.470264,42748.0,52.0,822.076907
2,1.623129,tcp,-,FIN,364.0,13186.0,8.0,16.0,Normal,0,4.928749,9.857498,13550.0,24.0,564.58331
3,1.681642,tcp,ftp,FIN,628.0,770.0,12.0,12.0,Normal,0,7.135878,7.135878,1398.0,24.0,58.249998
4,0.449454,tcp,-,FIN,534.0,268.0,10.0,6.0,Normal,0,22.249168,13.349501,802.0,16.0,50.124997


## 7. Data Cleaning and Preprocessing

This step involves handling missing values, encoding categorical variables, and scaling numerical features.

In [51]:
# Separate features (X) and labels (y)
X = df_merged.drop(columns=['category', 'binary_label'])
y_multi = df_merged['category']
y_binary = df_merged['binary_label']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X.select_dtypes(include='object').columns.tolist()

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ],
    remainder='passthrough'
)

# Apply the transformations
X_processed = preprocessor.fit_transform(X)

# The output X_processed is a sparse matrix or numpy array. Let's get the feature names.
ohe_feature_names = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_cols)
all_feature_names = numerical_cols + list(ohe_feature_names)

# Convert the processed data back to a DataFrame for inspection
X_processed_df = pd.DataFrame(X_processed.toarray(), columns=all_feature_names)


print("Shape of processed features:", X_processed_df.shape)
X_processed_df.head()


Shape of processed features: (283865, 240)


Unnamed: 0,duration,sbytes,dbytes,spkts,dpkts,srate,drate,total_bytes,total_pkts,bytes_per_pkt,...,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH,flag_URN,flag_no
0,-0.034962,-0.003974,-0.003987,-0.004535,-0.004538,-0.608363,-0.016298,-0.003981,-0.004537,-0.003435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.034305,-0.003973,-0.003837,-0.004505,-0.004412,-0.608485,-0.015432,-0.003905,-0.004459,-0.003435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.033096,-0.003974,-0.00394,-0.004528,-0.004494,-0.608557,-0.017079,-0.003957,-0.004511,-0.003435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.033023,-0.003973,-0.003985,-0.004513,-0.004509,-0.608548,-0.017172,-0.003979,-0.004511,-0.003435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.034554,-0.003973,-0.003987,-0.00452,-0.004531,-0.608482,-0.016961,-0.00398,-0.004526,-0.003435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 8. Handle Data Imbalance

Classification models can be biased towards the majority class. We'll use SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset. This should only be applied to the training data to avoid data leakage into the test set.

In [52]:
from sklearn.model_selection import train_test_split

# First, split the data into training and testing sets
X_train, X_test, y_train_multi, y_test_multi = train_test_split(
    X_processed_df, y_multi, test_size=0.2, random_state=42, stratify=y_multi
)

# Check the class distribution before SMOTE
print("Class distribution before SMOTE:")
print(y_train_multi.value_counts())

# Apply SMOTE to the training data
# n_jobs=-1 uses all available CPU cores
smote = SMOTE(random_state=42, n_jobs=-1)
X_train_smote, y_train_smote_multi = smote.fit_resample(X_train, y_train_multi)

# Check the class distribution after SMOTE
print("\nClass distribution after SMOTE:")
print(y_train_smote_multi.value_counts())


Class distribution before SMOTE:
category
Normal    85959
Probe     70179
DoS       67566
R2L        2031
U2R        1357
Name: count, dtype: int64





Class distribution after SMOTE:
category
Normal    85959
DoS       85959
Probe     85959
U2R       85959
R2L       85959
Name: count, dtype: int64


## 9. Save Processed Data

Finally, we save the processed and balanced training data, as well as the untouched test data, to be used in the modeling phase.

In [53]:
# Create a directory for processed data if it doesn't exist
PROCESSED_DATA_PATH = os.path.join(BASE_PATH, "processed_data")
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

# Save the datasets
# We'll save the SMOTE-balanced training set and the original test set
X_train_smote.to_csv(os.path.join(PROCESSED_DATA_PATH, "X_train_processed.csv"), index=False)
y_train_smote_multi.to_csv(os.path.join(PROCESSED_DATA_PATH, "y_train_processed.csv"), index=False)
X_test.to_csv(os.path.join(PROCESSED_DATA_PATH, "X_test_processed.csv"), index=False)
y_test_multi.to_csv(os.path.join(PROCESSED_DATA_PATH, "y_test_processed.csv"), index=False)

print(f"Processed data saved to: {PROCESSED_DATA_PATH}")


Processed data saved to: e:/Swinburne_bsc_data_science/year_two/sem_two/COS30049/Assignment2\processed_data
