# Prepare UNSW-NB15 Dataset
This notebook is used to prepare the UNSW-NB15 dataset for the project to run. For more information about the UNSW-NB15 dataset, please checck *[UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)](https://ieeexplore.ieee.org/document/7348942) by Nour Moustafa and Jill Slay*.  
## Major steps
1. Download the UNSW-NB15 dataset from Kaggle
2. Merge the 4 pieces of the dataset into 1 complete dataset
3. Simple data cleaning
4. Save dataset to .csv file  
******

## Download dataset
***Only needed when running the project for the first time***

In [7]:
# Dataset to download from Kaggle
kaggle_name = 'mrwellsdavid/unsw-nb15'
# Prepare directory to store the datasets
import os
if not os.path.exists('data'):
    os.mkdir('data')
if not os.path.exists('data/achieve'):
    os.mkdir('data/achieve')

### Requirement
Please ensure you have installed and configured **[Kaggle API Tool](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication)** in your environment in order to automatically download the dataset.  
Kaggle API Documentation: <https://www.kaggle.com/docs/api>  
**Important**: Make sure the **[authentication part](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication)** of the set-up process is corretcly performed. 

In [8]:
# Download the original UNSW-NB15 dataset using Kaggle API (This may take some time depending on internet connection)
# This cell equals running the command in the system shell
status = os.system('kaggle datasets download --force --unzip -d {} -p data/achieve'.format(kaggle_name))
if (status != 0):
    raise RuntimeError('Downloading Failed')

### Manual Replacement
Download the dataset from [kaggle page](https://www.kaggle.com/datasets/mrwellsdavid/unsw-nb15) and put all the unziped files under `data/achieve`. 

----
## Merge the datasets
This step will merge `UNSW-NB15_{1,2,3,4}.csv` according to the feature definition in `NUSW-NB15_features.csv`.

### Get Names and Types of the features

In [18]:
# Read feature definition in NUSW-NB15_features.csv
import pandas as pd
df_features = pd.read_csv('data/achieve/NUSW-NB15_features.csv', encoding='cp1252', index_col=0)
print(df_features.shape)
# Display
df_features

(49, 3)


Unnamed: 0_level_0,Name,Type,Description
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,srcip,nominal,Source IP address
2,sport,integer,Source port number
3,dstip,nominal,Destination IP address
4,dsport,integer,Destination port number
5,proto,nominal,Transaction protocol
6,state,nominal,Indicates to the state and its dependent proto...
7,dur,Float,Record total duration
8,sbytes,Integer,Source to destination transaction bytes
9,dbytes,Integer,Destination to source transaction bytes
10,sttl,Integer,Source to destination time to live value


In [19]:
df_features.iloc[:, 1].value_counts()

integer      20
Float        10
Integer       8
nominal       6
Timestamp     2
Binary        2
binary        1
Name: Type , dtype: int64

In [20]:
# Get Feature Names
col_names = df_features.iloc[:, 0].to_list()
# Get Feature Types
# Create dictionary for type translation
type_dict = {
    'integer': 'int32', 
    'Float': 'float64', 
    'Integer': 'int32', 
    'nominal': 'object', 
    'Timestamp': 'int64', 
    'Binary': 'int16', 
    'binary': 'int16'}
# Get dtypes for .csv reading
col_types = [type_dict[dtype] if dtype in type_dict else dtype for dtype in df_features.iloc[:, 1].to_list()]

### Merge all pieces of the UNSW-NB15 dataset

In [21]:
# Get all dataset piece paths
dataset_piece_paths = ['data/achieve/UNSW-NB15_{}.csv'.format(i) for i in range(1, 5)]
dataset_piece_paths

['data/achieve/UNSW-NB15_1.csv',
 'data/achieve/UNSW-NB15_2.csv',
 'data/achieve/UNSW-NB15_3.csv',
 'data/achieve/UNSW-NB15_4.csv']

In [22]:
# Since the dataset has some unexpected values in numeric columns
# the merging process just convert those values to NaN then remove them

import numpy as np

# The following methods converts the non-numeric values to NaN
# used as converting methods in read_csv converters
def castInt(value: str):
    try:
        return int(value)
    except ValueError:
        return np.nan

def castFloat(value: str):
    try:
        return float(value)
    except ValueError:
        return np.nan

# Creat a converter for csv reading
converter = dict()
for i in range(len(col_names)):
    if col_types[i].startswith('int'):
        converter[col_names[i]] = castInt
    elif col_types[i].startswith('float'):
        converter[col_names[i]] = castFloat

In [23]:
# Read the first dataset
df = pd.read_csv(dataset_piece_paths[0], header=None, index_col=False, names=col_names, converters=converter)

  df = pd.read_csv(dataset_piece_paths[0], header=None, index_col=False, names=col_names, converters=converter)


In [24]:
# Check dataset
print(df.shape)
df.head()

(700001, 49)


Unnamed: 0,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,...,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,Label
0,59.166.0.0,1390.0,149.171.126.6,53.0,udp,CON,0.001055,132,164,31,...,0,3,7,1,3,1,1,1,,0
1,59.166.0.0,33661.0,149.171.126.9,1024.0,udp,CON,0.036133,528,304,31,...,0,2,4,2,3,1,1,2,,0
2,59.166.0.6,1464.0,149.171.126.7,53.0,udp,CON,0.001119,146,178,31,...,0,12,8,1,2,2,1,1,,0
3,59.166.0.5,3593.0,149.171.126.5,53.0,udp,CON,0.001209,132,164,31,...,0,6,9,1,1,1,1,1,,0
4,59.166.0.3,49664.0,149.171.126.0,53.0,udp,CON,0.001169,146,178,31,...,0,7,9,1,1,1,1,1,,0


In [25]:
# Merge the left datasets
for path in dataset_piece_paths[1:]:
    # Read the dataset
    temp = pd.read_csv(path, header=None, index_col=False, names=col_names, converters=converter)
    # Concat the dataframe to the existing dataframe
    df = pd.concat([df, temp])
    # Release memory
    del temp
# Display the complete dataset
# Expected Output (according to the paper): (2540044, 49)
# Actual Output: (2540047, 49)
print(df.shape)

  temp = pd.read_csv(path, header=None, index_col=False, names=col_names, converters=converter)


(2540047, 49)


In [26]:
# Check the dataset
df.head()

Unnamed: 0,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,...,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,Label
0,59.166.0.0,1390.0,149.171.126.6,53.0,udp,CON,0.001055,132,164,31,...,0.0,3,7,1,3,1,1,1,,0
1,59.166.0.0,33661.0,149.171.126.9,1024.0,udp,CON,0.036133,528,304,31,...,0.0,2,4,2,3,1,1,2,,0
2,59.166.0.6,1464.0,149.171.126.7,53.0,udp,CON,0.001119,146,178,31,...,0.0,12,8,1,2,2,1,1,,0
3,59.166.0.5,3593.0,149.171.126.5,53.0,udp,CON,0.001209,132,164,31,...,0.0,6,9,1,1,1,1,1,,0
4,59.166.0.3,49664.0,149.171.126.0,53.0,udp,CON,0.001169,146,178,31,...,0.0,7,9,1,1,1,1,1,,0


## Simple Data Cleaning
Two modifications will be performed in this step:  
1. Replace the `NaN` values in **attack_cat** with `Normal`
2. Fix the `NaN` in feature `ct_flw_http_mthd`, `is_ftp_login`, and `ct_ftp_cmd` according to their definition by replace `NaN` with `-1`  
3. Remove the `NaN` values identified in previous steps violating the feature definition  

In [27]:
# Check values in attack_cat
df['attack_cat'].value_counts()

Generic             215481
Exploits             44525
 Fuzzers             19195
DoS                  16353
 Reconnaissance      12228
 Fuzzers              5051
Analysis              2677
Backdoor              1795
Reconnaissance        1759
 Shellcode            1288
Backdoors              534
Shellcode              223
Worms                  174
Name: attack_cat, dtype: int64

In [28]:
# Replace the NaN values in attack_cat with Normal
df['attack_cat'].fillna('Normal', inplace=True)
# Check values in attack_cat
df['attack_cat'].value_counts()

Normal              2218764
Generic              215481
Exploits              44525
 Fuzzers              19195
DoS                   16353
 Reconnaissance       12228
 Fuzzers               5051
Analysis               2677
Backdoor               1795
Reconnaissance         1759
 Shellcode             1288
Backdoors               534
Shellcode               223
Worms                   174
Name: attack_cat, dtype: int64

In [33]:
# Fix the NaN in feature ct_flw_http_mthd, is_ftp_login, and ct_ftp_cmd according to their definition
df['ct_flw_http_mthd'].fillna(-1, inplace=True)
df['is_ftp_login'].fillna(-1, inplace=True)
df['ct_ftp_cmd'].fillna(-1, inplace=True)
# Check value in ct_ftp_cmd
df['ct_ftp_cmd'].value_counts()

-1.0    1429879
 0.0    1066498
 1.0      40077
 2.0       1264
 4.0        960
 3.0        729
 6.0        332
 5.0        290
 8.0         18
Name: ct_ftp_cmd, dtype: int64

In [35]:
# Remove the NaN values identified in previous steps violating the feature definition
df.dropna(axis=0, inplace=True)
# Check dataset shape
df.shape

(2539739, 49)

## Save Dataset
Save the dataframe to .csv file

In [40]:
# Save the dataframe to file
output_path = 'data/UNSW-NB15.csv'
df.to_csv(output_path)