# Prepare UNSW-NB15 Dataset
This notebook is used to prepare the UNSW-NB15 dataset for the project to run. For more information about the UNSW-NB15 dataset, please checck *[UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)](https://ieeexplore.ieee.org/document/7348942) by Nour Moustafa and Jill Slay*.  
## Major steps
1. Download the UNSW-NB15 dataset from Kaggle
2. Merge the 4 pieces of the dataset into 1 complete dataset
3. Save dataset to .csv file  
******

## Download dataset
***Only needed when running the project for the first time***

In [1]:
# Dataset to download from Kaggle
kaggle_name = 'mrwellsdavid/unsw-nb15'
# Prepare directory to store the datasets
import os
if not os.path.exists('data'):
    os.mkdir('data')
if not os.path.exists('data/achieve'):
    os.mkdir('data/achieve')

### Requirement
Please ensure you have installed and configured **[Kaggle API Tool](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication)** in your environment in order to automatically download the dataset.  
Kaggle API Documentation: <https://www.kaggle.com/docs/api>  
**Important**: Make sure the **[authentication part](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication)** of the set-up process is corretcly performed. 

In [2]:
# Download the original UNSW-NB15 dataset using Kaggle API (This may take some time depending on internet connection)
# This cell equals running the command in the system shell
status = os.system('kaggle datasets download --force --unzip -d {} -p data/achieve'.format(kaggle_name))
if (status != 0):
    raise RuntimeError('Downloading Failed')

### Manual Replacement
Download the dataset from [kaggle page](https://www.kaggle.com/datasets/mrwellsdavid/unsw-nb15) and put all the unziped files under `data/achieve`. 

----
## Merge the datasets
This step will merge `UNSW-NB15_{1,2,3,4}.csv` according to the feature definition in `NUSW-NB15_features.csv`.

In [21]:
# Read feature definition in NUSW-NB15_features.csv
import pandas as pd
df_features = pd.read_csv('data/achieve/NUSW-NB15_features.csv', encoding='cp1252', index_col=0)
print(df_features.shape)
df_features.head()

(49, 3)


Unnamed: 0_level_0,Name,Type,Description
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,srcip,nominal,Source IP address
2,sport,integer,Source port number
3,dstip,nominal,Destination IP address
4,dsport,integer,Destination port number
5,proto,nominal,Transaction protocol


In [23]:
df_features.iloc[:, 1].value_counts()

integer      20
Float        10
Integer       8
nominal       6
Timestamp     2
Binary        2
binary        1
Name: Type , dtype: int64

In [25]:
# Get Feature Names
col_names = df_features.iloc[:, 0].to_list()
# Get Feature Types
type_dict = {'integer': pd.Int64Dtype, 'float': , }