<a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"> <img src="images/DLI Header.png" alt="Header" style="width: 400px;"/> </a>

# Anomaly Detection in Network Data using GPU-Accelerated XGBoost

- Ananth Sankar, Solutions Architect at NVIDIA.
- Eric Harper, Solutions Architect, Global Telecoms at NVIDIA.

As network traffic continues to grow exponentially, the number of network attacks and the different kinds of attacks is also growing. The ability to quickly and frequently train machine learning models to detect network intrusions is more important now than ever.

In this series of labs, we will learn how to use machine learning and deep learning models for detecting network intrusions in the full KDD99 dataset. The data processing and model training techniques that will be learned in these labs can be applied to many datasets for anomaly detection problems.

The KDD99 dataset consists of normal data points and points that have been labeled as Denial of Service (DoS), Remote to User (R2L), User to Root (U2R), and Probing (Probe) by logging network packet information. More information about the dataset can be found at https://kdd.ics.uci.edu/databases/kddcup99/task.html. 


We'll start off by exploring the dataset and then we will use the NVIDIA RAPIDS library to train GPU-accelerated XGBoost models for network intrusion detection. The RAPIDS suite of software libraries, built on CUDA-X AI, gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Notice that we will also be using the Pandas and Scikit-learn packages in this lab. 

In [1]:
!tar -zxvf ./data.tar.gz

data/
data/preprocessed_data_full.pkl
data/kddcup.data.corrected


In [2]:
# Import libraries that will be needed for the lab

import xgboost as xgb
import numpy as np
from collections import OrderedDict
import gc
from glob import glob
import os
import pandas as pd
from copy import copy
from time import time
from sklearn.metrics import roc_auc_score,confusion_matrix,accuracy_score,classification_report,roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from timeit import default_timer
import matplotlib.pyplot as plt
import pickle

# Set the seed for numpy
np.random.seed(123)

# Display all columns of Pandas' dataframes by default
pd.set_option('display.max_columns', None)

data_path = './data/kddcup.data.corrected'

## Section 1: Data

### 1.1 Load the Dataset

Let's begin by first importing the KDD99 Dataset using Pandas and then doing some basic data exploration.

In [3]:
col_names = ["duration","protocol_type","service","flag","src_bytes","dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in",
             "num_compromised","root_shell","su_attempted","num_root","num_file_creations","num_shells","num_access_files","num_outbound_cmds",
             "is_host_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
             "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate",
             "dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",
             "dst_host_srv_rerror_rate","label"]

df =  pd.read_csv(data_path, header=None, names=col_names, index_col=False)

# Display the first few rows of the dataset
df.head(5)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


Each row of the KDD Cup 99 dataset is a network connection, with a total of 41 independent variables and 1 dependent variable. The independent variables can be broadly divided into three groups:

1. Basic input features of network connections such as duration, protocol type, and number of bytes from source IP addresses 
2. Content input features of network connections
3. The statistical input features computed over a time window

Scroll to the right of the above cell to see all of the features.  The last column on the right is the "Label" column. This indicates whether the row is normal or some type of anomalous network traffic.

Let's see what types of anomalies are in our dataset.

In [4]:
pd.DataFrame(df['label'].value_counts())

Unnamed: 0,label
smurf.,2807886
neptune.,1072017
normal.,972781
satan.,15892
ipsweep.,12481
portsweep.,10413
nmap.,2316
back.,2203
warezclient.,1020
teardrop.,979


In [5]:
# here we train a label encoder so that we can map our classes to integers later for model training
le = LabelEncoder()
le.fit(df.label)
print(le.classes_)

['back.' 'buffer_overflow.' 'ftp_write.' 'guess_passwd.' 'imap.'
 'ipsweep.' 'land.' 'loadmodule.' 'multihop.' 'neptune.' 'nmap.' 'normal.'
 'perl.' 'phf.' 'pod.' 'portsweep.' 'rootkit.' 'satan.' 'smurf.' 'spy.'
 'teardrop.' 'warezclient.' 'warezmaster.']


### 1.2 Dataset Modification

Notice that the dataset has more anomalies than normal data. Reflect for a moment about the implications of having more anomalies might be. Reflect either here in the notebook, on a piece of paper, or with a peer sitting next to you.

We'll come back to test your hypothesis shortly. 

<a id='return'></a>

### 1.3 Data Preprocessing

In order to train an XGBoost model, we have to encode the strings in categorical variables to numeric terms. 

We will use one-hot encoding to translate each of the 7 categorical features:  `protocol type`, `service`, `flag`, `land`, `logged_in`, `is_host_login`, `is_guest_login` using the Pandas function `get_dummies()`. One-hot encoding will transform the categorical variable into a numerical variable for each category. If a category takes ten values, then that categorical variable will be transformed into 10 numerical variables.

### 1.4 One-hot Encode the Categorical Data

In [6]:
# capture the categorical variables and one-hot encode them
cat_vars = ['protocol_type', 'service', 'flag', 'land', 'logged_in','is_host_login', 'is_guest_login']

# find unique labels for each category
cat_data = pd.get_dummies(df[cat_vars])

# check that the categorical variables were created correctly
cat_data.head()

Unnamed: 0,land,logged_in,is_host_login,is_guest_login,protocol_type_icmp,protocol_type_tcp,protocol_type_udp,service_IRC,service_X11,service_Z39_50,service_aol,service_auth,service_bgp,service_courier,service_csnet_ns,service_ctf,service_daytime,service_discard,service_domain,service_domain_u,service_echo,service_eco_i,service_ecr_i,service_efs,service_exec,service_finger,service_ftp,service_ftp_data,service_gopher,service_harvest,service_hostnames,service_http,service_http_2784,service_http_443,service_http_8001,service_imap4,service_iso_tsap,service_klogin,service_kshell,service_ldap,service_link,service_login,service_mtp,service_name,service_netbios_dgm,service_netbios_ns,service_netbios_ssn,service_netstat,service_nnsp,service_nntp,service_ntp_u,service_other,service_pm_dump,service_pop_2,service_pop_3,service_printer,service_private,service_red_i,service_remote_job,service_rje,service_shell,service_smtp,service_sql_net,service_ssh,service_sunrpc,service_supdup,service_systat,service_telnet,service_tftp_u,service_tim_i,service_time,service_urh_i,service_urp_i,service_uucp,service_uucp_path,service_vmnet,service_whois,flag_OTH,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


<br>
Scroll to the right to notice the new categories that were created. For example, the categorical variable "protocol_type" is split into three categories, protocol_type_icmp, protocol_type_tcp and protocol_type_udp. Now that the one hot encoding of the categorical data is done, we need to merge the numerical data from the original data. 
<br>
<br>

In [7]:
numeric_vars = list(set(df.columns.values.tolist()) - set(cat_vars))
numeric_vars.remove('label')
numeric_data = df[numeric_vars].copy()

# check that the numeric data has been captured accurately
numeric_data.head()

Unnamed: 0,dst_host_diff_srv_rate,num_file_creations,dst_host_srv_rerror_rate,dst_host_same_srv_rate,dst_host_srv_diff_host_rate,su_attempted,dst_host_serror_rate,num_root,wrong_fragment,dst_bytes,urgent,diff_srv_rate,duration,dst_host_srv_serror_rate,root_shell,same_srv_rate,num_outbound_cmds,num_shells,srv_serror_rate,src_bytes,srv_count,num_access_files,num_compromised,srv_diff_host_rate,dst_host_count,rerror_rate,dst_host_same_src_port_rate,srv_rerror_rate,count,dst_host_srv_count,hot,dst_host_rerror_rate,serror_rate,num_failed_logins
0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,45076,0,0.0,0,0.0,0,1.0,0,0,0.0,215,1,0,0,0.0,0,0.0,0.0,0.0,1,0,0,0.0,0.0,0
1,0.0,0,0.0,1.0,0.0,0,0.0,0,0,4528,0,0.0,0,0.0,0,1.0,0,0,0.0,162,2,0,0,0.0,1,0.0,1.0,0.0,2,1,0,0.0,0.0,0
2,0.0,0,0.0,1.0,0.0,0,0.0,0,0,1228,0,0.0,0,0.0,0,1.0,0,0,0.0,236,1,0,0,0.0,2,0.0,0.5,0.0,1,2,0,0.0,0.0,0
3,0.0,0,0.0,1.0,0.0,0,0.0,0,0,2032,0,0.0,0,0.0,0,1.0,0,0,0.0,233,2,0,0,0.0,3,0.0,0.33,0.0,2,3,0,0.0,0.0,0
4,0.0,0,0.0,1.0,0.0,0,0.0,0,0,486,0,0.0,0,0.0,0,1.0,0,0,0.0,239,3,0,0,0.0,4,0.0,0.25,0.0,3,4,0,0.0,0.0,0


In [8]:
# concat numeric and the encoded categorical variables
numeric_cat_data = pd.concat([numeric_data, cat_data], axis=1)

# here we do a quick sanity check that the data has been concatenated correctly by checking the dimension of the vectors
print(cat_data.shape)
print(numeric_data.shape)
print(numeric_cat_data.shape)

(4898431, 88)
(4898431, 34)
(4898431, 122)


<br>
Now let's split the data into training set and test set in the ratio of 75:25. We will be using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">LabelEncoder</a>, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform">fit_transform</a> and <a href="https://scikit-https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">train_test_split</a> from scikit-learn.
<br>

In [9]:
# capture the labels
labels = df['label'].copy()

# convert labels to integers
integer_labels = le.transform(labels)

# split data into test and train
x_train, x_test, y_train, y_test = train_test_split(numeric_cat_data,
                                                    integer_labels,
                                                    test_size=.25, 
                                                    random_state=42)

<br>
We can inspect the dimension of the testing set and the training set to confirm that the data has been split correctly. We will also save the dataset to be used in the later portion of this lab and in lab-2 by "pickling" the data. Pickling allows us to save a python object as a binary file.
<br>

In [10]:
# check that the dimensions of our train and test sets are okay
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(3673823, 122)
(3673823,)
(1224608, 122)
(1224608,)


In [11]:
# save the datasets for later use
preprocessed_data = {
    'x_train':x_train,
    'y_train':y_train,
    'x_test':x_test,
    'y_test':y_test,
    'le':le
}

# pickle the preprocessed_data
path = 'preprocessed_data_full.pkl'
out = open(path, 'wb')
pickle.dump(preprocessed_data, out)
out.close()

We will approach this anomaly detection problem in two ways:

1. Implementing binary classification where we will label the Normal frames as '0' and Anomalous frames as '1' and use a 'one vs all' approach to detect an anomalous frame
2. Implementing multi-class classification where we will be able to detect the *type* of anomaly as well using our original y_train and y_test labels

# Conclusion 

- As we saw in the binary and multi-class classification problems, XGBoost can be very effective at detecting anomalies when you have labeled data. In labs 2 and 3, we will consider the same KDD99 dataset but we will train deep learning models to detect anomalies without using the labels.  This will mimic a more likely situation is the real world.

- GPU-Accelerating XGboost through RAPIDS is easy and fast.  The only change we had to make to use the GPU was to set the 'tree_method' parameter to 'gpu_hist'.

# References

<ol>
<li>
Dhaliwal, S., Nahid, A., & Abbas, R. (2018). Effective Intrusion Detection System Using XGBoost. Information, 9(7), 149. doi:10.3390/info9070149
</li>
<li>
Brownlee, J. A Gentle Introduction to XGBoost for Applied Machine Learning. Machine Learning
Mastery. Available online: http://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
(accessed on 2 March 2018).
    </li>
    <li>A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms. Available
online: https://pdfs.semanticscholar.org/1b34/80021c4ab0f632efa99e01a9b073903c5554.pdf (accessed on
        26 March 2018)</li>
    <li>
        XGBoost Parameters—Xgboost 0.7 Documentation. Available online: http://xgboost.readthedocs.io/en/
latest/parameter.html (accessed on 12 March 2018)
    </li>
    <li>
        RAPIDS Documentation and Cheat Sheet.Available online: https://rapids.ai/documentation.html
    </li>
    
   

<a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"> <img src="images/DLI Header.png" alt="Header" style="width: 400px;"/> </a>