# 1. Building Custom Environment (NSL-KDD Dataset)

This section involves using OpenAI Gym to develop our Deep Reinforcement Learning (DRL) environment. The environment that we define will have the following characteristics and act as a multi-class classifier:

Actions:

0 (do not alert)
<br>
1 (DoS alert)
<br>
2 (Probe)
<br>
3 (R2L)
<br>
4 (U2R)

Rewards:

+1 if agent correctly alerts to the correct type of attack <br>
0 if agent does not raise an alert when it is not needed <br>
-1 if agent does not raise an alert when there is an attack <br>
-1 if agent raises alert when there is not one needed <br>
-1 if agent raises an alert to the incorrect type of attack <br>

Episode Termination Condition:

i. An episode reaches >= 500 steps <br>
ii. An attack is issued and no alert is made <br>

In [98]:
# pip install pycm==3.3

Collecting pycm==3.3
  Downloading pycm-3.3-py2.py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 464 kB/s eta 0:00:011
Collecting art>=1.8
  Downloading art-5.3-py2.py3-none-any.whl (574 kB)
[K     |████████████████████████████████| 574 kB 5.1 MB/s eta 0:00:0101
[?25hInstalling collected packages: art, pycm
Successfully installed art-5.3 pycm-3.3
Note: you may need to restart the kernel to use updated packages.


In [102]:
# imports and filtering unnecessary tensorflow warnings

import gym 
import numpy as np
import pandas as pd
from pycm import *
from stable_baselines.common.env_checker import check_env
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
from stable_baselines.common.policies import FeedForwardPolicy
from stable_baselines.common.callbacks import BaseCallback
from sklearn.metrics import accuracy_score, f1_score
import swifter
from sklearn.model_selection import train_test_split

import os
import warnings
import tensorflow as tf
import logging

In [212]:
df = pd.read_csv("preprocessed_NSL-KDD.csv")
df=df.rename(columns = {'Class':'label'})


# converting any label != 0 to 1, so we just determine attack or not
# df.label = df.label.astype(bool).astype(int)

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        if (feature_name == 'label'):
            continue
        else:
            max_value = df[feature_name].max()
            min_value = df[feature_name].min()
            result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

df = normalize(df)

In [171]:
X_train, X_test, y_train, y_test = train_test_split(             # using 15% of the data for testing
    df.iloc[:, :-1], df["label"], test_size=0.15, shuffle=True)

X_train, X_val, y_train, y_val = train_test_split(               # using 15% of the data for validation
    X_train, y_train, test_size=0.15, shuffle=True)

print("train percentage: ", len(X_train)/len(df.index) * 100, "%")
print("test percentage: ", len(X_test)/len(df.index) * 100, "%")
print("validation percentage ", len(X_val)/len(df.index) * 100, "%")

train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)
val_data = pd.concat([X_val, y_val], axis=1)

print(train_data.label.unique())

train percentage:  72.24702848877429 %
test percentage:  15.001152967318617 %
validation percentage  12.751818543907092 %
[4 1 0 2 3]


In [172]:
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # or any {'0', '1', '2'}
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=Warning)

tf.get_logger().setLevel("INFO")
tf.autograph.set_verbosity(0)

tf.get_logger().setLevel(logging.ERROR)

In [173]:
class DRL_IDS_Env(gym.Env):
    def __init__(self, train_data): # test data created in last module
        '''
        constructor
        '''
        super().__init__()
        self.train_data = train_data
    
        # set limit for episode to 500 steps
        self.max_steps = 500
        self.extra_steps = None # counter for steps going beyond the max_steps limit
    
        # defining the reward function as discussed above
        # [(true_label, action) : reward]
        
        '''
        self.rewards = {(0, 1): -1, # (benign, alert) : -1
                        (1, 0): -1, # (attack, no alert) : -1
                        (1, 1): 1, # (attack, alert) : 1
                        (0, 0): 0, # (benign, no alert) : 0
        '''
        
        self.rewards = {(0, 0): 0,
                        (0, 1): -1,
                        (0, 2): -1,
                        (0, 3): -1,
                        (0, 4): -1,
                        (1, 0): -1,
                        (1, 1): 1, 
                        (1, 2): -1,
                        (1, 3): -1,
                        (1, 4): -1,
                        (2, 0): -1,
                        (2, 1): -1,
                        (2, 2): 1,
                        (2, 3): -1,
                        (2, 4): -1,
                        (3, 0): -1,
                        (3, 1): -1,
                        (3, 2): -1,
                        (3, 3): 1,
                        (3, 4): -1,
                        (4, 0): -1,
                        (4, 1): -1,
                        (4, 2): -1,
                        (4, 3): -1,
                        (4, 4): 1}
        
        # defining action/observation space
        self.action_space = gym.spaces.Discrete(5)  # either 0 (NORMAL) or 1 (ATTACK)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(train_data.shape[1] - 1,), dtype=np.float64) # 'box' implies we are dealing with real, valued quantities
    
    def step(self, action):
        '''
        agent taking a single step
        this method is called after an agent takes a step
        '''
        
        # check if action exists in action space
        try:
            self.action_space.contains(action)
        except AssertionError as msg:
            print(msg)
        
        # determine if the episode is finished
        ep_info = {}
        finished = False
        self.current_step += 1
        if self.current_step >= self.max_steps:
            ep_info['end_cause'] = 'max_step_limit_reached'
            finished = True # we do not want to exceed our max step limit
        
        if self.label == 1 and action == 0: # this implies there was an attack that we did not alert
            ep_info['end_cause'] = 'attack_unalerted'
            finished = True
            
        # calculate reward based on the label of the observation and action taken by agent
        reward = self.rewards[(self.label, action)] # maps back to our self.reward dictionary
        
        # calculate the next state if finished = False
        if not finished:
            self.i += 1 # hop to next row in dataset
            if self.i >= self.train_data.shape[0]: # if this extends beyond the number of rows in our dataset
                self.i = 0 # set back to first 'state'
            
            self.obs = self.train_data.iloc[self.i] # pulling that row, or 'observation' from our dataset
            self.label = int(self.obs.pop('label'))
            
        elif self.extra_steps is None:
            self.extra_steps = 0
        else:
            if self.extra_steps == 0:
                gym.logger.warn('Episode max_step length exceeded. You are entering uncharted territory and should reset the episode.')
                self.extra_steps += 1
                reward = 0
                
        return self.obs.values, reward, finished, ep_info
    
    def reset(self):
        
        extra_steps = None
        self.current_step = 0
        
        self.i = np.random.randint(0, self.train_data.shape[0]) # pick a random starting location from the 0th row to the nth row

        self.obs = self.train_data.iloc[self.i]
        # record the true label of self.obs
        self.label = int(self.obs.pop('label'))
        
        return self.obs.values

Now we will create an instance of DRL_IDS_Env (and validate it using stable_baselines)
Note: stable_baselines (https://stable-baselines.readthedocs.io/en/master/) is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.

In [174]:
# train_data = pd.read_csv("processed_data/train.csv")
env = DRL_IDS_Env(train_data)
check_env(env, warn=True)

# 2. Training within Custom Environment

We will be training the algorithm on multiple environment in parallel through the stable-baselines lib (vectorized environments)

In [175]:
n_envs = 16  # hyperparameter
env = DummyVecEnv([lambda: DRL_IDS_Env(train_data)] * n_envs)

In [176]:
# Defining network architecture

# https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html#custom-policy
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(
            *args,
            **kwargs,
            net_arch=[128, 64, 32],
            act_fun=tf.nn.relu,  # Using the ReLU (REctified Linear Unit) activation function
            feature_extraction= "mlp" # Multi-Layer Perceptrons
        )

In [177]:
# TODO: ADJUST THESE HYPERPARAMETERS

'''
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO
(it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far from the old policy. For that, PPO
uses clipping to avoid too large of updates.
'''
model = PPO2(
    CustomPolicy,
    env,
    gamma=0.9,
    n_steps=512,
    ent_coef=1e-05,
    learning_rate=lambda progress: progress
    * 0.0021,  # progress decreases from 1 to 0 -> lr decreasesb from 0.0021 to 0
    vf_coef=0.6,
    max_grad_norm=0.8,
    lam=0.8,
    nminibatches=16,
    noptepochs=55,
    cliprange=0.2,
    verbose=0,
    tensorboard_log="log_25",  # define the tensorboard log location
)

In [209]:
class AccF1Callback(BaseCallback):
    def __init__(self, train, val, eval_freq):
        super().__init__()
        self.train_data = train
        self.val_data = val
        self.eval_freq = eval_freq

    def _on_step(self):
        '''
        _on_step will be called after every eval_freq steps
        '''

        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            super()._on_step()

            predicted = self.train_data.drop(columns=["label"]).swifter.apply(lambda x: self.model.predict(x, deterministic=True)[0], axis=1)
            cm = ConfusionMatrix(self.train_data["label"].tolist(), predicted, digit=5)
            print("*-" * 50)
            
            print("total current timesteps: ", self.num_timesteps, '\n')
            # The accuracy is the number of correct predictions from all predictions made
            print('Overall Training Accuracy: ', cm.Overall_ACC)
            # In statistical analysis of classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p
            # and the recall r of the test to compute the score. The F1 score is the harmonic average of the precision and recall, where F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0
            print("Training F1 Scores: ", cm.F1)
            
            print('\n')
            predicted = self.val_data.drop(columns=["label"]).swifter.apply(lambda x: self.model.predict(x, deterministic=True)[0], axis=1)
            cm = ConfusionMatrix(self.val_data["label"].tolist(), predicted, digit=5)
            # The accuracy is the number of correct predictions from all predictions made
            print('Overall Validation Accuracy: ', cm.Overall_ACC)
            # In statistical analysis of classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p
            # and the recall r of the test to compute the score. The F1 score is the harmonic average of the precision and recall, where F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0
            print("Validation F1 Scores: ", cm.F1)
            
            print("*-" * 50)

        return True

In [210]:
# val_data = pd.read_csv("processed_data/val.csv")
eval_callback = AccF1Callback(train_data, val_data, eval_freq=1000 // n_envs)

In [211]:
model.learn(5000000, callback=eval_callback)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  992 

Overall Training Accuracy:  0.99544452181987
Training F1 Scores:  {0: 0.9978836952521163, 1: 0.9958872265601717, 2: 0.9623029472241261, 3: 0.4793388429752066, 4: 0.9971909138200103}


Overall Validation Accuracy:  0.9947394377774125
Validation F1 Scores:  {0: 0.9973656480505796, 1: 0.9945982444294396, 2: 0.9608540925266904, 3: 0.26666666666666666, 4: 0.9968701095461658}
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  1984 

Overall Training Accuracy:  0.99544452181987
Training F1 Scores:  {0: 0.9978836952521163, 1: 0.9958872265601717, 2: 0.9623029472241261, 3: 0.4793388429752066, 4: 0.9971909138200103}


Overall Validation Accuracy:  0.9947394377774125
Validation F1 Scores:  {0:

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  14880 

Overall Training Accuracy:  0.9959377901578459
Training F1 Scores:  {0: 0.9991720331186753, 1: 0.9954713383386963, 2: 0.963276836158192, 3: 0.5607476635514018, 4: 0.9969878475227643}


Overall Validation Accuracy:  0.9939174749301332
Validation F1 Scores:  {0: 0.9973642593568793, 1: 0.9922375970300371, 2: 0.9477611940298507, 3: 0.3076923076923077, 4: 0.9964898595943837}
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  15872 

Overall Training Accuracy:  0.9959377901578459
Training F1 Scores:  {0: 0.9991720331186753, 1: 0.9954713383386963, 2: 0.963276836158192, 3: 0.5607476635514018, 4: 0.9969878475227643}


Overall Validation Accuracy:  0.9939174749301332
Validation F1 Scores: 

KeyboardInterrupt: 