# 1. Building Custom Environment

This section involves using OpenAI Gym to develop our Deep Reinforcement Learning (DRL) environment. The environment that we define will have the following characteristics:

Actions:

0 (do not alert)
<br>
1 (alert)

Rewards:

+1 if agent correctly alerts to an attack <br>
0 if agent does not raise an alert when it is not needed <br>
-1 if agent does not raise an alert when there is an attack <br>
-1 if agent raises alert when there it is not needed <br>

Episode Termination Condition:

i. An episode reaches >= 500 steps <br>
ii. An attack is issued and no alert is made <br>

In [27]:
# imports and filtering unnecessary tensorflow warnings

import gym 
import numpy as np
import pandas as pd
from stable_baselines.common.env_checker import check_env
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
from stable_baselines.common.policies import FeedForwardPolicy
from stable_baselines.common.callbacks import BaseCallback
from sklearn.metrics import accuracy_score, f1_score
import swifter

import os
import warnings
import tensorflow as tf
import logging

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # or any {'0', '1', '2'}
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=Warning)

tf.get_logger().setLevel("INFO")
tf.autograph.set_verbosity(0)

tf.get_logger().setLevel(logging.ERROR)

In [58]:
class DRL_IDS_Env(gym.Env):
    def __init__(self, train_data): # test data created in last module
        '''
        constructor
        '''
        super().__init__()
        self.train_data = train_data
    
        # set limit for episode to 500 steps
        self.max_steps = 500
        self.extra_steps = None # counter for steps going beyond the max_steps limit
    
        # defining the reward function as discussed above
        # [(true_label, action) : reward]
        self.rewards = {(0, 1): -1, # (benign, alert) : -1
                        (1, 0): -1, # (attack, no alert) : -1
                        (1, 1): 1, # (attack, alert) : 1
                        (0, 0): 0} # (benign, no alert) : 0
        
        # defining action/observation space
        self.action_space = gym.spaces.Discrete(2)  # either 0 (NORMAL) or 1 (ATTACK)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(train_data.shape[1] - 1,), dtype=np.float64) # 'box' implies we are dealing with real, valued quantities
    
    def step(self, action):
        '''
        agent taking a single step
        this method is called after an agent takes a step
        '''
        
        # check if action exists in action space
        try:
            self.action_space.contains(action)
        except AssertionError as msg:
            print(msg)
        
        # determine if the episode is finished
        ep_info = {}
        finished = False
        self.current_step += 1
        if self.current_step >= self.max_steps:
            ep_info['end_cause'] = 'max_step_limit_reached'
            finished = True # we do not want to exceed our max step limit
        
        if self.label == 1 and action == 0: # this implies there was an attack that we did not alert
            ep_info['end_cause'] = 'attack_unalerted'
            finished = True
            
        # calculate reward based on the label of the observation and action taken by agent
        reward = self.rewards[(self.label, action)] # maps back to our self.reward dictionary
        
        # calculate the next state if finished = False
        if not finished:
            self.i += 1 # hop to next row in dataset
            if self.i >= self.train_data.shape[0]: # if this extends beyond the number of rows in our dataset
                self.i = 0 # set back to first 'state'
            
            self.obs = self.train_data.iloc[self.i] # pulling that row, or 'observation' from our dataset
            self.label = int(self.obs.pop('label'))
            
        elif self.extra_steps is None:
            self.extra_steps = 0
        else:
            if self.extra_steps == 0:
                gym.logger.warn('Episode max_step length exceeded. You are entering uncharted territory and should reset the episode.')
                self.extra_steps += 1
                reward = 0
                
        return self.obs.values, reward, finished, ep_info
    
    def reset(self):
        
        extra_steps = None
        self.current_step = 0
        
        self.i = np.random.randint(0, self.train_data.shape[0]) # pick a random starting location from the 0th row to the nth row
        
        #print('reset at state number: ', self.i)
        
        self.obs = self.train_data.iloc[self.i]
        
        # record the true label of self.obs
        self.label = int(self.obs.pop('label'))
        
        return self.obs.values

Now we will create an instance of DRL_IDS_Env (and validate it using stable_baselines)
Note: stable_baselines (https://stable-baselines.readthedocs.io/en/master/) is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.

In [59]:
train_data = pd.read_csv("processed_data/train.csv")
env = DRL_IDS_Env(train_data)
check_env(env, warn=True)

# 2. Training within Custom Environment

We will be training the algorithm on multiple environment in parallel through the stable-baselines lib (vectorized environments)

In [60]:
n_envs = 16  # hyperparameter
env = DummyVecEnv([lambda: DRL_IDS_Env(train_data)] * n_envs)

In [61]:
# Defining network architecture

# https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html#custom-policy
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(
            *args,
            **kwargs,
            net_arch=[128, 64, 32],
            act_fun=tf.nn.relu,  # Using the ReLU (REctified Linear Unit) activation function
            feature_extraction= "mlp" # Multi-Layer Perceptrons
        )

In [62]:
# TODO: ADJUST THESE HYPERPARAMETERS

'''
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO
(it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far from the old policy. For that, PPO
uses clipping to avoid too large of updates.
'''
model = PPO2(
    CustomPolicy,
    env,
    gamma=0.9,
    n_steps=512,
    ent_coef=1e-05,
    learning_rate=lambda progress: progress
    * 0.0021,  # progress decreases from 1 to 0 -> lr decreasesb from 0.0021 to 0
    vf_coef=0.6,
    max_grad_norm=0.8,
    lam=0.8,
    nminibatches=16,
    noptepochs=55,
    cliprange=0.2,
    verbose=0,
    tensorboard_log="log_25",  # define the tensorboard log location
)

In [75]:
class AccF1Callback(BaseCallback):
    def __init__(self, train, val, eval_freq):
        super().__init__()
        self.train_data = train
        self.val_data = val
        self.eval_freq = eval_freq

    def _on_step(self):
        '''
        _on_step will be called after every eval_freq steps
        '''

        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            super()._on_step()

            # calculating metrics to print for training data
            predicted = self.train_data.drop(columns=["label"]).swifter.apply(lambda x: self.model.predict(x, deterministic=True)[0], axis=1)
            accuracy = accuracy_score(self.train_data["label"], predicted)
            f1 = f1_score(self.train_data["label"], predicted)

            print("*-" * 50)
            
            print("total current timesteps: ", self.num_timesteps, '\n')
            print("Training --- Accuracy: ", accuracy)
            print("Training --- F1-Score: ", f1, '\n')

            # calculating metrics to print for validation data
            predicted = self.val_data.drop(columns=["label"]).swifter.apply(lambda x: self.model.predict(x, deterministic=True)[0], axis=1)
            accuracy = accuracy_score(self.val_data["label"], predicted)
            
            f1 = f1_score(self.val_data["label"], predicted)
            print("Validation --- Accuracy: ", accuracy)
            print("Validation --- F1-Score: ", f1)
            
            print("*-" * 50)

        return True

In [76]:
val_data = pd.read_csv("processed_data/val.csv")
eval_callback = AccF1Callback(train_data, val_data, eval_freq=1000 // n_envs)

In [None]:
model.learn(5000000, callback=eval_callback)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  992 

Training --- Accuracy:  0.9734716881050258
Training --- F1-Score:  0.9350720655175491 

Validation --- Accuracy:  0.9738518322061619
Validation --- F1-Score:  0.9361945389730508
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  1984 

Training --- Accuracy:  0.9734716881050258
Training --- F1-Score:  0.9350720655175491 

Validation --- Accuracy:  0.9738518322061619
Validation --- F1-Score:  0.9361945389730508
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
total current timesteps:  2976 

Training --- Accuracy:  0.973471688105025