## Network Traffic Dataset for malicious attack

This dataset of network traffic flow is generated by CICFlowMeter, indicate whether the traffic is malicious attack (Bot) or not (Benign).                             
CICFlowMeter - network traffic flow generator generates 69 statistical features such as Duration, Number of packets, Number of bytes, Length of packets, etc are also calculated separately in the forward and reverse direction.   
The output of the application is the CSV file format with two columns labeled for each flow, namely Benign or Bot.
The dataset has been organized per day, for each day the raw data including the network traffic (Pcaps) and event logs (windows and Ubuntu event Logs) per machine
are recorded.                  Download the dataset from the below wget command line provided and rename as Network_traffic.

In [None]:
! wget -O Network_Traffic.csv https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed+Traffic+Data+for+ML+Algorithms/Friday-02-03-2018_TrafficForML_CICFlowMeter.csv

## Install Libraries

In [None]:
! pip install pandas --user
! pip install imblearn --user

## Restart Notebook Kernel

In [None]:
from IPython.display import display_html
display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

## Import Libraries

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import tempfile
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.model_selection import KFold
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

### Declare Variables

In [2]:
lstZerodrp = ['Timestamp', 'BwdPSHFlags', 'FwdURGFlags', 'BwdURGFlags', 'CWEFlagCount', 'FwdBytsbAvg', 'FwdPktsbAvg',
              'FwdBlkRateAvg', 'BwdBytsbAvg',
              'BwdBlkRateAvg', 'BwdPktsbAvg']

lstScaledrp = ['FwdPSHFlags', 'FINFlagCnt', 'SYNFlagCnt', 'RSTFlagCnt', 'PSHFlagCnt', 'ACKFlagCnt', 'URGFlagCnt',
               'ECEFlagCnt']

DATA_FILE = "Network_Traffic.csv"

In [3]:
def read_dataFile():
    """
    Reads data file and returns dataframe result
    """
    chunksize = 100000
    chunk_list = []
    missing_values = ["n/a", "na", "--", "Infinity", "infinity", "Nan", "NaN"]

    for chunk in pd.read_csv(DATA_FILE, chunksize=chunksize, na_values=missing_values):
        chunk_list.append(chunk)
#         break
    dataFrme = pd.concat(chunk_list)

    lstcols = []
    for i in dataFrme.columns:
        i = str(i).replace(' ', '').replace('/', '')
        lstcols.append(i)
    dataFrme.columns = lstcols
    dfAllCpy = dataFrme.copy()
    dataFrme = dataFrme.drop(lstZerodrp, axis=1)
    return dataFrme

## Network Traffic Input Dataset 

### Attribute Information
    Features extracted from the captured traffic using CICFlowMeter-V3 = 69
    After removal of noise/unwarranted features, number of feature columns chosen: 10
    Features: FlowDuration,BwdPktLenMax,FlowIATStd,FwdPSHFlags,BwdPktLenMean,FlowIATMean,BwdIATMean,
              FwdSegSizeMin,InitBwdWinByts,BwdPktLenMin
    Flows labelled: Bot or Benign

In [4]:
read_dataFile().head()

Unnamed: 0,DstPort,Protocol,FlowDuration,TotFwdPkts,TotBwdPkts,TotLenFwdPkts,TotLenBwdPkts,FwdPktLenMax,FwdPktLenMin,FwdPktLenMean,...,FwdSegSizeMin,ActiveMean,ActiveStd,ActiveMax,ActiveMin,IdleMean,IdleStd,IdleMax,IdleMin,Label
0,443,6,141385,9,7,553,3773.0,202,0,61.444444,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1,49684,6,281,2,1,38,0.0,38,0,19.0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
2,443,6,279824,11,15,1086,10527.0,385,0,98.727273,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
3,443,6,132,2,0,0,0.0,0,0,0.0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
4,443,6,274016,9,13,1285,6141.0,517,0,142.777778,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign


In [5]:
def preprocess_na(dataFrme):
    """
    Removing NA values
    """
    na_lst = dataFrme.columns[dataFrme.isna().any()].tolist()
    for j in na_lst:
        dataFrme[j].fillna(0, inplace=True)
    return dataFrme

In [6]:
def create_features_label(dataFrme):
    """
    Create independent and Dependent Features
    """
    columns = dataFrme.columns.tolist()
    # Filter the columns to remove data we do not want 
    columns = [c for c in columns if c not in ["Label"]]
    # Store the variable we are predicting 
    target = "Label"
    # Define a random state 
    state = np.random.RandomState(42)
    X = dataFrme[columns]
    Y = dataFrme[target]
    return X, Y

In [7]:
def label_substitution(dataFrme):
    """
    Label substitution : 'Benign'as 0, 'Bot'as 1
    """
    dictLabel = {'Benign': 0, 'Bot': 1}
    dataFrme['Label'] = dataFrme['Label'].map(dictLabel)

    LABELS = ['Benign', 'Bot']
    count_classes = pd.value_counts(dataFrme['Label'], sort=True)
    
    # Get the Benign and the Bot values 
    Benign = dataFrme[dataFrme['Label'] == 0]
    Bot = dataFrme[dataFrme['Label'] == 1]
    return dataFrme

In [8]:
def handle_class_imbalance(X,Y):
    """
    Handle Class imbalancement 
    """
#    os_us = SMOTETomek(ratio=0.5)
#    X_res, y_res = os_us.fit_sample(X, Y)
    ros = RandomOverSampler(random_state=50)
    X_res, y_res = ros.fit_sample(X, Y)
    ibtrain_X = pd.DataFrame(X_res,columns=X.columns)
    ibtrain_y = pd.DataFrame(y_res,columns=['Label']) 
    return ibtrain_X,ibtrain_y

In [9]:
def correlation_features(ibtrain_X):
    """
    Feature Selection - Correlation Ananlysis 
    """
    corr = ibtrain_X.corr()
    cor_columns = np.full((corr.shape[0],), True, dtype=bool)
    for i in range(corr.shape[0]):
        for j in range(i + 1, corr.shape[0]):
            if corr.iloc[i, j] >= 0.9:
                if cor_columns[j]:
                    cor_columns[j] = False

    dfcorr_features = ibtrain_X[corr.columns[cor_columns]]
    return dfcorr_features

In [10]:
def top_ten_features(dfcorr_features,ibtrain_X,ibtrain_y):
    feat_X = dfcorr_features
    feat_y = ibtrain_y['Label']
    
    #apply SelectKBest class to extract top 10 best features
    bestfeatures = SelectKBest(score_func=f_classif, k=10)
    fit = bestfeatures.fit(feat_X,feat_y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(feat_X.columns)
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Features','Score']  #naming the dataframe columns
    final_feature = featureScores.nlargest(10,'Score')['Features'].tolist()
    final_feature.sort()
    sort_fn = final_feature
    dictLabel1 = {'Benign':0,'Bot':1}
    ibtrain_y['Label']= ibtrain_y['Label'].map(dictLabel1)
    selected_X = ibtrain_X[sort_fn]
    selected_Y = ibtrain_y['Label']
    return selected_X,selected_Y,sort_fn

In [11]:
def normalize_data(selected_X, selected_Y):
    """
    Normalize data 
    """
    scaler = MinMaxScaler(feature_range=(0, 1))
    selected_X = pd.DataFrame(scaler.fit_transform(selected_X), columns=selected_X.columns, index=selected_X.index)
    trainX, testX, trainY, testY = train_test_split(selected_X, selected_Y, test_size=0.25)
    print('-----------------------------------------------------------------')
    print("## Final features and Data pre-process for prediction")
    print('-----------------------------------------------------------------')
    print(testX)
    return trainX, testX, trainY, testY

In [12]:
tf.logging.set_verbosity(tf.logging.INFO)
    
'''Reads data file and returns dataframe result'''
dataFrme = read_dataFile()

''' Removing NA values'''
dataFrme = preprocess_na(dataFrme)

'''Create independent and Dependent Features'''
X, Y = create_features_label(dataFrme)

'''Label substitution : 'Benign'as 0, 'Bot'as 1'''
dataFrme = label_substitution(dataFrme)

'''Handle Class imbalancement'''
ibtrain_X, ibtrain_y = handle_class_imbalance(X, Y)

'''Feature Selection - Correlation Ananlysis'''
dfcorr_features = correlation_features(ibtrain_X)

'''Feature Selection - SelectKBest : Return best 10 features'''
selected_X, selected_Y, final_feature = top_ten_features(dfcorr_features, ibtrain_X, ibtrain_y)

'''Normalize data '''
trainX, testX, trainY, testY = normalize_data(selected_X, selected_Y)

-----------------------------------------------------------------
## Final features and Data pre-process for prediction
-----------------------------------------------------------------
         BwdPktLenMax  BwdPktLenMean  BwdPktLenMin  FlowDuration  FlowIATMax  \
1438410      0.000000       0.000000      0.000000      0.000004    0.000004   
1518446      0.000000       0.000000      0.000000      0.000004    0.000004   
1150618      0.000000       0.000000      0.000000      0.000004    0.000004   
173401       0.000000       0.000000      0.000000      0.000004    0.000004   
844752       0.082192       0.082213      0.083916      0.000424    0.000263   
...               ...            ...           ...           ...         ...   
1049367      0.000000       0.000000      0.000000      0.000005    0.000005   
1167751      0.076712       0.022095      0.000000      0.000090    0.000081   
123788       0.039726       0.039736      0.040559      0.000003    0.000003   
206238       0

## Definition of Serving Input Receiver Function

In [13]:
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k,dtype=tf.dtypes.float64) for k in final_feature]
  return input_columns
feature_columns =  make_feature_cols()
inputs = {}
for feat in feature_columns:
  inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
serving_input_receiver_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(inputs)

## Train and Save Network Traffic Model

In [14]:
TF_DATA_DIR = os.getenv("TF_DATA_DIR", "/tmp/data/")
TF_MODEL_DIR = os.getenv("TF_MODEL_DIR", "network/")
TF_EXPORT_DIR = os.getenv("TF_EXPORT_DIR", "network/")

x1 = np.asarray(trainX[final_feature])
y1 = np.asarray(trainY)

x2 = np.asarray(testX[final_feature])
y2 = np.asarray(testY)

In [15]:
def formatFeatures(features):
    formattedFeatures = {}
    numColumns = features.shape[1]

    for i in range(0, numColumns):
        formattedFeatures[final_feature[i]] = features[:, i]

    return formattedFeatures

In [17]:
trainingFeatures = formatFeatures(x1)
trainingCategories = y1

testFeatures = formatFeatures(x2)
testCategories = y2

# Train Input Function
def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((trainingFeatures, y1))
    dataset = dataset.batch(32).repeat(1000)
    return dataset

# Test Input Function
def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((testFeatures, y2))
    return dataset.batch(32).repeat(1000)


# Provide list of GPUs should be used to train the model

distribution=tf.distribute.experimental.ParameterServerStrategy()
print('Number of devices: {}'.format(distribution.num_replicas_in_sync))

# Configuration of  training model

config = tf.estimator.RunConfig(train_distribute=distribution, model_dir=TF_MODEL_DIR, save_summary_steps=100, save_checkpoints_steps=1000)

# Build 3 layer DNN classifier

model = tf.estimator.DNNClassifier(hidden_units=[13,65,110],
                                   feature_columns=feature_columns,
                                   model_dir=TF_MODEL_DIR,
                                   n_classes=2, config=config
                                   )


export_final = tf.estimator.FinalExporter(TF_EXPORT_DIR, serving_input_receiver_fn=serving_input_receiver_fn)

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn,
                                    max_steps=1000)

eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,
                                  steps=100,
                                  exporters=export_final,
                                  throttle_secs=1,
                                  start_delay_secs=1)

result = tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
print(result)

print('Training finished successfully')

INFO:tensorflow:ParameterServerStrategy with compute_devices = ('/device:GPU:0', '/device:GPU:1'), variable_device = '/device:CPU:0'
Number of devices: 2
INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Using config: {'_model_dir': 'network/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.parameter_server_strategy.ParameterServerStrategyV1 object at 0x7f561c486358>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': Non

## Update  storageUri in network_kfserving.yaml with pvc-name

In [None]:
pvcname = !(echo  $HOSTNAME | sed 's/.\{2\}$//')
pvc = "workspace-"+pvcname[0]
! sed -i "s/nfs/$pvc/g" network_kfserving.yaml
! cat network_kfserving.yaml

## Serving Network Traffic Model using kubeflow kfserving

In [None]:
!kubectl apply -f network_kfserving.yaml -n anonymous

In [None]:
!kubectl get inferenceservices -n anonymous

#### Note:
Wait for inference service READY=\"True\"

## Predict data from serving after setting INGRESS_IP
### Note - Use one of preprocessed row values from Data pre-process from prediction output cell

In [None]:
! curl -v -H "Host: network-model.anonymous.example.com" http://10.23.222.166:31380/v1/models/network-model:predict -d '{"signature_name":"predict","instances":[{"BwdPktLenMax":[0.158904] , "BwdPktLenMean":[0.039736] , "BwdPktLenMin":[0.00000], "FlowDuration":[0.053778] , "FlowIATMax":[0.053262] , "FwdPktLenMin":[0.0] , "FwdSegSizeMin":[0.454545] , "InitBwdWinByts":[1.0] , "Protocol":[0.0] , "RSTFlagCnt":[0.003357]}]}'

## Delete kfserving model & Clean up of stored models

In [None]:
!kubectl delete -f network_kfserving.yaml
!rm -rf /mnt/network
pvcname = !(echo  $HOSTNAME | sed 's/.\{2\}$//')
pvc = "workspace-"+pvcname[0]
! sed -i "s/$pvc/nfs/g" network_kfserving.yaml
! cat network_kfserving.yaml