# Web attack detection using CICIDS2017 dataset

Training data: "Intrusion Detection Evaluation Dataset" (CICIDS2017). Description page: https://www.unb.ca/cic/datasets/ids-2017.html

The data set is public. Download link: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/

CICIDS2017 combines 8 files recorded on different days of observation (PCAP + CSV). Used archive: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.zip

In the downloaded archive GeneratedLabelledFlows.zip the file "Thursday" Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv is selected.

Sources:

* [Sharafaldin2018] Iman Sharafaldin, Arash Habibi Lashkari and Ali A. Ghorbani. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. 2018
* [Kostas2018] Kahraman Kostas. Anomaly Detection in Networks Using Machine Learning. 2018 (error was found in assessing the importance of features)
* https://github.com/bozbil/Anomaly-Detection-in-Networks-Using-Machine-Learning (error was found in assessing the importance of features)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import os
dfs = [pd.read_csv('csvs/'+file) for file in os.listdir('csvs/')]
df = pd.concat(dfs)
df

Unnamed: 0,Average Packet Size,Flow Bytes/s,Fwd Packet Length Mean,Max Packet Length,Fwd IAT Min,Total Length of Fwd Packets,Flow IAT Mean,Fwd IAT Std,Fwd Packet Length Max,Fwd Header Length,Label
0,158.3,270.947,526,144.3,0.004,1443,1.65100,0.64916,512,536,1
1,181.3,118.621,666,167.3,0.004,1673,2.66759,2.02854,652,536,1
2,179.8,37.896,665,165.8,0.049,1658,8.98554,4.74464,651,536,1
3,174.9,68.606,603,160.9,0.004,1609,3.42530,2.54967,589,536,1
4,126.8,100.780,666,112.8,0.003,1128,2.12797,1.25877,652,528,1
...,...,...,...,...,...,...,...,...,...,...,...
313,982.9,3502.007,1935,968.9,4.035,9689,0.31952,0.30548,1921,544,1
314,1158.7,3638.156,1934,1144.7,3.942,11447,0.33230,0.40012,1920,536,1
315,1158.7,3470.817,1934,1144.7,4.145,11447,0.36742,0.40015,1920,536,1
316,1158.7,3319.433,1933,1144.7,4.308,11447,0.33678,0.40009,1919,536,1


In [3]:
# df.to_csv('combined.csv', index=False)

In [4]:
# df=pd.read_csv('combined.csv')
# df

In [5]:
df['Label'].unique()

array([1, 0])

In [6]:
df['Label'].value_counts()

Label
0    12500
1     2506
Name: count, dtype: int64

In [7]:
y = df['Label'].values
X = df.drop(columns=['Label'])
print(X.shape, y.shape)

(15006, 10) (15006,)


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_other, y_train, y_other = train_test_split(X, y, 
                                                    train_size=0.7, 
                                                    random_state=42,
                                                    stratify=y)
X_test, X_val, y_test, y_val = train_test_split(X_other, y_other, 
                                                    test_size=0.5, 
                                                    random_state=42,
                                                    stratify=y_other)

In [9]:
X_train.shape

(10504, 10)

In [10]:
y_train.shape

(10504,)

In [11]:
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 8750, 1: 1754}

In [12]:
X_val.shape

(2251, 10)

In [13]:
y_val.shape

(2251,)

In [14]:
X_test.shape

(2251, 10)

In [15]:
y_test.shape

(2251,)

In [16]:
X_other.shape

(4502, 10)

In [17]:
y_other.shape

(4502,)

In [18]:
import tensorflow as tf

In [21]:
class CustomClassifier(tf.keras.Model):
    def __init__(self):
        super(CustomClassifier, self).__init__()
        self.dense1 = tf.keras.layers.Dense(64, activation=tf.nn.relu)
        self.dense2 = tf.keras.layers.Dense(32, activation=tf.nn.relu)
        self.dense3 = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        output = self.dense3(x)
        return output

model = CustomClassifier()
learning_rate = 1*1e-3
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
              loss=tf.keras.losses.binary_crossentropy,
              metrics=[tf.keras.metrics.binary_accuracy])

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))
test_loss, test_acc = model.evaluate(X_test, y_test)
print('\nTest accuracy:', test_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9609062671661377
