# Web attack detection using CICIDS2017 dataset

Training data: "Intrusion Detection Evaluation Dataset" (CICIDS2017). Description page: https://www.unb.ca/cic/datasets/ids-2017.html

The data set is public. Download link: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/

CICIDS2017 combines 8 files recorded on different days of observation (PCAP + CSV). Used archive: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.zip

In the downloaded archive GeneratedLabelledFlows.zip the file "Thursday" Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv is selected.

Sources:

* [Sharafaldin2018] Iman Sharafaldin, Arash Habibi Lashkari and Ali A. Ghorbani. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. 2018
* [Kostas2018] Kahraman Kostas. Anomaly Detection in Networks Using Machine Learning. 2018 (error was found in assessing the importance of features)
* https://github.com/bozbil/Anomaly-Detection-in-Networks-Using-Machine-Learning (error was found in assessing the importance of features)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import os
dfs = [pd.read_csv('csvs/'+file) for file in os.listdir('csvs/')]
df = pd.concat(dfs)
df

Unnamed: 0,Average Packet Size,Flow Bytes/s,Fwd Packet Length Mean,Max Packet Length,Fwd IAT Min,Total Length of Fwd Packets,Flow IAT Mean,Fwd IAT Std,Fwd Packet Length Max,Fwd Header Length,Label
0,158.3,270.947,526,144.3,0.004,1443,1.65100,0.64916,512,536,1
1,181.3,118.621,666,167.3,0.004,1673,2.66759,2.02854,652,536,1
2,179.8,37.896,665,165.8,0.049,1658,8.98554,4.74464,651,536,1
3,174.9,68.606,603,160.9,0.004,1609,3.42530,2.54967,589,536,1
4,126.8,100.780,666,112.8,0.003,1128,2.12797,1.25877,652,528,1
...,...,...,...,...,...,...,...,...,...,...,...
313,982.9,3502.007,1935,968.9,4.035,9689,0.31952,0.30548,1921,544,1
314,1158.7,3638.156,1934,1144.7,3.942,11447,0.33230,0.40012,1920,536,1
315,1158.7,3470.817,1934,1144.7,4.145,11447,0.36742,0.40015,1920,536,1
316,1158.7,3319.433,1933,1144.7,4.308,11447,0.33678,0.40009,1919,536,1


In [3]:
# df.to_csv('combined.csv', index=False)

In [4]:
# df=pd.read_csv('combined.csv')
# df

In [5]:
df['Label'].unique()

array([1, 0])

In [6]:
df['Label'].value_counts()

Label
0    12500
1     2506
Name: count, dtype: int64

In [7]:
y = df['Label'].values
X = df.drop(columns=['Label'])
print(X.shape, y.shape)

(15006, 10) (15006,)


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_other, y_train, y_other = train_test_split(X, y, 
                                                    train_size=0.7, 
                                                    random_state=42,
                                                    stratify=y)
X_test, X_val, y_test, y_val = train_test_split(X_other, y_other, 
                                                    test_size=0.5, 
                                                    random_state=42,
                                                    stratify=y_other)

In [9]:
X_train.shape

(10504, 10)

In [10]:
y_train.shape

(10504,)

In [11]:
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 8750, 1: 1754}

In [12]:
X_val.shape

(2251, 10)

In [13]:
y_val.shape

(2251,)

In [14]:
X_test.shape

(2251, 10)

In [15]:
y_test.shape

(2251,)

In [16]:
X_other.shape

(4502, 10)

In [17]:
y_other.shape

(4502,)

In [18]:
import tensorflow as tf

In [19]:
model = tf.keras.Sequential([
                tf.keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(10,)),
                tf.keras.layers.Dense(32, activation=tf.nn.relu),
                tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
                ]
                )
learning_rate = 1*1e-3
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
              loss=tf.keras.losses.binary_crossentropy,
              metrics=[tf.keras.metrics.binary_accuracy])

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))
test_loss, test_acc = model.evaluate(X_test, y_test)
print('\nTest accuracy:', test_acc)

2024-04-14 10:16:43.155428: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Test accuracy: 0.9640160202980042


In [20]:
file_name = 'my_model.h5'
model.save(file_name)
loaded_model = tf.keras.models.load_model(file_name)

In [21]:
loaded_model.predict(X_other)



array([[1.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [22]:
X_other

Unnamed: 0,Average Packet Size,Flow Bytes/s,Fwd Packet Length Mean,Max Packet Length,Fwd IAT Min,Total Length of Fwd Packets,Flow IAT Mean,Fwd IAT Std,Fwd Packet Length Max,Fwd Header Length
303,1158.7,4.089442e+03,1933,1144.7,4.063,11447,0.31266,0.38543,1919,536
2679,1806.0,2.621571e+06,2962,1792.0,0.000,17920,0.00058,0.00069,2948,544
6690,195.9,4.424119e+04,809,181.9,0.000,1819,0.00907,0.00599,795,528
2976,680.9,6.047069e+06,1514,666.9,0.000,6669,0.00022,0.00011,1500,544
1609,296.4,3.459788e+05,1482,282.4,-0.034,2824,0.00131,0.00086,1468,520
...,...,...,...,...,...,...,...,...,...,...
1056,178.8,5.132695e+03,451,164.8,0.058,1648,0.06642,0.03530,437,520
6223,345.8,1.015208e+05,2862,331.8,0.005,3318,0.00696,0.00341,2848,520
392,1200.0,1.932367e+07,2898,1186.0,0.000,11860,0.00014,0.00010,2884,532
161,755.8,5.495928e+05,2914,741.8,0.000,7418,0.00255,0.00151,2900,532


In [23]:
good = (454.8,11590.893,2866,440.8,-0.047,4408,0.09773,0.0436,2852,536)
bad = (158.3,270.947,526,144.3,0.004,1443,1.651,0.64916,512,536)

In [24]:
def get_prediction(single):
    prediction = loaded_model.predict([single])
    print(f'pred={prediction}')
    threshold = 0.5
    if prediction >= threshold:
        print("Предупреждение: Вредоносная активность обнаружена!")
    else:
        print("Нет предупреждения: Вредоносная активность не обнаружена.")

In [25]:
get_prediction(good)
get_prediction(bad)

pred=[[0.]]
Нет предупреждения: Вредоносная активность не обнаружена.
pred=[[1.442228e-13]]
Нет предупреждения: Вредоносная активность не обнаружена.


In [56]:
def get_single_prediction(single):
    prediction = loaded_model.predict([single])
    threshold = 0.5
    return prediction >= threshold


def update_model(X, y, epochs=10, batch_size=32):
    """
    update the pretrained model ('loaded_model' attribute) in real time
    :param X: list of network characteristics, shape=(N, 10) for N sniffed packets
    :param y: labels
    :param epochs: number of epochs
    :param batch_size: size of bathes
    :return: None, update the 'loaded_model' attribute
    """
    if loaded_model is None:
        load_model()
    if get_single_prediction(X) == y:
        learning_rate = 1 * 1e-3
        loaded_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
                                  loss=tf.keras.losses.binary_crossentropy,
                                  metrics=[tf.keras.metrics.binary_accuracy]
                                  )
        loaded_model.fit([X], y, epochs=epochs, batch_size=batch_size)
#         save_loaded_model()

In [47]:
X_other[0:1]

Unnamed: 0,Average Packet Size,Flow Bytes/s,Fwd Packet Length Mean,Max Packet Length,Fwd IAT Min,Total Length of Fwd Packets,Flow IAT Mean,Fwd IAT Std,Fwd Packet Length Max,Fwd Header Length
303,1158.7,4089.442,1933,1144.7,4.063,11447,0.31266,0.38543,1919,536


In [52]:
good = (454.8,11590.893,2866,440.8,-0.047,4408,0.09773,0.0436,2852,536)
good = good
good

(454.8, 11590.893, 2866, 440.8, -0.047, 4408, 0.09773, 0.0436, 2852, 536)

In [48]:
 y_other[0:1]

array([1])

In [36]:
loaded_model.fit(X_other[0:1], y_other[0:1], epochs=2, batch_size=32)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x77614b45a290>

In [57]:
update_model(good, (0,))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
def get_single_dataset():
    with open('Xy.txt', 'r') as f:
        df = f.readline().split(',')
        X, y = df[0:-1:], (int(df[-1]),)
        X = tuple(tuple(float(e) for e in X),)
        return X, y

X, y = get_single_dataset()
print(X)

(186.356, 845.243, 2962.0, 172.356, -621169.97695, 499832.0, 11.55217, 0.00628, 2948.0, 150924.0)
