# InternetFirewallSupervisedLearning
## Author: Tijs van Lieshout

Predicting the Internet Firewall action based on log info. Supervised Learning Assignment for Master DSLS.

### Data availability:
- [Direct link](https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv)
- [Archive link with some info](https://archive.ics.uci.edu/ml/datasets/Internet+Firewall+Data)
- [F. Ertam and M. Kaya, "Classification of firewall log files with multiclass support vector machine," 2018 6th International Symposium on Digital Forensic and Security (ISDFS), 2018, pp. 1-4, doi: 10.1109/ISDFS.2018.8355382.](https://doi.org/10.1109/ISDFS.2018.8355382)

Table 1. Features and Description. Adapted from Ertam & Kaya, 2018.

|        Feature       |                  Description                 |
|:--------------------:|:--------------------------------------------:|
| Source Port          | Client Source Port                           |
| Destination Port     | Client Destination Port                      |
| NAT Source Port      | Network Address Translation Source Port      |
| NAT Destination Port | Network Address Translation Destination Port |
| Elapsed Time (sec)   | Elapsed Time for flow                        |
| Bytes                | Total Bytes                                  |
| Bytes Sent           | Bytes Sent                                   |
| Bytes Received       | Bytes Received                               |
| Packets              | Total Packets                                |
| pkts_sent            | Packets Sent                                 |
| pkts_received        | Packets Received                             |
| Action               | Class (allow, deny, drop, reset-both)        |

Table 2. Classes to predict (Actions). Adapted from Ertam & Kaya, 2018.

|   Action   |                                                               Description                                                              |
|:----------:|:--------------------------------------------------------------------------------------------------------------------------------------:|
| Allow      | Allows the internet traffic.                                                                                                           |
| Deny       | Blocks traffic and enforces the default Deny Action defined for the application that is being denied.                                  |
| Drop       | Silently drops the traffic; for an application, it overrides the default deny action. A TCP reset is not sent to the host/application. |
| Reset-Both | Sends a TCP reset to both the client-side and server-sidedevices.                                                                      |



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("log2.csv")
df

In [None]:
df.info()

In [None]:
allow, deny, drop, reset_both = df['Action'].value_counts()
print('Number of allowed logs: ', allow)  
print('Number of denied logs: ', deny)
print('Number of dropped logs: ', drop)  
print('Number of reset-both logs: ', reset_both) 

print('\n')
print('% of allowed logs', round(allow / len(df) * 100, 1), '%')
print('% of denied logs', round(deny / len(df) * 100, 1), '%')
print('% of dropped logs', round(drop / len(df) * 100, 1), '%')
print('% of reset-both logs', round(reset_both / len(df) * 100, 1), '%')

In [None]:
df['Action'].value_counts().plot(kind='barh')

Pretty unbalanced classes to predict unfortunately. Might run in some problems with the reset-both class later on. SMOTE could be used, but in this case I won't attempt it (out of the scope of this assignment).

In [None]:
# missing data
df.isnull().sum() 
# no missing data, no imputation needed

In [None]:
description = df.groupby(['Action']).describe()

In [None]:
c = df.corr().abs()
sns.heatmap(c, cmap=sns.color_palette("Blues", as_cmap=True))

'Bytes Sent', 'Bytes Received', 'pkts_sent' and 'pkts_received' can be discarded as Bytes and Packets are the total of the two pairs respectively.

I am also going to discard packets for bytes as it is highly correlated. I'll keep Bytes since it is more detailed than packets (1 packet consists of multiple bytes)

All of the port variables should not be seen as continous, but probably are interesting to see the range

In [None]:
sns.displot(df, x="Source Port", hue="Action")
plt.show()

In [None]:
description['Source Port']

All Actions of drop seem to happen in high source ports (minimum 49156). Reset-both Source Port minimum is 1024

In [None]:
sns.displot(df, x="Destination Port", hue="Action")
plt.show()

In [None]:
description['Destination Port']

Most actions seem to have a very low destination port. All drop actions are done on Destination Port 445

In [None]:
sns.displot(df, x="NAT Source Port", hue="Action")
plt.show()

In [None]:
description['NAT Source Port']

Allowed actions seme to be uniformly distributed over NAT Source Ports. All dropped NAT Source Ports are equal to 0. Most deny and reset-both actions have NAT Source Ports of 0.

In [None]:
sns.displot(df, x="NAT Destination Port", hue="Action")
plt.show()

In [None]:
description['NAT Destination Port']

Allowed actions seem to be distributed over all NAT Destination Ports, but mostly under 443. All dropped NAT Destination Ports are equal to 0. Most deny and reset-both actions have NAT Destination Ports of 0.

In [None]:
cols = ['Source Port', 
        'Destination Port', 
        'NAT Source Port', 
        'NAT Destination Port', 
        'Bytes', 
        'Elapsed Time (sec)']
df_features = df[cols].rename(columns={'Source Port':'source_port',
                                       'Destination Port':'destination_port', 
                                       'NAT Source Port':'nat_source_port', 
                                       'NAT Destination Port':'nat_destination_port',
                                       'Bytes':'bytes',
                                       'Elapsed Time (sec)':'elapsed_time'})

In [None]:
c = df_features.corr().abs()
sns.heatmap(c, cmap=sns.color_palette("Blues", as_cmap=True))

In [None]:
y = np.array(df['Action'].replace({'allow':0,'deny':1,'drop':2, 'reset-both':3}))
X = np.array(df_features)
print(y.shape)
print(X.shape)

Let's normalize because the variance differs greatly per feature.

In [None]:
from sklearn.preprocessing import StandardScaler

def normalize(X):
    scalar = StandardScaler()
    scalar = scalar.fit(X)
    X = scalar.transform(X)
    return X

X = normalize(X)

In [None]:
from sklearn.model_selection import train_test_split, ShuffleSplit

#split
test_size = 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)

#cross validation
cv = ShuffleSplit(n_splits=100, test_size=test_size, random_state=42)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

def evaluate(y_test, y_pred, X_test):
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

In [None]:
# Based on code from https://github.com/fenna/student_BFVM19DATASC3
from sklearn.metrics import mean_squared_error

def plot_learning_curves(model, X_train, y_train, X_val, y_val, training_sizes=range(999, len(X_train), 1000)):
    MSE_train, MSE_val = calculate_MSE_over_training_sizes(model, X_train, y_train, 
                                                           X_val, y_val, training_sizes)

    plt.plot(training_sizes, MSE_train,
             "r-o", linewidth=2, label="trainingsdata")
    plt.plot(training_sizes, MSE_val, 
             "b-*", linewidth=3, label="validationdata")
    plt.legend(loc="best", fontsize=14)   
    plt.xlabel("Training set size", fontsize=14) 
    plt.ylabel("RMSE", fontsize=14) 

    
def calculate_MSE_over_training_sizes(model, X_train, y_train, X_val, y_val, training_sizes):
    train_errors, val_errors = [], []
    for m in training_sizes:
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))
        
    return np.sqrt(train_errors), np.sqrt(val_errors)

## Most basic model; A dummy classifier

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dm = DummyClassifier(strategy='stratified') 
# stratified generates predictions by respecting the training set’s class distribution.
dm.fit(X_train, y_train)

In [None]:
y_pred = dm.predict(X_test)
evaluate(y_test, y_pred, X_test)

In [None]:
plot_learning_curves(dm, X_train, y_train, X_test, y_test, range(1999, len(X_train), 2000))

Based on this learning curve you can not say if the model is underfitted or overfitted as it actually doesn't learn over iterations since it is a dummy classifier.

Let's move on from this dummy model and attempt a basic classifier; Logistic regression.

## Logistic Regression

In [None]:
# train
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(max_iter=1000)
lg.fit(X_train, y_train)

In [None]:
# evaluation
y_pred = lg.predict(X_test)
evaluate(y_test, y_pred, X_test)

This warning basically means that there are not enough instances of the fourth class to have in both the training and test set, so it cannot evaluate it's prediction.

In [None]:
plot_learning_curves(lg, X_train, y_train, X_test, y_test, range(1999, len(X_train), 2000))

Based on this learning curve I would say this model is fitted pretty good as the validation data and training data error are close together.

As the fourth class could not be evaluated, let's make a decision tree which will at least attempt it albeit the low support.

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
y_pred = dt.predict(X_test)
evaluate(y_test, y_pred, X_test)

In [None]:
plot_learning_curves(dt, X_train, y_train, X_test, y_test)

Based on this learning curve I would say this model is underfitted as there is still a large gap between training error and validation error. Although mait looks like more data would not fix this problem.

Let's now get to the real deal, comparing SVM methods to Ertam & Kaya, 2018.

## SVM 

### Activation function: linear

In [None]:
from sklearn.svm import SVC
svm_lin = SVC(kernel='linear')
svm_lin.fit(X_train, y_train)

In [None]:
y_pred = svm_lin.predict(X_test)
evaluate(y_test, y_pred, X_test)

In [None]:
plot_learning_curves(svm_lin, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

Based on this learning curve I would say this model is roughly fitted good. No large gap in error between training and validation data.

### Activation function: poly

In [None]:
from sklearn.svm import SVC
svm_poly = SVC(kernel='poly')
svm_poly.fit(X_train, y_train)

In [None]:
y_pred = svm_poly.predict(X_test)
evaluate(y_test, y_pred, X_test)

This warning basically means that there are not enough instances of the fourth class to have in both the training and test set, so it cannot evaluate it's prediction.

In [None]:
plot_learning_curves(svm_poly, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

Based on this learning curve I would say this model is roughly fitted good. No large gap in error between training and validation data.

### Activation function: RBF

In [None]:
from sklearn.svm import SVC
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)

In [None]:
y_pred = svm_rbf.predict(X_test)
evaluate(y_test, y_pred, X_test)

This warning basically means that there are not enough instances of the fourth class to have in both the training and test set, so it cannot evaluate it's prediction.

In [None]:
plot_learning_curves(svm_rbf, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

Based on this learning curve I would say this model is roughly fitted good. No large gap in error between training and validation data.

### Activation function: sigmoid

In [None]:
from sklearn.svm import SVC
svm_sig = SVC(kernel='sigmoid')
svm_sig.fit(X_train, y_train)

In [None]:
y_pred = svm_sig.predict(X_test)
evaluate(y_test, y_pred, X_test)

This warning basically means that there are not enough instances of the fourth class to have in both the training and test set, so it cannot evaluate it's prediction.

In [None]:
plot_learning_curves(svm_sig, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

Based on this learning curve I would say this model is underfitted as there is still a large gap between training error and validation error.

Table 3: Comparison of SVM performance between Ertam & Kaya, 2018 and my own implementation.

\* = Better performing model

<style type="text/css">
.tg  {border-collapse:collapse;border-color:#93a1a1;border-spacing:0;}
.tg td{background-color:#fdf6e3;border-color:#93a1a1;border-style:solid;border-width:1px;color:#002b36;
  font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{background-color:#657b83;border-color:#93a1a1;border-style:solid;border-width:1px;color:#fdf6e3;
  font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-uzvj{border-color:inherit;font-weight:bold;text-align:center;vertical-align:middle}
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-uzvj" rowspan="2">Method</th>
    <th class="tg-7btt" colspan="2">F1</th>
    <th class="tg-7btt" colspan="2">Precision</th>
    <th class="tg-7btt" colspan="2">Recall</th>
  </tr>
  <tr>
    <td class="tg-7btt">Ertam &amp; Kaya</td>
    <td class="tg-7btt">van Lieshout</td>
    <td class="tg-7btt">Ertam &amp; Kaya</td>
    <td class="tg-7btt">van Lieshout</td>
    <td class="tg-7btt">Ertam &amp; Kaya</td>
    <td class="tg-7btt">van Lieshout</td>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">SVM Linear</td>
    <td class="tg-0pky">0.75</td>
    <td class="tg-0pky">0.99*</td>
    <td class="tg-0pky">0.68</td>
    <td class="tg-0pky">0.99<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.85</td>
    <td class="tg-0pky">0.99<span style="font-weight:400;font-style:normal">*</span></td>
  </tr>
  <tr>
    <td class="tg-0pky">SVM Polynomial</td>
    <td class="tg-0pky">0.53</td>
    <td class="tg-0pky">0.98<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.62</td>
    <td class="tg-0pky">0.98<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.47</td>
    <td class="tg-0pky">0.98<span style="font-weight:400;font-style:normal">*</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal">SVM RBF</span></td>
    <td class="tg-0pky">0.76</td>
    <td class="tg-0pky">0.99<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.63</td>
    <td class="tg-0pky">0.99<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.97</td>
    <td class="tg-0pky">0.99<span style="font-weight:400;font-style:normal">*</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal">SVM Sigmoid</span></td>
    <td class="tg-0pky">0.75</td>
    <td class="tg-0pky">0.84<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.60</td>
    <td class="tg-0pky">0.84<span style="font-weight:400;font-style:normal">*</span></td>
    <td class="tg-0pky">0.99*</td>
    <td class="tg-0pky">0.84</td>
  </tr>
</tbody>
</table>

Conclusion: Except for the recall of the SVM Sigmoid I have improved all other metrics for all other activation types of multi-class Support Vector Machines.