# InternetFirewallSupervisedLearning
## Author: Tijs van Lieshout

Predicting the Internet Firewall action based on log info. Supervised Learning Assignment for Master DSLS.

### Data availability:
- [Direct link](https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv)
- [Archive link with some info](https://archive.ics.uci.edu/ml/datasets/Internet+Firewall+Data)
- [F. Ertam and M. Kaya, "Classification of firewall log files with multiclass support vector machine," 2018 6th International Symposium on Digital Forensic and Security (ISDFS), 2018, pp. 1-4, doi: 10.1109/ISDFS.2018.8355382.](https://doi.org/10.1109/ISDFS.2018.8355382)

Table 1. Classes to predict (Actions)

|   Action   |                                                               Description                                                              |
|:----------:|:--------------------------------------------------------------------------------------------------------------------------------------:|
| Allow      | Allows the internet traffic.                                                                                                           |
| Deny       | Blocks traffic and enforces the default Deny Action defined for the application that is being denied.                                  |
| Drop       | Silently drops the traffic; for an application, it overrides the default deny action. A TCP reset is not sent to the host/application. |
| Reset-Both | Sends a TCP reset to both the client-side and server-sidedevices.                                                                      |



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("log2.csv")
df

In [None]:
df.info()

In [None]:
allow, deny, drop, reset_both = df['Action'].value_counts()
print('Number of allowed logs: ', allow)  
print('Number of denied logs: ', deny)
print('Number of dropped logs: ', drop)  
print('Number of reset-both logs: ', reset_both) 

print('\n')
print('% of allowed logs', round(allow / len(df) * 100, 1), '%')
print('% of denied logs', round(deny / len(df) * 100, 1), '%')
print('% of dropped logs', round(drop / len(df) * 100, 1), '%')
print('% of reset-both logs', round(reset_both / len(df) * 100, 1), '%')

In [None]:
df['Action'].value_counts().plot(kind='barh')

Pretty unbalanced classes to predict

In [None]:
# missing data
df.isnull().sum() 
# no missing data, no imputation needed

In [None]:
description = df.groupby(['Action']).describe()

In [None]:
c = df.corr().abs()
sns.heatmap(c, cmap=sns.color_palette("Blues", as_cmap=True))

'Bytes Sent', 'Bytes Received', 'pkts_sent' and 'pkts_received' can be discarded as Bytes and Packets are the total of the two pairs respectively.

I am also going to discard packets for bytes as it is highly correlated. I'll keep Bytes since it is more detailed than packets (1 packet consists of multiple bytes)

All of the port variables should not be seen as continous, but probably are interesting to see the range

In [None]:
sns.displot(df, x="Source Port", hue="Action")
plt.show()

In [None]:
description['Source Port']

All Actions of drop seem to happen in high source ports (minimum 49156). Reset-both Source Port minimum is 1024

In [None]:
sns.displot(df, x="Destination Port", hue="Action")
plt.show()

In [None]:
description['Destination Port']

Most actions seem to have a very low destination port. All drop actions are done on Destination Port 445

In [None]:
sns.displot(df, x="NAT Source Port", hue="Action")
plt.show()

In [None]:
description['NAT Source Port']

Allowed actions seme to be uniformly distributed over NAT Source Ports. All dropped NAT Source Ports are equal to 0. Most deny and reset-both actions have NAT Source Ports of 0.

In [None]:
sns.displot(df, x="NAT Destination Port", hue="Action")
plt.show()

In [None]:
description['NAT Destination Port']

Allowed actions seme to be uniformly distributed over NAT Destination Ports. All dropped NAT Destination Ports are equal to 0. Most deny and reset-both actions have NAT Destination Ports of 0.

In [None]:
cols = ['Source Port', 
        'Destination Port', 
        'NAT Source Port', 
        'NAT Destination Port', 
        'Bytes', 
        'Elapsed Time (sec)']
df_features = df[cols].rename(columns={'Source Port':'source_port',
                                       'Destination Port':'destination_port', 
                                       'NAT Source Port':'nat_source_port', 
                                       'NAT Destination Port':'nat_destination_port',
                                       'Bytes':'bytes',
                                       'Elapsed Time (sec)':'elapsed_time'})

In [None]:
c = df_features.corr().abs()
sns.heatmap(c, cmap=sns.color_palette("Blues", as_cmap=True))

In [None]:
y = np.array(df['Action'].replace({'allow':0,'deny':1,'drop':2, 'reset-both':3}))
X = np.array(df_features)
print(y.shape)
print(X.shape)

In [None]:
from sklearn.preprocessing import StandardScaler

def normalize(X):
    scalar = StandardScaler()
    scalar = scalar.fit(X)
    X = scalar.transform(X)
    return X

X = normalize(X)

In [None]:
from sklearn.model_selection import train_test_split, ShuffleSplit

#split
test_size = 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)

#cross validation
cv = ShuffleSplit(n_splits=100, test_size=test_size, random_state=42)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error

def evaluate(y_test, y_pred, X_test, clf):
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    

def plot_learning_curves(model, X_train, y_train, X_val, y_val, training_sizes=range(999, len(X_train), 1000)):
    """
    input:
        model:pipeline object
        X_train, y_train: trainingsdata
        X_val, y_val: test data
    """
    train_errors, val_errors = [], []
    
    
    
    for m in training_sizes:
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))

    plt.plot(training_sizes, np.sqrt(train_errors),
             "r-+", linewidth=2, label="trainingsdata")
    plt.plot(training_sizes, np.sqrt(val_errors), 
             "b-", linewidth=3, label="validationdata")
    plt.legend(loc="upper right", fontsize=14)   
    plt.xlabel("Training set size", fontsize=14) 
    plt.ylabel("RMSE", fontsize=14)     
    return 0

## Logistic Regression

In [None]:
#train
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(max_iter=1000)
lg.fit(X_train, y_train)

In [None]:
# evaluation
y_pred = lg.predict(X_test)
evaluate(y_test, y_pred, X_test, lg)

In [None]:
plot_learning_curves(lg, X_train, y_train, X_test, y_test, range(1999, len(X_train), 2000))

## Decision Tree

In [None]:
#train
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
# evaluation
y_pred = dt.predict(X_test)
evaluate(y_test, y_pred, X_test, dt)

In [None]:
plot_learning_curves(dt, X_train, y_train, X_test, y_test)

## SVM 

### Activation function: linear

In [None]:
from sklearn.svm import SVC
svm_lin = SVC(kernel='linear')
svm_lin.fit(X_train, y_train)

In [None]:
y_pred = svm_lin.predict(X_test)
evaluate(y_test, y_pred, X_test, svm_lin)

In [None]:
plot_learning_curves(svm_lin, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

### Activation function: poly

In [None]:
from sklearn.svm import SVC
svm_poly = SVC(kernel='poly')
svm_poly.fit(X_train, y_train)

In [None]:
y_pred = svm_poly.predict(X_test)
evaluate(y_test, y_pred, X_test, svm_poly)

In [None]:
plot_learning_curves(svm_poly, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

### Activation function: RBF

In [None]:
from sklearn.svm import SVC
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)

In [None]:
y_pred = svm_rbf.predict(X_test)
evaluate(y_test, y_pred, X_test, svm_rbf)

In [None]:
plot_learning_curves(svm_rbf, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))

### Activation function: sigmoid

In [None]:
from sklearn.svm import SVC
svm_sig = SVC(kernel='sigmoid')
svm_sig.fit(X_train, y_train)

In [None]:
y_pred = svm_sig.predict(X_test)
evaluate(y_test, y_pred, X_test, svm_sig)

In [None]:
plot_learning_curves(svm_sig, X_train, y_train, X_test, y_test, range(4999, len(X_train), 5000))