Train Model using  NetFlow files from csv format

Import dependecies
pip install numpy,pandas,xgboost,sklearn,pickle

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier,LocalOutlierFactor
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier,IsolationForest
from sklearn.cluster import DBSCAN,KMeans
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import model_selection
from pickle import dump
import time

Data set can be downloaded from here

https://rdm.uq.edu.au/files/650f1fa0-ef9c-11ed-b5f6-b1a04f482c13

Columns needed from csv files
Example:
172.31.66.17,51128,23.36.69.189,443,6,91.0,152,0,3,0,194,4285680,0,anomaly_name

After data normalization some field will be dropped
'src_ip', 'dst_ip','l7_proto','anomaly'
In this way these values can be written randomly and will be not used in learning process

field 'label' if equels 0 - is normal, if equals 1 - is anomaly, training model dont take it in to cosideration, just needed for validation

In [2]:
flow_fields = [
    "src_ip",
    "src_port",
    "dst_ip",
    "dst_port",
    "ip_protocol",
    "l7_proto",
    "in_bytes",
    "out_bytes",
    "in_pkts",
    "out_pkts",
    "tcp_flags",
    "duration",
    "label",
    "anomaly"
]

with open("datasets/NF-CSE-CIC-IDS2018.csv", "r") as csvfile:
    # pass input data stream as open("data.csv", "r") to csv.reader for testing
    # read and process line by line don't read into list
    df_src = pd.read_csv(csvfile, names=flow_fields)

Scaler for data standartization

In [3]:
def do_scl(df_num, cols):
    print("Original values:\n", df_num)

    scaler = RobustScaler()
    scaler_temp = scaler.fit_transform(df_num)

    std_df = pd.DataFrame(scaler_temp, columns =cols)

    print("\nScaled values:\n", std_df)

    return std_df

cat_cols = ['ip_protocol']

Process standatrization and normalization primitive

In [4]:
def process(dataframe):
    df_num = dataframe.drop(cat_cols, axis=1)
    num_cols = df_num.columns
    scaled_df = do_scl(df_num, num_cols)

    dataframe.drop(labels=num_cols, axis="columns", inplace=True)
    dataframe[num_cols] = scaled_df[num_cols]

    print("Before encoding:")
    print(dataframe['ip_protocol'])

    dataframe = pd.get_dummies(dataframe, columns = ['ip_protocol'])

    print("\nColumns after encoding:")
    print(dataframe.filter(regex='^protocol_type_'))
    
    return dataframe

Drop not necessary columns and process scaling

In [5]:
df = df_src.drop(['src_ip', 'dst_ip','l7_proto','anomaly'] ,axis=1)
scaled_train = process(df)

Original values:
          src_port  dst_port  in_bytes  out_bytes  in_pkts  out_pkts  \
0           51128       443       152          0        3         0   
1             443     51036       994        979        7         7   
2           12262       445       585        344        5         4   
3           61023        53       136        168        2         2   
4             443     51037        72         40        1         1   
...           ...       ...       ...        ...      ...       ...   
8392396        22     40810      2601          0       12         0   
8392397     15476        23        44          0        1         0   
8392398        23     15476        40          0        1         0   
8392399     56407        53        72          0        1         0   
8392400        53     56407       126          0        1         0   

         tcp_flags  duration  label  
0              194   4285680      0  
1               24   4234714      0  
2              

Split data in to training and validation sets

In [6]:
y = scaled_train['label'].values
y = y.astype('int')

X = scaled_train.drop(['label'], axis=1)

x_train, x_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=42)
x_train_reduced, x_test_reduced, y_train_reduced, y_test_reduced = \
    train_test_split(X, y, test_size=0.2, random_state=42)

n_estimators=1500 - the higher the value, the longer the training takes

Start learning process
These values can be tuned for best performance:
'random_state=47, contamination=0.01,n_estimators=1000'

In [24]:
#clf = IsolationForest(random_state=47,n_jobs=-1, contamination=0.05,n_estimators=1000) #Work 84.3% - Give 84.3 accuracy score
#clf = IsolationForest(random_state=47,n_jobs=-1, contamination=0.01,n_estimators=1000) #Work 86.3% - Give 84.3 accuracy score
#clf = IsolationForest(random_state=47,n_jobs=-1, contamination=0.02,n_estimators=1500) #Work 86.26 - Give 86.26 accuracy score
clf = IsolationForest(random_state=47,n_jobs=-1, contamination=0.01,n_estimators=3000)
clf.fit(x_train)



Start prediction. Trained model know only x_train data

In [25]:
predict_test = clf.predict(x_test)
predict_train = clf.predict(x_train)

Normalize predicted values to source values for testing prediction:  0- is normal behviour, 1 - is anomaly

In [None]:
predict_test[predict_test == 1] = 0
predict_train[predict_train == 1] = 0

predict_test[predict_test == -1] = 1
predict_train[predict_train == -1] = 1

Validate predicted values to known values and calculate accuracy

In [None]:
test_accuracy = metrics.accuracy_score(y_test,predict_test)
train_accuracy = metrics.accuracy_score(y_train,predict_train)

train_accuracy,test_accuracy

(0.8613730875554073, 0.8610708134319066)

Print results for accuracy

In [None]:
n_error_test = predict_test[predict_test == 1].size
n_error_outliers = predict_train[predict_train == 1].size

print( "errors novel regular: %d/40 ; errors novel abnormal: %d/40"
    % (n_error_test, n_error_outliers))

print("Training Accuracy " + "IsolationForestClassifier" + " {}  Test Accuracy ".format(train_accuracy*100) + 'IsolationForestClassifier' + " {}".format(test_accuracy*100))

errors novel regular: 31794/40 ; errors novel abnormal: 126261/40
Training Accuracy IsolationForestClassifier 86.13730875554073  Test Accuracy IsolationForestClassifier 86.10708134319066


Save trained model for future usage

In [13]:
with open("IsolationForestModel_86.pkl", "wb") as f:
    dump(clf, f, protocol=5)

From here starts example where we load the model from saved learned model file and use it to predict anomaly

In [None]:
# Read binary file
with open("IsolationForestModel_86.pkl", "rb") as f:
    clf_test = load(f)

We use again test data, but you can use your own data but you need standartize and normalize it first using methods described Above

In [None]:
predict_test_data= clf_test.predict(x_test)

Normalize predicted values to source values for testing prediction: 0- is normal behviour, 1 - is anomaly. In preddicted model values 1 - is normal, -1 - is anomaly we normalize it to 0 and 1

In [None]:
predict_test_data[predict_test_data == 1] = 0
predict_test_data[predict_test_data == -1] = 1

Validate accuracy if you have had dataset with knowing and setted anomalies. Otherweise you can select only results with value 1 - is anomaly and compare it with source data to detect IPs and other data

In [None]:
test_accuracy_test = metrics.accuracy_score(y_test,predict_test_data)
test_accuracy_test