<a href="https://colab.research.google.com/github/Bast-94/CYBERML-Project/blob/data-set/cyber-ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Cyber Security Project

In [42]:
CURRENT_PATH = '/content/drive/MyDrive/cyberml'
import pandas as pd
import numpy as np
import os
import sys
sys.path.append(CURRENT_PATH)
from anomaly_detection_use_case import *

## Prepare Data

### Loading dataset

In [43]:
dataset_path = os.path.join(CURRENT_PATH,'SWaT.A3_dataset_Jul_19_labelled.xlsx')
df = pd.read_excel(dataset_path,header=1)

### Cleaning data

In order to work easier on our dataset we need to clean it properly.

In [44]:
full_df = df.drop([0])
full_df = full_df.reset_index(drop=True)
full_df = full_df.rename(columns={'GMT +0':'Date'})
full_df['Attack'] = full_df['Attack'].fillna('benign')
full_df['Label'] = full_df['Label'].fillna(0).astype(int)
full_df['Date'] = pd.to_datetime(full_df['Date'])
full_df.to_csv('SWaT.A3_dataset_Jul_19_labelled.csv')
full_df.head()

Unnamed: 0,Date,Attack,Label,FIT 101,LIT 101,MV 101,P1_STATE,P101 Status,P102 Status,AIT 201,...,LSH 601,LSH 602,LSH 603,LSL 601,LSL 602,LSL 603,P6 STATE,P601 Status,P602 Status,P603 Status
0,2019-07-20 04:30:00+00:00,benign,0,0,729.8658,1,3,2,1,142.527557,...,"{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Active', u'Val...",2,1,1,1
1,2019-07-20 04:30:01+00:00,benign,0,0,729.434,1,3,2,1,142.527557,...,"{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Active', u'Val...",2,1,1,1
2,2019-07-20 04:30:02.004013+00:00,benign,0,0,729.12,1,3,2,1,142.527557,...,"{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Active', u'Val...",2,1,1,1
3,2019-07-20 04:30:03.004013+00:00,benign,0,0,728.6882,1,3,2,1,142.527557,...,"{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Active', u'Val...",2,1,1,1
4,2019-07-20 04:30:04+00:00,benign,0,0,727.7069,1,3,2,1,142.527557,...,"{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Active', u'Val...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Inactive', u'V...","{u'IsSystem': False, u'Name': u'Active', u'Val...",2,1,1,1


## Outlier detection with Isolation forest.

Firstly we need to retrieve columns list which contain categorical data by checking if they do not contain float value or datetime.

In [45]:
is_float = lambda x: isinstance(x,float)
categorical_columns = []
for col in full_df.columns:
  if not np.any(full_df[col].apply(is_float)) and col != 'Date':
    categorical_columns.append(col)
print(categorical_columns)

['Attack', 'Label', 'MV 101', 'P1_STATE', 'P101 Status', 'P102 Status', 'LS 201', 'LS 202', 'LSL 203', 'LSLL 203', 'MV201', 'P2_STATE', 'P201 Status', 'P202 Status', 'P203 Status', 'P204 Status', 'P205 Status', 'P206 Status', 'P207 Status', 'P208 Status', 'MV 301', 'MV 302', 'MV 303', 'MV 304', 'P3_STATE', 'P301 Status', 'P302 Status', 'AIT 401', 'LS 401', 'P4_STATE', 'P401 Status', 'P402 Status', 'P403 Status', 'P404 Status', 'UV401', 'MV 501', 'MV 502', 'MV 503', 'MV 504', 'P5_STATE', 'P501 Status', 'P502 Status', 'LSH 601', 'LSH 602', 'LSH 603', 'LSL 601', 'LSL 602', 'LSL 603', 'P6 STATE', 'P601 Status', 'P602 Status', 'P603 Status']


Isolation forest Algorithm is applied on categorical data with precising an `outliers_fraction` which represent the outlier rate in our dataset. To do so we just need to get the total count of attacks (labelled 1 data) and divide it by the total count.

With `get_list_of_if_outliers` implemented in previous practical sessions, we apply Isolation forest algorithm on categroical data. Then we will retrieve outliers indexes.

In [None]:
outlier_fraction = (full_df['Label'] == 1).sum() / len(full_df)
outlier_indexes = get_list_of_if_outliers(full_df[categorical_columns],outlier_fraction)

Let's compute accuracy by counting predicted outliers which are real outliers.

In [47]:
outliers = np.zeros(len(full_df))
outliers[outlier_indexes] = 1
full_df['outliers'] = outliers

attack_outliers = full_df[(full_df['outliers']== 1) & (full_df['Label']== 1)]
outliers_matches = len(attack_outliers)
print(f'{outliers_matches} outliers found in unsupervised manner are labelled as attacks')
if_accuracy = outliers_matches / len(outlier_indexes)
print(f'Isolation Forest accuracy:  {if_accuracy:.2f} ')

1002 outliers found in unsupervised manner are labelled as attacks
Isolation Forest accuracy:  0.43 


Then we compute F1 Score

In [48]:
from sklearn.metrics import f1_score

if_f1_score = f1_score(full_df['outliers'],full_df['Label'])
print(f'Isolation Forest F1 score: {if_f1_score:.2f}')

Isolation Forest F1 score: 0.41


With `0.43` for accuracy and `0.41` of F1 score, we can say that Isolation Forest is not well adapted on our dataset.

In [49]:
full_df.to_csv('clean_swat.csv')