# Task -1 Anomaly detection
## Student ID1: 345174478
## Student ID2: 326876786

#### In this assignment we will be using the Isolation Forest method to detect anomalies among the given dataset. 
#### In the following report, we have explored the data, answered the assignent questions and trained and tested our model. 

In [None]:
!pip3 install oletools

In [None]:
# Imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings

In [None]:
# file path - this for linux windows you will need "//"
f_path = "conn_attack.csv"
'''
record ID - The unique identifier for each connection record.
duration_  This feature denotes the number of seconds (rounded) of the connection. For example, a connection for 0.17s or 0.3s would be indicated with a “0” in this field.
src_bytes This field represents the number of data bytes transferred from the source to the destination (i.e., the amount of out-going bytes from the host).
dst_bytes This fea
ture represents the number of data bytes transferred from the destination to the source (i.e., the amount of bytes received by the host).
'''
df = pd.read_csv(f_path,names=["record ID","duration_", "src_bytes","dst_bytes"], header=None)

# Data exploration

### Here we have explored the data in order to gain a further understanding of the features. 
##### Comments on what we learned from this are written throughout the notebook.

In [None]:
#Relationship with numerical variables
var = 'record ID'
data = pd.concat([df['src_bytes'], df[var]], axis=1)
data.plot.scatter(x=var, y='src_bytes', ylim=(0,60000)); 

In [None]:
#Relationship with numerical variables
var = 'record ID'
data = pd.concat([df['dst_bytes'], df[var]], axis=1)
data.plot.scatter(x=var, y='dst_bytes', ylim=(0,800000)); 

In [None]:
#Relationship with numerical variables
var = 'record ID'
data = pd.concat([df['duration_'], df[var]], axis=1)
data.plot.scatter(x=var, y='duration_', ylim=(0,1600)); 

##### By graphing the record ID against the src_bytes, dst_bytes and duration_ features, it is simplier to see which instances are anomalous. Record ID will not provide the model with any additional information about the data, and can therefore be disregarded as a feature when training the model. However, each graph produces outliers, meaning through any one feature alone (src_bytes, dst_bytes and duration_), anomalies can be detected and thus each of these features are vital for the model.  

In [None]:
sns.distplot(df['dst_bytes'])
print("Skewness: %f" % df['dst_bytes'].skew())
print("Kurtosis: %f" % df['dst_bytes'].kurt())

In [None]:
sns.distplot(df['duration_'])
print("Skewness: %f" % df['duration_'].skew())
print("Kurtosis: %f" % df['duration_'].kurt())

In [None]:
sns.distplot(df['src_bytes'])
print("Skewness: %f" % df['src_bytes'].skew())
print("Kurtosis: %f" % df['src_bytes'].kurt())

##### The skewness tells us that there are a higher number of datapoints having lower src_byte, dst_byte and duration_ values. So when we train our model using Isolation Forest, data points with higher values for these features will be isolated quicker. 
##### Through the skewness and kurtosis of each feature we learn the direction of the outliers. Since all three features are positively skewed, most of the outliers will be present on the right side of the distribution. It does not tell us the number of outliers, rather the direction alone. 


In [None]:
df.corr() 

In [None]:
# Increase the size of the heatmap.
plt.figure(figsize=(16, 6))
# Store heatmap object in a variable to easily access it when you want to include more features (such as title).
# Set the range of values to be displayed on the colormap from -1 to 1, and set the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12});

##### The correlation heatmap provides the correlation between any two features. Training a model using features that are heavily correlated is redundant. Therefore, correlation mapping is usually helpful in ridding one feature of a pair of correlated features. However, we can learn from this heatmap that our features are not highly correlated, so using all three is necessary. 

# Choosing an Unsupervised Model
##### After exploring the data, we decided on the unsupervised model Isolation Forest. Since the dataset file does not contain labels, (we we're only given labels in a seperate file to check our work) the unsupervised algorithms group is suitable for this task. 

# Why Isolation Forest?
##### Isolation forest can identify anomalous datapoints using isolation rather than modeling the normal ones. After examining the data and noticing clear outliers within the graphs, we realized that instead of constructing a profile of what's "normal", and then report anything that cannot be considered normal as anomalous, our algorithm should explicitely isolate anomalous points in the dataset. The model processes data in a tree structure based on randomly selected features. All three features add context to the model and are thus helpful when trying to isolate data points through cuts. Data points with deeper trees are less likely to be anomalies since they required more cuts to isolate them. Furthermore, data points with shorter branches indicate anomalies as it was easier for the tree to separate them from other data points. We can observe that our model is sufficient through examining the confusion matrix, precision score and recall score. 

# Training The Model
##### In this section, we train, test and validate our results with the labels file. Isoltation Forest works as an ensemble of isolation trees. We chose 1000 base estimators using 256 samples each. Comments are written throughout the code along with a confusion matrix at the bottom.

In [None]:
DATA_PATH = "conn_attack.csv"
df = pd.read_csv(DATA_PATH, header=None, names=["record ID","duration_", "src_bytes","dst_bytes"])

In [None]:
data = df.drop(columns=["record ID"], axis=1).copy() # removing record ID from the feature list

In [None]:
data # showing that record ID was indeed dropped


In [None]:
%%time
from sklearn.ensemble import IsolationForest

# max features is 3 since all 3 features are useful. 
model = IsolationForest(contamination=float(0.004), n_estimators=500, max_samples=256, max_features=3)
model.fit(data.values) 

In [None]:
# testing the model on the dataset and adding prediction as column
df["is_anomaly?"] = pd.Series(model.predict(data.values))
df["is_anomaly?"] = df["is_anomaly?"].map({1: 0, -1: 1}) # instead of 1:normal -1:anomaly, we mapped to 0:normal 1:anomaly
print(df["is_anomaly?"].value_counts())

In [None]:
results = df.drop(["duration_","src_bytes","dst_bytes"], axis=1) # dropping the features

In [None]:
# showing that the prediction results
results

In [None]:
results.to_csv("conn_attack_iforest_pred.csv", index=False) # output file with prediction column

In [None]:
# validating the model against the labels given
PATH_TO_LABELS = 'conn_attack_anomaly_labels.csv'
data_labels = pd.read_csv(PATH_TO_LABELS, header=None, names=["record ID","label"])

## Confusion Matrix, Accuracy and Recall

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(df["is_anomaly?"], data_labels["label"], labels=[0,1])

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(data_labels["label"], df["is_anomaly?"])
print("accuracy score: {0:.2f}%".format(accuracy*100))

In [None]:
from sklearn.metrics import recall_score
recall = recall_score(data_labels["label"], df["is_anomaly?"])
print("recall score: {0:.2f}%".format(recall*100))

# Summary 
##### Our model has an accuracy score of 99.97% and recall score of 96.87%. The high precision score outlines the model's ability to accurately identify anomalous data. Meaning, from all the data points the model classified as anomalous, most of them were true positives. As we are dealing with anomaly detection, recall is important. Recall is the measure of predicted anomalies over the total number of anomalous data points. Emphasizing a higher recall rate means clients would prefer the false negative number to be as low as possible. This is important in anomaly detection because of the possibilty of cyber attacks that can evolve from undiagnosed anomalous data points. Therefore, we can presume that our model is satisfactory. 
### Link to Github : https://github.com/Rashipachino/Anomaly_Detection.git
### How to run : 
#### 1. Open terminal at folder containing the Dockerfile. 
#### 2. Run the following commands


In [None]:
!docker build -t iforest .

In [None]:
!docker run -t -d -p 8080:8080 iforest

The server will be running at: http://0.0.0.0:8080