<img src="/static/base/images/logo.png?v=641991992878ee24c6f3826e81054a0f" alt="Jupyter Notebook">
<h1 style="text-align: center">Notebook 3 - Cybersecurity Analytics</h1>

<h3>Prerequisites</h3>

- You must have R installed on your system (<a href="https://cran.r-project.org/bin/">Download: Follow "base" links</a>)
- You must have Jupyter installed on your system (<a href="https://jupyter.org/install">Download</a>)
- Some knowledge of R may be required

<h3>Explanation of Notebook 3</h3>

In this notebook, you'll explore two techniques used for instrution detection; 
+ K-Means Clustering (Unsupervised)
+ Decision Tree Classification (Supervised)

There will be different types of data visualisation involved in this notebook, and you must use them for the analysis.<br>
You will only use R throughout this notebook.

<h3>Getting started</h3>

You will require four packages for the scripts in this lab to work:
+ cluster
+ c50
+ e1071
+ ggplot2 

Run the cell below to install the packages.


In [18]:
install.packages(c("cluster","C50","e1071", "ggplot2"))

<h3>Overview</h3>

+ This notebook will be based on a scenario known as 'KDD-CUP-99' which is for a typical U.S. Air Force LAN (Local Area Network).
+ The duration is 9 weeks; 7 weeks of training, 2 weeks of testing.
+ The size of the original dataset is 4GB of compressed binary TCP dump data, which is approximately 5 million connection records.

<h4>What is a connection?</h4>

It is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.<br>

+ Each connection is labelled either _normal_, or as an _attack_, with exactly one attack type.<br>
+ Each connection record consists about 100 bytes.

<h4>Attacks</h4>

These two techniques will be used to analyse cyber security attacks which fall under four main categories:
+ DOS - Denial Of Service
+ R2L - Unauthorised access from a remote machine
+ U2R - Unauthorised access to local superuser privileges
+ Probing - Surveillance and other probing

<h4>Features</h4>

There are 41 features in total, however, those 41 features are divided into two type to help distinguish normal connections from attacks:

+ Raw features (e.g <code>flag</code>, <code>src_bytes</code>)
+ Derived 'higher-level' features (e.g <code>serror rate</code>, <code>count</code>)

Each feature has its own name, description, and type.

<h3>What is <i>k</i>-means clustering?</h3>

Clustering is a form of unsupervised learning, which is employed in the analysis of unlabelled data that is not categorised or grouped or when not a lot is known about the data.

<i>k</i>-means is a clustering algorithm that is partition-based, and it is commonly used in intrusion detection due to the transparent analysis it provides of clustered data.

What this means is that it groups the data into ___k___ number of groups, known as clusters, based on similarity measure.<br>
Based on the features, it iteratively assigns points to each cluster so that each point is similar to those within the cluster and dissimilar to those outside the cluster. 

There's two things <i>k</i>-means output:
+ The centroids/centre points of each cluster
+ The clustering assignments where each point is assigned to exactly one cluster

The below shows an example of the use of K-Means clustering:

![cluster.png](attachment:cluster.png)

In the left, you can see the raw data without any clusters, whereas in the right, you can see that the clusters are applied as the data is separated into 3 groups using <font color='red'>red</font>, <font color='green'>green</font>, and <font color='blue'>blue</font>, meaning that the ___k___ is 3.

<h3>Activity 1</h3>

1. Identify k through analysing the within groups sum of squares
2. Perform k-means clustering with identified k value
3. Can you identify any relationships between clusters and attacks?

Before proceeding with any of the three tasks, first you must load the installed libraries, which for now, will be 'cluster' and 'ggplot2'.

In [2]:
library(cluster)
library(ggplot2)

<h3>Decision Tree Classification</h3>

Decision trees construct tree-like structures using a series of boolean functions (e.g "yes" or "no" questions based on the characteristics of a set of variables) until no more relevant branches can be derived.

New data items can then be classified at the root node (The note at the very top) and moving down through the branches until a leaf node (A node without any child nodes) is reached and a classification is obtained.


<h3>Activity 2</h3>

1. Build a C5.0 decision tree model on the training data.
2. What issues do you notice with the tree model?
3. Test the model on the test data. How does it perform?
4. Select subsets of features and rerun the classification.

To do these four tasks, you must load the 'c50' library.

In [1]:
library(C50)