# Machine Learning Engineer Nanodegree 
## Capstone Proposal 
## Proposal: Network Intrusion Detection with Recurrent Neural Networks
Tobias Burri

May 28th 2017
 
### Domain Background (Cyber Security / Intrusion Detection) 
 
With an ever growing number of connected devices, more and more traffic is transferred over networks, generating a confusing amount of log data that has to be monitored. Given this development, manual monitoring becomes increasingly infeasable to prevent cyber attacks. Network Intrusion Detections Systems (NIDS) can help system administrators to detect network breaches, however setting up policies that are both flexible and effectice against unforseen attacks can be challenging. Applying machine learning in the analysis of log file datasets can help to improve NIDS and thus strengthening the security posture of organizations.

I am working for a an IT-Consulting company with a wide range of expertise, one of which is cyber security. However the application of machine learning to cyber security topics was of little relevance so far in our daily work. Doing this project, I would like to promote ML techniques within my company for potential future projects.


----------



### Problem Statement 

Given a sequence of network connections between a source and target IP, each of which is represented by a total of 41 features, the problem is to predict whether a connection represents an attempt to attack the source network and to correctly predict the type of attack. The suggested solution is to model the network traffic as a time series by applying a long short-term memory (LSTM) recurrent neural network. Whether connections were correctly labeled can be clearly observed and by setting a random seed the problem can be reproduced.

----------


### Dataset and inputs

The dataset used for the capstone project will be the KDD Cup 1999 Data, which was prepared for the Third International Knowledge Discovery and Data Mining Tools Competition.  The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections
It was downloaded from <a href="http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html">this</a> site. 
 
The KDD dataset was created by collecting raw TCP dump data from a simulated U.S. airforce local-area network in the course of 9 weeks. During the data collection the network was deliberately attacked multiple times. Each datapoint represents a connection between a source and target IP, using a well defined protocol, during a well defined time frame. Each datapoint is made up of 41 features and is labeled either as a normal connections or as an attack.

Some of the features are directly derived from the TCP/IP connections during a time interval, however the dataset includes also "higher-level" features that were derived from some of the basic features of the dataset. 

- The **'basic' ** set of features include inputs like duration of the connection (in seconds), the protocol type, the number of bytes from the source to the destination and vice versa. 
- The ** 'higher-level' ** features include inputs like the number of connections to the same host as the current connecton in the past two seconds, number of connection that use the same service, number of failed login attempts.

There is a total of 38 types of attacks, grouped into 4 categories:

- *DOS* (denial of service) e.g. syn flood
- *R2L* (unauthorized access from a remote machine) e.g. guessing password
- *U2R* (unauthorized access to local superuser privileges)  e.g. various 'buffer overflow' attacks
- *PROBING* (surveillance and other probing) e.g. port scanning

Not all of the attack types that occur in the test set, occur in the training set. This is a specific characteristic of the dataset to make it more realistic. However it is believed, that most of the novel attacks can be derived from the known attacks (see Tavallee, 2009). All types of attacks will be labeled as one of the categories they are grouped in. 

----------

### Solution Statement

With the aim of maintaining reproducability, the solution will be entirely implemented using a python notebook. In order to appropriately classify malicious connections, a recurrent neural network - LSTM implementation will be calibrated and applied. The model implementation will be entirely written in the Tensorflow 1.0.0 python library.  As needed for data preparation, exploration or visualisation, additional libraries will be applied such as numpy, sklearn or matplotlib. Once the basic model setup is implemented, the model will be improved by experimenting with hyperparameter optimization, regularization techniques and feature selection. The models performance will be measured by how well it is able to correclty classifiy different kinds of attack types in comparison to the winner model of the 1999 KDD competition.

----------

### Benchmark Model

The benchmark model for this capstone project will be the winning model used in the 1999 Data Mining and Knowledge Discovery competition (KDD Cup). The model applied was an ensemble of C5 decision trees. See (Pfahringer, 2000) for a more detailed overview of how this model was implemented.

This model resulted in the following confusion matrix (Taken from Elkan, 2000): 

<table style="width:50%; text-align: center;">
  <tr>
    <th>Actual → </th>
    <th>0</th>
    <th>1</th> 
    <th>2</th>
    <th>3</th>
    <th>4</th>
    <th>%Correct</th>
  </tr>
  <tr>
    <th>Predicted ↓</th>
    <th></th>
    <th></th> 
    <th></th>
    <th></th>
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>0</th>
    <td style='background-color:#E6E6E6'>60262</td> 
    <td style='background-color:#E6E6E6'>243</td>
    <td style='background-color:#E6E6E6'>78</td>
    <td style='background-color:#E6E6E6'>4</td>
    <td style='background-color:#E6E6E6'>6</td>
    <td>99.5%</td>
  </tr>
  <tr>
    <th>1</th>
    <td style='background-color:#E6E6E6'>511</td> 
    <td style='background-color:#E6E6E6'>3471</td>
    <td style='background-color:#E6E6E6'>184</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td>83.3%</td>
  </tr>
    <tr>
    <th>2</th>
    <td style='background-color:#E6E6E6'>5299</td> 
    <td style='background-color:#E6E6E6'>1328</td>
    <td style='background-color:#E6E6E6'>223226</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td>97.1%</td>
  </tr>
    <tr>
    <th>3</th>   
    <td style='background-color:#E6E6E6'>68</td> 
    <td style='background-color:#E6E6E6'>20</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td style='background-color:#E6E6E6'>30</td>
    <td style='background-color:#E6E6E6'>10</td>
    <td>13.2%</td>
  </tr>
    <tr>
    <th>4</th>
    <td style='background-color:#E6E6E6'>14527</td> 
    <td style='background-color:#E6E6E6'>294</td>
    <td style='background-color:#E6E6E6'>0</td>
    <td style='background-color:#E6E6E6'>8</td>
    <td style='background-color:#E6E6E6'>1360</td>
    <td>8.4%</td>
  </tr>
      <tr>
    <th>%correct</th>
    <td>74.6%</td> 
    <td>64.8%</td>
    <td>99.9%</td>
    <td>71.4%</td>
    <td>98.8%</td>
    <td></td>
  </tr>
   <caption>0=NORMAL;   1=PROBE;   2=DOS;  3=U2R;   4=R2L </caption>
</table>


----------
  

### Evaluation Metrics

The following metrics will be used to evaluate the performance of the applied model in comparison to the benchmark model:

- Accuracy: Defined as the proportion of true results (both true positives and true negatives) among the total number of datapoints examined.

- True Positive Rate: Defined as the ratio between the number of attacks correctly categorized as attacks and the total number of  attacks.

- False Postitive Rate: Defined as the ratio between the number of normal connections wrongly categorized as attacks and the total number of normal connections.

- Precision: Defined as the ratio of the number of true positives divided by the number of true positives and false positives.

- Recall: Defined as the ratio of number of true positives records divided by the number of true positives and false negatives.


Evaluation Metrics will be applied for the Global Test Dataset as well as for every type of connection (NORMAL; PROBE; DOS; U2R; R2L)

----------


### Project Design

The main framework that will be used for the implemantion of the capstone project is the Tensorflow python library. Additional python libraries for data tranformation and exploration will be used such as numpy, sklearn or matplotlib. 

The workflow for approaching a solution to the stated problem will be of following structure:


**1. Initial Setup**

 - Ensure that all employed software tools meet the Udacity project requirements
 - Installation of all required libraries
 - Setting up jupyther notebook and import libraries
 - Read in the data

**2. Data Preparation**
 - Data cleaning
 
    - *Since the dataset was preprocessed for a Data Mining competetion, it does not exhibit missing entries or any   
     other grave impurities. Therefore this step can be skipped.*

 - Tranformation of data so that it can be fed to the neural network by writing helper functions to: 

    - *Map the 38 attack types to the five groups (normal, dos, probe, r2l, u2r).*
 
    - *Cast lables and every other categorial feature that is encoded as a string into a float.*

    - *Rescale features to a value between 0 and 1. This will be done be subtracting from every observation min value  
    of the corresponding feature and dividing it by the corresponding max value , subtracted by the min value.*
  
    - *Binarize labels to a vector with the size of 5. ( Attack Type 3 ==> [0,0,0,1,0]  ) *

**3. Evaluation Metrics**
 
 - Provide functions to compute evaluation metrics:
   
   - *Accuracy*
   - *True Positive Rate*
   - *False Positive Rate*
   - *Precision*
   - *Recall*

**4. Basic Neural Network**
 - Setup basic feed-forward neural network to ensure features and labels are correctly formatted
 - Run it to ensure model as well as evaluation metrics are correctly implemented


**5. Recurrent Neural Network - LSTM**
 - Extend the model architecture to the requirements of an LSTM - RNN implementation.
 - Reshape input features to fit the model  requirements
 - Run it to ensure model as well as evaluation metrics are correctly implemented


**6. Improve the model through experiments**

  - Hyperparameter Optimization
    
    - *Batch Size*
    - *Number of LSTM cells*
    - *Number of Hidden Units*
 
  - Regularization 
    - *L1* 
    - *L2* 
    - *Dropout* 
    
  - Feature Selection
    - *Recursive Feature Elimination*
    - *Feature Importance*
    
    
**7. Documentation of Results ** 
 
  - Plot learning curves for different attack types
  - Compare Benchmark performance versus Deep Learning performance via confusion matrix / evaluation metrics
  - Give a final report on the results


--------------------------------

[1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.

[2] B. Pfahringer, "Winning the KDD99 classification cup: bagged boosting", ACM SIGKDD Explorations Newsletter Homepage archive, Volume 1 Issue 2, January 2000 

[3] C. Elkan, "Results of the KDD'99 classifier learning", ACM SIGKDD Explorations Newsletter Homepage archive, Volume 1 Issue 2, January 2000 