# Intrusion Detection Learning

## Cybersecurity

Cyber security is the set of technologies and processes designed to protect computers, networks, programs, and data from attack, unauthorized access, change, or destruction. Cyber security systems are composed of network security systems and computer (host) security systems. Each of these has, at a minimum, a firewall, antivirus software, and an intrusion detection system (IDS). IDSs help discover, determine, and identify unauthorized use, duplication, alteration, and destruction of information systems. The security breaches include external intrusions (attacks from outside the organization) and internal intrusions (attacks from within the organization).

There are three main types of cyber analytics in support of IDSs: **misuse-based** (sometimes also called signature-based), **anomaly-based**, and **hybrid**. Misuse-based techniques are designed to detect known attacks by using signatures of those attacks. They are effective for detecting known type of attacks without generating an overwhelming number of false alarms. They require frequent manual updates of the database with rules and signatures. Misuse-based techniques cannot detect novel (zero-day) attacks.

Anomaly-based techniques model the normal network and system behavior, and identify anomalies as deviations from normal behavior. They are appealing because of their ability to detect zero-day attacks. Another advantage is that the profiles of normal activity are customized for every system, application, or network, thereby making it difficult for attackers to know which activities they can carry out undetected. Additionally, the data on which anomaly-based techniques alert (novel attacks) can be used to define the signatures for misuse detectors. The main disadvantage of anomaly-based techniques is the potential for high false alarm rates (FARs) because previously unseen (yet legitimate) system behaviors may be categorized as anomalies.

Hybrid techniques combine misuse and anomaly detection. They are employed to raise detection rates of known intrusions and decrease the false positive (FP) rate for unknown attacks. An in-depth review of the literature did not discover many pure anomaly detection methods; most of the methods were really hybrid. Therefore, in the descriptions of ML and DM methods, the anomaly detection and hybrid methods are described together.

Another division of IDSs is based on where they look for intrusive behavior: network-based or host-based. A network-based IDS identifies intrusions by monitoring traffic through network devices. A host-based IDS monitors process and file activities related to the software environment associated with a specific host. [[1]](http://ieeexplore.ieee.org/document/7307098/)

## Machine Learning

An ML approach usually consists of two phases: training and testing. Often, the following steps are performed:

- Identify class attributes (features) and classes from training data.
- Identify a subset of the attributes necessary for classification (i.e., dimensionality reduction).
- Learn the model using training data.
- Use the trained model to classify the unknown data.

In reality, for most ML methods, there should be three phases, not two: training, validation, and testing. ML and DM methods often have parameters such as the number of layers and nodes for an ANN. After the training is complete, there are usually several models (e.g., ANNs) available. To decide which one to use and have a good estimation of the error it will achieve on a test set, there should be a third separate data set, the validation data set. The model that performs the best on the validation data should be the model used, and should not be fine-tuned depending on its accuracy on the test data set. Otherwise, the accuracy reported is optimistic and might not reflect the accuracy that would be obtained on another test set similar to but slightly different from the existing test set.

## Data Mining

The CRISP-DM model illustrates commonly used phases and paradigms by DM experts to solve problems. The model is composed of the following six phases:

- Business understanding: Defining the DM problem shaped by the project requirements.
- Data understanding: Data collection and examination.
- Data preparation: All aspects of data preparation to reach the final dataset.
- Modeling: Applying DM and ML methods and optimizing parameters to fit best model.
- Evaluation: Evaluating the method with appropriate metrics to verify business goals are reached.

Deployment: Varies from submitting a report to a full implementation of the data collection and modeling framework. Usually, the data analyst engages the phases until deployment, while the customer performs the deployment phase.

![alt text](image1.png)


## Dataset

Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders.  The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.


All the files were downloaded from [KDD-CUP-99](http://archive.ics.uci.edu/ml/databases/kddcup99/task.html), it contains the datasets, feature names and the task description, they are listed below.

- UserID (Doctor (type of doctor), Nurse, Administrative, other)
- PatientID (where is the patient, what is his/her illness)
- Type of connection
- Time of connection

In [5]:
!ls

KDD-CUP-99 Task Description.html
Untitled.ipynb
corrected.gz
image1.png
kdd.ipynb
kddcup.data.gz
kddcup.data_10_percent.gz
kddcup.names
kddcup.newtestdata_10_percent_unlabeled.gz
kddcup.testdata.unlabeled.gz
kddcup.testdata.unlabeled_10_percent.gz
training_attack_types.txt
typo-correction.txt
yelp_exploration.ipynb


### Load training data

In [6]:
import pandas as pd
import numpy  as np

In [7]:
properties = pd.read_table('kddcup.names',sep=':',skiprows=1,header=None)
names      = list(properties[0])
names.append('target')

In [8]:
X_train         = pd.read_table('kddcup.data.gz',header=None,sep=',')
X_train.columns = names
X_train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


### Missing Data

It is important to think why data may be missing. Statisticians often use the terms ‘missing at random’ and ‘not missing at random’ to represent different scenarios.

Data are said to be ‘missing at random’ if the fact that they are missing is unrelated to actual values of the missing data. For instance, if some quality-of-life questionnaires were lost in the postal system, this would be unlikely to be related to the quality of life of the trial participants who completed the forms. In some circumstances, statisticians distinguish between data ‘missing at random’ and data ‘missing completely at random’, although in the context of a systematic review the distinction is unlikely to be important. Data that are missing at random may not be important. Analyses based on the available data will tend to be unbiased, although based on a smaller sample size than the original data set.

Data are said to be ‘not missing at random’ if the fact that they are missing is related to the actual missing data. For instance, in a depression trial, participants who had a relapse of depression might be less likely to attend the final follow-up interview, and more likely to have missing outcome data. Such data are ‘non-ignorable’ in the sense that an analysis of the available data alone will typically be biased. Publication bias and selective reporting bias lead by definition to data that are 'not missing at random', and attrition and exclusions of individuals within studies often do as well.

The principal options for dealing with missing data are.
- analysing only the available data (i.e. ignoring the missing data) helpful for data missing at random;
- imputing the missing data with replacement values, and treating these as if they were observed (e.g. last observation carried forward, imputing an assumed outcome such as assuming all were poor outcomes, imputing the mean, imputing based on predicted values from a regression analysis) practical in most circumstances and commonly used;
- imputing the missing data and accounting for the fact that these were imputed with uncertainty (e.g. multiple imputation, simple imputation methods (as point 2) with adjustment to the standard error);
- using statistical models to allow for missing data, making assumptions about their relationships with the available data.
 
Four general recommendations for dealing with missing data in Cochrane reviews are as follows.
- Whenever possible, contact the original investigators to request missing data.
- Make explicit the assumptions of any methods used to cope with missing data: for example, that the data are assumed missing at random, or that missing values were assumed to have a particular value such as a poor outcome.
- Perform sensitivity analyses to assess how sensitive results are to reasonable changes in the assumptions that are made. [[2]](http://handbook.cochrane.org/chapter_16/16_1_missing_data.htm)
 

![alt text](missingdata.png)

In [9]:
pd.isnull(X_train).sum().sum()

0

In [10]:
y = X_train['target']
X_train.drop('target',axis=1,inplace=True)

In [11]:
import plotly.plotly     as py
import plotly.graph_objs as go

data = [go.Bar(x = y.value_counts().index,y =y.value_counts().values)]

py.iplot(data, filename='basic-bar')

Add additional classification all but smurf. neptune. normal.

In [None]:
from sklearn          import base
from sklearn.pipeline import Pipeline

class ColumnSelector(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = col_names
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [[row[self.col_names]] for row in X]

In [None]:
class DictEncoder(base.BaseEstimator, base.TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        result = []
        
        for i in X:
            temp = {}
            if i != None:
                temp[i] = 1
                    
            result.append(temp)
            
        return result

In [None]:
from sklearn.feature_extraction import DictVectorizer

de = DictEncoder()
dv = DictVectorizer(sparse=False)

y2 = de.fit_transform(y.values)
y2 = dv.fit_transform(y2)
y2 = pd.DataFrame(y2,columns=dv.feature_names_)

In [None]:
de = DictEncoder()
dv = DictVectorizer(sparse=False)

pt = de.fit_transform(X_train['protocol_type'].values)
pt = dv.fit_transform(pt)
pt = pd.DataFrame(pt,columns=dv.feature_names_)

In [None]:
de = DictEncoder()
dv = DictVectorizer(sparse=False)

sr = de.fit_transform(X_train['service'].values)
sr = dv.fit_transform(sr)
sr = pd.DataFrame(sr,columns=dv.feature_names_)

In [None]:
de = DictEncoder()
dv = DictVectorizer(sparse=False)

fl = de.fit_transform(X_train['service'].values)
fl = dv.fit_transform(fl)
fl = pd.DataFrame(fl,columns=dv.feature_names_)

In [None]:
X_train.ix[0].apply(type)

In [None]:
X_train['dst_host_same_srv_rate'].value_counts()