# Lab Assignment #4 
## Network Intrusion Detection Dataset Exploration

The Knowledge Discovery and Data Mining Special Interest Group (SIGKDD) from the Association of Computing Machinery (ACM) holds the KDD Cup every year, posing a different challenge to participants. In 1999, the topic was “computer network intrusion detection”, in which the task was to “learn a predictive model capable of distinguishing between bad connections, called intrusions or **attacks**, and good **normal** connections in a computer network.”

The KDD dataset was used for this competition. 

In [1]:
import os
from collections import defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

dataset_root = 'kdd-dataset'
train_file = os.path.join(dataset_root, 'KDDTrain+.txt')
test_file = os.path.join(dataset_root, 'KDDTest+.txt')

In [2]:
# Original KDD dataset feature names obtained from 
# http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
# http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

header_names = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack_type', 'success_pred']

The `attack_type` (2nd from the last column) has 22 different attacks and `normal` connection target. The `kdd-dataset/training_attack_types.txt` maps each of the 22 different attacks to one of the following four categories:
1) Denial of Service Attack (DoS): is an attack in
which the attacker makes some computing or memory
resource too busy or too full to handle legitimate requests,
or denies legitimate users access to a machine.
2) User to Root Attack (U2R): is a class of exploit in
which the attacker starts out with access to a normal
user account on the system (perhaps gained by sniffing
passwords, a dictionary attack, or social engineering)
and is able to exploit some vulnerability to gain root
access to the system.
3) Remote to Local Attack (R2L): occurs when an
attacker who has the ability to send packets to a
machine over a network but who does not have an
account on that machine exploits some vulnerability to
gain local access as a user of that machine.
4) Probing Attack (probe): is an attempt to gather information
about a network of computers for the apparent purpose
of circumventing its security controls.





## Your data exporation tasks
__Task 1: Generate train and test data sets__ (2 points)

Load `train_file` to a pandas DataFrames, called `train_data`. 
Load `train_file` to a pandas DataFrames, called `test_data`.

__Task 2: Create attack category__ (5 points)

Add `attack_category` to the `train_data` and `test_data` as an additional target (column). 

Given the values of `attack_type`, assign the `attack_category` based on  `kdd-dataset/training_attack_types.txt` mapping. `attack_category` could be `dos`, `u2r`, `r2l`, `probe`, or `benign` if `attack_type` is `normal`. 

__Task 3: Data visualization__ (8 points total, 2 points for each graph)

For the train_data:

- Draw a graph that shows the total number of instances of each attack_type along with attack_type.

- Draw a graph that shows the total number of instances of each attack_category along with attack_category.

For the test_data, do the same:

- Draw the graph of total number of instances of each attack_type versus attack_type.

- Draw the graph of total number of instances of each attack_category versus attack_category.

__Task 4: Data analysis__ (5 points)

What do you observe from the graphs? Do the train and test sets have similar statistics?
