# Assignment: Intrusion detection

## Task:  Connection Classification

Kaggle challenge: https://www.kaggle.com/sampadab17/network-intrusion-detection?select=Train_data.csv

### Problem description
The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks.

## Data
A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labelled as either normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes.
For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories:
• Normal
• Anomalous



## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
** class is correct
* What problems are we expecting?
** protocol_type, service, flag and class are neither int nor float
** not all features are needed

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Train_data.csv' , encoding = "ISO-8859-1")

In [3]:
test = pd.read_csv("Test_data.csv")

In [4]:
data.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25192 entries, 0 to 25191
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     25192 non-null  int64  
 1   protocol_type                25192 non-null  object 
 2   service                      25192 non-null  object 
 3   flag                         25192 non-null  object 
 4   src_bytes                    25192 non-null  int64  
 5   dst_bytes                    25192 non-null  int64  
 6   land                         25192 non-null  int64  
 7   wrong_fragment               25192 non-null  int64  
 8   urgent                       25192 non-null  int64  
 9   hot                          25192 non-null  int64  
 10  num_failed_logins            25192 non-null  int64  
 11  logged_in                    25192 non-null  int64  
 12  num_compromised              25192 non-null  int64  
 13  root_shell      

In [6]:
from sklearn.preprocessing import LabelEncoder
data["protocol_type"] = LabelEncoder().fit_transform(data["protocol_type"])
data["service"] = LabelEncoder().fit_transform(data["service"])
data["flag"] = LabelEncoder().fit_transform(data["flag"])
data["class"] = LabelEncoder().fit_transform(data["class"])


In [7]:
data.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,1,19,9,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,1
1,0,2,41,9,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,1
2,0,1,46,5,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,0
3,0,1,22,9,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,1
4,0,1,22,9,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X = data.drop("class", axis = 1)
y = data["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=42)

In [10]:
#Train
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
bayes.fit(X_train, y_train)

GaussianNB()

In [11]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_jobs=4)
rf.fit(X_train, y_train)

RandomForestClassifier(n_jobs=4, random_state=1)

In [12]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

In [13]:
from sklearn.metrics import classification_report

In [14]:
predictbayes = bayes.predict(X_test)
print(classification_report(y_test, predictbayes))

              precision    recall  f1-score   support

           0       0.90      0.12      0.21      1165
           1       0.57      0.99      0.72      1355

    accuracy                           0.59      2520
   macro avg       0.73      0.55      0.47      2520
weighted avg       0.72      0.59      0.48      2520



In [15]:
predictrf = rf.predict(X_test)
print(classification_report(y_test, predictrf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1165
           1       1.00      1.00      1.00      1355

    accuracy                           1.00      2520
   macro avg       1.00      1.00      1.00      2520
weighted avg       1.00      1.00      1.00      2520



In [16]:
predictknn = knn.predict(X_test)
print(classification_report(y_test, predictknn))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1165
           1       0.99      0.99      0.99      1355

    accuracy                           0.99      2520
   macro avg       0.99      0.99      0.99      2520
weighted avg       0.99      0.99      0.99      2520



* use Random Forest or KNeigbor as model. Random Forest has the better results