# Malicious Event Detection using Decision Tree and Context-Based

Use KDD Cup 1999 computer network intrusion detection dataset. 
The goal of this notebook is to distinguish the good and bad network connections.

### About the dataset
The data sources are primarily sourced from the 1998 DARPA Intrusion Detection Evaluation Program by MIT Lincoln Labs. 
The Dataset conntains a variety of network events that have been simulated in the military network environment. 
The data is a TCP dump that has been accumulatedfrom the local network of an Air Force environment.


In [1]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import cross_val_score

#### Implement the isolation forest, use the `sklearn.ensemble` package:

In [2]:
from sklearn.ensemble import IsolationForest

#### Measure the performance, use AUC and ROC 

In [4]:
from sklearn.metrics import roc_curve, auc
from sklearn.datasets import fetch_kddcup99
%matplotlib inline

#### Try to fetch the kdd cup 1999

In [None]:
dataset = fetch_kddcup99(subset=None, shuffle=True, percent10=True)

### Try alternative: Download data from KAGGLE

In [11]:
from pathlib import Path

path_dir = Path().resolve()
datasets = path_dir / "datasets"

training_data = datasets / "Train_data.csv"
test_data = datasets / "Test_data.csv"

df_train = pd.read_csv(training_data)
df_test = pd.read_csv(test_data)

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25192 entries, 0 to 25191
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     25192 non-null  int64  
 1   protocol_type                25192 non-null  object 
 2   service                      25192 non-null  object 
 3   flag                         25192 non-null  object 
 4   src_bytes                    25192 non-null  int64  
 5   dst_bytes                    25192 non-null  int64  
 6   land                         25192 non-null  int64  
 7   wrong_fragment               25192 non-null  int64  
 8   urgent                       25192 non-null  int64  
 9   hot                          25192 non-null  int64  
 10  num_failed_logins            25192 non-null  int64  
 11  logged_in                    25192 non-null  int64  
 12  num_compromised              25192 non-null  int64  
 13  root_shell      

In [27]:
X =  df_train.drop(columns=['class'])
y = df_train.loc[:,df_train.columns == 'class']
y['class'] = y['class'].map({'normal': 0, 'anomaly': 1})
y


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y['class'] = y['class'].map({'normal': 0, 'anomaly': 1})


Unnamed: 0,class
0,0
1,0
2,1
3,0
4,0
...,...
25187,1
25188,1
25189,1
25190,1


In [30]:
object_cols = X.select_dtypes(include='object').columns
X = pd.get_dummies(X, columns=object_cols, drop_first=True)
X.to_numpy()
y.to_numpy()

array([[0],
       [0],
       [1],
       ...,
       [1],
       [1],
       [1]], shape=(25192, 1))

In [31]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
treeclf = DecisionTreeClassifier(max_depth=7)
scores = cross_val_score(treeclf, X,y,scoring='accuracy', cv=5)
print(np.mean(scores))

treeclf.fit(X,y)

0.9940059516714447
