#### wg 5 assignment
# Anomaly Detection

## 1. Random Forests
Using Random Forests to get a classification accuracy of 0.9667.
(This code is taken from intrusion_detection.py, version of committ 29821ccffaca5817b3fe0b3ae2d972f8132b7999.)

(As an aside, I'm using python 3, not 2. Only change that had to be made was on the print statement at [*])

#### Imports

In [1]:
import sys
import gc

import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [2]:
import autosklearn

In [3]:
import autosklearn.classification
import sklearn.cross_validation
import sklearn.datasets
import sklearn.metrics

In [4]:
gc.collect()

0

Read data.

In [5]:
header = pd.read_table("kddcup.names.txt", header=None)
att_types = pd.read_table("training_attack_types.txt", sep=" ", header=None)

tr_raw = pd.read_csv("kddcup.data_10_percent_corrected", header=None)
test_raw = pd.read_csv("kddcup_testdata.corrected", header=None)

Do the preprocessing.

In [6]:
def preprocess(dat):
    dat.columns = header[0]
    att_types.columns = ["attack", "type"]
    dat["type"] = np.nan
    for i in range(0, len(att_types["attack"])):
        dat.loc[dat["attack"] == att_types.loc[i,].attack, "type"] = att_types.loc[i,].type
    dat.type = dat.type.fillna("unlisted")
    dat.attack = dat.attack.astype('category')

    dat[dat.select_dtypes(include=['number']).columns] = preprocessing.scale(
        dat[dat.select_dtypes(include=['number']).columns])
    dat = pd.get_dummies(dat, columns=['protocol_type', 'service', 'flag'])

    return dat

In [7]:
tr = preprocess(tr_raw)
test = preprocess(test_raw)  # actual test data (but labeled--don't cheat!)



Prepare for classification.

In [8]:
# not occurring in training set but occuring in test set so we just add it
tr['service_icmp'] = 0
test['service_red_i'] = 0
test['service_urh_i'] = 0

In [9]:
tr_labels = tr["type"].values
tr_features = tr.drop(["type", "attack"], axis=1).values

In [10]:
test_labels = test["type"].values
test_features = test.drop(["type", "attack"], axis=1).values

In [11]:
print(sys.getsizeof(tr))
print(sys.getsizeof(test))
print(sys.getsizeof(tr_labels))
print(sys.getsizeof(tr_features))
print(sys.getsizeof(test_labels))
print(sys.getsizeof(test_features))

230668458
147203606
96
112
96
112


In [14]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf.fit(tr_features, tr_labels)
print(clf.score(test_features, test_labels)) # [*]

0.966704069395


naive manual fine-tuning doesn't improve this.

## 2. Extremely Randomized Trees

Try Extremely Randomized Trees for a change. Gets up to .9634 in accuracy.

In [26]:
clf_2 = sk.ensemble.ExtraTreesClassifier(n_estimators=50, max_depth=None, min_samples_split=100, random_state=1)
clf_2.fit(tr_features, tr_labels)
print(clf_2.score(test_features, test_labels)) # [*]

0.96174954747


In [43]:
clf_3 = sk.ensemble.ExtraTreesClassifier(n_estimators=48, max_depth=None, min_samples_split=100, random_state=1)
clf_3.fit(tr_features, tr_labels)
print(clf_3.score(test_features, test_labels)) # [*]

0.96344713837


if you run this, your computer will (might) crash (mine has):

## 3. try to find a better model through automation
We can automate preprocessing, algorithm selection and tuning the hyperparameters of both. Auto-sklearn is a drop-in replacement of sci-kit learn that is currently the best package to do this (see for instance http://www.kdnuggets.com/2017/01/current-state-automated-machine-learning.html)

Let's go!

### small example from the documentation first
(to get an idea)

In [10]:
digits = sklearn.datasets.load_digits()

In [11]:
X = digits.data
y = digits.target

In [12]:
X_train, X_test, y_train, y_test = \
    sklearn.cross_validation.train_test_split(X, y, random_state=1)

In [13]:
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=1200)

In [9]:
automl.fit(X_train, y_train)

You are already timing task: index_run5
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
You are already timing task: index_run6
Process pynisher function call:
Traceback (most recent call last):
  File "/home/valentin/anaconda2/envs/DS-CCMLWI-wg5/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/valentin/anaconda2/envs/DS-CCMLWI-wg5/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/valentin/anaconda2/envs/DS-CCMLWI-wg5/lib/python3.5/site-packages/pynisher/limit_function_call.py", line 83, in subprocess_func
    return_value = (



You are already timing task: index_run6




In [10]:
y_hat = automl.predict(X_test)

In [11]:
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

Accuracy score 0.986666666667


Wow, this takes long! No wonder, it's doing a lot... (default is to terminate after 60 minutes, see https://automl.github.io/auto-sklearn/stable/api.html)

.98666 is what we'd expect as per the example, nice.

TODO: how big is that data set? Based on that, for how long do we have to expect our computation to take if it should go anywhere? How much time to allocate?

In [14]:
import sys
sys.getsizeof(digits)

304

.3 KB. that's nothing!

In [15]:
sys.getsizeof()

230668458

6 orders of magnitude larger, heh! Let's see where this goes.

### now back to our own task

In [13]:
autoclf = autosklearn.classification.AutoSklearnClassifier(ml_memory_limit=2048, 
                                                           time_left_for_this_task=180)
# the above line translates the following line to autosklearn:
# clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
autoclf.fit(tr_features, tr_labels, metric='f1_metric')
# the above line translates the following line to autosklearn:
# clf.fit(tr_features, tr_labels)
y_hat_intdect = autoclf.predict(test_features)
# the above line corresponds to nothing.
print("F1 score", sklearn.metrics.accuracy_score(test_features, y_hat_intdect))
# the above line translates the following line to autosklearn:
# print(clf.score(test_features, test_labels)) # [*]

[ERROR] [2017-03-22 19:25:13,652:AutoML(1):9757dd88af9ad89e9f3939c2f9337b06] Error creating dummy predictions:Memout 


ValueError: Trying to include unknown component: svc

I can't seem to make this work. Running it in the default format with a limit of a couple minutes gives you an error alone the lines of "no models trained" and "cannot make a prediction without a model".
As per the API/documentation, it should be possible to constrain autosklearn to only specific families of estimators, or even only to one. Trying this (current code above) as described in those same places or ANYWHERE on the internet won't get you there, though. Seems like the current version works a little differently than described. Bummer! Overall, auto-sklearn is not very well documented, which is a known draw-back (also given that the current version is 0.1.3, this might be expected).

As I don't have the time to just try running it for a lot longer, I instead want to try something else:

## 4. try optimizing hyperparameters with hyperopt

Actually using hyperopt-sklearn, a sklearn wrapper of hyperopt. See https://hyperopt.github.io/hyperopt-sklearn/.

(This still gives us automation of hyperparameter optimization of our desired estimator, but it's a little simpler than auto-sklearn. Might be a benefit now.)

Starting small: try out one of the examples.

In [2]:
from hpsklearn import HyperoptEstimator, any_classifier
import sklearn
from sklearn.datasets import fetch_mldata
from hyperopt import tpe
import numpy as np

# Download the data and split into training and test sets

digits = sklearn.datasets.load_digits()

X = digits.data
y = digits.target

test_size = int( 0.2 * len( y ) )
np.random.seed( 1 )
indices = np.random.permutation(len(X))
X_train = X[ indices[:-test_size]]
y_train = y[ indices[:-test_size]]
X_test = X[ indices[-test_size:]]
y_test = y[ indices[-test_size:]]

In [3]:
#estim = HyperoptEstimator( classifier=any_classifier('clf'),  
#                            algo=tpe.suggest, trial_timeout=300)

estim = HyperoptEstimator( algo=tpe.suggest, 
                            max_evals=150, 
                            trial_timeout=60 )

In [None]:
estim.fit( X_train, y_train )

In [None]:
print( estim.score( X_test, y_test ) )
# <<show score here>>
print( estim.best_model() )
# <<show model here>>