<a href="https://colab.research.google.com/github/DonErnesto/amld2021-unsupervised/blob/master/notebooks/challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop challenge

## Package installing and data import

In [None]:
# load the required files...
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab, need to get data and install libraries..')
    data_path = './'
    !curl -O https://raw.githubusercontent.com/DonErnesto/amld2021-unsupervised/master/notebooks/outlierutils.py
    !curl -O https://raw.githubusercontent.com/DonErnesto/amld2021-unsupervised/master/data/x_kdd.csv
    !curl -O https://raw.githubusercontent.com/DonErnesto/amld2021-unsupervised/master/data/x_kdd_prepared.csv
    !pip install --upgrade pyod
else:
    print('Not running on CoLab, data and libraries are already present')
    data_path = '../data'
    

In [None]:
# standard library imports
import os
import sys
from collections import Counter
import getpass

# pandas, seaborn etc.
import seaborn as sns
import sklearn 
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

# sklearn outlier models
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

# other sklearn functions
from sklearn.decomposition import PCA
from sklearn.covariance import MinCovDet, EmpiricalCovariance
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import scale as preproc_scale
from sklearn.manifold import TSNE

# pyod
import pyod
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.knn import KNN
from pyod.models.lof import LOF
# from pyod.models.pca import PCA as pyod_PCA
from pyod.models.iforest import IForest

In [None]:
from outlierutils import plot_top_N, plot_outlier_scores, LabelSubmitter, API_URL

## Dataset Import

In [None]:
dataset_path = 'x_kdd_prepared.csv'
x = pd.read_csv(os.path.join(data_path, dataset_path))
print(f'Data set size: {x.shape}')

## Challenge Description

You have just imported a data set, `x_kdd`, with 48K rows. The dataset was collected by by MIT Lincoln Labs in 1999, by operating a LAN-network as usual and additionally carrying out various attacks. This specific dataset (which is a subset of the original dataset) has "normal" traffic as inlier class, and several attacks (buffer_overflow, ftp_write, imap, ...) as outlier class. Although this data does not represent payment fraud, it is relevant because of the mixed data type.

The goal of the challenge is for you to tell which rows are the outliers, i.e. which rows correspond to network attacks.

There are no labels available. The target is to predict as many true positives as possible and as few false positives as possible, with the following weights:

- Each true positive reported yields **500** points
- Each false positive reported costs **25** points

You submit your prediction (the indices of the rows that you think are outliers) to a server by means of of some code discussed below. The server will provide feedback: it will tell you which rows are actually outliers and which ones are not.

**Hints**
- proceed iteratively! Submit a few points, learn based on the feedback of the server, then submit a few more points, etc..
- only submit points that you think are positives!! Just submitting all points, or random points, will not get you a good score :)
- the fraction of positives is less than 1%. Random guessing to gather labels is therefore unlikely to pay off. 
- given the limited time available for the workshop, we have already cleaned the data for you. If you rather do the cleaning yourself, set `dataset_path = 'x_kdd.csv'` in the cell above

## Your Outlier detection code

In [None]:
# Your code goes here!!
# by using one or more than one method, you will estimate a vector of scores, like this


from pyod.models.iforest import IForest

ifo = IForest(n_estimators=1000, max_samples=1024, random_state=1, contamination=0.01, behaviour='new')
ifo.fit(x)
# get the outlier scores of the data
scores_ifo = ifo.decision_scores_  # raw outlier scores

In [None]:
plt.plot(sorted(scores_ifo)[::-1])

## Example of a submission process

Given the `scores` array, you may want to submit for example the indices of the N points that have the highest score. You can use this helper function to calculate these indices:

In [None]:
def get_top_N_indices(scores, N=100):
    """ Helper function. Returns the indices of the points with the top N highest outlier scores
    """
    return np.argsort(scores)[::-1][:N]

In [None]:
indices_submission = get_top_N_indices(scores_ifo, N=40)
indices_submission

For the example `score` vector `scores_example = np.array([5.23, 4.12, 1.45, 7.23, 19.2, 2.23])`, the N=2 highest scoring points are at indices 4 and 3 in the original table, captured in the `indices_submission` vector above

## API submission

Submit your predictions to the API with the `LabelSubmitter` class. 
This class has two useful methods:
- with `.post_predictions()` you submit the indices of the estimated outliers. Submitting more than once the same index has no additional effect on your score 
- with `.get_labels()` you retrieve the label (1 for outliers and 0 for inliers) of all previously submitted indices

In [None]:
username='your_user_name'
password = getpass.getpass()
if not ('server' in locals() and server.jwt_token): #only if no labelsubmitter with .jwt_token is available
    server = LabelSubmitter(username=username,
                       password=password,
                       url=API_URL)

Use the parameter `endpoint='kdd'` option for this challenge. 

In [None]:
server.post_predictions(idx=indices_submission, endpoint='kdd')

In [None]:
labels = server.get_labels(endpoint='kdd')
labels