<a href="https://colab.research.google.com/github/DonErnesto/amld2021-unsupervised/blob/master/notebooks/challenge_hands_on.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop challenge

## Package installing and data import

In [None]:
# load the required files...
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab, need to get data and install libraries..')
    data_path = './'
    !curl -O https://raw.githubusercontent.com/DonErnesto/amld2021-unsupervised/master/notebooks/outlierutils.py
    !curl -O https://raw.githubusercontent.com/DonErnesto/amld2021-unsupervised/master/data/x_kdd_prepared.csv
    !pip install --upgrade pyod
else:
    print('Not running on CoLab, data and libraries are already present')
    data_path = '../data'
    

In [None]:
# standard library imports
import os
import sys
from collections import Counter
import getpass

# pandas, seaborn etc.
import seaborn as sns
import sklearn 
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

# sklearn outlier models
from sklearn.neighbors import NearestNeighbors
# from sklearn.neighbors import LocalOutlierFactor
# from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

# other sklearn functions
from sklearn.decomposition import PCA
from sklearn.covariance import MinCovDet, EmpiricalCovariance
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import scale as preproc_scale
from sklearn.manifold import TSNE

# pyod
import pyod
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.knn import KNN
from pyod.models.lof import LOF
# from pyod.models.pca import PCA as pyod_PCA
from pyod.models.iforest import IForest

In [None]:
from outlierutils import plot_top_N, plot_outlier_scores, LabelSubmitter, API_URL

## Data Imports

In [None]:
x_kdd = pd.read_csv(os.path.join(data_path, 'x_kdd_prepared.csv'))
x_kdd = x_kdd.drop_duplicates()
if x_kdd.index.max() > len(x_kdd):
    x_kdd = x_kdd.reset_index()
print(f'Data set size: {x_kdd.shape}')

## Challenge Description

You just imported a data set, `x_kdd`, with 48K rows. The dataset was collected by by MIT Lincoln Labs in 1999, by operating a LAN-network as usual, and additionally carrying out various attacks. This specific dataset (which is a subset of the original dataset) has "normal" traffic as inlier class, and several attacks (buffer_overflow, ftp_write, imap, ...) as outlier class. Although this data does not represent payment fraud, it is relevant because of the mixed data type. 


There are no labels available, there is therefore also no split in train and test. 
The target is to predict as many true positives as possible (each positive gets you a positive score), and as few false positives as possible (each false positive subtracts a small score). So only submit points that may likely be positives!!


Be selective, just submitting all points, or random points, will not get you a good score :)

- Each true positive found yields **500** points
- Each false positive costs **25** points

**Hints**

- The fraction of positives is less than 1%. Random guessing to gather labels is therefore unlikely to pay off. 
- When sufficiently many positive labels are available, this information may be used to further tune unsupervised algorithms, or to train a supervised classifier


First clean up the data: convert categorical columns to one-hot encoded, and MinMax-scale all features. Do not remove any rows!



In [None]:
# clean-up code here


## Outlier detection: your code!


In [None]:
def get_top_N_indices(scores, N=100):
    """ Helper function. Returns the indices of the points with the top N highest outlier scores
    """
    return np.argsort(scores)[::-1][:N]

In [None]:
get_top_N_indices(np.array([5, 4, 3, 2, 1, 0]), N=2)

## API submission

Submit your predictions to the API with a LabelSubmitter object. 
This object has a `.post_predictions()` method to submit predictions, and a `.get_labels()` method to retrieve the labels (positives and negatives) of all previous submissions. 

In [None]:
username='test'
password = getpass.getpass()
if not ('ls' in locals() and ls.jwt_token): #only if no labelsubmitter with .jwt_token is available
    ls = LabelSubmitter(username=username,
                       password=password,
                       url=API_URL)


Use the parameter `endpoint='kdd'` option for this challenge. 

In [None]:
ls.post_predictions(idx=[0, 1], endpoint='kdd')

In [None]:
labels = ls.get_labels(endpoint='kdd')
labels