# Tutorial for Reef Workflow v1.0
This is not done on the best dataset. We don't have ground truth labels for the "unlabeled" dataset and no held out test set. However, it walks you through the overall procedure. Contact paroma@stanford.edu for additional (possibly cooler) datasets.

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load Dataset
For this tutorial, we look at sample data that [Macrobase](http://macrobase.stanford.edu) uses to evaluate their anomaly detection system. Download the file `sample_labeled.csv` from [here](http://paroma.github.io/sample_labeled.csv) and place in a folder called `mobile_mb` under `reef/data/`

### Explore Data
We will take a look at the data at hand - the task is to predict which data points are outliers based on the `usage`, `latency`, `loaction`, and `version` features. 

In [2]:
import pandas as pd
df = pd.read_csv('/dfs/scratch0/paroma/reef/data/mobile_mb/sample_labeled.csv', sep=',',header=0)

df

Unnamed: 0,usage,latency,location,version,inlier,outlier
0,30.770,238,CAN,v2,False,False
1,31.280,611,CAN,v2,False,False
2,31.170,768,RUS,v4,False,False
3,30.940,192,AUS,v3,False,False
4,35.360,401,UK,v3,True,False
5,39.120,531,RUS,v4,False,False
6,33.900,223,UK,v3,False,False
7,40.090,582,USA,v1,False,False
8,2.897,391,CAN,v3,False,True
9,39.030,441,CAN,v2,False,False


### Convert Data
For this tutorial, we will designate data points that are outliers as `+1` and inliers as `-1`. The rest that are neither will be part of the unlabeled dataset. 

In [3]:
#Convert categorical features to one-hot vector features
df_cat = pd.get_dummies(df)
df_cat

Unnamed: 0,usage,latency,inlier,outlier,location_AUS,location_CAN,location_RUS,location_UK,location_USA,version_v1,version_v2,version_v3,version_v4
0,30.770,238,False,False,0,1,0,0,0,0,1,0,0
1,31.280,611,False,False,0,1,0,0,0,0,1,0,0
2,31.170,768,False,False,0,0,1,0,0,0,0,0,1
3,30.940,192,False,False,1,0,0,0,0,0,0,1,0
4,35.360,401,True,False,0,0,0,1,0,0,0,1,0
5,39.120,531,False,False,0,0,1,0,0,0,0,0,1
6,33.900,223,False,False,0,0,0,1,0,0,0,1,0
7,40.090,582,False,False,0,0,0,0,1,1,0,0,0
8,2.897,391,False,True,0,1,0,0,0,0,0,1,0
9,39.030,441,False,False,0,1,0,0,0,0,1,0,0


In [4]:
#primitive matrix, ground
primitive_matrix_train = []
primitive_matrix_val = []

ground_train = []
ground_val = []

for i in range(df_cat.values.shape[0]):
    is_val = df_cat.values[i,2] or df_cat.values[i,3]
    is_outlier = df_cat.values[i,3]

    if is_val:
        primitive_matrix_val.append(df_cat.values[i,4:12])
        ground_val.append(is_outlier)
    else:
        primitive_matrix_train.append(df_cat.values[i,4:12])
        ground_train.append(is_outlier)
        
np.save('/dfs/scratch0/paroma/reef/data/mobile_mb/primitive_matrix_val.npy', np.array(primitive_matrix_val).astype(float))
np.save('/dfs/scratch0/paroma/reef/data/mobile_mb/ground_val.npy', 2*np.array(ground_val).astype(float)-1.)
np.save('/dfs/scratch0/paroma/reef/data/mobile_mb/primitive_matrix_train.npy', np.array(primitive_matrix_train).astype(float))
np.save('/dfs/scratch0/paroma/reef/data/mobile_mb/ground_train.npy', 2*np.array(ground_train).astype(float)-1.)

## REEF: Automatically Generating Heuristics to Label Training Data

### Load Data
We load in the data that we generate above. The `train_` variables are the data points we do not have ground truth labels for. The `val_` variables are data points we have ground truth labels for.

In [5]:
from data.loader import DataLoader
dataset='mobile_mb'

#replace data_folder with location where .txt files are saved
dl = DataLoader()

train_primitive_matrix, val_primitive_matrix, test_primitive_matrix, train_ground, val_ground, test_ground = dl.load_data(dataset=dataset, data_path='/dfs/scratch0/paroma/reef/data/')

## Synthesis + Verification
Generate functions based on feedback from the verifier (vary cardinality)

In [6]:
from program_synthesis.heuristic_generator import HeuristicGenerator
validation_accuracy = []
training_accuracy = []
validation_coverage = []
training_coverage = []
idx = None

hg = HeuristicGenerator(train_primitive_matrix, val_primitive_matrix, 
                            val_ground, train_ground, 
                            b=0.5)
for i in range(3,5):
    if i == 3:
        hg.run_synthesizer(max_cardinality=1, idx=idx, keep=3, model='dt')
    else:
        hg.run_synthesizer(max_cardinality=1, idx=idx, keep=1, model='dt')
    hg.run_verifier()
    
    va,ta, vc, tc = hg.evaluate()
    validation_accuracy.append(va)
    training_accuracy.append(ta)
    validation_coverage.append(vc)
    training_coverage.append(tc)
    
    #No feedback needed for this small dataset (none of the data points are vague)
    #hg.find_feedback()
    #idx = hg.feedback_idx

### Look at Decision Trees

In [7]:
from StringIO import StringIO
from subprocess import check_call
import time

feat_names = df_cat.columns.values 
feat_names = feat_names[4:12]

import sklearn
i = 0
for dt in hg.hf:
    print 'HF:'+ str(i)
    print 'Feat:' + feat_names[hg.feat_combos[i][0]]
    out = StringIO()
    DTree = dt.tree_
    filename = 'hf_' + str(i)+ '_feats_' + str(hg.feat_combos[i][0]) + '_' + str(hg.feat_combos[i][0]) +'.dot'
    out = sklearn.tree.export_graphviz(dt, out_file=filename)
    i+=1

HF:0
Feat:version_v2
HF:1
Feat:version_v1
HF:2
Feat:location_USA
HF:3
Feat:version_v2


**After running this, run the following command from the command line to convert .dot to .png**
`dot -Tps filename.dot -o outfile.ps`

## Performance!

In [8]:
print "Program Synthesis Validation Accuracy: ", np.max(validation_accuracy[1:])

Program Synthesis Validation Accuracy:  0.9017857142857143
