# Evaluation of default configurations


We have two answer to questions:
 1. By which method can we find good Symbolic Defaults? 
 2. **Can we find good (i.e. better than currently known) symbolic defaults?**
 
This notebook addresses the second question.

----

general remarks:
 - does not currently factor in runtime, but this may be especially important for default values. I didn't explicitly measure runtime, but the 'symbolic defaults' in adaboost take much longer (because it significantly increases `n_estimators` and `max_depth`).

In [1]:
from persistence import load_problem, load_results_for_problem
from visualization.output_parser import get_performance_from_console_output, get_performance_from_csv

def load_random_search_results(problem_name):
    p = load_problem('problems.json', problem_name)
    return load_results_for_problem(p)

# 1a. SVC

After determining good symbolic defaults, we ought to see how they compare to current (scikit-learn) defaults. To this end, we compare three different default configurations (in bold is the name by which they will be referenced henceforth):

 - The **symbolic_pre** defaults we found from evolutionary optimization, specifically: `C=128, gamma=(mkd / 4)`. 
 This symbolic function uses metafeatures as calculated on the dataset *before* it is preprocessed.
 - The **symbolic_post** defaults we found from evolutionary optimization, specifically: `C=64, gamma=mkd`.
     This symbolic function uses metafeatures as calculated on the dataset *after* it has been preprocessed.
 - The scikit-learn **0.20** defaults, specifically: `C=1., gamma=(1 / n_features)`
 - The scikit-learn >= **0.22** defaults, specifically: `C=1., gamma=(1 / (n_features * X.var()))`
 
Note that actually all of these defaults are symbolic.

A second important detail to note is that these settings are not tried by themselves.
A (fairly standard) preprocessing pipeline is applied:
 - **Imputation**: using the mean for numeric features, and the most frequent value for categorical features.
 - **Transformation**: numeric features are scaled to N(0, 1), categorical features are one-hot encoded.
 - **Feature Selection**: all constant features are removed.
 
After these steps, the SVC is invoked on the preprocessed data with the given values for `C` and `gamma`.

Note: for the scikit-learn defaults, currently the metafeatures of the preprocessed data are used (e.g. `n_features` is determined after one-hot encoding, for instance). For the *symbolic* method `mkd` is determined on the original, (largely) unprocessed dataset (samples with NaN values are ignored).

## 1a.1  Loading Data

In [2]:
# The grid search result from Jan.
svc_results = load_random_search_results('svc_rbf')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_csv("data/results/ppp_svc_rbf_C32_Gmkd.txt")
old_default_performances = get_performance_from_csv("data/results/ppp_svc_rbf_default.txt")
new_default_performances = get_performance_from_csv("data/results/ppp_svc_rbf_default_Gscale.txt")

## 1a.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments ('loss' column).

**note:** The 'loss' column is calculated over all executed tasks with the default, because of work-in-progress/earlier cut-off with time constaints, currently the amount of tasks evaluated per method differs, so you find the amount of tasks for which the defaults have been evaluated in the 'N' column. The total loss is *not* normalized for the amount of completed tasks.

In [4]:
import pandas as pd
import numpy as np

methods = ['Symbolic', '0.20', '0.22']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, old_default_performances, new_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = svc_results[svc_results.task_id == row.name].predictive_accuracy.max()
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

Symbolic outperformed best on task 29 by -0.0057970434782607105
Symbolic outperformed best on task 3560 by -0.003849189873417702
Symbolic outperformed best on task 34538 by -0.003703222222221969
Symbolic outperformed best on task 23 by -0.006135132744989891
Symbolic outperformed best on task 9956 by -0.0037779685534591323
Symbolic outperformed best on task 20 by -0.0024999999999999467
Symbolic outperformed best on task 14 by -0.000500000000000056
Symbolic outperformed best on task 146607 by -0.0010720731730265998
Symbolic outperformed best on task 14965 by -0.0003985369498339386
Symbolic outperformed best on task 7592 by -0.003931128026509967
0.20 outperformed best on task 34538 by -0.0027772962962961945
0.20 outperformed best on task 23 by -0.006135132744989891
0.22 outperformed best on task 125920 by -0.008000000000000007
0.22 outperformed best on task 34538 by -0.0027772962962961945
0.22 outperformed best on task 23 by -0.008185141937856244
0.22 outperformed best on task 7592 by -0.

Unnamed: 0,Symbolic,0.20,0.22,loss,N
Symbolic,0.0,67.0,65.0,1.740567,92.0
0.20,19.0,0.0,3.0,3.965604,92.0
0.22,21.0,26.0,0.0,3.351373,92.0


This reads as *Symbolic* won over the *0.20* default 67 times, while the *0.20* default was better than *Symbolic* on 19 tasks. *Symbolic* obtained a loss of 1.74 over the best known result of each task (or slightly under 0.02 accuracy, on average).

We see that *Symbolic* (i.e. `C=128, gamma=mkd/4`) as default outperforms either of the two scikit-learn ones, both in terms of tasks where it achieves higher predictive accuracy, and the loss in accuracy it occurs across tasks.

# 1b SVC Poly

## 1b.1 Loading Data

In [6]:
# The grid search result from Jan.
svc_results = load_random_search_results('svc_poly')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_csv("data/results/ppp_svc_poly_D1_C6_G7.txt")
old_default_performances = get_performance_from_csv("data/results/ppp_svc_poly_default.txt")

## 1.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments ('loss' column).

**note:** The 'loss' column is calculated over all executed tasks with the default, because of work-in-progress/earlier cut-off with time constaints, currently the amount of tasks evaluated per method differs, so you find the amount of tasks for which the defaults have been evaluated in the 'N' column. The total loss is *not* normalized for the amount of completed tasks.

In [7]:
import pandas as pd
import numpy as np

methods = ['Symbolic', '0.20']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, old_default_performances, new_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = svc_results[svc_results.task_id == row.name].predictive_accuracy.max()
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

Symbolic outperformed best on task 9976 by -0.01999992307692311
Symbolic outperformed best on task 3904 by -0.0002752198725218813
0.20 outperformed best on task 34538 by -0.004629148148147966
0.20 outperformed best on task 9956 by -0.0012683144654086487
0.20 outperformed best on task 9954 by -0.0031250000000000444
0.20 outperformed best on task 9955 by -0.026249999999999996
0.20 outperformed best on task 20 by -0.0014999999999999458
0.20 outperformed best on task 3917 by -0.0009414299255247061


Unnamed: 0,Symbolic,0.20,loss,N
Symbolic,0.0,12.0,11.213985,92.0
0.20,77.0,0.0,3.493108,91.0


# 2. AdaBoost

From the evolutionary optimization, we found that often recommended symbolic defaults:
 - **learning rate**: 0.75..1.0
 - **n_estimators**: n
 - **max_depth**: p
 
However, as also noted due to the values of hyperparameters in the original experiments, it might as well read `n_estimators=500` and `max_depth=10`, the bounds of the original experiments.

I am not aware of any proposed changes to default values for this algorithm, so the only comparison is to scikit-learn defaults as they are.

In [None]:
# The grid search result from Jan.
adaboost_results = load_random_search_results('adaboost')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_console_output("data/results/pipeline_ada_75_500_10.txt")
sklearn_default_performances = get_performance_from_console_output("data/results/pipeline_ada_default.txt")

## 1.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments (last column).

In [None]:
import pandas as pd
import numpy as np

methods = ['Symbolic', 'sklearn']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, sklearn_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = svc_results[svc_results.task_id == row.name].predictive_accuracy.max()
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

# 3. Random Forest

----
**note**: Everything below is scratchpad and should be ignored

----