# Evaluation of default configurations


We have two answer to questions:
 1. By which method can we find good Symbolic Defaults? 
 2. **Can we find good (i.e. better than currently known) symbolic defaults?**
 
This notebook addresses the second question.

----

general remarks:
 - does not currently factor in runtime, but this may be especially important for default values. I didn't explicitly measure runtime, but the 'symbolic defaults' in adaboost take much longer (because it significantly increases `n_estimators` and `max_depth`).

In [1]:
from visualization.output_parser import generate_comparisons

# 1a. SVC

After determining good symbolic defaults, we ought to see how they compare to current (scikit-learn) defaults. To this end, we compare three different default configurations (in bold is the name by which they will be referenced henceforth):

 - The **symbolic** defaults we found from evolutionary optimization, specifically: `C=16, gamma=mkd/xvar`.
     This symbolic function uses metafeatures as calculated on the dataset *after* it has been preprocessed.
 - The scikit-learn **0.20** defaults, specifically: `C=1., gamma=(1 / n_features)`
 - The scikit-learn >= **0.22** defaults, specifically: `C=1., gamma=(1 / (n_features * X.var()))`
 
Note that actually all of these defaults are symbolic.

A second important detail to note is that these settings are not tried by themselves.
A (fairly standard) preprocessing pipeline is applied:
 - **Imputation**: using the mean for numeric features, and the most frequent value for categorical features.
 - **Transformation**: numeric features are scaled to N(0, 1), categorical features are one-hot encoded.
 - **Feature Selection**: all constant features are removed.
 
After these steps, the SVC is invoked on the preprocessed data with the given values for `C` and `gamma`.

Note that I had previously ran experiments that used meta-features computed on the data before preprocessing, and they resulted in different defaults: `C=128, gamma=(mkd / 4)`. I performed an evaluation with those defaults too, and they were better than both new and old scikit-learn defaults, but worse than the newly found symbolic defaults computed on the metafeatures after transformation. Those results however were still stored in an older format and will require some additional processing before I can compare them again to this newer format.

## Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments ('loss' column).

In [7]:
generate_comparisons('svc_rbf',
                    ["data/results/svc_rbf_c16_gmkdxvar.txt",
                     "data/results/ppp_svc_rbf_default.txt",
                     "data/results/ppp_svc_rbf_default_Gscale.txt"
                    ],
                    ['symbolic-post','0.20','0.22'])

symbolic-post outperformed best on task 14 by -0.0015000000000000568
symbolic-post outperformed best on task 12 by -1.1102230246251565e-16


Unnamed: 0,symbolic-post,0.20,0.22,loss,N
symbolic-post,0.0,70.0,68.0,1.660083,92.0
0.20,18.0,0.0,3.0,4.107663,92.0
0.22,19.0,26.0,0.0,3.493432,92.0


This reads as *Symbolic-post* won over the *0.20* default 68 times, while the *0.20* default was better than *Symbolic* on 18 tasks. *Symbolic* obtained a loss of 1.66 over the best known result of each task (or slightly under 0.02 accuracy, on average).

We see that *Symbolic* (i.e. `C=16, gamma=mkd/xvar`) as default outperforms either of the two scikit-learn ones, both in terms of tasks where it achieves higher predictive accuracy, and the loss in accuracy it occurs across tasks.

# 1b SVC Poly

## 1b.1 Loading Data

In [None]:
# The grid search result from Jan.
svc_results = load_random_search_results('svc_poly')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_csv("data/results/ppp_svc_poly_D1_C6_G7.txt")
old_default_performances = get_performance_from_csv("data/results/ppp_svc_poly_default.txt")

## 1b.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments ('loss' column).

**note:** The 'loss' column is calculated over all executed tasks with the default, because of work-in-progress/earlier cut-off with time constaints, currently the amount of tasks evaluated per method differs, so you find the amount of tasks for which the defaults have been evaluated in the 'N' column. The total loss is *not* normalized for the amount of completed tasks.

In [None]:
import pandas as pd
import numpy as np

methods = ['Symbolic', '0.20']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, old_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = svc_results[svc_results.task_id == row.name].predictive_accuracy.max()
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

# 2. AdaBoost

From the evolutionary optimization, we found that often recommended symbolic defaults:
 - **learning rate**: 0.75..1.0
 - **n_estimators**: n
 - **max_depth**: p
 
However, as also noted due to the values of hyperparameters in the original experiments, it might as well read `n_estimators=500` and `max_depth=10`, the bounds of the original experiments.

I am not aware of any proposed changes to default values for this algorithm, so the only comparison is to scikit-learn defaults as they are.

In [None]:
# The grid search result from Jan.
adaboost_results = load_random_search_results('adaboost')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_console_output("data/results/pipeline_ada_75_500_10.txt")
sklearn_default_performances = get_performance_from_console_output("data/results/pipeline_ada_default.txt")

## 2.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments (last column).

In [None]:
import pandas as pd
import numpy as np

methods = ['Symbolic', 'sklearn']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, sklearn_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = adaboost_results[adaboost_results.task_id == row.name].predictive_accuracy.max()
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

 Symbolic defaults take far longer to train, and I could not reasonably evaluate all datasets.

# 3. Random Forest

## 3.1 Loading Results

In [None]:
# The grid search result from Jan.
rfc_results = load_random_search_results('randomforest')

# results currently still stored in log. should be aggregated to single file..
# "data/results/pipeline_c128mkd4.txt"
symb_default_performances = get_performance_from_csv("data/results/ppp_rf_p027.txt")
sklearn_default_performances = get_performance_from_csv("data/results/ppp_rfc_default.txt")

## 3.2 Comparing Results
We compare results by number of times one's average cross-validation performance is better (first three columns) and by their loss as compared to the best found result in the original set of experiments (last column).

In [None]:
import pandas as pd
import numpy as np

methods = ['Symbolic', 'sklearn']
df = pd.DataFrame(np.zeros(shape=(len(methods), len(methods)+2)), columns = methods + ['loss', 'N'])
df.index = methods

# Calculate 'wins'
performances = list(zip(methods, [symb_default_performances, sklearn_default_performances]))
for (method, performance) in performances:
    for (method2, performance2) in performances:
        one_over_two = (performance.avg - performance2.avg) > 0
        df.loc[method][method2] = sum(one_over_two)

# Calculate loss        
for (method, performance) in performances:
    loss_sum = 0
    for i, row in performance.iterrows():
        best_score = rfc_results[rfc_results.task_id == row.name].predictive_accuracy.max()
        if np.isnan(best_score):
            continue
        loss = best_score - row.avg
        if loss < 0:
            print('{} outperformed best on task {} by {}'.format(method, row.name, loss))
        print(method, row.name, loss, best_score, row.avg)
        loss_sum += loss
    df.loc[method]['loss'] = loss_sum
    df.loc[method]['N'] = len(performance)
    
df

The default of `max_features=0.271` did outperform the default of `sqrt(p)` in terms of loss. However, it is not a clear better default. The default of `sqrt(p)` is still better more often and scales much better with the amount of features in a dataset in terms of runtime.

## XGBoost Linear

In [8]:
from visualization.output_parser import generate_comparisons

In [9]:
generate_comparisons('xgb_linear',
                     ["data/results/xgblinear_symbolic_eval",
                      "data/results/xgblinear_default_eval"],
                     ['Symbolic', 'Default'])

Symbolic outperformed best on task 3493 by -2.4371584699256488e-06
Default outperformed best on task 3493 by -2.4371584699256488e-06


Unnamed: 0,Symbolic,Default,loss,N
Symbolic,0.0,2.0,5.665586,82.0
Default,65.0,0.0,3.105805,95.0


Note that above loss is only calculated for ~37 tasks (and possibly a different amount of tasks for symbolic and default, though if this is the case those tasks for symbolic are a subset of default, meaning that default has less loss either way).

---

----
**note**: Everything below is scratchpad and should be ignored

----