## Analysis of the experiments

### Goal: Analyze the hyperparameters

- Goal: Analyze the hyperparameters of the experiments, produce a summary, explore trends
- Author: T. Slanináková, xslanin@fi.muni.cz
- Date: 2023-07-17

**Observed points / notes:**
- parameters with the highest impact on overall time: n. of categories (`cat`), n. of most similar buckets searched (`bucket`), maybe n. of epochs (`ep`)
- parameters with the lowest impact: `lr`, `model`

In [37]:
# pandas options
import pandas as pd
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

to generate `res-stable.csv`, run `python eval/eval.py`
- iterates through every experiment in `result/`, transforms it into a row in `res-stable.csv`

#### Load the experiments in `result/`, filter out only '10M'

In [38]:
df = pd.read_csv('res-stable.csv').query('size == "10M"')#.query('params.str.len() > 30', engine='python')
df.shape

(15535, 7)

In [39]:
df.head(2)

Unnamed: 0,data,size,algo,buildtime,querytime,params,recall
0,pca96v2,10M,Learned-index,22322.130576,216.556589,learned-index-pca96v2-10M-ep=212-lr=0.008-cat=122-model=MLP-4-buck=1-862353.elixir-pbs.elixir-czech.cz,0.67625
1,pca96v2,10M,Learned-index,5743.931219,669.535078,learned-index-pca96v2-10M-ep=55-lr=0.09-cat=122-model=MLP-6-buck=4-861285.elixir-pbs.elixir-czech.cz,0.88893


In [42]:
def process_dataframe(df):
    """ Parse info out of `params` columns. """
    df[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]] = df.params.str.split('-', expand=True)
    df = df.dropna()
    df['ep'] = df[4].str.split('=', expand=True)[1].astype(int)
    df['lr'] = df[5].str.split('=', expand=True)[1].astype(float)
    df['cat'] = df[6].str.split('=', expand=True)[1].astype(int)
    df['model'] = 'MLP-' + df[8]
    df.model = df.model.astype('category')
    df['bucket'] = df[9].str.split('=', expand=True)[1].astype(int)
    df['jobid'] = df[10].str.split('.elixir', expand=True)[0].astype(int)
    # move `params` column to the back
    df.insert(df.shape[1]-1, 'params', df.pop('params'))
    df = df.drop([0, 1, 2, 3, 9, 8, 7, 6, 5, 4, 10, 11, 12], axis=1)
    return df

In [43]:
df = process_dataframe(df)
df.head(2)

Unnamed: 0,data,size,algo,buildtime,querytime,recall,ep,lr,cat,model,bucket,jobid,params
0,pca96v2,10M,Learned-index,22322.130576,216.556589,0.67625,212,0.008,122,MLP-4,1,862353,learned-index-pca96v2-10M-ep=212-lr=0.008-cat=122-model=MLP-4-buck=1-862353.elixir-pbs.elixir-czech.cz
1,pca96v2,10M,Learned-index,5743.931219,669.535078,0.88893,55,0.09,122,MLP-6,4,861285,learned-index-pca96v2-10M-ep=55-lr=0.09-cat=122-model=MLP-6-buck=4-861285.elixir-pbs.elixir-czech.cz


In [45]:
# check out 5 fastest setups with recall higher than 90%
df.query('recall > 0.9').sort_values(by='querytime').head(5)

Unnamed: 0,data,size,algo,buildtime,querytime,recall,ep,lr,cat,model,bucket,jobid,params
594,pca96v2,10M,Learned-index,29538.396589,514.916801,0.90883,205,0.009,122,MLP-5,4,862500,learned-index-pca96v2-10M-ep=205-lr=0.009-cat=122-model=MLP-5-buck=4-862500.elixir-pbs.elixir-czech.cz
2100,pca96v2,10M,Learned-index,29868.730211,519.733375,0.90273,205,0.009,123,MLP-5,4,862501,learned-index-pca96v2-10M-ep=205-lr=0.009-cat=123-model=MLP-5-buck=4-862501.elixir-pbs.elixir-czech.cz
7393,pca96v2,10M,Learned-index,34515.362605,526.341861,0.9116,211,0.008,122,MLP-7,4,862386,learned-index-pca96v2-10M-ep=211-lr=0.008-cat=122-model=MLP-7-buck=4-862386.elixir-pbs.elixir-czech.cz
7617,pca96v2,10M,Learned-index,32231.619164,526.359743,0.91039,210,0.009,122,MLP-3,4,859428,learned-index-pca96v2-10M-ep=210-lr=0.009-cat=122-model=MLP-3-buck=4-859428.elixir-pbs.elixir-czech.cz
5929,pca96v2,10M,Learned-index,31652.786343,534.578541,0.91023,209,0.009,122,MLP-5,4,862506,learned-index-pca96v2-10M-ep=209-lr=0.009-cat=122-model=MLP-5-buck=4-862506.elixir-pbs.elixir-czech.cz


In [51]:
# use `jobid` to map to logs of the experiments in `job-logs`

#### Top 1 by data type, bucket size and n. of categories

In [48]:
df.query('recall > 0.9').groupby('data')['querytime'].min()

data
clip768v2    839.350932
pca32v2      689.580922
pca96v2      514.916801
Name: querytime, dtype: float64

In [54]:
df.query('recall > 0.9').groupby('bucket')['querytime'].min()

bucket
2     1587.177841
3      853.549118
4      514.916801
5      647.813633
6      661.953912
7      798.435510
8      820.391420
9     1035.096954
10    1010.960264
12    1226.541410
20    2772.028177
24    4191.389168
40    5401.773514
Name: querytime, dtype: float64

In [60]:
df.query('recall > 0.9').groupby('cat')['querytime'].min().to_frame().style.background_gradient(axis=1, subset=['querytime'], vmin=500, vmax=3000)

Unnamed: 0_level_0,querytime
cat,Unnamed: 1_level_1
10,2807.695384
20,1587.177841
30,1576.220044
50,1047.837158
80,926.336974
90,877.09595
100,699.878418
105,689.580922
110,564.778666
112,572.394605
