## Evaluation results

### Overall association statistics

The tables in this section show the standard association metrics between human scores and different types of machine scores. These results are computed on the evaluation set. The scores for each model have been truncated to [min-0.4998, max+.4998].When indicated, scaled scores are computed by re-scaling the predicted scores using mean and standard deviation of human scores as observed on the training data and mean and standard deviation of machine scores as predicted for the training set. 


In [None]:
def read_evals(model_list, file_format_summarize):
    evals = []
    for (model_id, config, csvdir, file_format) in model_list:
        csv_file = os.path.join(csvdir, '{}_eval_short.{}'.format(model_id, file_format))
        if os.path.exists(csv_file):
            df_eval = DataReader.read_from_file(csv_file, index_col=0)
            df_eval.index = [model_id]
            
            # figure out whether the score was scaled
            df_eval['system score type'] = 'scale' if config.get('use_scaled_predictions') == True or config.get('scale_with') is not None else 'raw'        
            #rename the columns to remove reference to scale/raw scores
            new_column_names = [col.split('.')[0] if not 'round' in col 
                                else '{} (rounded)'.format(col.split('.')[0])
                                for col in df_eval.columns ]
            df_eval.columns = new_column_names
            evals.append(df_eval)          
    if len(evals) > 0:
        df_evals = pd.concat(evals, sort=True)
    else:
        df_evals = pd.DataFrame()
    return(df_evals)

df_eval = read_evals(model_list, file_format_summarize)
if not df_eval.empty:
    writer = DataWriter(summary_id)
    writer.write_experiment_output(output_dir,
                                   {'eval_short': df_eval},
                                   index=True,
                                   file_format=file_format_summarize)

#### Descriptive holistic score statistics

The table shows distributional properties of human and system scores. SMD values lower then -0.15 or higher than 0.15 are <span class="highlight_color">highlighted</span>.

In [None]:
pd.options.display.width=10
formatter = partial(color_highlighter, low=-0.15, high=0.15)
if not df_eval.empty:
     display(HTML(df_eval[['N', 'system score type', 'h_mean', 'h_sd', 
                           'sys_mean', 'sys_sd',  'SMD']].to_html(index=True,
                                                                  classes=['sortable'],
                                                                  escape=False,
                                                                  formatters={'SMD': formatter},
                                                                  float_format=int_or_float_format_func)))
else:
     display(Markdown("No information available for any of the models"))

#### Association statistics

The table shows the standard association metrics between human scores and machine scores. Note that some evaluations are based on rounded (`Trim-round`) scores computed by first truncating and then rounding the predicted score.

In [None]:
if not df_eval.empty:
     display(HTML(df_eval[['N',
                           'system score type',
                           'corr', 'R2', 'RMSE',
                           'wtkappa (rounded)',
                           'kappa (rounded)',
                           'exact_agr (rounded)',
                           'adj_agr (rounded)']].to_html(index=True,
                                                         classes=['sortable'],
                                                         escape=False,
                                                         float_format=int_or_float_format_func)))
else:
     display(Markdown("No information available for any of the models"))