## Goal of this notebook
Compare the quality of the different combinations of QA systems built as part of the `build-testing-qa`. This notebook will ingest all files created as part of that process so long as they are in the `generated_comparison_files` folder

In [9]:
import os
import pandas as pd
import pickle

In [7]:
dir_prefix = './generated_comparison_files/'
all_files = sorted([f for f in os.listdir(dir_prefix) if os.path.isfile(os.path.join(dir_prefix,f)) and f.endswith('.pkl')])
len(all_files)
all_files[0]

'word(100,20)_finetuned_GENERATIVE_BART(8,500).pkl'

In [17]:
concat_files = []
for fname in all_files:
    with open(dir_prefix + fname, 'rb') as fp:
        concat_files.append(pd.DataFrame(pickle.load(fp)))


# Trying to view this in a single window is messy
pd.concat(concat_files).to_csv(dir_prefix + 'combined_comparison.csv')

In [19]:
df = pd.concat(concat_files)
preprocessor_params = ['PREPROCESSOR_SPLIT_BY', 'PREPROCESSOR_SPLIT_LENGTH', 'PREPROCESSOR_SPLIT_OVERLAP']
embedding_params = ['EMBEDDING_MODEL', 'EMBEDDING_MODEL_SHORTNAME', 'EMBEDDING_MAX_SEQ_LENGTH']
output_params = ['OUTPUT_TYPE', 'OUTPUT_NBEAMS', 'OUTPUT_MAXLENGTH', 'retriever_topk']
decision_factors = ['question', 'answer', 'exec_time_seconds']

df[decision_factors]

Unnamed: 0,question,answer,exec_time_seconds
0,Who is Avery Kelly?,Avery Kelly is one of the most famous people i...,11.973174
1,Who is Avery Kelly?,"Avery Kelly is a witch. She's a witch, but she...",10.186448
2,Who is Avery Kelly?,"Avery Kelly is a witch. She's a witch, but she...",12.462568
3,Who is Avery Kelly?,"Avery Kelly is a witch. She's a witch, but she...",12.849471
4,Who is Zed?,Zed Sadler is one of the founding members of t...,6.341122
...,...,...,...
59,How long has the Carmine been dead?,"I don't know the answer to your question, but ...",4.843870
60,How old is Matthew?,"I don't know the answer to your question, but ...",13.131870
61,How old is Matthew?,"I don't know the answer to your question, but ...",28.584262
62,How old is Matthew?,"I don't know the answer to your question, but ...",10.802597


## Summary of Learnings:
- The execution time doesn't seem to directly correlate with the top-K of the retriever
- The difference in execution time has some correlation with the question, but the minimum times are similar (1 sec) while maximum times are highly variable (4-40 seconds)
- 400-token preprocessed data tends to skew higher by ~30%
- Embedding model had the highest impact on performance with average performance for distilbert / finetuned performance being 50% slower, worst-case being 75% slower, and best-case being 100% slower.
- Finetuned model loves to say "I don't know if this is what you've looking for"
- 400 seems to maybe be a little long, but 100 too short.
- Honestly there may have been too many questions created.
- Good Top-k is around 10
- 
### Question-by-question performance
- Verona Sight : Ok, low word-count documents are generally better though typical distilbert is bad. Top-k doesn't matter past 10, but that might be a limitation of the Generator. Not super accurate.
- Carmine Dead: All slightly wrong but very similar.
- Matthew: low-length finetuned is the best, everything else gets too inventive.
- Snowdrop: BERT has no clue, and other are roughly the same (with some generative weirdness)
- Awakening Ritual: Distilbert did good, but longer passages were needed to properly build the context.
- Miss: Completely lost (ha)
- Forest Ribbon Trail: not too bad ona ll 3, the larger context better.
- Toadswallow: all bad
- Alpeona: Not great
- Hungry Choir: Tuned and distilbert all goood, but longer context important.
- Kennet: finetuned for sure, more context.
- Arena: finetuned
- Avery: Finetuned, but none great
- Zed: Finetuned
- Maricaca: Finetuned
- Why Avery: None great