## Steps
- Document Store: FAISS for this purpose, with the first few chapters / arcs (through 5-10) for speed purposes
- Preprocessor: We'll be using a `max_seq_length=400` for each of the embedding models, iterating on the split length and split_overlap
- Retriever: Embedding Retriever so we can work on the SBERT models. This should be the state-of-the-art retriever at the moment and works reasonably well.
- Embedding Models: Testing 3: The [SBERT Ms Marco Distilbert](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b), [SBERT Ms Marco BERT](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5), and my fine-tuned distilbert (link tbd).
- End result methodologies:
    - [Extractive Search](https://docs.haystack.deepset.ai/docs/ready_made_pipelines#extractiveqapipeline) as a baseline to find relevant documents and contextual quotes.
    - [Summarization](https://docs.haystack.deepset.ai/docs/ready_made_pipelines#extractiveqapipeline) as a less freeform version of the Generative search.
    - [Generative Seq2Seq](https://docs.haystack.deepset.ai/docs/ready_made_pipelines#generativeqapipeline) model using `vblagoje/bart_lfqa`. I may also experiment with the ELI5 bart (or try my own).
- Test run location: If I'm able to get a GPU in Colab I may run it there for speed, otherwise local machine should be fine.
- Test run questions: will be present in the notebook and representative of the few arcs + some off-the-wall questions.
- Time the execution time (in FAISS so not a straight translation to ElasticSearch) to understand the tradeoffs.
- Learn a reasonable Top K for the Retriever - likely 5-15.
- Output: Question-Answer pairs keyed on their metadata about the above results.

In [1]:
PREPROCESSOR_SPLIT_BY = 'word' # word, sentence
PREPROCESSOR_SPLIT_LENGTH = 400 # 400,100 / 8,2
PREPROCESSOR_SPLIT_OVERLAP = 20 # 20 / 1

# EMBEDDING_MODEL = "sentence-transformers/msmarco-distilbert-base-tas-b" # listed above 
# EMBEDDING_MODEL_SHORTNAME = "msmarco_distilbert" # msmarco_distilbert, msmarco_bert, finetuned
# EMBEDDING_MODEL = "sentence-transformers/msmarco-bert-base-dot-v5"
# EMBEDDING_MODEL_SHORTNAME = "msmarco_bert"
EMBEDDING_MODEL = "../twig_otherverse_parahumans_adapted"
EMBEDDING_MODEL_SHORTNAME = "finetuned"
EMBEDDING_MAX_SEQ_LENGTH = 500 # 500

# OUTPUT_TYPE = "GENERATIVE_BART" # SUMMARTIVE_X, EXTRACTIVE
OUTPUT_TYPE = "SUMMARATIVE_PEGASUS"
OUTPUT_NBEAMS = 8 # 3, 8
OUTPUT_MAXLENGTH = 500 # 200, 500

RETRIEVER_TOP_KS = [5,10,20,50]

FINAL_TEST_CHAPTER = 6 # Exclusive last chapter to be tested

SUMMARY_NAME = f"{PREPROCESSOR_SPLIT_BY}({PREPROCESSOR_SPLIT_LENGTH},{PREPROCESSOR_SPLIT_OVERLAP})_{EMBEDDING_MODEL_SHORTNAME}_{OUTPUT_TYPE}({OUTPUT_NBEAMS},{OUTPUT_MAXLENGTH})"

In [2]:
TEST_QUESTIONS = [
    "Who is Avery Kelly?",
    "Who is Zed?",

    "What is Toadswallow?",
    "What is Miss?",
    "What is the Forest Ribbon Trail?",
    "What animal is Snowdrop?",
    "What happened at the Awakening Ritual?",

    "When does the Hungry Choir contest start?",
    "When does Alpeana Operate?",

    "Where is Kennet?",
    "Where is the Arena?",

    "Why was Avery chosen for Awakening?",
    "Why is Maricaca a suspect?",

    "How does Verona's Sight describe objects?",
    "How long has the Carmine been dead?",
    "How old is Matthew?"
]

In [3]:
import os
import pickle
import logging
import time
from haystack.document_stores import FAISSDocumentStore
from haystack import Document
from haystack.nodes import PreProcessor, EmbeddingRetriever, Seq2SeqGenerator, TransformersSummarizer, FARMReader
from haystack.pipelines import GenerativeQAPipeline, ExtractiveQAPipeline, SearchSummarizationPipeline


In [4]:
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.WARNING)


In [5]:
if os.path.isfile(f'generated_comparison_files/{SUMMARY_NAME}.pkl'):
    raise ValueError("This combination of parameters already exists!")

In [6]:
with open('../data/chapter_fmt_list.pkl','rb') as f:
    all_chapters = pickle.load(f)
chapters = [i for i in all_chapters if int(i['meta']['arc_number']) < FINAL_TEST_CHAPTER]
print(f"Testing with {len(chapters)} chapters up to Arc {FINAL_TEST_CHAPTER}")

Testing with 75 chapters up to Arc 6


In [7]:
preprocessor = PreProcessor(
    split_by=PREPROCESSOR_SPLIT_BY,
    split_length=PREPROCESSOR_SPLIT_LENGTH,
    split_overlap=PREPROCESSOR_SPLIT_OVERLAP,

    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_respect_sentence_boundary= PREPROCESSOR_SPLIT_BY=='word',
    progress_bar=True, 
    add_page_number=True
)
docs = preprocessor.process(chapters)
print(f"We will be working with {len(docs)} documents from {len(chapters)} chapters")

Preprocessing:   0%|          | 0/75 [00:00<?, ?docs/s]

We will be working with 1564 documents from 75 chapters


In [8]:
try:
    print("Removing old document store...")
    os.remove("faiss_document_store.db")
except OSError:
    print("Are you sure the document store db exists?")
document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat", similarity='cosine') # We want to stick with Cosine Similarity because it works best with the SBERT models we use
document_store.write_documents(docs)

Removing old document store...


Writing Documents:   0%|          | 0/1564 [00:00<?, ?it/s]

In [9]:
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model=EMBEDDING_MODEL,
    model_format="sentence_transformers",
    max_seq_len=EMBEDDING_MAX_SEQ_LENGTH,
    progress_bar=True,
)

document_store.update_embeddings(retriever)

Updating Embedding:   0%|          | 0/1564 [00:00<?, ? docs/s]

Batches:   0%|          | 0/49 [00:00<?, ?it/s]

In [10]:
if OUTPUT_TYPE == "GENERATIVE_BART":
    print("Configuring this in the testing iteration...")
    # TODO: generate a dict of params: generators / pipelines based on the factors
    generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa", num_beams=OUTPUT_NBEAMS, max_length=OUTPUT_MAXLENGTH)
    pipe = GenerativeQAPipeline(generator, retriever) # We specify the params later
elif OUTPUT_TYPE == "SUMMARATIVE_PEGASUS":
    summarizer = TransformersSummarizer(model_name_or_path="pszemraj/led-large-book-summary", max_length=OUTPUT_MAXLENGTH)
    pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever,generate_single_summary=True,return_in_answer_format=True)
else:
    raise ValueError("Not Configured yet!")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [11]:
# Testing 1-question flow
all_combos = []
        
# To make the param matching a little simpler I've looped instead of doing pipeline.batch
for q_idx,q in enumerate(TEST_QUESTIONS):
    print(f'Question {q_idx+1} of {len(TEST_QUESTIONS)}')
    for k_idx,r_topk in enumerate(RETRIEVER_TOP_KS):
        print(f'  TopK {k_idx+1} of {len(RETRIEVER_TOP_KS)}')
        start_time = time.time()
        try:
            result = pipe.run(
                query = q,
                params = {
                    "Retriever": {"top_k":r_topk},
                    # "Generator": {"max_length": maxlength, "num_beams": nbeam}
                }
            )
            if OUTPUT_TYPE == "GENERATIVE_BART":
                result['answers'][0].answer
            elif OUTPUT_TYPE == "SUMMARATIVE_PEGASUS":
                answer = result['answers'][0]['answer']
        except Exception:
            print("CUDA problem")
            answer = "N/A"

        end_time = time.time()
        execution_time_seconds = end_time - start_time

        d = {
            'question': q,
            'exec_time_seconds': execution_time_seconds,
            'answer': answer,
            # Input all params
            'retriever_topk': r_topk,
            'PREPROCESSOR_SPLIT_BY' : PREPROCESSOR_SPLIT_BY,
            'PREPROCESSOR_SPLIT_LENGTH' : PREPROCESSOR_SPLIT_LENGTH,
            'PREPROCESSOR_SPLIT_OVERLAP' : PREPROCESSOR_SPLIT_OVERLAP,

            'EMBEDDING_MODEL' : EMBEDDING_MODEL,
            'EMBEDDING_MODEL_SHORTNAME' : EMBEDDING_MODEL_SHORTNAME,
            'EMBEDDING_MAX_SEQ_LENGTH': EMBEDDING_MAX_SEQ_LENGTH,

            'OUTPUT_TYPE': "GENERATIVE_BART",
            'OUTPUT_NBEAMS':OUTPUT_NBEAMS,
            'OUTPUT_MAXLENGTH':OUTPUT_MAXLENGTH
        }
        all_combos.append(d)


Question 1 of 16
  TopK 1 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 2 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 3 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 4 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]



CUDA problem
Question 2 of 16
  TopK 1 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 2 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 3 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 4 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
Question 3 of 16
  TopK 1 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 2 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 3 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 4 of 4




Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
Question 4 of 16
  TopK 1 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 2 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 3 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
  TopK 4 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CUDA problem
Question 5 of 16
  TopK 1 of 4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
with open(f'generated_comparison_files/{SUMMARY_NAME}.pkl','wb') as f:
    pickle.dump(all_combos,f)
print(f"File written to {SUMMARY_NAME}")