# What does this do?
Given a database of documents (focused around Wildbow serials), attempts to answers questions.

# How does it do this?
For more details (or to recreate the steps yourself) please check out the Github README!
1. Vectorizes all chapters of the serials using a GPL fine-tuned SentenceTransformers models. These chapters are transformed into Documents of varying number of words (results can vary based on the sizes used) and stored in a database.
2. Processes your question into a vector of a similar format to the above.
3. Finds the Document(s) most similar to the question and returns them. This will also filter the documents.
4. From those documents, [Extract](https://docs.haystack.deepset.ai/docs/retriever#embedding-retrieval-recommended) a number of answer(s) that answer those questions.
5. Output the chapters + content that was used to inform the answer (since it's likely to be unsatisfactory). Use this as a source for what chapters are likely to have your answers or to get some raw content to paste into something else (chatGPT, etc) or interpret on your own.

# How do I use it?
0. If at all possible, go to `Change Runtime Type` in `Resources` and choose `GPU`. This should significantly speed up the queries. I have the best luck past 7pm ET for availability, but it's not strictly necessary.
1. Ensure that the database files (`wildbow_150.index`,`wildbow_150.json`,`wildbow_150_sqldb.db`, etc) are in the same location as `DB_FOLDER_LOCATION`. Colab can be finicky about how it attaches to GDrive so this may need some finangling. I've tried to leave 2 solutions in that each may work (you shouldn't need both).
2. There are a few variables defined in Cell X that can slightly change how everything behaves - feel free to experiment!
  - NUM_BEAMS: How many different [beams](https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24) the Generator will produce for answer generation. Default is 8.
  - MAX_LENGTH: The maximum length of the sequence that will be generated. Keep in mind that a sequence will most likely naturally terminate before this is reached. Default is 500.
  - GENERATOR_TOPK: The number of answers that a Generator will return. Default is 3.
  - DATABASE_NAME: The name of the database that we'll search through to answer the questions. Options are `pale`, `otherverse` (comprising of Pale, Pact, Poke, and Pate), `parahumans` (comprising of Worm, Glow-worm, and Ward), `twig`, and `wildbow` (comprising of everything).
  - DOCUMENT_LENGTH: The size of the documents used in the database. This can cause the generator to return different results because of the information it ingests. `150`, `250`, `400`, `mixed` (storing all 3 types) are available here.
2. Run the cells through X - this will set everything up properly.
3. In Cell Y, specify the following variables:
  - QUESTION: Your question that you wish to ask. Default is <Question with acceptable answer>
  - RETRIEVER_TOPK: the number of documents that will be used to answer the question. Default is 10.

# Limitations
1. There is no ability to filter on series or chapters here due to [Database limitations](https://docs.haystack.deepset.ai/docs/document_store#choosing-the-right-document-store). To get around this, I've included a few different databases (utilizing the same underlying embeddings). Based on the series you select it should pull the correct db if everything has been formatted correctly. There is no way to filter to chapters. These may be fixed in a future version, but it requires a different Database setup.

# How accurate are the answers?
You're best off thinking of this like you're asking a mediocre fanfiction writer what the answer is - while they're enthusiastic about the series they probably only read part of it and don't exactly have a commitment to accuracy.
While the answers are grounded in truth, they're prone to a number of pitfalls:
  - Generative models are prone to favor likely combinations of words over rare combinations. This is especially true here as there is a specialized vocabulary for these series - the Generator doesn't know _exactly_ why "Other" is special or why "Scion" appears so often here and is less likely to select it as a high-probability. 
  - It's only as good as the data it can build off of. Not only are these _stories_ where we're supposed to read between the lines and infer information, but those stories are told from biased character perspectives who receive half-truths from those around them. It's very gullible!



# Will there be any changes to this in the future?:
Nothing in-progress at the moment, but there's a few areas of potential here:
  - Change of database to allow filtering on metadata (eg only Pale, only chapters < 10.1, only Avery PoV, etc)
  - Different types of answer generation like Extractive or Summaritive. These are relatively simple switches to make through this framework, it's just a matter of interest.
  - A Generator that can better mimic the style of Wildbow's answers. This would rely on a source-of-truth set of Document:Summary data that would be difficult to compile. However, it would both provide a more familiar output style and potentially could teach the model some of the vocabulary quirks that are present in the worlds (eg Other in the PactVerse != other in common usage).
  - A better way to capture rare words (like Primordial) in the answers - in NLP this is called [temperature](https://lukesalamone.github.io/posts/what-is-temperature/). While you typically don't want high amounts of creativity in your QA system, it can help make up for other shortcomings like the lack of trained vocabulary. This would require [MonkeyPatching](https://stackoverflow.com/questions/5626193/what-is-monkey-patching) or other general modifications of the [Haystack source code](https://github.com/deepset-ai/haystack/blob/main/haystack/nodes/answer_generator/transformers.py#L465) which could be a large endeavor. 
  - A better pretraining on real questions and answers. The pretraining here is done via a technique called [Generative Pseudolabeling](https://www.pinecone.io/learn/gpl/), but it's no substitute for stronger training data for domain adaption.
  - The inclusion of various Word-of-God quotes or user-submitted answers to other queries (ie Reddit, Discord). The difficulty here is parsing the data formats and gathering it into formats that 1) better inform the Document embeddings how to store the information, 2) can be searched efficiently, and 3) are correct.
  - Updating Pale chapters available - currently the initial dataset only includes Pale chapters up to 23.1. This would be an incremental update.
  - Incorporating the possibility of a MultiModal Transformer for the Extra Materials. Currently any information from the Pale Extra Materials are included via transcripts manually pulled from the site(many thanks to those who transcribed this information) but the possibility of searching text and image embeddings simultaneously is a possibility.
  - Incorporate the ability to switch to a batch of questions depending on how many questions are asked. Currently this will only work on 1 question at a time.

In [None]:
DATABASE_FOLDER_LOCATION = "./drive/MyDrive/pale-companion-files/db-gen/"
NUM_BEAMS = 32
MAX_LENGTH = 500
GENERATOR_TOPK = 3 

DATABASE_NAME = "pale" # pale, otherverse, parahumans, twig, wildbow.
DOCUMENT_LENGTH = 'mixed' # 150, 250, 40, 'mixed'

In [None]:
!pip install "faiss-gpu>=1.6.3,<2"
!pip install "sqlalchemy <2"
!pip install "farm-haystack==1.14.0"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
^C
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import os
import pickle
import logging
import time
import textwrap
from haystack.document_stores import FAISSDocumentStore
from haystack import Document
from haystack.nodes import PreProcessor, EmbeddingRetriever, Seq2SeqGenerator, TransformersSummarizer, FARMReader
from haystack.pipelines import GenerativeQAPipeline, ExtractiveQAPipeline, SearchSummarizationPipeline

In [None]:
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)

In [None]:
# Use this solution if you get a popup with no code when you attach GDrive
os.chdir(f'{DATABASE_FOLDER_LOCATION}/{DOCUMENT_LENGTH}') # Note - you'll need to reset the kernel or comment this line out if this has already been run
print(os.getcwd()) # This should match your GDrive/db/doclength folder
print([f for f in os.listdir('.') if os.path.isfile(f)])
# Use this solution if you have to run a code cell to attach GDrive
# TODO
document_store = FAISSDocumentStore.load(
    index_path=f'{DATABASE_NAME}_{DOCUMENT_LENGTH}.index' if not DOCUMENT_LENGTH=='mixed' else f'{DATABASE_NAME}.index',
    config_path=f'{DATABASE_NAME}_{DOCUMENT_LENGTH}.json' if not DOCUMENT_LENGTH=='mixed' else f'{DATABASE_NAME}.json'
    )
document_store.get_embedding_count() 

In [None]:
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="TheSpaceManG/wildbow-distilbert", # The name of model to create embeddings with. This will pull from HuggingFace and can be changed out as desired (eg distilbert-base-uncased-distilled-squad).
    model_format="sentence_transformers",
    max_seq_len=500,
    progress_bar=True,
)

In [None]:
generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa", num_beams=NUM_BEAMS, max_length=MAX_LENGTH, top_k=GENERATOR_TOPK)

pipe = GenerativeQAPipeline(generator, retriever) # We specify some of these params later

In [None]:
# Given the format of the answer from our retrieval pipeline, parse out in a reasonable format
# 1. The answer(s) of which there will be K
# 2. The metadata about the answers, of which there will be k sets we want to break down into a few categories
# What series, what PoVs, what chapters, etc
def format_answers(output, chapter_info=False, text_output=False):
  for answerset in output['answers']:
    response = answerset.answer
    [print(x) for x in textwrap.wrap(f"Answer: {response}", width=120,break_long_words=False)]
    metadata = answerset.meta['doc_metas']
    if chapter_info:
      print('-----')
      print(f"{len(metadata)} Documents")
      for idx,x in enumerate(metadata):
        print(f"Source {idx+1}: {x['arc_title']}, {x['chapter']}, {x['pov']}, {x['series']}")

    print('-----------------------------------------------------------')

In [None]:
# QUESTION: Your question that you wish to ask! 
QUESTION = "Who is the Carmine?"

# The number of documents you want to parse to answer this.
# Personally I find that for the primitive Generative QA 15 is a decent number, but you should play with this
RETRIEVER_TOPK = 15


In [None]:
# This will take 3-5 minutes or so, just be patient!
OUTPUT = pipe.run(
                query = QUESTION,
                params = {
                    "Retriever": {"top_k":RETRIEVER_TOPK}
                    }
            )

In [None]:
# Parse the output
format_answers(OUTPUT, chapter_info=True)