# What does this do?
Given a database of documents (focused around Wildbow serials), attempts to answers questions.

# How does it do this?
For more details (or to recreate the steps yourself) please check out the Github README!
1. Vectorizes all chapters of the serials using a GPL fine-tuned SentenceTransformers models. These chapters are transformed into Documents of varying number of words (results can vary based on the sizes used) and stored in a database.
2. Processes your question into a vector of a similar format to the above.
3. Finds the Document(s) most similar to the question and returns them. This will also filter the documents.
4. From those documents, [Generate](https://docs.haystack.deepset.ai/reference/answer-generator-api#seq2seqgenerator) a few responses that answer those questions. This will also note what documents these answers came from in case you want to further explore manually.
5. In addition, [Extract](https://docs.haystack.deepset.ai/docs/retriever#embedding-retrieval-recommended) a number of answer(s) directly that answer those questions. This can include full text of the answers

# How do I use it?
0. If at all possible, go to `Change Runtime Type` in `Resources` and choose `GPU`. This should significantly speed up the queries. I have the best luck past 7pm ET for availability, but it's not strictly necessary.
1. Ensure that the database files (`wildbow_150.index`,`wildbow_150.json`,`wildbow_150_sqldb.db`, etc) are in the same location as `DB_FOLDER_LOCATION`. Colab can be finicky about how it attaches to GDrive so this may need some finangling. I've tried to leave 2 solutions in that each may work (you shouldn't need both).
2. There are a few variables defined in Cell X that can slightly change how everything behaves - feel free to experiment!
  - NUM_BEAMS: How many different [beams](https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24) the Generator will produce for answer generation. Default is 8.
  - MAX_LENGTH: The maximum length of the sequence that will be generated. Keep in mind that a sequence will most likely naturally terminate before this is reached. Default is 500.
  - GENERATOR_TOPK: The number of answers that a Generator will return. Default is 3.
  - DATABASE_NAME: The name of the database that we'll search through to answer the questions. Options are `pale`, `otherverse` (comprising of Pale, Pact, Poke, and Pate), `parahumans` (comprising of Worm, Glow-worm, and Ward), `twig`, and `wildbow` (comprising of everything).
  - DOCUMENT_LENGTH: The size of the documents used in the database. This can cause the generator to return different results because of the information it ingests. `150`, `250`, `400`, `mixed` (storing all 3 types) are available here.
2. Run the cells through X - this will set everything up properly.
3. In Cell Y, specify the following variables:
  - QUESTION: Your question that you wish to ask.
  - RETRIEVER_TOPK: the number of documents that will be used to answer the question. Default is 15.
  - EXTRACT_TOPK: The number of extractive (raw text) answers you want to create. Similar to the GENERATOR_TOPK above.

# Limitations
1. There is no ability to filter on series or chapters here due to [Database limitations](https://docs.haystack.deepset.ai/docs/document_store#choosing-the-right-document-store). To get around this, I've included a few different databases (utilizing the same underlying embedding generation). Based on the series you select it should pull the correct db if everything has been formatted correctly. There is no way to filter to chapters. These may be fixed in a future version, but it requires a different Database setup that is harder to do in the Colab environment.

# How accurate are the generated answers?
You're best off thinking of this like you're asking a mediocre fanfiction writer what the answer is - while they're enthusiastic about the series they probably only read part of it and don't exactly have a commitment to accuracy.
While the answers are grounded in truth, they're prone to a number of pitfalls:
  - Generative models are prone to favor likely combinations of words over rare combinations. This is especially true here as there is a specialized vocabulary for these series - the Generator doesn't know _exactly_ why "Other" is special or why "Scion" appears so often here and is less likely to select it as a high-probability. 
  - It's only as good as the data it can build off of. Not only are these _stories_ where we're supposed to read between the lines and infer information, but those stories are told from biased character perspectives who receive half-truths from those around them. It's very gullible!

## What about the extractive answers?
These answers are true to the text, but just keep in mind they're lacking context and persist things like character bias without additional information.


# Will there be any changes to this in the future?:
Nothing in-progress at the moment, but there's a few areas of potential here:
  - Change of database to allow filtering on metadata (eg only Pale, only chapters < 10.1, only Avery PoV, etc)
  - Different styles of answer generation. This type of work is relatively simple to do within the Haystack framework but it relies on a custom [Converter](https://github.com/deepset-ai/haystack/blob/main/haystack/nodes/answer_generator/transformers.py#L481) being written. See [this](https://yjernite.github.io/lfqa.html) for an explanation on what is involved in getting a new LFQA generator working.
  - A Generator that can better mimic the style of Wildbow's answers. This would rely on a source-of-truth set of Document:Summary data that would be difficult to compile. However, it would both provide a more familiar output style and potentially could teach the model some of the vocabulary quirks that are present in the worlds (eg Other in the PactVerse != other in common usage). Similar to the above.
  - A better way to capture rare words (like Primordial, Aurum) in the answers - in NLP this is called [temperature](https://lukesalamone.github.io/posts/what-is-temperature/). While you typically don't want high amounts of creativity in your QA system, it can help make up for other shortcomings like the lack of trained vocabulary. This would require [MonkeyPatching](https://stackoverflow.com/questions/5626193/what-is-monkey-patching) or other general modifications of the [Haystack source code](https://github.com/deepset-ai/haystack/blob/main/haystack/nodes/answer_generator/transformers.py#L465) which could be a large endeavor. 
  - A better pretraining on real questions and answers. The pretraining here is done via a technique called [Generative Pseudolabeling](https://www.pinecone.io/learn/gpl/), but it's no substitute for stronger training data for domain adaption.
  - The inclusion of various Word-of-God quotes or user-submitted answers to other queries (ie Reddit, Discord). The difficulty here is parsing the data formats and gathering it into formats that 1) better inform the Document embeddings how to store the information, 2) can be searched efficiently, and 3) are correct.
  - Updating Pale chapters available - currently the initial dataset only includes Pale chapters up to 23.1. This would be an incremental update.
  - Incorporating the possibility of a MultiModal Transformer for the Extra Materials. Currently any information from the Pale Extra Materials are included via transcripts manually pulled from the site(many thanks to those who transcribed this information) but the possibility of searching text and image embeddings simultaneously is a possibility.

In [1]:
DATABASE_FOLDER_LOCATION = "./drive/MyDrive/pale-companion-files/db-gen/"
NUM_BEAMS = 32
MAX_LENGTH = 500
GENERATOR_TOPK = 3 

DATABASE_NAME = "pale" # pale, otherverse, parahumans, twig, wildbow.
DOCUMENT_LENGTH = 'mixed' # 150, 250, 40, 'mixed'

In [2]:
!pip install "faiss-gpu>=1.6.3,<2"
!pip install "sqlalchemy <2"
!pip install "farm-haystack==1.14.0"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import os
import pickle
import logging
import time
import textwrap
from haystack.document_stores import FAISSDocumentStore
from haystack import Document
from haystack.nodes import PreProcessor, EmbeddingRetriever, Seq2SeqGenerator, TransformersSummarizer, FARMReader
from haystack.pipelines import GenerativeQAPipeline, ExtractiveQAPipeline, SearchSummarizationPipeline

In [4]:
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [5]:
# Use this solution if you get a popup with no code when you attach GDrive
os.chdir(f'{DATABASE_FOLDER_LOCATION}/{DOCUMENT_LENGTH}') # Note - you'll need to reset the kernel or comment this line out if this has already been run
print(os.getcwd()) # This should match your GDrive/db/doclength folder
print([f for f in os.listdir('.') if os.path.isfile(f)])
# Use this solution if you have to run a code cell to attach GDrive
# TODO
document_store = FAISSDocumentStore.load(
    index_path=f'{DATABASE_NAME}_{DOCUMENT_LENGTH}.index' if not DOCUMENT_LENGTH=='mixed' else f'{DATABASE_NAME}.index',
    config_path=f'{DATABASE_NAME}_{DOCUMENT_LENGTH}.json' if not DOCUMENT_LENGTH=='mixed' else f'{DATABASE_NAME}.json'
    )
document_store.get_embedding_count() 

/content/drive/MyDrive/pale-companion-files/db-gen/mixed
['twig.index', 'twig.json', 'twig_sqldb.db', 'parahumans.index', 'parahumans.json', 'parahumans_sqldb.db', 'pale.index', 'pale.json', 'pale_sqldb.db', 'otherverse.index', 'otherverse.json', 'otherverse_sqldb.db', 'wildbow.index', 'wildbow.json', 'wildbow_sqldb.db']


9423

In [6]:
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="TheSpaceManG/wildbow-distilbert", # The name of model to create embeddings with. This will pull from HuggingFace and can be changed out as desired (eg distilbert-base-uncased-distilled-squad).
    model_format="sentence_transformers",
    max_seq_len=500,
    progress_bar=True,
)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model TheSpaceManG/wildbow-distilbert
  return self.fget.__get__(instance, owner)()


In [13]:
generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa", num_beams=NUM_BEAMS, max_length=MAX_LENGTH, top_k=GENERATOR_TOPK)
reader = FARMReader("deepset/bert-base-cased-squad2")

pipe = GenerativeQAPipeline(generator, retriever) # We specify some of these params later
extract_pipe = ExtractiveQAPipeline(reader, retriever)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
  return self.fget.__get__(instance, owner)()
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/bert-base-cased-squad2' (Bert)


Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/bert-base-cased-squad2' (Bert model) from model hub.
DEBUG:haystack.modeling.model.prediction_head:Prediction head initialized with size [768, 2]


Downloading (…)okenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [46]:
# Given the format of the answer from our retrieval pipeline, parse out in a reasonable format
# 1. The answer(s) of which there will be K
# 2. The metadata about the answers, of which there will be k sets we want to break down into a few categories
# What series, what PoVs, what chapters, etc
def format_answers(output, chapter_info=False, answer_type="gen"):
  if answer_type=="gen":
    for answerset in output['answers']:
      response = answerset.answer
      [print(x) for x in textwrap.wrap(f"Answer: {response}", width=120,break_long_words=False)]
      metadata = answerset.meta['doc_metas']
      if chapter_info:
        print('-----')
        print(f"{len(metadata)} Documents")
        for idx,x in enumerate(metadata):
          print(f"Source {idx+1}: {x['arc_title']}, {x['chapter']}, {x['pov']}, {x['series']}")

      print('-----------------------------------------------------------')
  if answer_type=="extract":
    for answerset in output['answers']:
      response = answerset.answer
      [print(x) for x in textwrap.wrap(f"Answer: {response}", width=120,break_long_words=False)]
      metadata = answerset.meta
      if chapter_info:
        print('-----')
        print(f"Source: {metadata['arc_title']}, {metadata['chapter']}, {metadata['pov']}, {metadata['series']}")

      print('-----------------------------------------------------------')

In [92]:
# Given the output of the pipelines here, return the full source contents that the answers come from
def return_source_content(output, answer_type="gen"):
  if answer_type=='gen':
    for answerset in output['answers']:
      response = answerset.answer
      [print(x) for x in textwrap.wrap(f"Answer: {response}", width=120,break_long_words=False)]
      print('--------')
      metadata = answerset.meta['doc_metas']
      content = answerset.meta['content']
      for m,c in zip(metadata,content):
        print(f"Source: {m['arc_title']}, {m['chapter']}, {m['pov']}, {m['series']}")
        [print(o) for o in textwrap.wrap(f"Content: {c}", width=160,break_long_words=False)]
        print('-----')

  if answer_type=='extract':
    for answerset, content in zip(output['answers'], output['documents']):
      response = answerset.answer
      [print(x) for x in textwrap.wrap(f"Answer: {response}", width=120,break_long_words=False)]
      metadata = answerset.meta
      print('-----')
      print(f"Source: {metadata['arc_title']}, {metadata['chapter']}, {metadata['pov']}, {metadata['series']}")
      [print(o) for o in textwrap.wrap(f"Content: {content.content}", width=160,break_long_words=False)]

In [51]:
# QUESTION: Your question that you wish to ask! 
QUESTION = "What is the Carmine Wolf?"

# The number of documents you want to parse to answer this.
# Personally I find that for the primitive Generative QA 15 is a decent number, but you should play with this
RETRIEVER_TOPK = 15

# The number of exact answers you want from the above
EXTRACT_TOPK = 5


In [19]:
# This will take 3-5 minutes or so, just be patient!
OUTPUT = pipe.run(
                query = QUESTION,
                params = {
                    "Retriever": {"top_k":RETRIEVER_TOPK}
                    }
            )

DEBUG:haystack.pipelines.base:Running node 'Query` with input: {'root_node': 'Query', 'params': {'Retriever': {'top_k': 15}}, 'query': 'Who is the Carmine?', 'node_id': 'Query'}
DEBUG:haystack.pipelines.base:Running node 'Retriever` with input: {'root_node': 'Query', 'params': {'Retriever': {'top_k': 15}}, 'query': 'Who is the Carmine?', 'node_id': 'Retriever'}


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:haystack.nodes.retriever.base:Retrieved documents with IDs: ['7aca828764c184f00d9604e3446edafe', '3e234e554c9b669a0171622205f18e5e', 'aa45d317c82135ae704cfb260b9bc9ea', 'd1e8083567bf4e859b32dedea14899c9', '5eac146509cf8d10fd1f8f87a2547d0a', '4c7dfe0e2ffeafc8d146235304369dd3', '543ec48b4a70a7d62ba2f286307c8c58', '19c3215c5a6379aa1cad3d1c7877a06f', 'd3c9e6224a3c3e6c6190a78c9d0caad6', '4e25fa247ed273489b05dfd6e1d5a26c', 'e3059a8aac74b66c1699d12e9f69bf35', '180d11272c1af282aac4bf84765fb844', '2641e959def8f2d2485418bd782023b7', '3889d29ce1e91213e3743139af2c571a', '389bceceb4a25a92f120812b299cbb8']
DEBUG:haystack.pipelines.base:Running node 'Generator` with input: {'documents': [<Document: {'content': '“So long as this is nonviolent, or the opposite of violence.”  “It’s negotiation,” Lucy replied.  She reached into her inside coat pocket, then pulled out a folded up printout.  She held it out.  Mike was the one who approached to take it.  He glanced up at Jude.  “Hi,” Jude said.  Veron

In [47]:
# Parse the output
format_answers(OUTPUT, chapter_info=True, answer_type="gen")

Answer: I don't know much about the Carmine, but I'll give it a shot. The Carmine is a group of people who believe in
the existence of a supernatural being called the "Alabaster." They believe that the Alabaster is the source of all evil
in the world, and that the "Carmine" is responsible for bringing about the end of the world as we know it. They believe
this to be the cause of the Black Death, the Black Plague, and other supernatural phenomena that have plagued the world
since the beginning of time. The Black Death is believed to be responsible for many of the atrocities that have occurred
throughout history, including the Holocaust, World War II, the Rwandan Genocide, the Armenian Genocide, and so on and so
forth. There's a lot more to it than that, but that's the gist of it.
-----
15 Documents
Source 1: In Absentia, 21.10, Verona, Pale
Source 2: Let Slip, 20.a, Nomi, Pale
Source 3: In Absentia, 21.13, All, Pale
Source 4: False Moves, 12.6, Avery, Pale
Source 5: Let Slip, 20.z, Jule

In [52]:
OUTPUT_EXTRACTED = extract_pipe.run(
                query = QUESTION,
                params = {
                    "Retriever": {"top_k":RETRIEVER_TOPK},
                    "Reader": {"top_k":EXTRACT_TOPK}
                    }
            )

DEBUG:haystack.pipelines.base:Running node 'Query` with input: {'root_node': 'Query', 'params': {'Retriever': {'top_k': 15}, 'Reader': {'top_k': 5}}, 'query': 'Who is the Carmine?', 'node_id': 'Query'}
DEBUG:haystack.pipelines.base:Running node 'Retriever` with input: {'root_node': 'Query', 'params': {'Retriever': {'top_k': 15}, 'Reader': {'top_k': 5}}, 'query': 'Who is the Carmine?', 'node_id': 'Retriever'}


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:haystack.nodes.retriever.base:Retrieved documents with IDs: ['7aca828764c184f00d9604e3446edafe', '3e234e554c9b669a0171622205f18e5e', 'aa45d317c82135ae704cfb260b9bc9ea', 'd1e8083567bf4e859b32dedea14899c9', '5eac146509cf8d10fd1f8f87a2547d0a', '4c7dfe0e2ffeafc8d146235304369dd3', '543ec48b4a70a7d62ba2f286307c8c58', '19c3215c5a6379aa1cad3d1c7877a06f', 'd3c9e6224a3c3e6c6190a78c9d0caad6', '4e25fa247ed273489b05dfd6e1d5a26c', 'e3059a8aac74b66c1699d12e9f69bf35', '180d11272c1af282aac4bf84765fb844', '2641e959def8f2d2485418bd782023b7', '3889d29ce1e91213e3743139af2c571a', '389bceceb4a25a92f120812b299cbb8']
DEBUG:haystack.pipelines.base:Running node 'Reader` with input: {'documents': [<Document: {'content': '“So long as this is nonviolent, or the opposite of violence.”  “It’s negotiation,” Lucy replied.  She reached into her inside coat pocket, then pulled out a folded up printout.  She held it out.  Mike was the one who approached to take it.  He glanced up at Jude.  “Hi,” Jude said.  Verona s

Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

In [53]:
format_answers(OUTPUT_EXTRACTED,chapter_info=True,answer_type="extract", return_passage=False)

Answer: The Carmine is meant to serve and represent us
-----
Source: In Absentia, 21.10, Verona, Pale
-----------------------------------------------------------
Answer: Beast’s domain
-----
Source: Lost for Words, 1.3, Avery, Pale
-----------------------------------------------------------
Answer: ex-Forsworn
-----
Source: Let Slip, 20.9, Avery, Pale
-----------------------------------------------------------
Answer: Charles Abrams
-----
Source: Wild Abandon, 18.z, Misc., Pale
-----------------------------------------------------------
Answer: ex-Forsworn
-----
Source: Let Slip, 20.z, Julette, Pale
-----------------------------------------------------------
