## Generate Q-A pairs for Pale chapters and Generative Pseudo-labeling to help improve the performance of the DPR. This will lean on a few different sets of examples: [Generating Examples](https://haystack.deepset.ai/tutorials/18_gpl) and [Training your own model](https://haystack.deepset.ai/tutorials/09_dpr_training) being the 2 examples from Haystack, and [this](https://kaushikshakkari.medium.com/open-domain-question-answering-series-part-6-fine-tuning-dense-retriever-models-without-any-aed7c08c33ff) and [this](https://www.pinecone.io/learn/gpl/) provide some moer general content.

1. Generate QA pairs for the listed set of files (Pale). This entire set should be preprocessed in a slightly different way than they are for the primary retreiver (the number of max tokens is a bit longer than the 256 max for the DPR models used in the retreivers).

2. Save our these pais and make sure they match the formatting needed for the retreiver in the daily driver primary models.

3. Repeat this for the other novels as needed. there may be value generating these questions at the same time or there may not - unsure but not likely.

In [1]:
!nvidia-smi

Fri Feb 17 22:49:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 516.94       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  ERR!                On   | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8    N/A /  N/A |      0MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# !pip install "faiss-gpu>=1.6.3,<2"
# !pip install -q git+https://github.com/deepset-ai/haystack.git

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


In [4]:
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# from datasets import load_dataset


In [5]:
import os
# print(os.getcwd())
# os.chdir('./drive/MyDrive/pale-companion-files/GPL/')

In [6]:
import pickle
with open('../../data/poke_fmt_list.pkl','rb') as f:
    chapters = pickle.load(f)

In [7]:
from haystack import Document
chapter_documents = [Document.from_dict(d) for d in chapters]
len(chapter_documents)

4

In [8]:
from haystack.nodes import PreProcessor

word_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=30,
    progress_bar=True, 
    add_page_number=True
)

In [9]:
corpus = word_preprocessor.process(chapter_documents)
len(corpus)

Preprocessing:   0%|          | 0/4 [00:00<?, ?docs/s]

147

In [10]:
# We load the TAS-B model, a state-of-the-art model trained on MS MARCO
max_seq_length = 300
model_name = "msmarco-distilbert-base-tas-b"

org_model = SentenceTransformer(model_name)
org_model.max_seq_length = max_seq_length

In [11]:
print("Len Corpus:", len(corpus))

Len Corpus: 147


In [12]:
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="cosine")
document_store.write_documents(corpus)


retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
    model_format="sentence_transformers",
    max_seq_len=max_seq_length,
    progress_bar=False,
)
document_store.update_embeddings(retriever)


Writing Documents:   0%|          | 0/147 [00:00<?, ?it/s]

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/msmarco-distilbert-base-tas-b
INFO - haystack.document_stores.faiss -  Updating embeddings for 147 docs...


Updating Embedding:   0%|          | 0/147 [00:00<?, ? docs/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - hay

In [13]:
from haystack.nodes.question_generator import QuestionGenerator
from haystack.nodes.label_generator import PseudoLabelGenerator

use_question_generator = True


if use_question_generator:
    questions_producer = QuestionGenerator(
        model_name_or_path="doc2query/msmarco-t5-base-v1",
        max_length=64,
        split_length=128,
        split_overlap=30,
        batch_size=1,
        num_queries_per_doc=3,
    )

else:
    questions_producer = query_doc_pairs

# We can use either QuestionGenerator or already generated questions in PseudoLabelGenerator
psg = PseudoLabelGenerator(questions_producer, retriever, max_questions_per_document=10, batch_size=1, top_k=10)
output, pipe_id = psg.run(documents=document_store.get_all_documents())


INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
Using sep_token, but it is not set yet.
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_

Generating questions:   0%|          | 0/290 [00:00<?, ?it/s]

Mine negatives:   0%|          | 0/859 [00:00<?, ?it/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - hay

Score margin:   0%|          | 0/859 [00:00<?, ?it/s]

In [14]:
with open('./poke-gpl-output.pkl','wb') as fp:
  pickle.dump(output,fp)

In [15]:
output['gpl_labels'][0]

{'question': 'who said mags has been as kind and fair as any practitioner',
 'pos_doc': 'mags has been as kind and fair as any practitioner i’ve run into-”\n\n“you don’t think that’s because she’s got motives?” deedee asked.\n\n“i’ve got motives,” mags said.  “i want my information.  i want to not make enemies.”\n\n“yeah?” dee hissed.  “you want-”\n\n“dee,” ben said, with a note of anger in his voice.  she turned her head, glancing at him.  “take this any further, and it might be where we part ways.  this isn’t fun.”\n\nthe look in her eyes was wounded.\n\n“pussy bitch,” buttsack growled.  “you know you’d lose if you came at me.  you’re a shit goblin.  spending too much time around humans, losing your edge?  that’s bad enough.  but you picked a shitty, sad, coward pussy of a human to hang around.”\n\n“no,” mags told him.  “stop that.”\n\ndee tensed, muscles moving beneath skin, poised to pounce.\n\na distant boom served like the firing of a starter pistol.  dee lunged to action, and be