## Generate Q-A pairs for Pale chapters and Generative Pseudo-labeling to help improve the performance of the DPR. This will lean on a few different sets of examples: [Generating Examples](https://haystack.deepset.ai/tutorials/18_gpl) and [Training your own model](https://haystack.deepset.ai/tutorials/09_dpr_training) being the 2 examples from Haystack, and [this](https://kaushikshakkari.medium.com/open-domain-question-answering-series-part-6-fine-tuning-dense-retriever-models-without-any-aed7c08c33ff) and [this](https://www.pinecone.io/learn/gpl/) provide some moer general content.

1. Generate QA pairs for the listed set of files (Pale). This entire set should be preprocessed in a slightly different way than they are for the primary retreiver (the number of max tokens is a bit longer than the 256 max for the DPR models used in the retreivers).

2. Save our these pais and make sure they match the formatting needed for the retreiver in the daily driver primary models.

3. Repeat this for the other novels as needed. there may be value generating these questions at the same time or there may not - unsure but not likely.

In [1]:
!nvidia-smi

Fri Feb 17 01:59:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install "faiss-gpu>=1.6.3,<2"
!pip install -q git+https://github.com/deepset-ai/haystack.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


In [4]:
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# from datasets import load_dataset


In [5]:
import os
print(os.getcwd())
os.chdir('./drive/MyDrive/pale-companion-files/GPL/')

/content


In [6]:
import pickle
with open('../chapter_fmt_list.pkl','rb') as f:
    chapters = pickle.load(f)

In [7]:
from haystack import Document
chapter_documents = [Document.from_dict(d) for d in chapters]
len(chapter_documents)

307

In [8]:
from haystack.nodes import PreProcessor

word_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=30,
    progress_bar=True, 
    add_page_number=True
)

In [9]:
corpus = word_preprocessor.process(chapter_documents)
len(corpus)

Preprocessing:   0%|          | 0/307 [00:00<?, ?docs/s]



21196

In [10]:
# We load the TAS-B model, a state-of-the-art model trained on MS MARCO
max_seq_length = 300
model_name = "msmarco-distilbert-base-tas-b"

org_model = SentenceTransformer(model_name)
org_model.max_seq_length = max_seq_length

In [11]:
print("Len Corpus:", len(corpus))

Len Corpus: 21196


In [13]:
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="cosine")
document_store.write_documents(corpus)


retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
    model_format="sentence_transformers",
    max_seq_len=max_seq_length,
    progress_bar=False,
)
document_store.update_embeddings(retriever)


Writing Documents:   0%|          | 0/21196 [00:00<?, ?it/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/msmarco-distilbert-base-tas-b
INFO:haystack.document_stores.faiss:Updating embeddings for 21196 docs...


Updating Embedding:   0%|          | 0/21196 [00:00<?, ? docs/s]

In [14]:
from haystack.nodes.question_generator import QuestionGenerator
from haystack.nodes.label_generator import PseudoLabelGenerator

use_question_generator = True


if use_question_generator:
    questions_producer = QuestionGenerator(
        model_name_or_path="doc2query/msmarco-t5-base-v1",
        max_length=64,
        split_length=128,
        split_overlap=30,
        batch_size=8,
        num_queries_per_doc=3,
    )

else:
    questions_producer = query_doc_pairs

# We can use either QuestionGenerator or already generated questions in PseudoLabelGenerator
psg = PseudoLabelGenerator(questions_producer, retriever, max_questions_per_document=10, batch_size=8, top_k=10)
output, pipe_id = psg.run(documents=document_store.get_all_documents())


INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
Using sep_token, but it is not set yet.
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Generating questions:   0%|          | 0/43226 [00:00<?, ?it/s]

Mine negatives:   0%|          | 0/15989 [00:00<?, ?it/s]

Score margin:   0%|          | 0/15989 [00:00<?, ?it/s]

In [15]:
with open('./pale-gpl-output.pkl','wb') as fp:
  pickle.dump(output,fp)

In [22]:
output['gpl_labels'][9]

{'question': "who said this won't screw with our heads",
 'pos_doc': 'The points of comparison, the minute factors… no.”  Charles glanced in the direction of the Aurum a moment before the Aurum spoke.  “An amendment.  The question is about leadership, character, and merit.  The three challengers will participate.”  The centipede rasped as it grazed floor and glass.  It let the man who’d been seated on the gold centipede’s head down to the floor, but remained in constant contact with him as it slid by, reaching.  For Avery, Verona, and Lucy.  “Us against him?” Avery asked.  “Each of them alone,” Musser said.  “I must insist.”  Alone?  Avery felt a bit of trepidation.  “We awoke together,” Lucy said.  “It’s a fair adjustment, to keep to the spirit of the challenge,” the Aurum said.  “This won’t-” Verona paused, as the centipede’s long body slid between her and Lucy.  She rubbed her palm.  “It won’t Worold us or whatever the term is?  It won’t screw with our heads?”  “No more than briefly