## Fine tune a Retriever on the GPL data
Follow [this guide](https://haystack.deepset.ai/tutorials/09_dpr_training) / [this guide](https://haystack.deepset.ai/tutorials/18_gpl) on training your own Retreiver. The differences here are all around whether you're better off using an EmbeddingRetreiver or a DensePassage Retriever. As part of the general evaluation of the preprocessor steps / qualititative analysis of this on a few chapters of Pale (for speed) I'll make the end decision since unfortunately the GPL done with the EmbeddingRetriever isn't heavily compatible with the DPR.

This will need to be done iteratively as 
1) my personal PC likely won't be able to handle all files
2) Colab is not likely to be able to do the fine tuning in one go. Will most likely process this one book at a time, starting with the shorter ones as a proof of concept.

In [None]:
import os
print(os.getcwd())
os.chdir('./drive/MyDrive/pale-companion-files/finetune/')
print(os.getcwd())

/content
/content/drive/MyDrive/pale-companion-files/finetune


In [None]:
import pickle
# Test on Pact first as a medium dataset
with open('../GPL/pact-gpl-output.pkl','rb') as f:
    pact_questions = pickle.load(f)

In [None]:
pact_questions['gpl_labels'][0].keys()

dict_keys(['question', 'pos_doc', 'neg_doc', 'score'])

In [None]:
pact_questions['gpl_labels'][5]

{'question': "what did dad say about pearl's trick",
 'pos_doc': 'He said that was close enough.\nIf I have to be truthful then I need to say my feelings hurt almost as bad as any of it.  I wish someone would explain this better.  Daddy said it was a trick but I said I did not think it made sense that someone my age could plan a trick like that and plan ahead to have people waiting in the shed like Pearl did.\nDaddy said the members of the Duchamp family could and they would do worse because they were scared of me so I could never ever never ever be friends with them.  I asked him not even when I was an adult and he said when I am an adult I will know better or I deserve what I get.\nI think I started having the bad dreams around then.  Every night for a long time.  Then one night daddy came and picked me up and he carried me to his bed.  He told me the deal was I was allowed to cry but only so long as it was night and my head was on the pillow.  In daylight I cannot cry or show weakne

The expected format of the data in Haystack should be:
```
[{'question', 'pos_doc', 'neg_doc', 'score'}]
```

In [None]:
!nvidia-smi

Tue Feb 21 01:31:00 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install -q datasets
!pip install "faiss-gpu>=1.6.3,<2"
!pip install -q git+https://github.com/deepset-ai/haystack.git


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu<2,>=1.6.3
  Downloading faiss_gpu-1.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fais

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


In [None]:
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset


In [22]:
# We load the TAS-B model, a state-of-the-art model trained on MS MARCO
max_seq_length = 400
model_name = "msmarco-t-base-tas-b"

org_model = SentenceTransformer(model_name)
org_model.max_seq_length = max_seq_length


In [18]:
with open('../chapter_fmt_list.pkl','rb') as f:
    pale_chapters = pickle.load(f)
limited_chapters = [i for i in pale_chapters if int(i['meta']['arc_number']) < 5]

len(limited_chapters)

from haystack import Document
chapter_documents = [Document.from_dict(d) for d in limited_chapters]
len(chapter_documents)

62

In [19]:
from haystack.nodes import PreProcessor

word_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=300,
    split_respect_sentence_boundary=True,
    split_overlap=40,
    progress_bar=True, 
    add_page_number=True
)

corpus =  word_preprocessor.process(chapter_documents)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable  HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Preprocessing:   0%|          | 0/62 [00:00<?, ?docs/s]

In [23]:
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="cosine")
document_store.write_documents(corpus)


retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
    model_format="sentence_transformers",
    max_seq_len=max_seq_length,
    progress_bar=True,
)
document_store.update_embeddings(retriever)


Writing Documents:   0%|          | 0/1877 [00:00<?, ?it/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/msmarco-distilbert-base-tas-b
INFO:haystack.document_stores.faiss:Updating embeddings for 1877 docs...


Updating Embedding:   0%|          | 0/1877 [00:00<?, ? docs/s]

Batches:   0%|          | 0/59 [00:00<?, ?it/s]

In [24]:
len(pact_questions["gpl_labels"])

36607

In [25]:
retriever.train(pact_questions["gpl_labels"])

INFO:haystack.nodes.retriever._embedding_encoder:Training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 400, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 36607 examples


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2287 [00:00<?, ?it/s]

In [26]:
retriever.save("pact_adapted_retriever")

In [32]:
# Now Poke
with open('../GPL/poke-gpl-output.pkl','rb') as f:
    poke_questions = pickle.load(f)
len(poke_questions["gpl_labels"])

859

In [33]:
retriever.train(poke_questions["gpl_labels"])

INFO:haystack.nodes.retriever._embedding_encoder:Training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 400, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 859 examples


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/53 [00:00<?, ?it/s]

In [34]:
retriever.save("poke_pact_adapted_retriever")

In [35]:
# Now Pate
with open('../GPL/pate-gpl-output.pkl','rb') as f:
    pate_questions = pickle.load(f)
len(pate_questions["gpl_labels"])

335

In [36]:
retriever.train(pate_questions["gpl_labels"])

INFO:haystack.nodes.retriever._embedding_encoder:Training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 400, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 335 examples


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/20 [00:00<?, ?it/s]

In [37]:
retriever.save("pate_poke_pact_adapted_retriever")

In [38]:
# now expand to the training to Pale
with open('../GPL/pale-gpl-output.pkl','rb') as f:
    pale_questions = pickle.load(f)
len(pale_questions["gpl_labels"])

127907

In [39]:
retriever.train(pale_questions["gpl_labels"])

INFO:haystack.nodes.retriever._embedding_encoder:Training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 400, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 127907 examples


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7994 [00:00<?, ?it/s]

In [40]:
retriever.save("otherverse_adapted_retriever")