## Fine tune a Retriever on the GPL data
Follow [this guide](https://haystack.deepset.ai/tutorials/09_dpr_training) / [this guide](https://haystack.deepset.ai/tutorials/18_gpl) on training your own Retreiver. The differences here are all around whether you're better off using an EmbeddingRetreiver or a DensePassage Retriever. As part of the general evaluation of the preprocessor steps / qualititative analysis of this on a few chapters of Pale (for speed) I'll make the end decision since unfortunately the GPL done with the EmbeddingRetriever isn't heavily compatible with the DPR.

This will need to be done iteratively as 
1) my personal PC likely won't be able to handle all files
2) Colab is not likely to be able to do the fine tuning in one go. Will most likely process this one book at a time, starting with the shorter ones as a proof of concept.

#### Training Status
- Otherverse 
  - Pale (~ch 23.1) - done
  - Pact - done
  - Poke - done
  - Pate - done
- Parahumans 
  - Worm - done
  - Ward - in progress
  - Glowworm - done
- Twig - done

In [1]:
MODEL_NAME_IN = "twig_otherverse_glowworm_worm_adapted" # The name of the model to start training from. Can be either a local model or the starting model sentence-transformers/msmarco-distilbert-base-tas-b
FILENAME_TO_TUNE = "../GPL/ward-gpl-output.pkl" # filepath of the file / documents to tune on. Relative path to the /finetune/ directory this code executes in
# The retriever only uses a small handful of documents as its "test" configuration so we don't need to do anything special here.
MODEL_NAME_OUT = "twig_otherverse_parahumans_adapted"
REMOVE_DOCUMENT_STORE = True # Needed if you want to make a new retriever document store
# The document store in this case is irrelevant so it's just a small bit of sample data

In [2]:
import os
print(os.getcwd())
os.chdir('./drive/MyDrive/pale-companion-files/finetune/')
if REMOVE_DOCUMENT_STORE:
  try:
    print("Deleting Document Store...")
    os.remove("training_document_store_index.db")
    os.remove("training_document_store_config.db")
    os.remove("faiss_document_store.db")
  except OSError:
      print("Tried to Delete Document Store - are you sure it exists?")

print(os.getcwd())

/content
Deleting Document Store...
/content/drive/MyDrive/pale-companion-files/finetune


In [3]:
import pickle
with open(FILENAME_TO_TUNE,'rb') as f:
    questions = pickle.load(f)

In [4]:
questions['gpl_labels'][5]

{'question': "what was tristan's cape about",
 'pos_doc': 'They would be pulling an all-nighter.\nOn his way back to the dorm rooms, he saw and waved at Figurehead.  Then it was back to his room.\nHe couldn’t sleep.  More accurately, he couldn’t bring himself to lie down in the bed, couldn’t bring himself to give up the time he would spend unconscious.  It wasn’t supposed to count, but—\nSuffocation gas, the thought crossed his mind.  It was hard to breathe, to swallow.  It had been a heck of a week, as Mr. Vaughn had said.  Something practically every day, whether it was fights or showing up at an event for law enforcement.  As fun as the cape stuff could be, with the banter and the team interplay, the emotional highs and lows had their cost.\nAnd he had so very little available to spend.\nHe made his way to the desk he shared with Tristan.  Homework.\nHe felt like if someone said one mean word to him, he could burst into tears.  Homework felt just masochistic enough to punish himself

The expected format of the data in Haystack should be:
```
[{'question', 'pos_doc', 'neg_doc', 'score'}]
```

In [5]:
!nvidia-smi

Fri Feb 24 22:02:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    24W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [6]:
!pip install "faiss-gpu>=1.6.3,<2"
!pip install -q git+https://github.com/deepset-ai/haystack.git


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu<2,>=1.6.3
  Downloading faiss_gpu-1.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m99.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setu

In [7]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


In [8]:
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from haystack import Document
from haystack.nodes import PreProcessor
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import FAISSDocumentStore


INFO:haystack.telemetry_2:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


In [9]:
with open('../chapter_fmt_list.pkl','rb') as f:
    pale_chapters = pickle.load(f)
limited_chapters = [i for i in pale_chapters if int(i['meta']['arc_number']) < 1]

chapter_documents = [Document.from_dict(d) for d in limited_chapters]
len(chapter_documents)

2

In [10]:
word_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=300,
    split_respect_sentence_boundary=True,
    split_overlap=40,
    progress_bar=True, 
    add_page_number=True
)

corpus =  word_preprocessor.process(chapter_documents)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Preprocessing:   0%|          | 0/2 [00:00<?, ?docs/s]

In [11]:
if REMOVE_DOCUMENT_STORE: # just a cold start
  document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="cosine") # You'll need to delete the index time each time
  document_store.write_documents(corpus)
else:
  document_store = FAISSDocumentStore.load(index_path="training_document_store_index.db", config_path="training_document_store_config.db")

max_seq_length = 400 # This should be kept at 400 to match all other iterations of fine tuning
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model=MODEL_NAME_IN, # In the future we may want to fine tune the embedding model on the main documents as well
    model_format="sentence_transformers",
    max_seq_len=max_seq_length,
    progress_bar=True,
)

document_store.update_embeddings(retriever) # We want to update embeddings each time though as we have a new embedding model (courtesy of the training process) and just want to do a quick timecheck. original - 24 seconds

document_store.save(index_path="training_document_store_index.db", config_path="training_document_store_config.db")


Writing Documents:   0%|          | 0/31 [00:00<?, ?it/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model twig_otherverse_glowworm_worm_adapted
INFO:haystack.document_stores.faiss:Updating embeddings for 31 docs...


Updating Embedding:   0%|          | 0/31 [00:00<?, ? docs/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
num_warmup_steps=int(0.05*len(questions["gpl_labels"]))
learning_rate = 2e-5 # The recommended weight from BERT, too high and you overwrite pretrained knowledge
batch_size = 16 # 16 is the default and barely fits (14/15 GB used) in a colab GPU
len(questions["gpl_labels"]), int(0.05*len(questions["gpl_labels"]))


(73997, 3699)

In [13]:
retriever.train(questions["gpl_labels"], n_epochs=1,num_warmup_steps=num_warmup_steps, learning_rate=learning_rate)

INFO:haystack.nodes.retriever._embedding_encoder:Training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 400, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 73997 examples


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4624 [00:00<?, ?it/s]

In [14]:
retriever.save(MODEL_NAME_OUT)
print(f"Underlying retriever model saved to {MODEL_NAME_OUT}")

Underlying retriever model saved to twig_otherverse_parahumans_adapted
