# Dense Retrieval with Simple Transformers - Training


## Imports and logging

In [22]:
import logging
import json

from simpletransformers.retrieval import RetrievalModel, RetrievalArgs
import wandb


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

2022-08-23 15:14:35.128017: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


## Data Preparation

The dataset can be downloaded from https://github.com/facebookresearch/DPR

```bash
# Downloads the dataset from https://github.com/facebookresearch/DPR
python download_data.py --resource data.retriever.nq-train
python download_data.py --resource data.retriever.nq-dev

# Move things around and clean up
mv downloads/data .
mv data/retriever/* data
rm -r data/retriever
rm -r downloads
```

In [2]:
train_data = "data/nq-train.json"
eval_data = "data/nq-dev.json"

In [5]:
with open(train_data, "r") as f:
    train = json.load(f)

## Data Formats

### DPR format

A JSON file where each entry is a dictionary must contain the two keys:
- `question`
- `postive_ctxs`

In [6]:
train[0].keys()

dict_keys(['dataset', 'question', 'answers', 'positive_ctxs', 'negative_ctxs', 'hard_negative_ctxs'])

Here, `positive_ctxs` is a list of relevant documents where each entry must contain the two keys:
- `title`
- `text`

While this list may contain multiple relevant documents, we only use the first (most) relevant document during training

In [7]:
train[0]["positive_ctxs"][0].keys()

dict_keys(['title', 'text', 'score', 'title_score', 'passage_id'])

Overall, Simple Transformers looks for a query and the relevant doc/passage for that query. The relevant passage may also contain an optional `title` value.

In [11]:
print(f"Query: {train[0]['question']}")
print()
print(f"Relevant passage title: {train[0]['positive_ctxs'][0]['title']}")
print()
print(f"Relevant passage title: {train[0]['positive_ctxs'][0]['text']}")


Query: big little lies season 2 how many episodes

Relevant passage title: Big Little Lies (TV series)

Relevant passage title: series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley


## Recommended data format for custom datasets

### TSV file

The recommended data format for custom datasets is a simple TSV file with the following 3 columns:
- `query_text`
- `gold_passage`
- `title`

Alternatively, a Pandas Dataframe with the same columns may be used.

## Retrieval Model

The retrieval model is a dual encoder consisting of two BERT encoders

In [2]:
model_type = "custom"
model_name = None
context_name = "bert-base-uncased"
question_name = "bert-base-uncased"

In [4]:
model_args = RetrievalArgs()

# Training parameters
model_args.num_train_epochs = 40
model_args.train_batch_size = 40
model_args.learning_rate = 1e-5
model_args.max_seq_length = 256

# Evaluation parameters
model_args.retrieve_n_docs = 100
model_args.eval_batch_size = 100
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True
# model_args.evaluate_during_training_steps = 200

# Model tracking
model_args.wandb_project = "Dense retrieval with Simple Transformers"
model_args.save_model_every_epoch = False
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.save_best_model = True
model_args.overwrite_output_dir = True

In [5]:
model = RetrievalModel(
    model_type=model_type,
    model_name=model_name,
    context_encoder_name=context_name,
    query_encoder_name=question_name,
    args=model_args,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.we

## Training the model

To train the model, we simply call `train_model` and pass the path to the training data file, along with the validation data file (required for evaluating during training).

In [6]:
model.train_model(train_data, eval_data=eval_data)



Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 38.37 MiB, post-processed: Unknown size, total: 38.37 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-e40ee4a6467d4bb8/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/58880 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-e40ee4a6467d4bb8/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_model: Training started


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

[34m[1mwandb[0m: Currently logged in as: [33mthilina[0m (use `wandb login --relogin` to force relogin)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 12:21:57.985474: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


Running Epoch 0 of 5:   0%|          | 0/1472 [00:00<?, ?it/s]

  (max_idxs == torch.tensor(labels)).sum().cpu().detach().numpy().item()
INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


  0%|          | 0/51 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.
INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/7 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Retrieving docs:   0%|          | 0/13 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training loss,0.27994
lr,1e-05
global_step,1450.0
_runtime,747.0
_timestamp,1660818863.0
_step,28.0


0,1
Training loss,█▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
lr,▁▂▃▄▅▅▆▇█████████▇▇▇▇▇▇▇▇▇▇▇▇
global_step,▁▁▁▂▂▂▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇██
_runtime,▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
_timestamp,▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
_step,▁▁▁▂▂▂▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇██


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 12:38:01.996084: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 16.66825433210893, 'mrr@1': 0.5493476592478895, 'mrr@2': 0.6128933231005372, 'mrr@3': 0.6332054233819392, 'mrr@5': 0.6478255308262983, 'mrr@10': 0.6558959787547661, 'top_1_accuracy': 0.5493476592478895, 'top_2_accuracy': 0.6764389869531849, 'top_3_accuracy': 0.7373752877973906, 'top_5_accuracy': 0.8006139677666922, 'top_10_accuracy': 0.8597083653108212}
INFO:simpletransformers.retrieval.retrieval_model:Saving model into outputs/best_model


Running Epoch 1 of 5:   0%|          | 0/1472 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


  0%|          | 0/51 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.
INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/7 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Retrieving docs:   0%|          | 0/13 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval_loss,16.66825
mrr@1,0.54935
mrr@2,0.61289
mrr@3,0.63321
mrr@5,0.64783
mrr@10,0.6559
top_1_accuracy,0.54935
top_2_accuracy,0.67644
top_3_accuracy,0.73738
top_5_accuracy,0.80061


0,1
eval_loss,▁▁
mrr@1,▁▁
mrr@2,▁▁
mrr@3,▁▁
mrr@5,▁▁
mrr@10,▁▁
top_1_accuracy,▁▁
top_2_accuracy,▁▁
top_3_accuracy,▁▁
top_5_accuracy,▁▁


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 12:46:01.875104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 18.92426655509255, 'mrr@1': 0.5857252494244052, 'mrr@2': 0.6488104374520338, 'mrr@3': 0.6676899462778204, 'mrr@5': 0.6815272448196469, 'mrr@10': 0.6886760101840685, 'top_1_accuracy': 0.5857252494244052, 'top_2_accuracy': 0.7118956254796623, 'top_3_accuracy': 0.7685341519570222, 'top_5_accuracy': 0.8283960092095165, 'top_10_accuracy': 0.8805832693783576}
INFO:tensorboardX.summary:Summary name eval_mrr@1 is illegal; using eval_mrr_1 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@2 is illegal; using eval_mrr_2 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@3 is illegal; using eval_mrr_3 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@5 is illegal; using eval_mrr_5 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@10 is illegal; using eval_mrr_10 instead.
INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


  0%|          | 0/51 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.
INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/7 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Retrieving docs:   0%|          | 0/13 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval_loss,18.92427
mrr@1,0.58573
mrr@2,0.64881
mrr@3,0.66769
mrr@5,0.68153
mrr@10,0.68868
top_1_accuracy,0.58573
top_2_accuracy,0.7119
top_3_accuracy,0.76853
top_5_accuracy,0.8284


0,1
eval_loss,▁▁
mrr@1,▁▁
mrr@2,▁▁
mrr@3,▁▁
mrr@5,▁▁
mrr@10,▁▁
top_1_accuracy,▁▁
top_2_accuracy,▁▁
top_3_accuracy,▁▁
top_5_accuracy,▁▁


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 12:57:15.515633: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 20.71651282454982, 'mrr@1': 0.642056792018419, 'mrr@2': 0.7044512663085188, 'mrr@3': 0.7226144794064979, 'mrr@5': 0.7344333589153237, 'mrr@10': 0.7401528219371658, 'top_1_accuracy': 0.642056792018419, 'top_2_accuracy': 0.7668457405986185, 'top_3_accuracy': 0.8213353798925557, 'top_5_accuracy': 0.8726016884113584, 'top_10_accuracy': 0.9146584804297775}


Running Epoch 2 of 5:   0%|          | 0/1472 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


  0%|          | 0/51 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.
INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/7 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Retrieving docs:   0%|          | 0/13 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval_loss,20.71651
mrr@1,0.64206
mrr@2,0.70445
mrr@3,0.72261
mrr@5,0.73443
mrr@10,0.74015
top_1_accuracy,0.64206
top_2_accuracy,0.76685
top_3_accuracy,0.82134
top_5_accuracy,0.8726


0,1
eval_loss,▁▁
mrr@1,▁▁
mrr@2,▁▁
mrr@3,▁▁
mrr@5,▁▁
mrr@10,▁▁
top_1_accuracy,▁▁
top_2_accuracy,▁▁
top_3_accuracy,▁▁
top_5_accuracy,▁▁


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 13:09:20.713398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 23.59813054402669, 'mrr@1': 0.6535686876438987, 'mrr@2': 0.7157329240214889, 'mrr@3': 0.7334868252750063, 'mrr@5': 0.7447608083908929, 'mrr@10': 0.750366797012998, 'top_1_accuracy': 0.6535686876438987, 'top_2_accuracy': 0.7778971603990791, 'top_3_accuracy': 0.8311588641596316, 'top_5_accuracy': 0.8796623177283193, 'top_10_accuracy': 0.9201841903300076}
INFO:tensorboardX.summary:Summary name eval_mrr@1 is illegal; using eval_mrr_1 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@2 is illegal; using eval_mrr_2 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@3 is illegal; using eval_mrr_3 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@5 is illegal; using eval_mrr_5 instead.
INFO:tensorboardX.summary:Summary name eval_mrr@10 is illegal; using eval_mrr_10 instead.
INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


  0%|          | 0/51 [00:00<?, ?ba/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.
INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/7 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Downloading and preparing dataset retrieval_dataset_loading_script/default (download: Unknown size, generated: 4.24 MiB, post-processed: Unknown size, total: 4.24 MiB) to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3...


Generating train split:   0%|          | 0/6515 [00:00<?, ? examples/s]

Dataset retrieval_dataset_loading_script downloaded and prepared to /deep_learning/.cache/huggingface/datasets/retrieval_dataset_loading_script/default-6677aded6fea238d/0.0.0/f0c8460ab8d4db814fa43c3f14f6e1ad1e59a0b97758c518dafbf7dcd9e00fc3. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Retrieving docs:   0%|          | 0/13 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval_loss,23.59813
mrr@1,0.65357
mrr@2,0.71573
mrr@3,0.73349
mrr@5,0.74476
mrr@10,0.75037
top_1_accuracy,0.65357
top_2_accuracy,0.7779
top_3_accuracy,0.83116
top_5_accuracy,0.87966


0,1
eval_loss,▁▁
mrr@1,▁▁
mrr@2,▁▁
mrr@3,▁▁
mrr@5,▁▁
mrr@10,▁▁
top_1_accuracy,▁▁
top_2_accuracy,▁▁
top_3_accuracy,▁▁
top_5_accuracy,▁▁


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.13.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
2022-08-18 13:16:11.500419: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 24.18571879646995, 'mrr@1': 0.6581734458940905, 'mrr@2': 0.7214121258633922, 'mrr@3': 0.7382962394474291, 'mrr@5': 0.7491864927091327, 'mrr@10': 0.7545236267953075, 'top_1_accuracy': 0.6581734458940905, 'top_2_accuracy': 0.7846508058326938, 'top_3_accuracy': 0.8353031465848043, 'top_5_accuracy': 0.8825786646201075, 'top_10_accuracy': 0.9215656178050652}


Running Epoch 3 of 5:   0%|          | 0/1472 [00:00<?, ?it/s]

KeyboardInterrupt: 

# Tracking training progress

In [1]:
import wandb

In [2]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mthilina[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [3]:
api = wandb.Api()
run_path = "thilina/Dense retrieval with Simple Transformers/3k77hhta"

run = api.run(run_path)

### Training losses at each logging step

In [4]:
run.history().dropna(subset=["Training loss"])

Unnamed: 0,_step,_runtime,lr,Training loss,global_step,_timestamp,top_1_accuracy,mrr@1,train_loss,top_3_accuracy,...,gradients/graph_1embeddings.word_embeddings.weight,gradients/encoder.layer.6.attention.self.query.bias,gradients/encoder.layer.8.attention.output.dense.bias,gradients/graph_1encoder.layer.7.attention.self.query.bias,gradients/embeddings.word_embeddings.weight,gradients/encoder.layer.6.attention.self.query.weight,gradients/graph_1encoder.layer.11.attention.output.LayerNorm.bias,gradients/graph_1encoder.layer.6.output.LayerNorm.weight,gradients/graph_1encoder.layer.6.output.dense.bias,gradients/encoder.layer.4.attention.self.key.bias
0,6,174,9.906595e-07,9.326607e+00,350.0,1660863692,,,,,...,,,,,,,,,,
1,15,391,2.264365e-06,3.730740e+00,800.0,1660863909,,,,,...,,,,,,,,,,
2,16,415,2.405887e-06,3.454845e+00,850.0,1660863933,,,,,...,,,,,,,,,,
3,17,439,2.547410e-06,2.934197e+00,900.0,1660863957,,,,,...,,,,,,,,,,
4,18,463,2.688933e-06,2.701782e+00,950.0,1660863981,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509,1299,40719,1.228612e-07,4.877671e-03,58200.0,1660904237,,,,,...,,,,,,,,,,
510,1302,40791,9.575948e-08,7.152554e-08,58350.0,1660904309,,,,,...,,,,,,,,,,
511,1303,40815,8.672557e-08,1.490113e-07,58400.0,1660904333,,,,,...,,,,,,,,,,
512,1305,40863,6.865774e-08,9.651372e-06,58500.0,1660904381,,,,,...,,,,,,,,,,


### Evaluation during training

In [18]:
from IPython.core.display import display, HTML
display(HTML("<style>div.output_scroll { height: 100em; }</style>"))

In [20]:
%wandb thilina/Dense%20retrieval%20with%20Simple%20Transformers -h 1024