# Experiment 1

* [Dataset collection](#Dataset-collection)
* [Intialise Dataset, Storage and Model](#Intialise-Dataset,-Storage-and-Model)
* [Build Triples](#Build-Triples)
* [Training](#Training)
* [Evaluating](#Evaluating)

## Dataset collection

In [1]:
from data import collect_negatives # import the defined negative collection function

In [2]:
%%script false --no-raise-error

negative_lookup = collect_negatives()

It will also save the collection to local data folder.

### Load the saved negatives lookup

In [3]:
from utils import load_data

negative_lookup = load_data('data', 'negative_lookup')

## Intialise Dataset, Storage and Model

In [5]:
import ir_datasets

DATASET = r'msmarco-passage/train/triples-small' # From https://ir-datasets.com

dataset = ir_datasets.load(DATASET)

In [6]:
from datetime import datetime
import pandas as pd

start_time = datetime.now()
docpairs = pd.DataFrame(dataset.docpairs_iter())
docpairs.to_csv(f'data/docpairs.csv')
end_time = datetime.now()

print("Loading time: %s" % (end_time - start_time))

Loading time: 0:02:13.401365


In [2]:
docpairs = pd.read_csv('data/docpairs.csv', index_col=0)

### 100k samples

In [6]:
%%script false --no-raise-error

from data import sample_df

new_docpairs = sample_df(docpairs, 100000, 'new_docpairs') # baseline

### Build the true negative 100k samples

In [7]:
%%script false --no-raise-error

from data import true_negatives

truenegative_docpairs = true_negatives(new_docpairs, negative_lookup)

### Load the saved 100k samples

In [8]:
new_docpairs = load_data('data', 'new_docpairs')
truenegative_docpairs = load_data('data', 'truenegative_docpairs')

## Build Triples

In [9]:
from data import dataset_from_idx

baseline_triples = dataset_from_idx(dataset, new_docpairs, 'baseline_triples')
new_triples = dataset_from_idx(dataset, truenegative_docpairs, 'new_triples')

In [3]:
baseline_triples = pd.read_csv('data/baseline_triples.csv', index_col=0)
new_triples = pd.read_csv('data/new_triples.csv', index_col=0)

## Training

In [1]:
%run -i 'train.py' --dataset_name 'docpairs' --train_name 'baseline_triples' --out_dir 'model_base' --batch_size 256

{0, 1}


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


  0%|          | 0/1000000.0 [00:00<?, ?it/s]

In [2]:
%run -i 'train.py' --dataset_name 'docpairs' --train_name 'new_triples' --out_dir 'model_new' --batch_size 256

{0, 1}


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


  0%|          | 0/1000000.0 [00:00<?, ?it/s]

## Evaluating

In [1]:
%run -i 'evaluate.py' --output_name '20230717' --batch_size 256

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


03:42:40.901 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.


  indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})


03:53:15.007 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 5 empty documents
03:53:15.165 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
monoT5: 100%|████████████████████████████| 755/755 [11:51<00:00,  1.06batches/s]
monoT5: 100%|████████████████████████████| 755/755 [11:53<00:00,  1.06batches/s]
