<a href="https://colab.research.google.com/github/FilipeFariaDias/information-retrival-using-pyterrier/blob/main/Information_Retrieval_with_PyTerrier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Apresentação

O [Terrier](http://terrier.org/) é uma ferramenta open source de Recuperação de Informações desenvolvida pela Universidade de Glasgow. Além das funcionalidades básicas de indexação e consulta, o Terrier implementa várias técnicas do estado da arte de melhoria de desempenho. Nessa disciplina, iremos usar o PyTerrier, uma API em Python para o Terrier.
Documentação em https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/


## Instalação do PyTerrier


In [None]:
pip --version

pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)


In [None]:
!pip install pyserini
!pip install python-terrier
!pip install --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR
# !pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting pyserini
  Downloading pyserini-0.21.0-py3-none-any.whl (154.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyjnius>=1.4.0 (from pyserini)
  Downloading pyjnius-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0 (from pyserini)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.95 (from pyserini)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nmslib>=2.1.1 (from pyse

## Init

É necessário executar `pt.init()` para poder usar as funções do PyTerrier

Argumentos Opcionais:
 - `version` - terrier IR version e.g. "5.2"
 - `mem` - megabytes allocated to java e.g. "4096"
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.


In [None]:
import pyterrier as pt
import pandas as pd
import pyserini
import json
import os
import hashlib
import shutil
if not pt.started():
  pt.init()
import onir_pt
from pprint import pprint
from sklearn.model_selection import train_test_split

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


#Getting the datasets

In [None]:
!!pip install wget



In [None]:
from google.colab import drive
drive.mount('/content/drive')
shutil.rmtree('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_docs')
os.makedirs('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_docs')

Mounted at /content/drive


In [None]:
DATASET_PATH = "/content/drive/My Drive/data/training_data/training_data/"

TOPICS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics.txt"
TEST_TOPICS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics_test.txt"
TRAIN_TOPICS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics_train.txt"

QRELS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels.txt"
TRAIN_QRELS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels_train.txt"
TEST_QRELS_PATH = "/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels_test.txt"

files = pt.io.find_files("/content/drive/My Drive/data/training_data/training_data/CT json")
DOCS_PATH = os.path.join(DATASET_PATH, "CT json")
DEV_PATH = os.path.join(DATASET_PATH, "dev.json")
TRAIN_PATH = os.path.join(DATASET_PATH, "train.json")
TREC_DOCS_PATH = '/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_docs'

gh_files = pt.io.find_files("/content/drive/My Drive/data/GH95/docs")

In [None]:
with open(TRAIN_PATH) as f:
  train_files = json.load(f)

with open(DEV_PATH) as f:
  dev_files = json.load(f)

In [None]:
docs_data = []
for file_ in files:
  if '.json' in file_:
    docs_data.append(json.load(open(file_, 'r')))

In [None]:
columns = ["docno", "Type", "Section_id", "Primary_id", "Secondary_id", "Statement", "Label", "Primary_evidence_index", "Secondary_evidence_index"]

topics = pt.io.read_topics(TOPICS_PATH)
# train_topics = pt.io.read_topics(TRAIN_TOPICS_PATH)
test_topics = pt.io.read_topics(TEST_TOPICS_PATH)

qrels = pt.io.read_qrels(QRELS_PATH)
train_qrels = pt.io.read_qrels(TRAIN_QRELS_PATH)
test_qrels = pt.io.read_qrels(TEST_QRELS_PATH)

dev_df = pd.DataFrame.from_dict(dev_files, orient='index').reset_index()
train_df = pd.DataFrame.from_dict(train_files, orient='index').reset_index()
# train_df.columns = columns
docs_df = pd.DataFrame(docs_data)

In [None]:
train, test = train_test_split(docs_df, test_size=0.2)

In [None]:
dev_df

Unnamed: 0,index,Type,Section_id,Primary_id,Statement,Label,Primary_evidence_index,Secondary_id,Secondary_evidence_index
0,1adc970c-d433-44d0-aa09-d3834986f7a2,Single,Results,NCT00066573,there is a 13.2% difference between the result...,Contradiction,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",,
1,6b9162d0-0816-46d4-81af-c60028dcc63b,Comparison,Eligibility,NCT00425854,Patients with significantly elevated ejection ...,Contradiction,[15],NCT01224678,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
2,0b6cc8e3-69ee-4a91-b93d-2ad3fddce65f,Comparison,Adverse Events,NCT02273973,a significant number of the participants in th...,Contradiction,"[5, 18]",NCT00281697,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
3,cc1f712a-2116-4e40-9810-f315e3fa5ff8,Single,Results,NCT00593346,the primary trial does not report the PFS or o...,Entailment,"[0, 1, 2, 3]",,
4,904061c0-14fa-4f13-9118-9a41e24fa8eb,Single,Eligibility,NCT02340221,Prior treatment with fulvestrant or with a pho...,Contradiction,[13],,
...,...,...,...,...,...,...,...,...,...
195,d310ec4e-993e-4827-8dc5-9aca053972db,Comparison,Intervention,NCT00688909,The the primary trial intervention involves on...,Contradiction,"[0, 1, 2]",NCT00450723,"[0, 1, 2]"
196,a5617ae4-05a3-42d0-9e14-141de5f8c010,Comparison,Adverse Events,NCT00258960,the secondary trial reported 1 single case of ...,Entailment,"[0, 1, 2, 3, 4, 5, 6, 7, 8]",NCT00121992,"[0, 9, 14, 23]"
197,42d1fcd3-8faa-4065-bbba-42cc90ab67fb,Comparison,Results,NCT00856492,the secondary trial and the primary trial do n...,Entailment,"[0, 1, 2, 3]",NCT00009945,"[0, 1, 2, 3]"
198,d01fda83-5dc8-4ad5-92b8-7553dabd7046,Single,Results,NCT00428922,the outcome measurement of the primary trial i...,Entailment,"[0, 1]",,


In [None]:
train_df

Unnamed: 0,index,Type,Section_id,Primary_id,Secondary_id,Statement,Label,Primary_evidence_index,Secondary_evidence_index
0,5bc844fc-e852-4270-bfaf-36ea9eface3d,Comparison,Intervention,NCT01928186,NCT00684983,All the primary trial participants do not rece...,Contradiction,"[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5]"
1,86b7cb3d-6186-4a04-9aa6-b174ab764eed,Single,Eligibility,NCT00662129,,"Patients with Platelet count over 100,000/mm¬¨...",Contradiction,"[18, 22, 23, 24]",
2,dbed5471-c2fc-45b5-b26f-430c9fa37a37,Comparison,Adverse Events,NCT00093145,NCT00703326,Heart-related adverse events were recorded in ...,Entailment,"[0, 3]","[0, 7, 8, 9, 10]"
3,20c35c89-8d23-4be3-b603-ac0ee0f3b4de,Single,Eligibility,NCT01097642,,Adult Patients with histologic confirmation of...,Contradiction,"[0, 1, 3, 4, 5]",
4,f17cb242-419d-4f5d-bfa4-41494ed5ac0e,Comparison,Intervention,NCT00852930,NCT02308020,Laser Therapy is in each cohort of the primary...,Contradiction,"[0, 1, 2, 3, 4, 5, 6, 7]","[0, 1, 2, 3, 4, 5, 6]"
...,...,...,...,...,...,...,...,...,...
1695,f37774f4-db96-4aa6-b3a1-626953faeecf,Comparison,Eligibility,NCT00213980,NCT02536339,"Adequate blood, kidney, and hepatic function a...",Entailment,"[0, 1, 2, 3, 4, 5]","[0, 7]"
1696,4fef4cdf-53bf-4239-9d31-4710fd3edc6f,Single,Results,NCT01605396,,The Ridaforolimus + Dalotuzumab + Exemestane g...,Contradiction,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]",
1697,331affb2-f8e9-4a55-ac4c-62d2ecc4f80b,Single,Intervention,NCT00181363,,The only difference between the interventions ...,Entailment,"[0, 1, 2, 3, 4, 5]",
1698,a577e819-c928-4217-8743-f4809e852919,Single,Eligibility,NCT00834678,,Patients must have a white blood cell count ab...,Entailment,[18],


In [None]:
docs_df

Unnamed: 0,Clinical Trial ID,Intervention,Eligibility,Results,Adverse Events
0,NCT00001832,"[INTERVENTION 1: , Abl Cells IV + Cyclophosp...","[INCLUSION CRITERIA, Patients must have eval...","[Outcome Measurement: , Clinical Response, ...","[Adverse Events 1:, Total: 0/3 (0.00%), Ly..."
1,NCT00003199,"[INTERVENTION 1: , TX/Maintenance Therapy fo...","[Inclusion Criteria:, Patients with inflamma...","[Outcome Measurement: , Event-free Survival,...","[Adverse Events 1:, Total: 2/50 (4.00%), P..."
2,NCT00003404,"[INTERVENTION 1: , Adjuvant Radiotherapy, ...","[DISEASE CHARACTERISTICS:, Histologically pr...","[Outcome Measurement: , Local Recurrence Rat...","[Adverse Events 1:, Total: 0/46 (0.00%)]"
3,NCT00003782,"[INTERVENTION 1: , Arm 1: Doxorubicin + Cycl...","[DISEASE CHARACTERISTICS:, Histologically co...","[Outcome Measurement: , Overall Survival, ...","[Adverse Events 1:, Total: 66/1748 (3.78%), ..."
4,NCT00003830,"[INTERVENTION 1: , Arm I:Sentinel Node Resec...","[DISEASE CHARACTERISTICS:, Resectable invasi...","[Outcome Measurement: , Morbidity - Number o...","[Adverse Events 1:, Total: 9/2788 (0.32%), ..."
...,...,...,...,...,...
994,NCT03719677,"[INTERVENTION 1: , Habit Development Interve...","[Inclusion Criteria:, English speaking, Di...","[Outcome Measurement: , Self Reported Behavi...","[Adverse Events 1:, Total: 0/7 (0.00%)]"
995,NCT03765996,"[INTERVENTION 1: , Decongestive Physiotherap...","[Inclusion Criteria:, Patients who had unila...","[Outcome Measurement: , Change of the Limb V...","[Adverse Events 1:, Total: 0/18 (0.00%), Adv..."
996,NCT04030104,"[INTERVENTION 1: , IUS Alone, IUS alone imag...","[Inclusion Criteria:, One analyzable mass pe...","[Outcome Measurement: , Gain in Specificity ...","[Adverse Events 1:, Total: 2/480 (0.42%), ..."
997,NCT04080297,"[INTERVENTION 1: , 100 mg Q-122, Dosage wa...","[Inclusion Criteria:, Be a female of any rac...","[Outcome Measurement: , Adverse Event (AE) R...","[Adverse Events 1:, Total: 0/10 (0.00%), B..."



# PyTerrier Indexing

More examples in (https://github.com/terrier-org/pyterrier).



## Creating files in TREC format

We used the task 7 from SemEval 2023 data to create the topics, QRELS and the TREC files.

The TREC files were created at different levels to find which level of indexing results in a better retrieving result.

At document level each TREC file is a different clinical trial. At section leval each file is a section of a clinical trial. And at passage level each file is a evidence of the clinical trial.

The problem is that we can't know for sure which evidence is a better response for a certain hypothesis because the task does'nt provide the topics, qrels or even the scores from each evidence to know the right answer. So the data is not suitable for passage/information retrieval task.

In [None]:
import json

with open(DEV_PATH) as json_file:
    dev = json.load(json_file)

with open(TRAIN_PATH) as json_file:
    train_data = json.load(json_file)

# Example instance
# keys = list(train_data.keys())
# print(dev[keys[0]])
# print(len(keys))

In [None]:
def get_evidence_list(rct_path, section_id=None):
  evidence_file = open(rct_path, 'r')
  evidence_data = json.load(evidence_file)
  evidence_file.close()
  del evidence_file
  if section_id is not None:
    return evidence_data[section_id]
  return evidence_data

In [None]:
topic_template = """
<top>
  <num>%d</num>
  <title>%s</title>
  <desc>%s</desc>
  <narr>%s</narr>
</top>\n
"""

doc_template = """<DOC>
  <DOCNO>%s</DOCNO>
  <DOCID>%s</DOCID>
  <TEXT>%s</TEXT>
</DOC>
"""

docid_template = "%s-%s-%d"
id_template = "%s-%s"

qrels_template = "%d 0 %s %d\n"

In [None]:
# num = 0
# train_count = 0
# test_count = 0
# for k in keys:
#   # print(k)
#   num += 1
#   rct = train_data[k]['Primary_id']
#   section = train_data[k]['Section_id']
#   evidence_list = get_evidence_list(os.path.join(DOCS_PATH, rct + '.json'), section)
#   hypothesis = train_data[k]['Statement']
#   train_count += 1 if rct in train.values else 0
#   test_count += 1 if rct in test.values else 0
#   # for j, evidence in enumerate(evidence_list):
#   #   pprint(train_data[k].keys())
#   # break
# print(f'train: {train_count}')
# print(f'test: {test_count}')
# print(f'num: {num}')

In [None]:
def get_section_data(rct_path, section):
  evidence_file = open(rct_path, 'r')
  evidence_data = json.load(evidence_file)
  evidence_file.close()
  del evidence_file
  return evidence_data[section]

In [None]:
def get_sections(rct_path):
  evidence_file = open(rct_path, 'r')
  evidence_data = json.load(evidence_file)
  evidence_file.close()
  del evidence_file
  return list(evidence_data.keys())


In [None]:
keys = list(train_data.keys())

In [None]:
all_files = [file_.split('/')[-1][:-5] for file_ in files]

**Document files**

Creating TREC files using all clinical trials

In [None]:
for file_ in files:
  if '.json' in file_:
    content = json.dumps(json.load(open(file_, 'r')))
    filename_ = file_.split('/')[-1][:-5]
    doc_tmp = doc_template % (filename_, filename_, content)
    with open(os.path.join(TREC_DOCS_PATH, filename_ + '.sgml'), 'w') as f:
      f.write(doc_tmp)

In [None]:
struct = list()
dict_doc = {}
trec_doc = ""
content = ""
qrels_content = ""
train_content = ""
test_content = ""
train_qrels_content = ""
test_qrels_content = ""

num = 0
for k in keys:
  num += 1
  rct = train_data[k]['Primary_id']
  hypothesis = train_data[k]['Statement']
  content += (topic_template % (num, hypothesis, hypothesis, hypothesis))

  train_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in train_df.values else ""
  # test_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in test.values else ""

  for file_ in all_files:
    if train_data[k]['Type'] == 'Single':
      qrels_content += (qrels_template % (num, file_, int(file_ in train_data[k]['Primary_id'])))
      train_qrels_content += (qrels_template % (num, file_, int(file_ in train_data[k]['Primary_id']))) if rct in train_df.values else ""
      # test_qrels_content += (qrels_template % (num, file_, int(file_ in train_data[k]['Primary_id']))) if rct in test.values else ""
    else:
      train_qrels_content += (qrels_template % (num, file_, int(file_ in train_data[k]['Primary_id'] or file_ in train_data[k]['Secondary_id']))) if rct in train_df.values else ""
      # test_qrels_content += (qrels_template % (num, file_, int(file_ in train[k]['Primary_id'] or file_ in train[k]['Secondary_id']))) if rct in test.values else ""
      qrels_content += (qrels_template % (num, file_, int(file_ in train_data[k]['Primary_id'] or file_ in train_data[k]['Secondary_id'])))

In [None]:
keys = list(dev.keys())

In [None]:
# struct = list()
# content = ""
# qrels_content = ""
# train_content = ""
test_content = ""
# train_qrels_content = ""
test_qrels_content = ""
num = 0
for k in keys:
  num += 1
  rct = dev[k]['Primary_id']
  hypothesis = dev[k]['Statement']
  content += (topic_template % (num, hypothesis, hypothesis, hypothesis))

  # train_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in train.values else ""
  test_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in dev_df.values else ""

  for file_ in all_files:
    if dev[k]['Type'] == 'Single':
      qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id'])))
      # train_qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id']))) if rct in train.values else ""
      test_qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id']))) if rct in dev_df.values else ""
    else:
      # train_qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id'] or file_ in dev[k]['Secondary_id']))) if rct in train.values else ""
      test_qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id'] or file_ in dev[k]['Secondary_id']))) if rct in dev_df.values else ""
      qrels_content += (qrels_template % (num, file_, int(file_ in dev[k]['Primary_id'] or file_ in dev[k]['Secondary_id'])))

**Section files**

Creating TREC files using the sections of the clinical trials

In [None]:
# struct = list()
# content = ""
# qrels_content = ""
# train_content = ""
# test_content = ""
# train_qrels_content = ""
# test_qrels_content = ""
# num = 0
# for k in keys:
#   num += 1
#   rct = train_data[k]['Primary_id']
#   section_id = train_data[k]['Section_id']
#   section_list = get_sections(os.path.join(DOCS_PATH, rct + '.json'))
#   hypothesis = train_data[k]['Statement']
#   content += (topic_template % (num, hypothesis, hypothesis, hypothesis))
#   train_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in train.values else ""
#   test_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in test.values else ""
#   for section in section_list:
#     docid_tmp = hashlib.sha1((id_template % (rct, section)).encode('utf-8')).hexdigest()
#     qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section)))
#     train_qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section))) if rct in train.values else ""
#     test_qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section))) if rct in test.values else ""

#     if docid_tmp not in struct:
#       doc_tmp = doc_template % (docid_tmp, docid_tmp, get_section_data(os.path.join(DOCS_PATH, rct + '.json'), section))
#       with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#         f.write(doc_tmp)
#       struct.append(docid_tmp)

#   if train_data[k]['Type'] == 'Comparison':
#     rct = train_data[k]['Secondary_id']
#     section_list = get_sections(os.path.join(DOCS_PATH, rct + '.json'))

#     for section in section_list:
#       docid_tmp = hashlib.sha1((id_template % (rct, section)).encode('utf-8')).hexdigest()
#       qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section)))
#       train_qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section))) if rct in train.values else ""
#       test_qrels_content += (qrels_template % (num, docid_tmp, int(section_id in section))) if rct in test.values else ""

#       if docid_tmp not in struct:
#         doc_tmp = doc_template % (docid_tmp, docid_tmp, get_section_data(os.path.join(DOCS_PATH, rct + '.json'), section))
#         with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#           f.write(doc_tmp)
#         struct.append(docid_tmp)

**Passage files**

Creating TREC files using the evidences of clinical trials.  

In [None]:
# struct = list()
# dict_doc = {}
# trec_doc = ""
# content = ""
# qrels_content = ""
# train_content = ""
# test_content = ""
# train_qrels_content = ""
# test_qrels_content = ""
# num = 0
# for k in keys:
#   num += 1
#   rct = train_data[k]['Primary_id']
#   section = train_data[k]['Section_id']
#   evidence_list = get_evidence_list(os.path.join(DOCS_PATH, rct + '.json'), section)
#   hypothesis = train_data[k]['Statement']

#   content += (topic_template % (num, hypothesis, hypothesis, hypothesis))
#   train_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in train_df.values else ""
#   # test_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in test.values else ""
#   for j, evidence in enumerate(evidence_list):
#     docid_tmp = hashlib.sha1((docid_template % (rct, section, j)).encode('utf-8')).hexdigest()
#     train_qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Primary_evidence_index']))) if rct in train_df.values else ""
#     # test_qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Primary_evidence_index']))) if rct in test.values else ""
#     qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Primary_evidence_index'])))
#     # if j in train_data[k]['Primary_evidence_index']:
#     #   qrels_content += (qrels_template % (num, docid_tmp, 1))   #criar qrels somente com valores uteis
#     #   if rct in train.values:
#     #     train_qrels_content += (qrels_template % (num, docid_tmp, 1))
#     #   if rct in test.values:
#     #     test_qrels_content += (qrels_template % (num, docid_tmp, 1))

#     if docid_tmp not in struct:
#       dict_doc[docid_tmp] ={}
#       dict_doc[docid_tmp]['docid']= docid_tmp
#       dict_doc[docid_tmp]['text']= evidence.strip()
#       trec_doc += doc_template % (docid_tmp, docid_tmp, evidence.strip())
#       doc_tmp = doc_template % (docid_tmp, docid_tmp, evidence.strip())
#       with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#         f.write(doc_tmp)
#       struct.append(docid_tmp)

#   if train_data[k]['Type'] == 'Comparison':
#     rct = train_data[k]['Secondary_id']
#     evidence_list = get_evidence_list(os.path.join(DOCS_PATH, rct + '.json'), section)
#     for j, evidence in enumerate(evidence_list):
#       docid_tmp = hashlib.sha1((docid_template % (rct, section, j)).encode('utf-8')).hexdigest()
#       train_qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Secondary_evidence_index']))) if rct in train_df.values else ""
#       # test_qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Secondary_evidence_index']))) if rct in test.values else ""
#       qrels_content += (qrels_template % (num, docid_tmp, int(j in train_data[k]['Secondary_evidence_index'])))
#       # if j in train_data[k]['Secondary_evidence_index']:
#       #   qrels_content += (qrels_template % (num, docid_tmp, 1))
#       #   if rct in train.values:
#       #     train_qrels_content += (qrels_template % (num, docid_tmp, 1))
#       #   if rct in test.values:
#       #     test_qrels_content += (qrels_template % (num, docid_tmp, 1))

#       if docid_tmp not in struct:
#         dict_doc[docid_tmp] ={}
#         dict_doc[docid_tmp]['docid']= docid_tmp
#         dict_doc[docid_tmp]['text']= evidence.strip()
#         trec_doc += doc_template % (docid_tmp, docid_tmp, evidence.strip())
#         doc_tmp = doc_template % (docid_tmp, docid_tmp, evidence.strip())
#         with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#           f.write(doc_tmp)
#         struct.append(docid_tmp)


In [None]:
# keys = list(dev.keys())

In [None]:
# struct = list()
# # content = ""
# # qrels_content = ""
# # train_content = ""
# test_content = ""
# # train_qrels_content = ""
# test_qrels_content = ""
# num = 0
# for k in keys:
#   num += 1
#   rct = dev[k]['Primary_id']
#   section = dev[k]['Section_id']
#   evidence_list = get_evidence_list(os.path.join(DOCS_PATH, rct + '.json'), section)
#   hypothesis = dev[k]['Statement']

#   content += (topic_template % (num, hypothesis, hypothesis, hypothesis))
#   # train_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in train.values else ""
#   test_content += (topic_template % (num, hypothesis, hypothesis, hypothesis)) if rct in dev_df.values else ""
#   for j, evidence in enumerate(evidence_list):
#     docid_tmp = hashlib.sha1((docid_template % (rct, section, j)).encode('utf-8')).hexdigest()
#     # train_qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Primary_evidence_index']))) if rct in train.values else ""
#     test_qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Primary_evidence_index']))) if rct in dev_df.values else ""
#     qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Primary_evidence_index'])))
#     # if j in dev[k]['Primary_evidence_index']:
#     #   qrels_content += (qrels_template % (num, docid_tmp, 1))
#     #   if rct in train.values:
#     #     train_qrels_content += (qrels_template % (num, docid_tmp, 1))
#     #   if rct in test.values:
#     #     test_qrels_content += (qrels_template % (num, docid_tmp, 1))

#     if docid_tmp not in struct:
#       dict_doc[docid_tmp] ={}
#       dict_doc[docid_tmp]['docid']= docid_tmp
#       dict_doc[docid_tmp]['text']= evidence.strip()
#       trec_doc += doc_template % (docid_tmp, docid_tmp, evidence.strip())
#       doc_tmp = doc_template % (docid_tmp, docid_tmp, evidence.strip())
#       with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#         f.write(doc_tmp)
#       struct.append(docid_tmp)

#   if dev[k]['Type'] == 'Comparison':
#     rct = dev[k]['Secondary_id']
#     evidence_list = get_evidence_list(os.path.join(DOCS_PATH, rct + '.json'), section)
#     for j, evidence in enumerate(evidence_list):
#       docid_tmp = hashlib.sha1((docid_template % (rct, section, j)).encode('utf-8')).hexdigest()
#       # train_qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Secondary_evidence_index']))) if rct in train.values else ""
#       test_qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Secondary_evidence_index']))) if rct in dev_df.values else ""
#       qrels_content += (qrels_template % (num, docid_tmp, int(j in dev[k]['Secondary_evidence_index'])))
#       # if j in dev[k]['Secondary_evidence_index']:
#       #   qrels_content += (qrels_template % (num, docid_tmp, 1))
#       #   if rct in train.values:
#       #     train_qrels_content += (qrels_template % (num, docid_tmp, 1))
#       #   if rct in test.values:
#       #     test_qrels_content += (qrels_template % (num, docid_tmp, 1))

#       if docid_tmp not in struct:
#         dict_doc[docid_tmp] ={}
#         dict_doc[docid_tmp]['docid']= docid_tmp
#         dict_doc[docid_tmp]['text']= evidence.strip()
#         trec_doc += doc_template % (docid_tmp, docid_tmp, evidence.strip())
#         doc_tmp = doc_template % (docid_tmp, docid_tmp, evidence.strip())
#         with open(os.path.join(TREC_DOCS_PATH, docid_tmp + '.sgml'), 'w') as f:
#           f.write(doc_tmp)
#         struct.append(docid_tmp)


Creating topics, qrels and trec files.
They will be used for indexing and the retrieving methods    

In [None]:
with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics_train.txt', 'w') as f:
  f.write(train_content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics_test.txt', 'w') as f:
  f.write(test_content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels_train.txt', 'w') as f:
  f.write(train_qrels_content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels_test.txt', 'w') as f:
  f.write(test_qrels_content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics.txt', 'w') as f:
  f.write(content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels.txt', 'w') as f:
  f.write(qrels_content)

with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_doc.sgml', 'w') as f:
  f.write(trec_doc)

json_doc = json.dumps(dict_doc)
with open('/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/json_doc.json', 'w') as f:
  f.write(json_doc)

In [None]:
topics = pt.io.read_topics(TOPICS_PATH)
train_topics = pt.io.read_topics(TRAIN_TOPICS_PATH)
test_topics = pt.io.read_topics(TEST_TOPICS_PATH)

qrels = pt.io.read_qrels(QRELS_PATH)
train_qrels = pt.io.read_qrels(TRAIN_QRELS_PATH)
test_qrels = pt.io.read_qrels(TEST_QRELS_PATH)


Transforming the json files to trec files

In [None]:
# def json_to_trec(json_file, trec_file, docid):
#   with open(json_file) as f:
#     data = json.load(f)

#   with open(trec_file, 'w') as f:
#     f.write("<DOC>\n")
#     f.write(f"<DOCNO>{data['Clinical Trial ID']}</DOCNO>\n")
#     f.write(f"<DOCID>{data['Clinical Trial ID']}</DOCID>\n")
#     f.write(f"<TEXT>{data}</TEXT>")



Creating empty trec files with the same name as the json formatted files

In [None]:
# json_directory = '/content/drive/MyDrive/Colab Notebooks/CTRs'
# trec_directory = '/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_docs'
# json_ext = '.json'
# trec_ext = '.sgml'

# filenames = [f for f in os.listdir(docs) if f.endswith(json_ext)]

# for filename in filenames:
#   new_filename = os.path.splitext(filename)[0] + trec_ext
#   open(os.path.join(trec_directory, new_filename), 'w').close()


Getting the TREC files


In [None]:
trec_files = pt.io.find_files("/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/trec_docs")
# docid = 0
# for json_file in files:
#   for trec_file in trec_files:
#     docid+=1
#     if os.path.basename(os.path.splitext(json_file)[0]) == os.path.basename(os.path.splitext(trec_file)[0]):
#       json_to_trec(json_file, trec_file, docid)

# Indexing the TREC formatted files

In [None]:
INDEX_DIR='/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/index' #diretório aonde o índice vai ficar
indexer = pt.TRECCollectionIndexer(INDEX_DIR,
    # vamos salvar o texto como metadados
    meta= {'docno' : 50, 'text' : 4096},
    # The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
    meta_tags = {'text' : 'ELSE'},
    verbose=True,
    overwrite=True) #para sobrescrever, caso já tenha um índice com aquele nome
indexref = indexer.index(trec_files)
#Indexando os arquivos -- chamando o método index no objeto TRECCollectionIndexer
index = pt.IndexFactory.of(indexref)

  0%|          | 0/999 [6ms<?, ?files/s]

##Training and reranking models

In [None]:
knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='text')

[02;37m[2023-08-02 15:35:32,242][WordvecHashVocab][DEBUG] [0m[37m[starting] reading cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p[0m
[02;37m[2023-08-02 15:35:33,796][WordvecHashVocab][DEBUG] [0m[37m[finished] reading cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p [1.55s][0m


In [None]:
pipeline = (pt.BatchRetrieve(index) % 100 # get top 100 results
            >> pt.text.get_text(index, 'text') # fetch the document text
            >> knrm) # apply neural re-ranker


In [None]:
br = pt.BatchRetrieve(index)
# foo = (pt.BatchRetrieve(index) # get top 100 results
#        >> pt.text.get_text(index, 'text') # fetch the document text
#        >> trained_model)

pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    eval_metrics=["recip_rank", "map", "recall_5", "recall_10", "P.1", "P.5", "P.10", "mrt"]
)

[02;37m[2023-08-02 15:59:50,339][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 15:59:50,830][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/57500 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:03:03,009][onir_pt][DEBUG] [0m[37m[finished] batches: [03:12] [57500it] [299.20it/s][0m


Unnamed: 0,name,recip_rank,map,recall_5,recall_10,P.1,P.5,P.10,mrt
0,DPH,0.218014,0.192459,0.228235,0.279118,0.161765,0.053294,0.033353,375.448365
1,DPH >> VBERT,0.021444,0.017253,0.013235,0.030882,0.002353,0.003765,0.004176,486.415703


In [None]:
pipeline.fit(train_topics,
             train_qrels,
             test_topics,
             test_qrels)

pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    eval_metrics=["recip_rank", "map", "recall_5", "recall_10", "P.1", "P.5", "P.10", "mrt"]
)

[02;37m[2023-08-02 16:53:48,564][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:53:48,565][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:53:48,566][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:54:05,886][onir_pt][DEBUG] [0m[37m[finished] batches: [17.32s] [5000it] [288.68it/s][0m
[02;37m[2023-08-02 16:54:06,279][onir_pt][DEBUG] [0m[37m[finished] validation [17.71s][0m
[02;37m[2023-08-02 16:54:06,281][onir_pt][INFO] [0m[32mpre-validation: 0.0229[0m
[02;37m[2023-08-02 16:54:06,300][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:06,301][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:54:06,302][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:54:12,457][onir_pt][DEBUG] [0m[37m[finished] train pairs: [6.15s] [1024it] [166.40it/s][0m
[02;37m[2023-08-02 16:54:12,458][onir_pt][DEBUG] [0m[37m[finished] training [6.16s][0m
[02;37m[2023-08-02 16:54:12,459][onir_pt][INFO] [0m[32mtraining   it=0 loss=0.2942[0m
[02;37m[2023-08-02 16:54:12,461][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:54:12,462][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:12,463][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [8ms<?, ?it/s]

[02;37m[2023-08-02 16:54:29,529][onir_pt][DEBUG] [0m[37m[finished] batches: [17.07s] [5000it] [292.98it/s][0m
[02;37m[2023-08-02 16:54:29,915][onir_pt][DEBUG] [0m[37m[finished] validation [17.45s][0m
[02;37m[2023-08-02 16:54:29,916][onir_pt][INFO] [0m[32mvalidation it=0 map=0.0110 ndcg=0.0756 P_10=0.0025[0m
[02;37m[2023-08-02 16:54:29,917][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:29,917][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:54:29,918][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:54:35,623][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.70s] [1024it] [179.50it/s][0m
[02;37m[2023-08-02 16:54:35,625][onir_pt][DEBUG] [0m[37m[finished] training [5.71s][0m
[02;37m[2023-08-02 16:54:35,629][onir_pt][INFO] [0m[32mtraining   it=1 loss=0.2380[0m
[02;37m[2023-08-02 16:54:35,630][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:54:35,631][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:35,632][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [10ms<?, ?it/s]

[02;37m[2023-08-02 16:54:52,731][onir_pt][DEBUG] [0m[37m[finished] batches: [17.10s] [5000it] [292.42it/s][0m
[02;37m[2023-08-02 16:54:53,124][onir_pt][DEBUG] [0m[37m[finished] validation [17.49s][0m
[02;37m[2023-08-02 16:54:53,127][onir_pt][INFO] [0m[32mvalidation it=1 map=0.0625 ndcg=0.1276 P_10=0.0120 <--[0m
[02;37m[2023-08-02 16:54:53,128][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:53,128][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:54:53,129][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:54:58,442][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.31s] [1024it] [192.76it/s][0m
[02;37m[2023-08-02 16:54:58,443][onir_pt][DEBUG] [0m[37m[finished] training [5.31s][0m
[02;37m[2023-08-02 16:54:58,444][onir_pt][INFO] [0m[32mtraining   it=2 loss=0.1762[0m
[02;37m[2023-08-02 16:54:58,444][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:54:58,446][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:54:58,447][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 16:55:15,549][onir_pt][DEBUG] [0m[37m[finished] batches: [17.10s] [5000it] [292.37it/s][0m
[02;37m[2023-08-02 16:55:15,938][onir_pt][DEBUG] [0m[37m[finished] validation [17.49s][0m
[02;37m[2023-08-02 16:55:15,940][onir_pt][INFO] [0m[32mvalidation it=2 map=0.0583 ndcg=0.1231 P_10=0.0110[0m
[02;37m[2023-08-02 16:55:15,940][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:55:15,941][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:55:15,942][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:55:21,359][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.42s] [1024it] [189.05it/s][0m
[02;37m[2023-08-02 16:55:21,360][onir_pt][DEBUG] [0m[37m[finished] training [5.42s][0m
[02;37m[2023-08-02 16:55:21,361][onir_pt][INFO] [0m[32mtraining   it=3 loss=0.1789[0m
[02;37m[2023-08-02 16:55:21,361][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:55:21,362][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:55:21,363][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:55:38,302][onir_pt][DEBUG] [0m[37m[finished] batches: [16.94s] [5000it] [295.19it/s][0m
[02;37m[2023-08-02 16:55:38,675][onir_pt][DEBUG] [0m[37m[finished] validation [17.31s][0m
[02;37m[2023-08-02 16:55:38,677][onir_pt][INFO] [0m[32mvalidation it=3 map=0.0572 ndcg=0.1219 P_10=0.0110[0m
[02;37m[2023-08-02 16:55:38,677][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:55:38,678][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:55:38,678][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:55:44,304][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.63s] [1024it] [182.03it/s][0m
[02;37m[2023-08-02 16:55:44,306][onir_pt][DEBUG] [0m[37m[finished] training [5.63s][0m
[02;37m[2023-08-02 16:55:44,309][onir_pt][INFO] [0m[32mtraining   it=4 loss=0.1783[0m
[02;37m[2023-08-02 16:55:44,310][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:55:44,310][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:55:44,310][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [10ms<?, ?it/s]

[02;37m[2023-08-02 16:56:01,431][onir_pt][DEBUG] [0m[37m[finished] batches: [17.12s] [5000it] [292.07it/s][0m
[02;37m[2023-08-02 16:56:01,810][onir_pt][DEBUG] [0m[37m[finished] validation [17.50s][0m
[02;37m[2023-08-02 16:56:01,812][onir_pt][INFO] [0m[32mvalidation it=4 map=0.0631 ndcg=0.1253 P_10=0.0100 <--[0m
[02;37m[2023-08-02 16:56:01,813][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:01,814][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:56:01,814][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:56:07,037][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.22s] [1024it] [196.09it/s][0m
[02;37m[2023-08-02 16:56:07,038][onir_pt][DEBUG] [0m[37m[finished] training [5.22s][0m
[02;37m[2023-08-02 16:56:07,039][onir_pt][INFO] [0m[32mtraining   it=5 loss=0.1775[0m
[02;37m[2023-08-02 16:56:07,039][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:56:07,039][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:07,040][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [8ms<?, ?it/s]

[02;37m[2023-08-02 16:56:24,063][onir_pt][DEBUG] [0m[37m[finished] batches: [17.02s] [5000it] [293.72it/s][0m
[02;37m[2023-08-02 16:56:24,470][onir_pt][DEBUG] [0m[37m[finished] validation [17.43s][0m
[02;37m[2023-08-02 16:56:24,472][onir_pt][INFO] [0m[32mvalidation it=5 map=0.0609 ndcg=0.1239 P_10=0.0100[0m
[02;37m[2023-08-02 16:56:24,472][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:24,473][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:56:24,473][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:56:29,966][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.49s] [1024it] [186.46it/s][0m
[02;37m[2023-08-02 16:56:29,968][onir_pt][DEBUG] [0m[37m[finished] training [5.49s][0m
[02;37m[2023-08-02 16:56:29,969][onir_pt][INFO] [0m[32mtraining   it=6 loss=0.1668[0m
[02;37m[2023-08-02 16:56:29,969][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:56:29,969][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:29,973][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:56:46,676][onir_pt][DEBUG] [0m[37m[finished] batches: [16.70s] [5000it] [299.37it/s][0m
[02;37m[2023-08-02 16:56:47,059][onir_pt][DEBUG] [0m[37m[finished] validation [17.09s][0m
[02;37m[2023-08-02 16:56:47,060][onir_pt][INFO] [0m[32mvalidation it=6 map=0.0589 ndcg=0.1225 P_10=0.0100[0m
[02;37m[2023-08-02 16:56:47,061][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:47,062][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:56:47,062][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:56:52,265][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.20s] [1024it] [196.83it/s][0m
[02;37m[2023-08-02 16:56:52,267][onir_pt][DEBUG] [0m[37m[finished] training [5.21s][0m
[02;37m[2023-08-02 16:56:52,267][onir_pt][INFO] [0m[32mtraining   it=7 loss=0.1710[0m
[02;37m[2023-08-02 16:56:52,267][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:56:52,267][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:56:52,269][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 16:57:08,962][onir_pt][DEBUG] [0m[37m[finished] batches: [16.69s] [5000it] [299.60it/s][0m
[02;37m[2023-08-02 16:57:09,360][onir_pt][DEBUG] [0m[37m[finished] validation [17.09s][0m
[02;37m[2023-08-02 16:57:09,362][onir_pt][INFO] [0m[32mvalidation it=7 map=0.0585 ndcg=0.1228 P_10=0.0100[0m
[02;37m[2023-08-02 16:57:09,362][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:09,363][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:57:09,363][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:57:14,470][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.11s] [1024it] [200.56it/s][0m
[02;37m[2023-08-02 16:57:14,472][onir_pt][DEBUG] [0m[37m[finished] training [5.11s][0m
[02;37m[2023-08-02 16:57:14,473][onir_pt][INFO] [0m[32mtraining   it=8 loss=0.1755[0m
[02;37m[2023-08-02 16:57:14,473][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:57:14,473][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:14,474][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:57:30,784][onir_pt][DEBUG] [0m[37m[finished] batches: [16.31s] [5000it] [306.59it/s][0m
[02;37m[2023-08-02 16:57:31,149][onir_pt][DEBUG] [0m[37m[finished] validation [16.68s][0m
[02;37m[2023-08-02 16:57:31,150][onir_pt][INFO] [0m[32mvalidation it=8 map=0.0586 ndcg=0.1226 P_10=0.0100[0m
[02;37m[2023-08-02 16:57:31,150][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:31,151][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:57:31,151][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:57:36,680][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.53s] [1024it] [185.22it/s][0m
[02;37m[2023-08-02 16:57:36,682][onir_pt][DEBUG] [0m[37m[finished] training [5.53s][0m
[02;37m[2023-08-02 16:57:36,682][onir_pt][INFO] [0m[32mtraining   it=9 loss=0.1794[0m
[02;37m[2023-08-02 16:57:36,683][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:57:36,683][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:36,687][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:57:53,010][onir_pt][DEBUG] [0m[37m[finished] batches: [16.32s] [5000it] [306.32it/s][0m
[02;37m[2023-08-02 16:57:53,380][onir_pt][DEBUG] [0m[37m[finished] validation [16.70s][0m
[02;37m[2023-08-02 16:57:53,381][onir_pt][INFO] [0m[32mvalidation it=9 map=0.0526 ndcg=0.1183 P_10=0.0110[0m
[02;37m[2023-08-02 16:57:53,382][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:53,382][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:57:53,383][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:57:58,506][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.12s] [1024it] [199.88it/s][0m
[02;37m[2023-08-02 16:57:58,507][onir_pt][DEBUG] [0m[37m[finished] training [5.13s][0m
[02;37m[2023-08-02 16:57:58,509][onir_pt][INFO] [0m[32mtraining   it=10 loss=0.1799[0m
[02;37m[2023-08-02 16:57:58,511][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:57:58,512][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:57:58,513][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 16:58:14,739][onir_pt][DEBUG] [0m[37m[finished] batches: [16.22s] [5000it] [308.18it/s][0m
[02;37m[2023-08-02 16:58:15,100][onir_pt][DEBUG] [0m[37m[finished] validation [16.59s][0m
[02;37m[2023-08-02 16:58:15,101][onir_pt][INFO] [0m[32mvalidation it=10 map=0.0588 ndcg=0.1228 P_10=0.0125[0m
[02;37m[2023-08-02 16:58:15,101][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:58:15,102][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:58:15,102][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:58:20,606][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.50s] [1024it] [186.06it/s][0m
[02;37m[2023-08-02 16:58:20,608][onir_pt][DEBUG] [0m[37m[finished] training [5.51s][0m
[02;37m[2023-08-02 16:58:20,611][onir_pt][INFO] [0m[32mtraining   it=11 loss=0.1662[0m
[02;37m[2023-08-02 16:58:20,612][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:58:20,612][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:58:20,614][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 16:58:36,863][onir_pt][DEBUG] [0m[37m[finished] batches: [16.25s] [5000it] [307.73it/s][0m
[02;37m[2023-08-02 16:58:37,226][onir_pt][DEBUG] [0m[37m[finished] validation [16.61s][0m
[02;37m[2023-08-02 16:58:37,227][onir_pt][INFO] [0m[32mvalidation it=11 map=0.0549 ndcg=0.1207 P_10=0.0115[0m
[02;37m[2023-08-02 16:58:37,228][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:58:37,228][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:58:37,228][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 16:58:42,331][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.10s] [1024it] [200.70it/s][0m
[02;37m[2023-08-02 16:58:42,333][onir_pt][DEBUG] [0m[37m[finished] training [5.11s][0m
[02;37m[2023-08-02 16:58:42,336][onir_pt][INFO] [0m[32mtraining   it=12 loss=0.1767[0m
[02;37m[2023-08-02 16:58:42,337][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:58:42,337][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:58:42,337][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [10ms<?, ?it/s]

[02;37m[2023-08-02 16:58:58,696][onir_pt][DEBUG] [0m[37m[finished] batches: [16.36s] [5000it] [305.67it/s][0m
[02;37m[2023-08-02 16:58:59,066][onir_pt][DEBUG] [0m[37m[finished] validation [16.73s][0m
[02;37m[2023-08-02 16:58:59,067][onir_pt][INFO] [0m[32mvalidation it=12 map=0.0558 ndcg=0.1217 P_10=0.0110[0m
[02;37m[2023-08-02 16:58:59,068][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:58:59,068][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:58:59,069][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:59:04,504][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.43s] [1024it] [188.44it/s][0m
[02;37m[2023-08-02 16:59:04,505][onir_pt][DEBUG] [0m[37m[finished] training [5.44s][0m
[02;37m[2023-08-02 16:59:04,509][onir_pt][INFO] [0m[32mtraining   it=13 loss=0.1695[0m
[02;37m[2023-08-02 16:59:04,509][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:59:04,510][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:59:04,511][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 16:59:20,755][onir_pt][DEBUG] [0m[37m[finished] batches: [16.24s] [5000it] [307.82it/s][0m
[02;37m[2023-08-02 16:59:21,118][onir_pt][DEBUG] [0m[37m[finished] validation [16.61s][0m
[02;37m[2023-08-02 16:59:21,119][onir_pt][INFO] [0m[32mvalidation it=13 map=0.0528 ndcg=0.1183 P_10=0.0110[0m
[02;37m[2023-08-02 16:59:21,120][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:59:21,121][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:59:21,121][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:59:26,308][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.19s] [1024it] [197.46it/s][0m
[02;37m[2023-08-02 16:59:26,309][onir_pt][DEBUG] [0m[37m[finished] training [5.19s][0m
[02;37m[2023-08-02 16:59:26,310][onir_pt][INFO] [0m[32mtraining   it=14 loss=0.1899[0m
[02;37m[2023-08-02 16:59:26,311][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:59:26,311][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:59:26,315][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [13ms<?, ?it/s]

[02;37m[2023-08-02 16:59:42,774][onir_pt][DEBUG] [0m[37m[finished] batches: [16.46s] [5000it] [303.79it/s][0m
[02;37m[2023-08-02 16:59:43,168][onir_pt][DEBUG] [0m[37m[finished] validation [16.86s][0m
[02;37m[2023-08-02 16:59:43,170][onir_pt][INFO] [0m[32mvalidation it=14 map=0.0548 ndcg=0.1201 P_10=0.0120[0m
[02;37m[2023-08-02 16:59:43,170][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:59:43,171][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 16:59:43,171][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 16:59:48,438][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.27s] [1024it] [194.48it/s][0m
[02;37m[2023-08-02 16:59:48,439][onir_pt][DEBUG] [0m[37m[finished] training [5.27s][0m
[02;37m[2023-08-02 16:59:48,439][onir_pt][INFO] [0m[32mtraining   it=15 loss=0.1695[0m
[02;37m[2023-08-02 16:59:48,440][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 16:59:48,440][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 16:59:48,441][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:00:04,867][onir_pt][DEBUG] [0m[37m[finished] batches: [16.43s] [5000it] [304.40it/s][0m
[02;37m[2023-08-02 17:00:05,230][onir_pt][DEBUG] [0m[37m[finished] validation [16.79s][0m
[02;37m[2023-08-02 17:00:05,232][onir_pt][INFO] [0m[32mvalidation it=15 map=0.0548 ndcg=0.1201 P_10=0.0120[0m
[02;37m[2023-08-02 17:00:05,232][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:05,233][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:00:05,233][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:00:10,692][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.46s] [1024it] [187.64it/s][0m
[02;37m[2023-08-02 17:00:10,694][onir_pt][DEBUG] [0m[37m[finished] training [5.46s][0m
[02;37m[2023-08-02 17:00:10,695][onir_pt][INFO] [0m[32mtraining   it=16 loss=0.1737[0m
[02;37m[2023-08-02 17:00:10,697][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:00:10,698][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:10,699][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [10ms<?, ?it/s]

[02;37m[2023-08-02 17:00:27,200][onir_pt][DEBUG] [0m[37m[finished] batches: [16.50s] [5000it] [303.01it/s][0m
[02;37m[2023-08-02 17:00:27,568][onir_pt][DEBUG] [0m[37m[finished] validation [16.87s][0m
[02;37m[2023-08-02 17:00:27,569][onir_pt][INFO] [0m[32mvalidation it=16 map=0.0562 ndcg=0.1215 P_10=0.0105[0m
[02;37m[2023-08-02 17:00:27,570][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:27,570][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:00:27,571][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:00:32,728][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.16s] [1024it] [198.57it/s][0m
[02;37m[2023-08-02 17:00:32,731][onir_pt][DEBUG] [0m[37m[finished] training [5.16s][0m
[02;37m[2023-08-02 17:00:32,732][onir_pt][INFO] [0m[32mtraining   it=17 loss=0.1746[0m
[02;37m[2023-08-02 17:00:32,732][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:00:32,732][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:32,732][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 17:00:49,093][onir_pt][DEBUG] [0m[37m[finished] batches: [16.36s] [5000it] [305.61it/s][0m
[02;37m[2023-08-02 17:00:49,492][onir_pt][DEBUG] [0m[37m[finished] validation [16.76s][0m
[02;37m[2023-08-02 17:00:49,493][onir_pt][INFO] [0m[32mvalidation it=17 map=0.0528 ndcg=0.1187 P_10=0.0110[0m
[02;37m[2023-08-02 17:00:49,494][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:49,495][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:00:49,495][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:00:55,170][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.67s] [1024it] [180.47it/s][0m
[02;37m[2023-08-02 17:00:55,175][onir_pt][DEBUG] [0m[37m[finished] training [5.68s][0m
[02;37m[2023-08-02 17:00:55,176][onir_pt][INFO] [0m[32mtraining   it=18 loss=0.1669[0m
[02;37m[2023-08-02 17:00:55,176][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:00:55,176][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:00:55,177][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:01:11,665][onir_pt][DEBUG] [0m[37m[finished] batches: [16.49s] [5000it] [303.25it/s][0m
[02;37m[2023-08-02 17:01:12,031][onir_pt][DEBUG] [0m[37m[finished] validation [16.85s][0m
[02;37m[2023-08-02 17:01:12,032][onir_pt][INFO] [0m[32mvalidation it=18 map=0.0563 ndcg=0.1218 P_10=0.0115[0m
[02;37m[2023-08-02 17:01:12,033][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:01:12,034][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:01:12,034][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:01:17,205][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.17s] [1024it] [198.07it/s][0m
[02;37m[2023-08-02 17:01:17,206][onir_pt][DEBUG] [0m[37m[finished] training [5.17s][0m
[02;37m[2023-08-02 17:01:17,208][onir_pt][INFO] [0m[32mtraining   it=19 loss=0.1751[0m
[02;37m[2023-08-02 17:01:17,209][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:01:17,209][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:01:17,210][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [10ms<?, ?it/s]

[02;37m[2023-08-02 17:01:33,501][onir_pt][DEBUG] [0m[37m[finished] batches: [16.29s] [5000it] [306.92it/s][0m
[02;37m[2023-08-02 17:01:33,898][onir_pt][DEBUG] [0m[37m[finished] validation [16.69s][0m
[02;37m[2023-08-02 17:01:33,899][onir_pt][INFO] [0m[32mvalidation it=19 map=0.0586 ndcg=0.1242 P_10=0.0135[0m
[02;37m[2023-08-02 17:01:33,900][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:01:33,900][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:01:33,901][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:01:39,380][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.48s] [1024it] [186.91it/s][0m
[02;37m[2023-08-02 17:01:39,384][onir_pt][DEBUG] [0m[37m[finished] training [5.48s][0m
[02;37m[2023-08-02 17:01:39,384][onir_pt][INFO] [0m[32mtraining   it=20 loss=0.1671[0m
[02;37m[2023-08-02 17:01:39,386][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:01:39,386][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:01:39,386][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [14ms<?, ?it/s]

[02;37m[2023-08-02 17:01:55,756][onir_pt][DEBUG] [0m[37m[finished] batches: [16.37s] [5000it] [305.47it/s][0m
[02;37m[2023-08-02 17:01:56,132][onir_pt][DEBUG] [0m[37m[finished] validation [16.75s][0m
[02;37m[2023-08-02 17:01:56,133][onir_pt][INFO] [0m[32mvalidation it=20 map=0.0535 ndcg=0.1192 P_10=0.0120[0m
[02;37m[2023-08-02 17:01:56,133][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:01:56,134][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:01:56,134][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:02:01,457][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.32s] [1024it] [192.42it/s][0m
[02;37m[2023-08-02 17:02:01,458][onir_pt][DEBUG] [0m[37m[finished] training [5.32s][0m
[02;37m[2023-08-02 17:02:01,459][onir_pt][INFO] [0m[32mtraining   it=21 loss=0.1666[0m
[02;37m[2023-08-02 17:02:01,459][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:02:01,460][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:02:01,463][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 17:02:17,966][onir_pt][DEBUG] [0m[37m[finished] batches: [16.50s] [5000it] [302.99it/s][0m
[02;37m[2023-08-02 17:02:18,346][onir_pt][DEBUG] [0m[37m[finished] validation [16.89s][0m
[02;37m[2023-08-02 17:02:18,347][onir_pt][INFO] [0m[32mvalidation it=21 map=0.0554 ndcg=0.1211 P_10=0.0115[0m
[02;37m[2023-08-02 17:02:18,348][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:02:18,348][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:02:18,349][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:02:23,544][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.19s] [1024it] [197.13it/s][0m
[02;37m[2023-08-02 17:02:23,545][onir_pt][DEBUG] [0m[37m[finished] training [5.20s][0m
[02;37m[2023-08-02 17:02:23,546][onir_pt][INFO] [0m[32mtraining   it=22 loss=0.1805[0m
[02;37m[2023-08-02 17:02:23,549][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:02:23,549][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:02:23,550][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [9ms<?, ?it/s]

[02;37m[2023-08-02 17:02:39,897][onir_pt][DEBUG] [0m[37m[finished] batches: [16.35s] [5000it] [305.87it/s][0m
[02;37m[2023-08-02 17:02:40,273][onir_pt][DEBUG] [0m[37m[finished] validation [16.72s][0m
[02;37m[2023-08-02 17:02:40,274][onir_pt][INFO] [0m[32mvalidation it=22 map=0.0556 ndcg=0.1213 P_10=0.0110[0m
[02;37m[2023-08-02 17:02:40,275][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:02:40,275][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:02:40,275][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:02:45,920][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.64s] [1024it] [181.42it/s][0m
[02;37m[2023-08-02 17:02:45,924][onir_pt][DEBUG] [0m[37m[finished] training [5.65s][0m
[02;37m[2023-08-02 17:02:45,924][onir_pt][INFO] [0m[32mtraining   it=23 loss=0.1659[0m
[02;37m[2023-08-02 17:02:45,924][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:02:45,924][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:02:45,925][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:03:02,499][onir_pt][DEBUG] [0m[37m[finished] batches: [16.57s] [5000it] [301.73it/s][0m
[02;37m[2023-08-02 17:03:02,879][onir_pt][DEBUG] [0m[37m[finished] validation [16.95s][0m
[02;37m[2023-08-02 17:03:02,879][onir_pt][INFO] [0m[32mvalidation it=23 map=0.0545 ndcg=0.1203 P_10=0.0115[0m
[02;37m[2023-08-02 17:03:02,880][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:03:02,881][onir_pt][DEBUG] [0m[37m[starting] training[0m
[02;37m[2023-08-02 17:03:02,881][onir_pt][DEBUG] [0m[37m[starting] train pairs[0m


train pairs:   0%|          | 0/1024 [6ms<?, ?it/s]

[02;37m[2023-08-02 17:03:08,168][onir_pt][DEBUG] [0m[37m[finished] train pairs: [5.29s] [1024it] [193.71it/s][0m
[02;37m[2023-08-02 17:03:08,170][onir_pt][DEBUG] [0m[37m[finished] training [5.29s][0m
[02;37m[2023-08-02 17:03:08,171][onir_pt][INFO] [0m[32mtraining   it=24 loss=0.1817[0m
[02;37m[2023-08-02 17:03:08,171][onir_pt][DEBUG] [0m[37m[starting] validation[0m
[02;37m[2023-08-02 17:03:08,174][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:03:08,175][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/5000 [8ms<?, ?it/s]

[02;37m[2023-08-02 17:03:24,506][onir_pt][DEBUG] [0m[37m[finished] batches: [16.33s] [5000it] [306.19it/s][0m
[02;37m[2023-08-02 17:03:24,881][onir_pt][DEBUG] [0m[37m[finished] validation [16.71s][0m
[02;37m[2023-08-02 17:03:24,883][onir_pt][INFO] [0m[32mvalidation it=24 map=0.0533 ndcg=0.1187 P_10=0.0110[0m
[02;37m[2023-08-02 17:03:24,884][onir_pt][INFO] [0m[32mearly stopping; model reverting back to it=4[0m
[02;37m[2023-08-02 17:26:40,031][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 17:26:40,032][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/57500 [7ms<?, ?it/s]

[02;37m[2023-08-02 17:29:43,786][onir_pt][DEBUG] [0m[37m[finished] batches: [03:04] [57500it] [312.92it/s][0m


Unnamed: 0,name,recip_rank,map,recall_5,recall_10,P.1,P.5,P.10,mrt
0,DPH,0.218014,0.192459,0.228235,0.279118,0.161765,0.053294,0.033353,361.059752
1,DPH >> VBERT,0.112653,0.100305,0.115294,0.160294,0.074706,0.026353,0.018824,465.847355


In [None]:
# Load a version of EPIC trained on the MS-MARCO dataset
lazy_epic = onir_pt.reranker.from_checkpoint(
    'https://macavaney.us/epic.msmarco.tar.gz',
    expected_md5="2f6a16be1a6a63aab1e8fed55521a4db")

config file not found: config




[02;37m[2023-08-02 14:50:47,938][onir.util.download][DEBUG] [0m[37mdownloaded https://macavaney.us/epic.msmarco.tar.gz [5.03s] [494M] [182MB/s] [md5 hash verified][0m


100%|██████████| 231508/231508 [647ms<0ms, 357751.40B/s]  
100%|██████████| 433/433 [1ms<0ms, 497843.65B/s]
100%|██████████| 440473133/440473133 [42.34s<0ms, 10402475.15B/s]  


In [None]:
# Use the TREC COVID dataset for this example
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')

In [None]:
# Build an inverted index for TREC COIVID with pyterrier
pt_index_path = './terrier_cord19'
if not os.path.exists(pt_index_path + '/data.properties'):
    indexer = pt.index.IterDictIndexer(pt_index_path)
    index_ref = indexer.index(dataset.get_corpus_iter(), fields=('abstract',), meta=('docno',))
else:
    index_ref = pt.IndexRef.of(pt_index_path + '/data.properties')
index = pt.IndexFactory.of(index_ref)

cord19/trec-covid documents:   0%|          | 0/192509 [7ms<?, ?it/s]

  index_ref = indexer.index(dataset.get_corpus_iter(), fields=('abstract',), meta=('docno',))


14:52:46.458 [ForkJoinPool-1-worker-3] ERROR org.terrier.structures.indexing.Indexer - Could not finish MetaIndexBuilder: 
java.io.IOException: Key 8lqzfj2e is not unique: 37597,11755
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.indexDocuments(BasicIndexer.java:270)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:388)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:377)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:131)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:120)
	at java.

In [None]:
br = pt.BatchRetrieve(index) % 30
pipeline = (br >> pt.text.get_text(dataset, 'abstract')
               >> pt.apply.generic(lambda x: x.rename(columns={'abstract': 'text'}))
               >> lazy_epic)
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> EPIC (lazy)'],
    eval_metrics=["map","recip_rank", "P.5", "mrt"]
)

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [0ms] [458B] [3.46MB/s]
[INFO] [starting] https://mirror.ir-datasets.com/0307a37b6b9f1a5f233340a769d538ea
[INFO] [finished] https://mirror.ir-datasets.com/0307a37b6b9f1a5f233340a769d538ea: [2ms] [18.7kB] [11.3MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [0ms] [458B] [3.03MB/s]
[INFO] [starting] https://mirror.ir-datasets.com/8138424a59daea0aba751c8a891e5f54
[INFO] [finished] https://mirror.ir-datasets.com/8138424a59daea0aba751c8a891e5f54: [56ms] [1.14MB] [20.4MB/s]


[02;37m[2023-08-02 14:53:00,072][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 14:53:05,648][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/375 [7ms<?, ?it/s]

[02;37m[2023-08-02 14:54:05,836][onir_pt][DEBUG] [0m[37m[finished] batches: [01:00] [375it] [ 6.23it/s][0m


Unnamed: 0,name,map,recip_rank,P.5,mrt
0,DPH,0.031769,0.766833,0.684,44.082439
1,DPH >> EPIC (lazy),0.032313,0.807889,0.724,1353.03771


In [None]:
# vbert = onir_pt.reranker('vanilla_transformer', 'bert', text_field='text', vocab_config={'train': True})

knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='abstract')

# knrm = onir_pt.reranker.from_checkpoint('https://macavaney.us/knrm.medmarco.tar.gz', Ctext_field='text', expected_md5="d70b1d4f899690dae51161537e69ed5a")

# foo = onir_pt.reranker('knrm', 'wordvec_hash', text_field='text').fit(tr_run=train_qrels, tr_qrels=train_qrels, va_run=test_topics, va_qrels=test_qrels, tr_pairs=None)

[02;37m[2023-08-02 14:54:05,984][WordvecHashVocab][DEBUG] [0m[37m[starting] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip[0m




[02;37m[2023-08-02 14:54:09,738][onir.util.download][DEBUG] [0m[37mdownloaded https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [3.49s] [682M] [186MB/s][0m
[02;37m[2023-08-02 14:54:09,748][WordvecHashVocab][DEBUG] [0m[37m[finished] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [3.77s][0m
[02;37m[2023-08-02 14:54:09,749][WordvecHashVocab][DEBUG] [0m[37m[starting] extracting vecs[0m
[02;37m[2023-08-02 14:54:21,627][WordvecHashVocab][DEBUG] [0m[37m[finished] extracting vecs [11.88s][0m
[02;37m[2023-08-02 14:54:21,629][WordvecHashVocab][DEBUG] [0m[37m[starting] loading vecs into memory[0m
[02;37m[2023-08-02 14:56:32,577][WordvecHashVocab][DEBUG] [0m[37m[finished] loading vecs into memory [02:11][0m
[02;37m[2023-08-02 14:56:32,930][WordvecHashVocab][DEBUG] [0m[37m[starting] writing cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p[0m
[02;37m[2023-08-02 14:56:4

In [None]:
pipeline = (pt.BatchRetrieve(index) % 100 # get top 100 results
            >> pt.text.get_text(dataset, 'abstract') # fetch the document text
            >> knrm) # apply neural re-ranker

# pipeline.fit(
#     train_topics,
#     train_qrels,
#     test_topics,
#     test_qrels)

In [None]:
br = pt.BatchRetrieve(index)
# foo = (pt.BatchRetrieve(index) # get top 100 results
#        >> pt.text.get_text(index, 'text') # fetch the document text
#        >> trained_model)
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    eval_metrics=["recip_rank", "map", "recall_5", "recall_10", "P.1", "P.5", "P.10", "mrt"]
)

[02;37m[2023-08-02 14:56:56,185][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 14:56:56,470][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [7ms<?, ?it/s]

[02;37m[2023-08-02 14:56:59,203][onir_pt][DEBUG] [0m[37m[finished] batches: [2.73s] [1250it] [457.60it/s][0m


Unnamed: 0,name,recip_rank,map,recall_5,recall_10,P.1,P.5,P.10,mrt
0,DPH,0.829762,0.170652,0.00796,0.015211,0.76,0.688,0.658,53.271312
1,DPH >> VBERT,0.56313,0.054804,0.004497,0.009492,0.4,0.408,0.45,147.721786


In [None]:
# # retrieve documents with text
# br = pt.BatchRetrieve(index, metadata=['docno', 'text'])

# # use Tf as a passage scorer on sliding window passages
# psg_scorer = (
#     pt.text.sliding(text_attr='text', length=15, prepend_attr=None)
#     >> pt.text.scorer(body_attr="text", wmodel='Tf', takes='docs')
# )

# # use psg_scorer for performing query-biased summarisation on docs retrieved by br
# retr_pipe = (br >> pt.text.snippets(psg_scorer) >> lazy_epic)

In [None]:
# lazy_epic = onir_pt.reranker.from_checkpoint(
#     'https://macavaney.us/epic.msmarco.tar.gz',
#     expected_md5="2f6a16be1a6a63aab1e8fed55521a4db")

In [None]:
br = pt.BatchRetrieve(index)
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="LemurTF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
pl2 = pt.BatchRetrieve(index, wmodel="PL2")

pt.Experiment(
    [tf, tfidf, bm25, pl2, br, pipeline],
    topics,
    qrels,
    names=['TF', 'TFIDF', 'BM25', 'PL2', 'BR', 'DPH'],
    eval_metrics=["recip_rank", "map", "recall_5", "recall_10", "P.1", "P.5", "P.10", "mrt"]
)

[02;37m[2023-08-02 14:57:14,091][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2023-08-02 14:57:14,092][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [6ms<?, ?it/s]

[02;37m[2023-08-02 14:57:16,820][onir_pt][DEBUG] [0m[37m[finished] batches: [2.73s] [1250it] [458.22it/s][0m


Unnamed: 0,name,recip_rank,map,recall_5,recall_10,P.1,P.5,P.10,mrt
0,TF,0.261535,0.019708,0.001217,0.00227,0.1,0.136,0.132,39.368954
1,TFIDF,0.8265,0.192764,0.008405,0.015744,0.74,0.708,0.666,46.701783
2,BM25,0.836857,0.195498,0.008551,0.016738,0.76,0.712,0.692,43.860381
3,PL2,0.772652,0.175589,0.008581,0.015814,0.66,0.712,0.672,47.568903
4,BR,0.829762,0.170652,0.00796,0.015211,0.76,0.688,0.658,47.557327
5,DPH,0.56313,0.054804,0.004497,0.009492,0.4,0.408,0.45,99.288014


In [None]:
# INDEX_DIR='/content/drive/MyDrive/Colab Notebooks/GH95' #diretório aonde o índice vai ficar
# indexer = pt.TRECCollectionIndexer(INDEX_DIR,
#     # vamos salvar o texto como metadados
#     meta= {'docno' : 26, 'text' : 2048},
#     # The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
#     meta_tags = {'text' : 'ELSE'},
#     verbose=True,
#     overwrite=True) #para sobrescrever, caso já tenha um índice com aquele nome
# indexref = indexer.index(gh_files)
# #Indexando os arquivos -- chamando o método index no objeto TRECCollectionIndexer
# index = pt.IndexFactory.of(indexref)

In [None]:

# INDEX_DIR='/content/drive/MyDrive/Colab Notebooks/CTRs/index' #diretório aonde o índice vai ficar
# indexer = pt.IterDictIndexer(INDEX_DIR,
#                              # vamos salvar o texto como metadados
#                              meta={'docno' : 26,
#                                    'text' : 4096,
#                                    'Intervention': 4096,
#                                    'Eligibility': 4096,
#                                    'Results': 4096,
#                                    'Adverse Events': 4096},
#                              # The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
#                              verbose=True,
#                              overwrite=True) #para sobrescrever, caso já tenha um índice com aquele nome

# indexref = indexer.index(iter_ctrs())
# #Indexando os arquivos -- chamando o método index no objeto TRECCollectionIndexer
# index = pt.IndexFactory.of(indexref)

Por default, o PyTerrier aplica stemming (Porter) e remoção de stopwords (tudo para inglês). Podemos não fazer stemming (`stemmer=None`) ou fazer stemming para outro idioma  (`stemmer='portugese'` -- sim, está escrito errado) e também manter as stopwords (`stopwords=None`).


# Learning some pyterrier and information retrieval techniques:

##Index inspecting:

IndexRef represents a [IndexRef](http://terrier.org/docs/current/javadoc/org/terrier/querying/IndexRef.html) object. It can be seen as a pointer or URI that points to the location of the index file.

In [None]:
indexref.toString()

'/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/index/data.properties'

`index.getCollectionStatistics().toString()` provides information about the index such as the number of documents indexed and the number of distinct terms.

In [None]:
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

Number of documents: 999
Number of terms: 8670
Number of postings: 193532
Number of fields: 0
Number of tokens: 440379
Field names: []
Positions:   false



In [None]:
# check the meta-index fields
print(index.getMetaIndex().getKeys())


['docno', 'text']


In [1]:
# If necessary, we can print the entire lexicon to see the indexed terms
# for kv in index.getLexicon():
#  print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )

In [None]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 100 #docids  start to be counted at zero
#NB: the posting list is null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
  termid = posting.getId()
  lee = lex.getLexiconEntry(termid)
  print(lee.getKey(), end=' ')

pregnanc 60 equal 5 arm titl particip clinic analyz outcom prior intraven disord total time infect greater complet frame criteria hepat evid unit year diseas event intervent chemotherapi measur cell respons statu perform normal 1000 elig number result treatment descript advers group inclus mg type follow requir trial 8 3 2 1 iv patient id adequ cancer surgeri myocardi function cycl infarct breast need bone tumor karnofski inflammatori dai local 7 4 tissu cardiac malign renal rate carcinoma invas neutropenia lobular febril diagnosi 17 doxorubicin abnorm 13 21 marrow last primari 70 baselin patholog 08 54 site thrombocytopenia contain advanc detect stomat size vomit 15 1200 exclus previou reserv physician 62 assess second sampl cisplatin pancytopenia bidimension diarrhoea identifi 65 69 extent 26 fatigu concomit origin neutropen presenc gemcitabin feed jaundic neoplast macroscop arrest guid kp section macrophag mononuclear stroma nondescript devoid decis lobul corrobor fibroblast fibroel

In [None]:
#List the frequencies of terms in doc 10
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 100 #docids are 0-based
#NB: postings will be null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
  termid = posting.getId()
  lee = lex.getLexiconEntry(termid)
  print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))




pregnanc with frequency 1
60 with frequency 2
equal with frequency 1
5 with frequency 6
arm with frequency 2
titl with frequency 1
particip with frequency 2
clinic with frequency 1
analyz with frequency 1
outcom with frequency 1
prior with frequency 1
intraven with frequency 2
disord with frequency 1
total with frequency 1
time with frequency 1
infect with frequency 2
greater with frequency 1
complet with frequency 3
frame with frequency 1
criteria with frequency 2
hepat with frequency 1
evid with frequency 2
unit with frequency 2
year with frequency 1
diseas with frequency 1
event with frequency 2
intervent with frequency 2
chemotherapi with frequency 3
measur with frequency 4
cell with frequency 2
respons with frequency 3
statu with frequency 2
perform with frequency 2
normal with frequency 1
1000 with frequency 2
elig with frequency 1
number with frequency 4
result with frequency 2
treatment with frequency 1
descript with frequency 1
advers with frequency 2
group with frequency 2
in

In [None]:
# #which documents contain the term and how often
# meta = index.getMetaIndex()
# inv = index.getInvertedIndex()
# le = lex.getLexiconEntry( "stroma" )
# # the lexicon entry is also our pointer to access the inverted index posting list
# for posting in inv.getPostings( le ):
#   docno = meta.getItem("docno", posting.getId())
#   print("%s with frequency %d " % (docno, posting.getFrequency()))

#Querying

## Iteractive Queries

###Terrier Query Language

[Fonte](http://terrier.org/docs/v1.1.1/terrier_develop.html)

Terrier offers a flexible and powerful query language for searching with phrases, fields, or specifying that terms are required to appear in the retrieved documents. Some examples of queries are the following:
* term1 term2 	retrieves documents that contains 1 or more term1 and term2 (they need not contain both)
* term1^2.3	the weight of term1 is boosted 2.3.
* +term1 +term2	retrieves documents that contain both term1 and term2.
* +term1 -term2	retrieves documents that contain term1 and do not contain term2.
* "term1 term2"	retrieves documents where the terms term1 and term2 appear in a phrase.
* "term1 term2"~n	retrieves documents where the terms term1 and term2 appear within a distance of n blocks. The order of the terms is not considered.




Várias funções de ranking implementadas (`wmodel`), incluindo  TF_IDF, PL2, DFR. A lista completa está em http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html

In [None]:
br = pt.BatchRetrieve(index, metadata=["docno", "text"], wmodel="TF_IDF")
br.search("Patients with significantly elevated ejection fraction")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,493,NCT00912340,"NCT00912340 {""Clinical Trial ID"": ""NCT0091234...",0,7.837889,Patients with significantly elevated ejection ...
1,1,783,NCT01931163,"NCT01931163 {""Clinical Trial ID"": ""NCT0193116...",1,7.704695,Patients with significantly elevated ejection ...
2,1,399,NCT00698035,"NCT00698035 {""Clinical Trial ID"": ""NCT0069803...",2,7.541196,Patients with significantly elevated ejection ...
3,1,290,NCT00470847,"NCT00470847 {""Clinical Trial ID"": ""NCT0047084...",3,7.375055,Patients with significantly elevated ejection ...
4,1,711,NCT01629615,"NCT01629615 {""Clinical Trial ID"": ""NCT0162961...",4,7.314571,Patients with significantly elevated ejection ...
...,...,...,...,...,...,...,...
777,1,321,NCT00545688,"NCT00545688 {""Clinical Trial ID"": ""NCT0054568...",777,0.548988,Patients with significantly elevated ejection ...
778,1,500,NCT00929240,"NCT00929240 {""Clinical Trial ID"": ""NCT0092924...",778,0.539227,Patients with significantly elevated ejection ...
779,1,604,NCT01250379,"NCT01250379 {""Clinical Trial ID"": ""NCT0125037...",779,0.523706,Patients with significantly elevated ejection ...
780,1,846,NCT02259114,"NCT02259114 {""Clinical Trial ID"": ""NCT0225911...",780,0.521455,Patients with significantly elevated ejection ...


##Batch Queries

To assess the quality of an Information Retrieval (IR) system, it is necessary to run a significant number of queries (i.e., at least 30) and calculate evaluation metrics for them. The queries used in evaluation campaigns are commonly referred to as "topics." A topic is represented by a structure that includes an identification number, a title, a description, and a narrative. The title is a concise description of the topic. The description provides a bit more detail, and the narrative assists the individuals who produce relevance judgments in distinguishing relevant from non-relevant documents. Below, we provide an example of a topic:



```
<top>
<num> 254 </num>
<title> Earthquake Damage <title>
<desc> Find documents describing damage to property or persons caused by an earthquake and specifying the area affected. <desc>
<narr> Relevant documents will provide details on damage to buildings and material goods or injuries to people as a result of an earthquake. The geographical location (e.g. country, region, city) affected by the earthquake must also be mentioned. <narr>
</top>

```


In [None]:
#file with topic queries
topicsFile = '/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/topics.txt'
topics = pt.io.read_topics(topicsFile)
print(topics)

      qid                                              query
0       1  all the primary trial participants do not rece...
1       2   patients with platelet count over 100 000 mm anc
2       3  heart related adverse events were recorded in ...
3       4  adult patients with histologic confirmation of...
4       5  laser therapy is in each cohort of the primary...
...   ...                                                ...
1895  196  the the primary trial intervention involves on...
1896  197  the secondary trial reported 1 single case of ...
1897  198  the secondary trial and the primary trial do n...
1898  199  the outcome measurement of the primary trial i...
1899  200  all the primary trial patients had a minimum o...

[1900 rows x 2 columns]


In [None]:
#retrieving documents with bm25
#bm25 = pt.BatchRetrieve(index, wmodel="BM25",num_results=100) #the default is to retrieve 1000 docs, but it can be changed with num_results
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
resbm25 = bm25.transform(topics)
#analyzing the list of documents retrieved for each query
resbm25

KeyboardInterrupt: ignored

#Evaluating Query Results
To evaluate how well the Information Retrieval (IR) system responded to the queries, it's necessary to have knowledge of the "expected" responses, which are known as relevance judgments. The list of documents that should have been retrieved for the set of queries from the file "topicos.txt" is found in the file "qrels.txt." A small excerpt from this file is illustrated below. The excerpt shows documents that were evaluated for query topic 4. The document "NCT01097642" was considered relevant to this query (indicated by the number 1 in the last column).

```

4 0 NCT01091974 0
4 0 NCT01094184 0
4 0 NCT01095003 0
4 0 NCT01097460 0
4 0 NCT01097642 1
4 0 NCT01104584 0
4 0 NCT01105312 0
4 0 NCT01105650 0


In [None]:
#arquivo com os julgamentos de relevância
qrelsFile = '/content/drive/MyDrive/Colab Notebooks/CTRs/TREC/topics_qrels/qrels.txt'
qrels = pt.io.read_qrels(qrelsFile)
print(qrels)

##Evaluation metrics

There are several metrics for evaluating the quality of the results of an IR (Information Retrieval) system. Mean Average Precision (MAP) is one of the most important. Additionally, we can also look at precision at various points in the ranking, such as P@1, P@10, etc.
To list the implemented metrics, use `ir_measures.parse_trec_measure('official') `

In [None]:
from pyterrier.measures import *
# Evaluating the query results in terms of MAP, P@1, P@5, and P@10.
pt.Utils.evaluate(resbm25, qrels, metrics = ['map','P_1','P_5','P_10']) #mean for all queries


In [None]:
#visualizing the result of MAP per query.
pt.Utils.evaluate(resbm25, qrels, metrics = ['map'], perquery=True)

## Precision/Recall Plots

Generating the interpolated precision values ​​at standard recall levels to be able to draw the precision-recall curve

In [None]:
iprec = pt.Utils.evaluate(resbm25, qrels, metrics = [IPrec@0.0,IPrec@0.1,IPrec@0.2,IPrec@0.3,IPrec@0.4,IPrec@0.5,IPrec@0.6,IPrec@0.7,IPrec@0.8,IPrec@0.9,IPrec@1.0])
iprec.values()

In [None]:
import matplotlib.pyplot as plt
x=[0, 0.1, 0.2,0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
y=iprec.values()
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel('recall')
plt.ylabel('precision')
plt.title("Curva de precisão e revocação")
plt.plot(x,y,'bo-', linewidth=2)
plt.show()


#Conducting comparative experiments
Knowing the results of metrics for an experimental configuration alone is not sufficient. In most cases, the goal is to compare different configurations. In this case, we will use pt.Experiment() to compare 4 ranking functions: Tf, TF-IDF, BM25, and PL2.
For more details, visit https://pyterrier.readthedocs.io/en/latest/experiments.html.

In [None]:
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="LemurTF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
pl2 = pt.BatchRetrieve(index, wmodel="PL2")

#Analyzing the MAP values for the query set
pt.Experiment([tf, tfidf, bm25, pl2], topics, qrels, eval_metrics=["map"],perquery=False,round=4)