Empezar a ver lo de query expansion con clinical notes. 
* Procesar los datos.
* Replicar el análisis del paper.
    * Incluir la búsqueda con las keywords de expertos y combinar eso con las alternativas
* Agregar alternativas de ranking, como por ejemplo, alguna semántica.

[Investigating the Impact of Query Representation on Medical Information Retrieval](https://drive.google.com/file/d/1V9P9idStG4Gfaf4ZZ5Gpn6ldIk0Zcdq_/view)

Tienen repo: https://github.com/GiorgosPeikos/inf_extraction_med_ir

Mucho código, dudo que todo sea absolutamente necesario. De base usan pyterrier. La creación del index no está, pero si la del retrieval y también la evaluación (usan las métricas del paquete).

Ver en qué formato hay que darle los datos de los qrels (creo que son esos para que haga la evaluación solo)

https://www.trec-cds.org/2016.html

https://www.trec-cds.org/2023.html (quedan dos meses, el contenido es más parecido a lo que está en las diferentes secciones de MIMIC, no tenemos nada de referencia... de 2022 no subieron los qrels)

Judgment ("qrels") file. Judgment of 0 is non-relevant, 1 is excluded, and 2 is eligible.

* 2014 y 2015 -- pmc-text-02.tar.gz 2014-01-2 (no coinciden los ids), si coinciden con lo que está en ``ir_datasets``.
* 2021 y 2022 -- ClinicalTrials.2021-04-27.part1.zip (están todos los ids)
* ahdoc - clinicaltrials.gov-16_dec_2015 (faltan 8 ids sobre el de 2016, si lo hago sobre el de 2021, solo faltan 2 ids)
    * Si bien este está pensado para clinical trials, como los casos son los mismos que para 2014 y 2015, se podrían usar las keywords en la evalución de medical documents.

* Save pickle for all queries.
* Save pickle for all qrels.
* Save pickles for clinical trials.
    * One per folder + one summarizing everything.
* Save pickle for documents
    * https://github.com/titipata/pubmed_parser --> Un parser que funciona bien! Permite obtener metadatos + texto
    
NOTA: Los ids de los documentos de pubmed NO coinciden con los que están en los judgements, peeero, ``ir_datasets`` tiene el dataset ya procesado. Baja los mismos archivos que bajé de la página... deberían ser los mismos sets que ya tengo.
Al bajarlo, luego crea un doc_store y en teoría queda todo listo para ser usado. (https://github.com/allenai/ir_datasets)
    
TODO: En los clinical trials podrían haber quedado algunos \n dependiendo de dónde se hayan generado los archivos    

In [1]:
import pandas as pd
from tqdm.notebook import tqdm
import os
from bs4 import BeautifulSoup
import pickle

In [None]:
# podría haber usado pandas :face-palm:
def get_topics(file_name): # {file_name : {name: XX, topics: [{number: XX, type_:XX, description: XX, summary: XX}]}
    with open(file_name, 'r',encoding='utf8') as f:
        data = f.read()
    data = BeautifulSoup(data, "xml")
    topic_list = []
    topics = data.find('topics') # puede dar vacío
    if topics is not None:
        task = topics.get('task')
        for topic in topics.find_all('topic'):
            number = topic.get('number')
            summary = topic.summary
            if summary is not None:
                summary = BeautifulSoup(summary.text,'html').text.strip()
                description = BeautifulSoup(topic.description.text,'html').text.strip()
            else:
                description = BeautifulSoup(topic.text,'html').text.strip()
            type_ = topic.get('type')

            tt = {'number':number, 'description':description}
            if type_ is not None:
                tt['type'] = type_
            if summary is not None:
                tt['summary'] = summary
            topic_list.append(tt)
        return topic_list
    for topic in data.find_all('TOP'):
        topic_list.append({'number':topic.NUM.text, 'description':BeautifulSoup(topic.TITLE.text,'html').text.strip()})
    return topic_list    

In [None]:
from collections import defaultdict

def get_judgements(file_name): # trec judgements
    df = pd.read_csv(file_name,sep=' ',header=None) 
    if len(df.columns) == 1:
          df = pd.read_csv(file_name,sep='\t',header=None) 

    df = df.drop(columns=[1]) # 1 se puede borrar

    relevants = defaultdict(dict) # {topic : {document : score}}
    relevant_score = defaultdict(defaultdict(set).copy) # {topic: {score : {documents}}}
    for i in tqdm(range(0,len(df))):
        relevants[df[0].values[i]][df[2].values[i]] = df[3].values[i]
        relevant_score[df[0].values[i]][df[3].values[i]].add(df[2].values[i])
        
    return relevants, relevant_score

In [None]:
matching_files = {} # matching = {name : {topics: XX, qrels: XX}} 
matching_files['topics2014.xml'] = 'qrels2014.txt'
matching_files['topics2015B.xml'] = 'qrels-treceval-2015.txt'
matching_files['topics2021.xml'] = 'qrels2021.txt'
# matching_files['topics2022.xml'] = None
matching_files['adhoc-queries.json'] = 'qrels-clinical_trials.txt'

matching_files

In [None]:
dir_topics = './'

In [None]:
topics = {}
judgements = {}

for t,j in matching_files.items():
    if not t.endswith('.json'):
        topics[t] = get_topics(dir_topics + t)  
    judgements[j] = get_judgements(dir_topics + j)
    
with open(dir_topics + '__all_judgements.pickle','wb') as file:
    pickle.dump(judgements,file)

In [None]:
import json
with open(dir_topics + 'adhoc-queries.json','r') as file:
    jj = json.load(file)

for i in tqdm(range(0,len(jj))):
    topic = jj[i]
    topic['number'] = topic['qId']
    del topic['qId']
    topic['task'] = topic['queryType']
    del topic['queryType']
    
    orig = topic['number'].split('-')
    orig[0] = 'topics2014' if orig[0] == 'trec2014' else 'topics2015B' 
    for tt in topics[orig[0]+'.xml']:
        if tt['number'] == orig[1]:
            topic['summary'] = tt['summary']
            break

topics['adhoc-queries.json'] = jj
with open(dir_topics + '__all_topics.pickle','wb') as file:
    pickle.dump(topics,file)

----------------------------------

Clinical trials

In [None]:
import xmltodict

import re
re_spaces = re.compile('\n\s+')

def load_clinical_trial(file_name=None,text=None):
    
    def process_textblock(text):
        xx = [] 
        text = text.replace('\r','')
        for q in text.split('\n\n'): 
            x = q.split('\n') 
            x = [z.strip() for z in x] # hay que mergear lo que haya quedado mal del tokenizer
            for i in range(1,len(x)):
                if x[i] == '':
                    continue
                j = i-1
                while x[j] == '':
                    j = j-1
                if not x[i][0].isupper() and not x[j][-1] == '.':
                    x[j] = x[j] + ' ' + x[i]
                    x[i] = ''
            x = [z.strip() for z in x if len(z) > 0] 
            xx.extend(x)
        aa = []
        title = None
        crit = {}
        for z in xx:
            z = z.strip()
            if z.endswith(':'):
                if title is not None:
                    crit[title] = aa
                title = z[:-1].lower()
                aa = []
            elif title is not None:
                aa.append(z.strip() if not z.startswith('-') else z[1:].strip())
        if title is not None and len(aa) > 0:
            crit[title] = aa
        if len(crit) == 0:
            for l in xx:
                l = l.split(':')
                if l[0].isupper():
                    crit[l[0]] = ' '.join(l[1:]).strip()
        if len(crit) == 0:
            crit = xx
        return crit           
    
    if file_name is not None:
        with open(file_name,encoding='utf8') as xml_file:
            d = xmltodict.parse(xml_file.read())
    else:
        d = xmltodict.parse(text)
    
    if 'eligibility' in d['clinical_study'] and 'criteria' in d['clinical_study']['eligibility'] and 'textblock' in d['clinical_study']['eligibility']['criteria']:
        d['clinical_study']['eligibility']['criteria']['textblock'] = process_textblock(d['clinical_study']['eligibility']['criteria']['textblock'])
    
    if 'eligibility' in d['clinical_study'] and 'study_pop' in d['clinical_study']['eligibility'] and 'textblock' in d['clinical_study']['eligibility']['study_pop']:
        d['clinical_study']['eligibility']['study_pop']['textblock'] = process_textblock(d['clinical_study']['eligibility']['study_pop']['textblock'])
    
    if 'brief_summary' in d['clinical_study'] and 'textblock' in d['clinical_study']['brief_summary']:
        d['clinical_study']['brief_summary']['textblock'] = process_textblock(d['clinical_study']['brief_summary']['textblock'])
    
    if 'detailed_description' in d['clinical_study'] and 'textblock' in d['clinical_study']['detailed_description']:
        d['clinical_study']['detailed_description']['textblock'] = process_textblock(d['clinical_study']['detailed_description']['textblock'])
    
    d = d['clinical_study']
    return d

In [2]:
clinical_trials_dir = 'C:/Users/Anto/Desktop/clinical_trials/'
clinical_trials_dir = 'D:/clinical_trials/'
prefix = 'ClinicalTrials.2021-04-27'

In [None]:
clinical_trials = {}
for dd in tqdm(os.listdir(clinical_trials_dir + prefix + '/')):
    
    if os.path.exists(clinical_trials_dir + prefix + '__'+ dd + '.pickle'):
        continue
    
    for ct in tqdm(os.listdir(clinical_trials_dir + dd)):
        ct_file = clinical_trials_dir + dd + '/' + ct
        print(ct_file)
        ct_dict = load_clinical_trial(ct_file)
        clinical_trials[ct[0:-4]] = ct_dict
        
    with open(clinical_trials_dir + prefix + '__'+ dd + '.pickle','wb') as file:
        pickle.dump(clinical_trials,file)
    clinical_trials.clear()

In [None]:
# de los pickles, seleccionar solo el summary, description y eligibility para reducir un poco
from collections import defaultdict

def summarize_trials(dir_,prefix):
    reduced_trials = {}
    for ff in tqdm(os.listdir(dir_)):
        if not ff.endswith('.pickle') or not ff.startswith(prefix):
            continue
        trials = pd.read_pickle(dir_ + ff)

        for k,d in tqdm(trials.items()):
            tt = defaultdict(defaultdict(dict).copy)
            if 'eligibility' in d and 'criteria' in d['eligibility'] and 'textblock' in d['eligibility']['criteria']:
                tt['eligibility']['criteria']['textblock'] = d['eligibility']['criteria']['textblock'] 

            if 'brief_summary' in d and 'textblock' in d['brief_summary']:
                tt['brief_summary']['textblock'] = d['brief_summary']['textblock']

            if 'detailed_description' in d and 'textblock' in d['detailed_description']:
                tt['detailed_description']['textblock'] = d['detailed_description']['textblock'] 
            reduced_trials[k] = tt
    return reduced_trials

In [None]:
reduced_trials = summarize_trials(clinical_trials_dir,prefix)

with open(clinical_trials_dir + prefix + '__document_summary.pickle','wb') as file:
    pickle.dump(reduced_trials,file)

In [None]:
prefix = 'clinicaltrials.gov-16_dec_2015'

In [None]:
import tarfile

In [None]:
tar = tarfile.open(clinical_trials_dir + prefix + ".tgz", "r:gz")

clinical_trials_2015 = {}

if os.path.exists(clinical_trials_dir + prefix + '__document_summary.pickle'):
    processed = pd.read_pickle(clinical_trials_dir + prefix + '__document_summary.pickle').keys()
else:
    processed = set()
    for ff in os.listdir(clinical_trials_dir):
        if not ff.startswith(prefix) or not ff.endswith('.pickle'):
            continue
        processed.update(pd.read_pickle(clinical_trials_dir + ff).keys())
        
for member in tqdm(tar): # get_members is not a generator
    f = tar.extractfile(member)
    if f is None or member.name[member.name.rindex('/')+1:].startswith('._'):
        continue
        
    if member.name[member.name.rindex('/')+1:] in processed:
        continue
        
    clinical_trials_2015[member.name[member.name.rindex('/')+1:]] = load_clinical_trial(text=f.read())
    
    if len(clinical_trials_2015) % 4000 == 0:
        with open(clinical_trials_dir + prefix + '_'+member.name[member.name.rindex('/')+1:]+'.pickle','wb') as file:
            pickle.dump(clinical_trials_2015,file)
        clinical_trials_2015.clear()

if len(clinical_trials_2015) > 0:
        with open(clinical_trials_dir + prefix + '_'+member.name[member.name.rindex('/')+1:]+'.pickle','wb') as file:
            pickle.dump(clinical_trials_2015,file)
        clinical_trials_2015.clear()
        
tar.close()

In [None]:
reduced_trials = summarize_trials(clinical_trials_dir,prefix)

if os.path.exists(clinical_trials_dir + prefix + '__document_summary.pickle'):
    reduced_trials.update(pd.read_pickle(clinical_trials_dir + prefix + '__document_summary.pickle'))

with open(clinical_trials_dir + prefix + '__document_summary.pickle','wb') as file:
    pickle.dump(reduced_trials,file)

----------------------------------------

Pubmed documents

In [None]:
documents_dir = 'C:/Users/Anto/Downloads/'
doc_med_files = ['pmc-00.tar.gz', 'pmc-01.tar.gz', 'pmc-02.tar.gz', 'pmc-03.tar.gz'] 

In [None]:
import pubmed_parser as pp

processed = set()
for ff in tqdm(os.listdir(documents_dir)):
    if not ff.startswith('__documents_'):
        continue
    processed.update(pd.read_pickle(documents_dir + ff).keys())

print(len(processed))
    
docs = {}
for i in range(0,len(doc_med_files)):
    tar = tarfile.open(documents_dir + doc_med_files[i], "r:gz")
    for member in tqdm(tar): 
        if not member.name.endswith('xml'):
            continue
            
        if member.name[member.name.rindex('/')+1:-5] in processed:
            continue
            
        tar.extract(member,documents_dir)
#         print(documents_dir + member.name)
        dict_out = pp.parse_pubmed_xml(documents_dir + member.name) # metadatos
        dict_out['text'] = pp.parse_pubmed_paragraph(documents_dir + member.name,all_paragraph=True) # texto
        os.remove(documents_dir + member.name) 

        docs[member.name[member.name.rindex('/')+1:-5]] = dict_out
        if len(docs) % 1000 == 0:
            with open(documents_dir + '__documents_'+member.name[member.name.rindex('/')+1:-5]+'.pickle','wb') as file:
                pickle.dump(docs,file)
            docs.clear()

if len(docs) > 0:
    with open(documents_dir + '__documents_'+member.name[member.name.rindex('/')+1:-5]+'.pickle','wb') as file:
        pickle.dump(docs,file)
    docs.clear()

In [None]:
# summarize documents in one pickle title, keywords (?), abstract

from collections import defaultdict

def summarize_documents(dir_,prefix):
    reduced_trials = {}
    for ff in tqdm(os.listdir(dir_)):
        if not ff.endswith('.pickle') or not ff.startswith(prefix):
            continue
        trials = pd.read_pickle(dir_ + ff)

        for k,d in tqdm(trials.items()): # 'full_title', 'abstract', 'journal', 'pmid', 'pmc',
            tt = defaultdict(defaultdict(dict).copy)
            if 'full_title' in d:
                tt['full_title'] = d['full_title']

            if 'abstract' in d:
                tt['abstract'] = d['abstract']

            if 'journal' in d:
                tt['journal'] = d['journal']
                
            if 'pmid' in d:
                tt['pmid'] = d['pmid']
                
            if 'pmc' in d:
                tt['pmc'] = d['pmc']
                
            reduced_trials[k] = tt
    return reduced_trials

In [None]:
# documents_dir = 'D:/'
prefix = '__documents_'

reduced_trials = summarize_documents(documents_dir,prefix)

if os.path.exists(documents_dir + prefix + '__document_summary.pickle'):
    reduced_trials.update(pd.read_pickle(documents_dir + prefix + '__document_summary.pickle'))

with open(documents_dir + prefix + '__document_summary.pickle','wb') as file:
    pickle.dump(reduced_trials,file)

Chequear que no faltan documentos de los que necesitamos, tanto pmc como clinical trials

In [27]:
all_judgements = pd.read_pickle(clinical_trials_dir + '__all_judgements.pickle')
all_judgements.keys()

dict_keys(['qrels2014.txt', 'qrels-treceval-2015.txt', 'qrels2021.txt', 'qrels-clinical_trials.txt'])

In [31]:
documents = pd.read_pickle(clinical_trials_dir + '__documents___document_summary.pickle')
# documents = pd.read_pickle(clinical_trials_dir + 'ClinicalTrials.2021-04-27__document_summary.pickle')
# documents = pd.read_pickle(clinical_trials_dir + 'clinicaltrials.gov-16_dec_2015__document_summary.pickle')
len(documents)

939770

In [28]:
docs = set()
for jj in all_judgements['qrels2014.txt'][1].values():# por cada topic, {score : {docs}}
    docs.update(jj[1])
    docs.update(jj[2])
len(docs)

3271

In [7]:
docs - set(documents.keys()) # en 2016 requiere sacar xml del nombre

En teoría, los datasets de documentos de pmc están disponibles en ``ir_datasets``, y se pueden pasar casi directamente a ``pyterrier`` para ser indexados.

Baja los datasets y los descomprime, no incluyen el body, solo el abstract.
En el dataset bajado hay 147k documentos que no están en el otro... Raro. 
Lo más raro, es que los ids de los documentos relevantes no coinciden con los del trec!

Como solución "simple", guardar todo esto nuevo y usarlo desde las estructuras que ya había creado.

* La de 2015 usa los mismos documentos, con lo que solo hay que guardar extra los qrels y los topics.
* Los tópicos son iguales en ambos casos.
* Los qrels son los mismos en ambos casos.

https://ir-datasets.com/pmc.html#pmc/v1/trec-cds-2014

In [3]:
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014") # son iguales que los de 2015!

In [6]:
import pickle
import re
re_spaces = re.compile('\s+')
docs_ir = {} 

i = 0

for query in tqdm(dataset.docs_iter()):
    docs_ir[query[0]] = {'journal' : query[1], 'title':query[2], 
                         'abstract':query[3],
                         'body': re_spaces.sub(' ',query[4].replace('\n',''))}  # namedtuple<doc_id, journal, title, abstract, body>
    i += 1
    if i % 100000 == 0:
        with open(clinical_trials_dir + '__documents_ir_2014_body_'+str(i)+'.pickle','wb') as file:
            pickle.dump(docs_ir,file)
        docs_ir.clear()

if len(docs_ir) > 0:
    with open(clinical_trials_dir + '__documents_ir_2014_body_'+str(i)+'.pickle','wb') as file:
            pickle.dump(docs_ir,file)
    docs_ir.clear()

0it [00:00, ?it/s]

In [None]:
from collections import defaultdict
qrels = defaultdict(dict) # {topic : {document : score}}
qrels_scores = defaultdict(defaultdict(set).copy) # {topic: {score : {documents}}}

for query in tqdm(dataset.qrels_iter()):
    qrels[int(query[0])][int(query[1])] = int(query[2])
    qrels_scores[int(query[0])][int(query[2])].add(int(query[1]))

with open(clinical_trials_dir + 'qrles_ir_2014.pickle','wb') as file:
    pickle.dump((qrels,qrels_scores),file)    
len(qrels)

In [71]:
queries = []
for query in tqdm(dataset.queries_iter()):
    queries.append({'number': query[0], 'type':query[1],'description':query[2],'summary':query[3]})
    
with open(clinical_trials_dir + 'queries_ir_2014.pickle','wb') as file:
    pickle.dump(queries,file)

0it [00:00, ?it/s]

In [107]:
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")

In [108]:
qrels = defaultdict(dict) # {topic : {document : score}}
qrels_scores = defaultdict(defaultdict(set).copy) # {topic: {score : {documents}}}

for query in tqdm(dataset.qrels_iter()):
    qrels[int(query[0])][int(query[1])] = int(query[2])
    qrels_scores[int(query[0])][int(query[2])].add(int(query[1]))

with open(clinical_trials_dir + 'qrles_ir_2015.pickle','wb') as file:
    pickle.dump((qrels,qrels_scores),file)    
len(qrels)


0it [00:00, ?it/s]

30

In [75]:
queries = []
for query in tqdm(dataset.queries_iter()):
    queries.append({'number':str(query[0]), 'type':query[1],'description':query[2],'summary':query[3]})
    
with open(clinical_trials_dir + 'queries_ir_2015.pickle','wb') as file:
    pickle.dump(queries,file)

0it [00:00, ?it/s]

In [118]:
all_judgements = pd.read_pickle(clinical_trials_dir + '__all_judgements.pickle')
all_judgements.keys()

dict_keys(['qrels2014.txt', 'qrels-treceval-2015.txt', 'qrels2021.txt', 'qrels-clinical_trials.txt'])