# Unieversity of Pavia
## Artificial Intelligence BSc
### Information Retrieval and Recommender Systems

### Authors:
 - Michele Ventimiglia
 - Manuel Dellabona

This script is part of the Clinic Trials SE project and is released under the GNU General Public License:
https://www.gnu.org/licenses/gpl-3.0.html#license-text

## Setup

In [None]:
from _setup import check
check(verbose=True)

### Requirements

In [1]:
from src.tools.setup import check_requirements
check_requirements("./requirements.txt")

### Modules

In [2]:
import os

from src.preprocessing.indexing import Indexer
from src.preprocessing.transform import Transformer
from src.preprocessing.extract import XMLExtractor, PickleExtractor # for 2021 TREC cds use pickle

### Paths

In [3]:
DATA_PATH = "path\\to\\ClinicalTrialsSE\\data"
DATASET21_PATH = "path\\to\\ClinicalTrialsSE\\data\\TREC21"
DATASET22_PATH = "path\\to\\ClinicalTrialsSE\\data\\TREC22"
INDEXING_FILES_PATH = "path\\to\\ClinicalTrialsSE\\data\\index"
JDK_PATH = "path\\to\\Java\\jdk-21\\bin"

## Documents conversion

In [4]:
pextractor = PickleExtractor()
extractor = XMLExtractor()

In [5]:
if __name__ == '__main__':
    raw_documents = pextractor.process_data(
        file_path = os.path.join(DATA_PATH, 'TREC21.pkl'),
        save_path = DATASET21_PATH,  # be careful when setting, a lot of files will be generated!
        save = False,
        n_max = -1  # set -1 to process all documents (time and resource expensive!)
    )

In [6]:
# if __name__ == '__main__':
#     raw_documents = extractor.process_docs(
#         folder_path = DATASET_PATH,
#         parallel_processing = True
#     )

In [7]:
raw_documents[0]

{'docno': 'NCT00976963',
 'text': "NCT00976963 Single Dose Monurol for Treatment of Acute Cystitis Single Dose Monurol for Treatment of Acute Cystitis Urinary tract infecton (UTI) is a very common problem in young healthy women, afflicting\r\n      approximately one-half of women by their late 20's. One of the most common antibiotics used\r\n      to treat UTIs is Trimethoprim-sulfa (TMP-SMX), usually for total of three days. However,\r\n      concerns about increased antibiotic resistance have led to increased interest in studying\r\n      other antibiotics for UTI.\r\n\r\n      An alternative antibiotic which is also FDA approved for the treatment of UTIs is fosfomycin\r\n      (Monurol). The effectiveness of fosfomycin in curing UTIs when given as a single dose is not\r\n      well studied. The purpose of this research study is to determine what the cure rates are with\r\n      a single dose of fosfomycin versus the more standard 3-day course of TMP-SMX. Procedures subjects will und

## Preprocessing

In [8]:
transformer = Transformer(
    save_path = DATA_PATH,
    verbose = True
)

[i] spaCy v3.7.2


In [9]:
if __name__ == '__main__':
    documents = transformer.process_docs(
        documents = raw_documents,
        parallel_processing = True,
        batch_size = 1000
    )

                                                                                

In [12]:
documents[0]

{'docno': 'NCT00976924', 'text': 'general volunteers patient monitoring clinical interventional medical inclusion year function strips glucose blood device pressure range iso15197 strip diagnostic test meter 22 exclusion criteria accept monitor tianjin 78 hospital nct00976924 accuracy healthy university diabetes', 'title': 'Clinical Test of Blood Glucose Test Strips', 'summary': 'Blood glucose test strips are tested with the test meters to test the accuracy of the blood\r\n      pressure monitoring function.'}


In [13]:
transformer._save(documents)

## Indexing

In [10]:
indexer = Indexer(
    jdk_path = JDK_PATH,
    file_dir = INDEXING_FILES_PATH,
    verbose = True
)

[i] Pandas v2.1.3
[i] PyTerrier v0.10.0


PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8



In [11]:
indexer.index(
    data = documents,
    index_folder = INDEXING_FILES_PATH
)

 20%|██        | 76289/375580 [00:58<03:39, 1361.18documents/s]



100%|██████████| 375580/375580 [05:01<00:00, 1245.17documents/s]

22:52:27.524 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 4 empty documents


100%|██████████| 375580/375580 [05:06<00:00, 1225.26documents/s]
