## Cooking with ClarityNLP - Session #10: ClarityNLP, Clinical Trials, and Language Models

For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model for recognizing chemotherapy regimens, and we will estimate its performance via K-fold cross validation.

For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

## ClinicalTrials.gov and the AACT Database

The U.S. Library of Medicine maintains a website called [ClinicalTrials.gov](https://clinicaltrials.gov/), which hosts a large database of clinical trial information. The data includes the trial objectives, inclusion and exclusion criteria, the participating medical centers, the personnel conducting the trial, and other related information. Each clinical trial is assigned a unique **NCT ID**, such as NCT03601923, which is the ID for a now-recruiting trial to evaluate the drug Niraparib as a treatment for pancreatic cancer.

The home page for ClinicalTrials.gov provides a web form by which users can enter search criteria and find clinical trials of interest:

![clinicaltrials_gov.png](session_10/clinicaltrials_gov.png)

This form is convenient for interactive searches, but larger-scale analysis of the clinical trial data requires access to the database, which this site does not provide.

Luckily, there is a [related AACT website](https://aact.ctti-clinicaltrials.org/) that hosts a downloadable PostgreSQL database containing all of the data at clinicaltrials.gov. As of this writing, the current download is 871 MB in size, and it contains data for more than 289,000 clinical trials. The download file is updated at near-daily intervals. If you want to run natural language processing or machine learning algorithms on the clinical trial data, the AACT database is what you need. The AACT website contains links to download and installation instructions, the data dictionary and database schema, and lots of other documentation.

### Loading the AACT Database into Solr

To be able to interrogate the AACT data with ClarityNLP, the relevant text from the clinical trials needs to be ingested into Solr. At GA Tech we downloaded a snapshot of the AACT database, installed it, got it working, and ingested the eligibility criteria into our Solr instance. Some python scripts that we used can be found in our [utilities project](https://github.com/ClarityNLP/Utilities). Look at the .py files containing *aact* in the filename if you're interested. Documentation on the fields that ClarityNLP expects can be found in our [documentation](https://claritynlp.readthedocs.io/en/latest/developer_guide/technical_background/solr.html).

Our data ingestion process maps these [required fields](https://claritynlp.readthedocs.io/en/latest/developer_guide/technical_background/solr.html) as follows:
* ``report_type``: either ``"Clinical Trial Inclusion Criteria"`` or ``"Clinical Trial Exclusion Criteria"``
* ``report_text``: text of the inclusion or exclusion criteria
* ``source: AACT``
* ``subject``: the NCT ID of the trial


**IMPORTANT NOTE**: After ingesting this data we found that our python scripts did not capture all of the eligibility criteria for some trials. The text of the eligibility criteria unfortunately does not conform to the official data dictionary in all instances. The data dictionary [states the following](https://prsinfo.clinicaltrials.gov/definitions.html#EligibilityCriteria) for the eligibility criteria:

<pre>
Eligibility Criteria *
Definition: A limited list of criteria for selection of participants in the clinical study, provided in terms
of inclusion and exclusion criteria and suitable for assisting potential participants in identifying clinical
studies of interest. Use a bulleted list for each criterion below the headers "Inclusion Criteria" and 
"Exclusion Criteria".
Limit: 15,000 characters. 
</pre>

The elibigility criteria can be found in the ``criteria`` column of the ``eligibilities`` table.

Some trials do not use a bulleted list for the criteria; others are missing either inclusion or exclusion criteria or both; others alternate back and forth between inclusion and exclusion, etc. Our ingest scripts do not capture all of the data under these circumstances.

We are working on improved data ingest code and will update our scripts when we are satisfied that we can cleanly separate inclusion from exclusion criteria. For purposes of this cooking session, the scripts work well enough, but we would like to eventually capture everything. Stay tuned!

## Chemotherapy Regimens

Our goal is to be able to find mentions of chemotherapy regimens in the clinical trial eligibility criteria. We would also like to be able to assign a score to each criterion that indicates how confident we are in the identification. For some regimens such as BEACOPP, FOLFOX, RCHOP-14, and XELOX this is not too difficult. For other regimens such as CA, ICE, and PACE this task is somewhat more difficult, since the regimen names match those of common words or abbreviations.

Our approach to this problem is to use ClarityNLP to extract regimen names and the surrounding text from the clinical trial eligibility criteria. We then normalize the text, replace the regimen names with a token, and compute all ngrams that include the regimen token. These ngrams provide us with sample contexts for the use of regimen names. By counting all such ngrams and scoring them in a smoothed language model, we can assign a probability to each ngram and use it to estimate the likelihood that a given ngram contains a regimen name.

### Regimen Names

So where can one find a list of the names of chemotherapy regimens? There happens to be a comprehensive, peer-reviewed oncology wiki at [HemeOnc.org](https://hemonc.org/wiki/Main_Page) that contains information on treatment regimens for many different types of cancer. One of our intrepid co-workers scraped this wiki and generated hundreds of NLPQL files for extracting the treatment regimens and relevant drugs from clinical text. The full list can be found [here](https://github.com/ClarityNLP/Utilities/tree/master/regimen_nlpql), along with NLPQL for related drugs and conditions. A JSON file containing all of this data can be found [here](https://github.com/ClarityNLP/Utilities/blob/master/cancer/regimen_tree.json).

We provide a python utility script that extracts the regimen names from the JSON data and writes them to a .csv file. You can find this utility script [here](https://github.com/ClarityNLP/Utilities/blob/master/get_all_regimens.py).  Running this script generates the file ``all_regimen_names.csv``, which you can find in our github repo in the support folder for this notebook: ``ClarityNLP/notebooks/cooking/session_10/``. Here is a sample of what the file contains (1961 lines total):

![all_regimens.png](session_10/all_regimens.png)

The regimen name(s) are found in the first column. Alternate names are found in the second column.

Believe it or not there are still other chemotherapy regimen names not contained in this file. We have collected additional regimen names (mostly from Wikipedia) and saved them to the file ``notebooks/session_10/supplemental_chemo_regimens.txt``.

### NLPQL for Finding Chemotherapy Regimens

ClarityNLP requires an NLPQL file as input. Here is a template for an NLPQL file that we can use to find chemotherapy regimens in the AACT data:

<pre>
phenotype "Chemotherapy Regimens" version "1";
include ClarityCore version "1.0" called Clarity;

// NOTE: uses field mappings described above
documentset Docs:
    Clarity.createDocumentSet({
        "report_types":[
            "Clinical Trial Inclusion Criteria",
            "Clinical Trial Exclusion Criteria"
        ],
        "filter_query":"source:AACT"
    });

// << INSERT TERMSET FOR CHEMOTHERAPY REGIMEN NAMES HERE >>

define hasRegimenTerm:
    Clarity.ProviderAssertion({
        termset: [RegimenTerms],
        documentset: [Docs]
    });
</pre>

This NLPQL file will run a [provider assertion](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/provider_assert.html) to identify regimen terms in the AACT data. There is a placeholder for the termset - how can we generate it from our collected regimen names?

It is relatively straightforward to extract the data from the CSV file and the supplemental file and generate an NLPQL termset from it. We have saved the results to ``notebooks/cooking/session_10/regimen_termset.txt``. This termset omits any regimen names that are not found in our copy of the AACT data, as well as any regimen names of fewer than four characters (except for the "FCR" and "TAC" regimens, which are common). We found that we get too many false positives for two and three letter acronyms. The resulting termset includes 601 regimen names.

With the termset in hand, you can paste it into the NLPQL file above and run it. The full NLPQL file can be found in ``notebooks/cooking/find_regimens.nlpql``.

Note that the termset also includes pure drug names, such as ``Methotrexate``, which generally are not the names for a chemotherapy regimen. We prefer to run with these drug names, since they are often used in contexts that discuss chemotherapy regimens, and we do not want to miss any such contexts. The drug names can be pruned from the results in a postprocessing step as we demonstrate below. 

The NLPQL file takes a while to run. On our system it spawned over 2276 tasks and ran for approximately 45 minutes with five tasks in parallel. So if you run this NLPQL file, be patient!

<<< PUT RESULT FILE IN REPO >>>

### Dataset Generation

Now we would like to extract sentences that include mentions of clinical trials from the results. Unfortunately we cannot just extract the [sentence](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/provider_assert.html) field, since the sentence tokenizer is often confused by the strange formatting of the clinical trial criteria data. So we need to do some more work to find the sentences that we need.

First some preliminaries:

In [1]:
import re
import os
import sys
import csv
import nltk

# output files
REGIMEN_FILE = 'session_10/regimens.txt'
SENTENCE_FILE = 'session_10/sentences.txt'

We will need a function to load the phenotype .csv file and extract the "sentence" and matching term from the termset. We will also need functions to load the termset as well as a list of drug names so that we can filter them from the results. You can find a list of drug names in the github repo in ``notebooks/cooking/session_10/drug_names.txt``.

Here is the loading code:

In [2]:
###############################################################################
def load_csv_data(csv_file):
    """
    Extract the 'sentence' and 'term' fields from the file.
    Return a list of (term, sentence) tuples.
    """

    data = []
    
    with open(csv_file, 'rt') as infile:
        dict_reader = csv.DictReader(infile)
        for row_dict in dict_reader:
            sentence = row_dict['sentence']
            term     = row_dict['term']
            data.append( (term, sentence))

    return data


###############################################################################
def _load_termset(termset_file):
    """
    Extract individual regimen terms from an NLPQL termset.
    """

    term_list = []
    with open(termset_file, 'rt') as infile:
        for line in infile:
            line = line.strip()

            # regimen names are quoted
            if not line.startswith('"'):
                continue

            # strip quotes and commas
            if line.endswith('",'):
                regimen = line[1:-2]
            else:
                # final term
                regimen = line[1:-1]

            term_list.append(regimen)
            
    return term_list


###############################################################################
def _load_drugs(drug_file):
    """
    Load the list of drug names. The input file is ascii text, with
    a single drug name per line.
    """

    drug_list = []
    with open(drug_file, 'rt') as infile:
        for line in infile:
            drug_list.append(line.strip())

    return drug_list

We will also need code to perform text cleanup on the sentences and the terms. 