## Cooking with ClarityNLP - Session #10: Clinical Trials, ClarityNLP, and Language Models

For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model and estimate its performance via K-fold cross validation.

For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

## The AACT Clinical Trials Database

The U.S. Library of Medicine maintains a website called [ClinicalTrials.gov](https://clinicaltrials.gov/), which hosts a large database of information relevant to clinical trials. This information includes the trial objectives; inclusion and exclusion criteria; the participating medical centers; the people in charge of the trial, and other related information. Each clinical trial is assigned a unique **NCT ID**, such as NCT03601923, which is the ID for a now-recruiting trial to evaluate the effectiveness of the drug Niraparib as a treatment for pancreatic cancer.

The home page for ClinicalTrials.gov provides a web form by which users can enter search criteria and find clinical trials of interest:

![clinicaltrials_gov.png](data_session_10/clinicaltrials_gov.png)

This form is convenient for interactive searches, but larger-scale analysis of the clinical trials requires access to the database, which this site does not provide.

Luckily, there is a [related AACT website](https://aact.ctti-clinicaltrials.org/) that hosts a downloadable PostgreSQL database containing all of the data at clinicaltrials.gov. As of this writing, the current download is 871 MB in size, and it contains data for more than 289,000 clinical trials. The download file is updated at near-daily intervals. If you want to run natural language processing or machine learning algorithms on the clinical trial data, the AACT database is what you need. The AACT website contains links to download and installation instructions, the data dictionary and database schema, and lots of other documentation.

### Loading the AACT Database into Solr

To be able to interrogate the AACT data with ClarityNLP, we first need to load the database contents into Solr. At GA Tech we downloaded a snapshot of the AACT database, installed it, got it working, and ingested the contents into our Solr instance. Some python scripts that we used can be found in our [utilities project](https://github.com/ClarityNLP/Utilities). Look at the .py files containing *aact* in the filename if you're interested. Documentation on the fields that ClarityNLP expects can be found in our [documentation](https://claritynlp.readthedocs.io/en/latest/developer_guide/technical_background/solr.html).

Our data ingestion process maps these required fields as follows:
* ``report_type``: either ``"Clinical Trial Inclusion Criteria"`` or ``"Clinical Trial Exclusion Criteria"``
* ``report_text``: text of the inclusion or exclusion criteria
* ``source: AACT``
* ``subject``: the NCT ID of the trial


## Chemotherapy Regimens

Our goal is to be able to find mentions of chemotherapy regimens in the clinical trial eligibility criteria. We would also like to be able to assign a score to each such criterion that indicates how confident we are in the assignment. For some regimens such as BEACOPP, FOLFOX, RCHOP-14, and XELOX this is not too difficult. For other regimens such as CA, ICE, and PACE this task is somewhat more difficult, since the regimen names match those of common words or abbreviations.

Our approach to this problem is to use ClarityNLP to extract regimen names and the surrounding text from the clinical trial eligibility criteria. We then normalize the text, replace the regimen names with a token, and compute all ngrams that include the regimen token. These ngrams provide us with sample contexts for the use of regimen names. By counting all such ngrams and using them in a smoothed language model, we can assign a probability to each ngram and use it to estimate the likelihood that a given ngram contains a regimen name.

### Regimen Names

Luckily for us there is a comprehensive, peer-reviewed oncology wiki at [HemeOnc.org](https://hemonc.org/wiki/Main_Page). This web site contains information on treatment regimens for many different types of cancer. One of our intrepid co-workers scraped this wiki and generated hundreds of NLPQL files that extract the relevant drugs and treatment regimens from clinical text. The full list can be found [here](https://github.com/ClarityNLP/Utilities/tree/master/regimen_nlpql), along with NLPQL for related drugs and conditions. A JSON file containing all of this data can be found [here](https://github.com/ClarityNLP/Utilities/blob/master/cancer/regimen_tree.json).

We also have another python utility script that extracts the regimen names from the JSON data and writes them to a .csv file. You can find this utility script [here](https://github.com/ClarityNLP/Utilities/blob/master/get_all_regimens.py).  Running this script generates the file ``all_regimen_names.csv``, which you can find in our github repo in the support folder for this notebook: ``ClarityNLP/notebooks/cooking/data_session_10/``. Here is a sample of what the file contains (1961 lines total):

![all_regimens.png](data_session_10/all_regimens.png)

The regimen name(s) are found in the first column. Alternate names or common variants are found in the second column.

There are still other chemotherapy regimen names not contained in this file. We have collected additional regimen names and saved them to the file ``notebooks/data_session_10/supplemental_chemo_regimens.txt``. It is relatively straightforward to extract the data from the CSV file and the supplemental file and generate an NLPQL termset from it. We have done this and saved the result to ``notebooks/cooking/data_session_10/regimen_termset.txt``.