## Cooking with ClarityNLP - Session #10: Clinical Trials, ClarityNLP, and Language Models

For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial announcements. We will then use the extracted text to develop an ngram language model that can be used to disambiguate chemotherapy regimens from normal text.

For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

## The AACT Clinical Trials Database

The U.S. Library of Medicine maintains a website called [ClinicalTrials.gov](https://clinicaltrials.gov/) that hosts a large database of clinical trial data. The data provided by this website includes the objectives of the trial, patient inclusion and exclusion criteria, the participating medical centers, the study directors, and other relevant information. Each clinical trial is assigned a unique **NCT ID** such as NCT03601923, which is the ID for a now-recruiting trial to evaluate the effectiveness of the drug Niraparib for the treatment of pancreatic cancer.

The home page for ClinicalTrials.gov provides a web form by which users can enter search criteria and find clinical trials of interest:

![clinicaltrials_gov.png](data_session_10/clinicaltrials_gov.png)

This form is convenient for interactive searches, but larger-scale analysis of the clinical trials requires access to the database, which this site does not provide.

Luckily, there is a [related AACT website](https://aact.ctti-clinicaltrials.org/) that hosts a downloadable PostgreSQL database containing all of the data at clinicaltrials.gov. As of this writing, the current download is 871 MB in size. The download file is updated at near-daily intervals. If you want to run natural language processing or machine learning algorithms on the clinical trial data, the AACT database is what you need. The AACT website contains links to download and installation instructions, the data dictionary and database schema, and lots of other documentation.

### Loading the AACT Database into Solr

To be able to interrogate the AACT data with ClarityNLP, we need to first ingest the database contents into Solr. At GA Tech we downloaded a snapshot of the AACT database, installed it, and ingested the contents into our Solr instance.
Some python scripts that we used can be found in our [utilities project](https://github.com/ClarityNLP/Utilities). Look at the .py files containing *aact* in the filename if you're interested. Documentation on the fields that ClarityNLP expects can be found in our [documentation](https://claritynlp.readthedocs.io/en/latest/developer_guide/technical_background/solr.html).

Our data ingestion process mapes these required fields as follows:
* ``report_type``: either ``"Clinical Trial Inclusion Criteria"`` or ``"Clinical Trial Exclusion Criteria"``
* ``report_text``: text of the inclusion or exclusion criteria
* ``source: AACT``
* ``subject``: the NCT ID of the trial


## Chemotherapy Regimens

We want to be able to find mentions of chemotherapy regimens in the clinical trial criteria. Some examples:
* ``Relapsed DLBCL patients having received 6-8 cycles of RCHOP-like chemotherapy as 1st line induction treatment.``
* ``patients can receive up to 10 days of steroid therapy prior to starting treatment with BV+AVD``
* ``Melphalan 140 mg/m2 or 200 mg/m2, BEAM, BEAC, BUCY or CBV``
*

need better examples