Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Commit

Permalink
some final updates to the cooking notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
richardboyd committed Feb 27, 2019
1 parent 37bf4fc commit c0b17ff
Showing 1 changed file with 9 additions and 7 deletions.
16 changes: 9 additions & 7 deletions notebooks/cooking/Cooking_with_ClarityNLP_022719.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cooking with ClarityNLP - Session #10: ClarityNLP, Clinical Trials, and Language Models"
"## Cooking with ClarityNLP - Session #10: Clinical Trials and Language Models"
]
},
{
Expand Down Expand Up @@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Our goal is to be able to find mentions of chemotherapy regimens in the clinical trial eligibility criteria. We would also like to be able to assign a score to each criterion that indicates how confident we are in the identification. For some regimens such as BEACOPP, FOLFOX, RCHOP-14, and XELOX this is not too difficult. For other regimens such as CA, ICE, and PACE this task is somewhat more difficult, since the regimen names match those of common words or abbreviations.\n",
"Our goal is to be able to find mentions of chemotherapy regimens in the clinical trial eligibility criteria. We would also like to be able to assign a score to each criterion that indicates how confident we are in the identification. For some regimens such as BEACOPP, FOLFOX, ProMACE-CytaBOM, and XELOX this is not too difficult. For other regimens such as CA, ICE, and BEAM this task is somewhat more difficult, since the regimen names match those of common words or abbreviations.\n",
"\n",
"Our approach to this problem is to use ClarityNLP to extract regimen names and the surrounding text from the clinical trial eligibility criteria. We then normalize the text, replace the regimen names with a token, and compute all ngrams that include the regimen token. These ngrams provide us with sample contexts for the use of regimen names. By counting all such ngrams and scoring them in a smoothed language model, we can assign a probability to each ngram and use it to estimate the likelihood that a given ngram contains a regimen name."
]
Expand All @@ -126,7 +126,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We provide a python utility script that extracts the regimen names from the JSON data and writes them to a .csv file. You can find this utility script [here](https://github.com/ClarityNLP/Utilities/blob/master/get_all_regimens.py). Running this script generates the file ``all_regimen_names.csv``, which you can find in our github repo in the support folder for this notebook: ``ClarityNLP/notebooks/cooking/session_10/``. Here is a sample of what the file contains (**1961 lines total**):"
"We provide a python utility script called ``get_all_regimens.py`` that extracts the regimen names from the JSON data and writes them to a .csv file. You can find this utility script [here](https://github.com/ClarityNLP/Utilities/blob/master/get_all_regimens.py). Running this script generates the file ``all_regimen_names.csv``, which you can find in our github repo in the support folder for this notebook: ``ClarityNLP/notebooks/cooking/session_10/``. Here is a sample of what the file contains (**1961 lines total**):"
]
},
{
Expand Down Expand Up @@ -203,9 +203,11 @@
"* Creates a placeholder for a chemotherapy regimen termset\n",
"* Creates a [TermFinder](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/term_finder.html) task to search for the regimen terms in the AACT data\n",
"\n",
"It is straightforward to extract the data from the CSV file and the supplemental regimen file and generate an NLPQL termset that lists all of the (nearly 2000) acronyms. Unfortunately, the number of false positives that ClarityNLP finds is overwhelming. This is because many of the regimen names such as HAM, RICE, ACT, and others are common English words. Most two and three-letter regimen acronyms seem to collide with other common abbreviations in this data.\n",
"It is straightforward to extract the data from the CSV file and the supplemental regimen file and generate an NLPQL termset containing all of the ~2000 regimen names. Unfortunately, the number of false positives that ClarityNLP finds is overwhelming. This is because many of the regimen names such as HAM, RICE, ACT, and others are common English words. Most two and three-letter regimen acronyms seem to collide with other common abbreviations in this data.\n",
"\n",
"We have ruthlessly pruned the termlist to remove drug names, common acronyms, and any regimen names that are not present in our AACT data. This process required substantial experimentation and repeated runs to produce a result amenable to the automated analysis described below. Our final pruned termset can be found in ``notebooks/cooking/session_10/regimen_termset.txt``. This termset includes **349** regimen names."
"There is also another problem with such a large termset. **Solr has a limit on the number of Booleans that it can use in a given query.** This value can be found in the ``solrconfig.xml`` file as the entry ``maxBooleanClauses``. The default seems to be 1024. ClarityNLP converts the termset into a Boolean query, so running with a huge termset may cause you to exceed your system's ``maxBooleanClauses`` limit. Please check the value in your ``solrconfig.xml`` file and adjust if necessary.\n",
"\n",
"To overcome both of these problems, we have ruthlessly **pruned the termlist** to remove drug names, common acronyms, and any regimen names that are not present in our AACT data. This process required trial-and-error experimentation and repeated runs to produce a result amenable to the **automated** analysis described below. Our final pruned termset can be found in ``notebooks/cooking/session_10/regimen_termset.txt``. This termset includes **349** regimen names."
]
},
{
Expand All @@ -228,7 +230,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We would now like to use the results from ClarityNLP to generate an ngram language model. Unfortunately, we cannot just extract the [sentence](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/term_finder.html) field from the results, since the sentence tokenizer is often confused by the strange formatting of the clinical trial criteria data. We need to do some more work to extract the sentences that we need."
"We would now like to use the results from ClarityNLP to generate an ngram language model. Unfortunately, we cannot just extract the [sentence](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/term_finder.html) field from the results, since **the sentence tokenizer is often confused by the strange formatting of the clinical trial criteria data**. We need to do some more work to extract the sentences that we need."
]
},
{
Expand Down Expand Up @@ -687,7 +689,7 @@
"\n",
"To generate the language model, we will use the state-of-the art in language model generation software, [KenLM](https://kheafield.com/code/kenlm/). This code is lightning fast and can handle language data at web scale. It is written in C++ for performance, so a binary will need to be built on your system if you want to run it.\n",
"\n",
"Download and build the code for your system with the following commands. You will need to install ``cmake`` and ``wget`` if you do not already have them installed on your system (or you can just download the tarball directly without using wget):\n",
"Download and build the code for your system with the following commands. You will need to install ``boost``, ``Eigen``, ``cmake``, and ``wget`` if you do not already have them installed on your system (or you can just download the tarball directly without using wget):\n",
"\n",
"<pre>\n",
"wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n",
Expand Down

0 comments on commit c0b17ff

Please sign in to comment.