Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Commit

Permalink
added some language model references to cooking notebook, other minor…
Browse files Browse the repository at this point in the history
… updates
  • Loading branch information
richardboyd committed Feb 25, 2019
1 parent a77791c commit 4e98431
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions notebooks/cooking/Cooking_with_ClarityNLP_022719.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model for recognizing chemotherapy regimens, and we will estimate its performance via K-fold cross validation.\n",
"For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model that recognizes usage contexts for chemotherapy regimens. We will also estimate the performance of our model via K-fold cross validation.\n",
"\n",
"For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html). We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues)."
]
Expand Down Expand Up @@ -628,14 +628,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Language Model and K-Fold Cross-Validation Code"
"## Ngram Language Model and K-Fold Cross-Validation Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have the code to generate the cross validation data, we need the code to generate the language model. Rather than writing our own, we will use the state-of-the art in language model generation software, [KenLM](https://kheafield.com/code/kenlm/). This code is lightning fast and can handle language data at web scale. It is written in C++ for performance, so a binary will need to be built on your system if you want to run it.\n",
"Now that we have the code to generate the cross validation data, we turn our attention to the language model.\n",
"\n",
"An ngram language model is essentially a set of ngrams and their scores. The score represents the probability of occurrence for the ngram in the training text. Ngram frequency counts are used as a starting point for the probability estimate, but an additional \"smoothing\" procedure must be applied to make the model useful. You can find more information about ngram models in chapter three of the latest draft of [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/). A comprehensive overview of smoothing techniques can be found in a [technical report by Chen and Goodman](https://dash.harvard.edu/handle/1/25104739).\n",
"\n",
"To generate the language model, we will use the state-of-the art in language model generation software, [KenLM](https://kheafield.com/code/kenlm/). This code is lightning fast and can handle language data at web scale. It is written in C++ for performance, so a binary will need to be built on your system if you want to run it.\n",
"\n",
"Download and build the code for your system with the following commands. You will need to install ``cmake`` and ``wget`` if you do not already have them installed on your system (or you can just download the tarball directly without using wget):\n",
"\n",
Expand Down

0 comments on commit 4e98431

Please sign in to comment.