added some language model references to cooking notebook, other minor…

… updates
ClarityNLP · Feb 25, 2019 · 4e98431 · 4e98431
1 parent a77791c
commit 4e98431
Showing 1 changed file with 7 additions and 3 deletions.
diff --git a/notebooks/cooking/Cooking_with_ClarityNLP_022719.ipynb b/notebooks/cooking/Cooking_with_ClarityNLP_022719.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model for recognizing chemotherapy regimens, and we will estimate its performance via K-fold cross validation.\n",
+    "For today's cooking session we will use ClarityNLP to find mentions of chemotherapy regimens in clinical trial eligibility criteria. We will then use the extracted text to build an ngram language model that recognizes usage contexts for chemotherapy regimens. We will also estimate the performance of our model via K-fold cross validation.\n",
     "\n",
     "For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues)."
    ]
@@ -628,14 +628,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Language Model and K-Fold Cross-Validation Code"
+    "## Ngram Language Model and K-Fold Cross-Validation Code"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now that we have the code to generate the cross validation data, we need the code to generate the language model. Rather than writing our own, we will use the state-of-the art in language model generation software, [KenLM](https://kheafield.com/code/kenlm/). This code is lightning fast and can handle language data at web scale. It is written in C++ for performance, so a binary will need to be built on your system if you want to run it.\n",
+    "Now that we have the code to generate the cross validation data, we turn our attention to the language model.\n",
+    "\n",
+    "An ngram language model is essentially a set of ngrams and their scores. The score represents the probability of occurrence for the ngram in the training text. Ngram frequency counts are used as a starting point for the probability estimate, but an additional \"smoothing\" procedure must be applied to make the model useful. You can find more information about ngram models in chapter three of the latest draft of [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/). A comprehensive overview of smoothing techniques can be found in a [technical report by Chen and Goodman](https://dash.harvard.edu/handle/1/25104739).\n",
+    "\n",
+    "To generate the language model, we will use the state-of-the art in language model generation software, [KenLM](https://kheafield.com/code/kenlm/). This code is lightning fast and can handle language data at web scale. It is written in C++ for performance, so a binary will need to be built on your system if you want to run it.\n",
     "\n",
     "Download and build the code for your system with the following commands. You will need to install ``cmake`` and ``wget`` if you do not already have them installed on your system (or you can just download the tarball directly without using wget):\n",
     "\n",