## Cooking with ClarityNLP - Session #7: NLPQL Expressions

Today we will take a behind-the-scenes look at NLPQL expressions and how the system evaluates them. We will also provide an overview of our new [NLPQL editor tool](https://nlpql-editor.herokuapp.com/demo.html) that makes the task of creating NLPQL files much easier.

For background on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).

We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### Extracting Measurements from Radiology Reports

To start things off, suppose that we have access to a medical center's electronic health records for radiology. We are interested in searching the radiology reports for patients with lesions in the 1cm - 2cm size range. How can we use ClarityNLP to find these patients?

As you've learned in previous cooking sessions, we need to **create an NLPQL file** that defines a **documentset** for the radiology reports and a **termset** with lesion-related terms. We will need to run the **measurement finder** to extract measurements, and then **filter the measurements** with mathematical expressions that constrain the allowable lesion sizes.

#### Creating the NLPQL File

When developing a new NLPQL file, it is best to limit the number of documents processed until the NLPQL is fully debugged and working. So let's start by limiting our initial document set to 50 documents. It shouldn't take too long to processes 50 documents, and if we make a mistake, we can quickly recover.

A limit on the number of documents is specified by a ``limit`` statement in the NLPQL file. So open a text editor, create a new file called ``lesion.nlpql``, and enter the following line:

<pre>limit 50;</pre>

Next we need to insert some boilerplate that identifies the phenotype and version, provides a description, and imports the ClarityNLP libraries. All of your NLPQL files will have text similar to this at the start:

<pre>phenotype "Lesions1to2Cm" version "1";
description "Find lesions with sizes ranging from 1 to 2 cm.";
include ClarityCore version "1.0" called Clarity;</pre>

Since we only want to search radiology reports, we can create a documentset specifically for this purpose. Note that the ``report_types`` field actually takes an array argument (identified by the square brackets). We will use a single-element array containing the term ``Radiology``, the label used by the MIMIC-III data set:

<pre>
documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Radiology"]
    });
</pre>

Next we need to create a list of the tumor-related terms we want ClarityNLP to search for. We ponder this for a while and eventually arrive at a termset that uses some radiology lingo:

<pre>
termset LesionTerms: [
    "lesion", "growth", "mass", "malignancy", "tumor",
    "neoplasm", "nodule", "cyst", "focus of enhancement",
    "echodensity", "hypoechoic focus", "echogenic focus"
];
</pre>

Since we need to find and extract measurements, we must insert a command to run ClarityNLP's measurement finder. The simplest command to do this is:

<pre>
define LesionMeasurement:
    Clarity.MeasurementFinder({
        documentset: [Docs],
        termset: [LesionTerms]
    });
</pre>

This command runs the measurement finder on each sentence of our source documents. It returns any measurements that occur in the same sentence as a term in our ``LesionTerms`` termset.

Our goal is to find **patients** with tumors of the specified dimensions, so we specify a ``Patient`` context:

<pre>
context Patient;
</pre>

Now we're ready to write the expressions that constrain the lesion measurements to our desired size of 1-2 cm. Here we will insert three commands to do so, and we will explain the differences in results for each below:

<pre>
define xBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20;

define xyBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20;

define xyzBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20 AND
          LesionMeasurement.dimension_Z >= 10 AND LesionMeasurement.dimension_Z <= 20;
</pre>

ClarityNLP normalizes all dimensional measurements to units of **millimeters**, so our desired range of 1-2 cm becomes 10-20 mm.

Next, why do we need constraints in one, two, and three dimensions? **Because that's how radiology measurements are reported.** Sometimes the radiologist will state that a lesion measures ``15 mm``, ignoring the other dimensions. At other times the radiologist will note the values of the second and third dimensions. We want to enforce constraints on all possibilities, so we need a constraint statement for each dimension.

How did we know to use the ``dimension_X``, ``dimension_Y``, and ``dimension_Z`` field names?  Because our [MeasurementFinder API documentation](https://claritynlp.readthedocs.io/en/latest/api_reference/nlpql/measurementfinder.html) says to!

And with that we're done. Here's is the final text of our ``lesion.nlpql`` all in one place:

<pre>
limit 50;
phenotype "Lesions1to2Cm" version "1";
description "Find lesions with sizes ranging from 1 to 2 cm.";
include ClarityCore version "1.0" called Clarity;

// radiology documents only in the documentset
documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Radiology"]
    });

// lesion terms
termset LesionTerms: [
    "lesion", "growth", "mass", "malignancy", "tumor",
    "neoplasm", "nodule", "cyst", "focus of enhancement",
    "echodensity", "hyperechogenic focus"
];

// extract lesion measurements
define LesionMeasurement:
    Clarity.MeasurementFinder({
        documentset: [Docs],
        termset: [LesionTerms]
    });

// we want to find patients, so use 'Patient' context
context Patient;

// constraints on X, XY, and XYZ

define xBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20;

define xyBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20;

define xyzBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20 AND
          LesionMeasurement.dimension_Z >= 10 AND LesionMeasurement.dimension_Z <= 20;
</pre>

### Testing the NLPQL for Syntax Errors

Before trying to process documents with our new NLPQL file, it is a good idea to first check it for syntax errors. We can do this by submitting it to the ``nlpql_tester`` API endpoint, a useful tool for the NLPQL developer.

In prevous cooking sessions we showed you how to use the [Postman](www.postman.com) GUI tool to submit NLPQL files to the ClarityNLP webserver. Today we will show you how to use a command-line tool called [curl](https://curl.haxx.se/) to do the same thing.

The nlpql_tester API for a local ClarityNLP instance is typically found at ``localhost:5000/nlpql_tester``. The NLPQL file should be sent via HTTP POST using a content type of ``text/plain``.

To submit the file, install ``curl`` on your system, then open a terminal window, change directories to the location of ``lesion.nlpql``, and run the next command. If you are following along in the notebook, there is a copy of ``lesion.nlpql`` in ``notebooks/cooking/assets/``.

<pre>
curl -i -X POST http://localhost:5000/nlpql_tester -H "Content-Type: text/plain" --data-binary "@lesion.nlpql"
</pre>

Here's what the various options mean:
```
-i: include the HTTP header in the output
-X: request type (must be POST)
-H: add the subsequent Content-Type: text/plain to the header of the HTTP request
--data-binary: POST the data exactly as specified, no additional processing
```

Note also that an ``@`` character precedes the file name. If you run the command outside of the folder that contains ``lesion.nlpql``, replace the final quoted string with ``"@/path/to/lesion.nlpql"``, substituting the appropriate path on your system.

The ``curl`` command sends the file to the ``nlpql_tester`` API endpoint via HTTP POST. If the syntax is OK the system responds with a JSON result. Otherwise the system responds with an error message.

You can run the NLPQL tester directly from this notebook by first running the code in the next cell:

In [4]:
# This code below is only required for running ClarityNLP in Jupyter notebooks.
# It is not required if running NLPQL via API or the ClarityNLP GUI.
import pandas as pd
import claritynlp_notebook_helpers as claritynlp

Now run the next cell to test the NLPQL file. You should see some JSON output as the result. For some reason, the output is truncated when running in this notebook. You can see the full JSON result by using the ``curl`` command above.

In [5]:
lesion_nlpql_text = claritynlp.load_file('assets/lesion.nlpql')
json_result = claritynlp.run_nlpql_tester(lesion_nlpql_text)
print(json_result)

{
    "limit": 50,
    "phenotype": {
        "declaration": "phenotype",
        "values": [],
        "named_arguments": {},
        "library": "ClarityNLP",
        "name": "\"LesionDemo\"",
        "description": "",
        "alias": "",
        "concept": "",
        "arguments": [],
        "funct": "",
        "version": "1"
    },
    "description": "\"Find lesions of various sizes.\"",
    "document_sets": [
        {
            "declaration": "documentset",
            "values": [],
            "named_arguments": {
                "report_types": [
                    "Radiology"
                ]
            },
            "library": "Clarity",
            "name": "Docs",
            "description": "",
            "alias": "",
            "concept": "",
            "arguments": [],
            "funct": "createDocumentSet",
            "version": ""
        }
    ],
    "includes": [
        {
            "declaration": "include",
            "values": [],
            "named

### Running the NLPQL File

Having verified that the NLPQL file has the proper syntax, we submit the job to the ClarityNLP server with a similar ``curl`` command:
<pre>
curl -i -X POST http://localhost:5000/nlpql -H "Content-Type: text/plain" --data-binary "@lesion.nlpql"
</pre>

Alternatively, you can run from the next notebook cell:

In [None]:
lesion_nlpql_text = claritynlp.load_file('assets/lesion.nlpql')
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(lesion_nlpql_text)

**The job may take several minutes to run.** After it runs to completion, browse to the location of the CSV file containing the intermediate results, and open it in in a spreadsheet application such as Microsoft Excel. We have saved the results of a run to ``notebooks/cooking/assets/lesion_intermediate.csv``. The next cell displays some data from this file:

In [4]:
lesion_csv = pd.read_csv('assets/lesion_intermediate.csv', 
                         usecols=['dimension_X', 'dimension_Y', 'dimension_Z', 'nlpql_feature', 'subject', 'job_id'])
lesion_csv

Unnamed: 0,dimension_X,dimension_Y,dimension_Z,job_id,nlpql_feature,subject
0,28,16,,11131,LesionMeasurement,40463
1,6,,,11131,LesionMeasurement,40463
2,17,8,,11131,LesionMeasurement,40463
3,110,101,,11131,LesionMeasurement,40463
4,7,,,11131,LesionMeasurement,37766
5,6,,,11131,LesionMeasurement,37766
6,7,,,11131,LesionMeasurement,37766
7,39,20,,11131,LesionMeasurement,26259
8,8,,,11131,LesionMeasurement,43634
9,5,,,11131,LesionMeasurement,43634


### Spreadsheet Rows are MongoDB Documents

Our run generated a CSV file containing a header row and 194 rows of data. This CSV file is a dump of the results for our particular job, which has a unique ``job_id`` of 11131. These results are stored in a MongoDB collection called ``phenotype_results`` in a database called ``nlp``. It is important to understand that **each row** of data above is a **separate document** in the MongoDB database. For instance, here is the underlying database document for row 2 above, which was written directly by the ``MeasurementFinder`` task:

In [11]:
import json
obj = { "_id" : "5bfd9c9a31ab5b2e981dca14", "sentence" : "there is a 1.7 x 0.8 cm fdg-avid soft tissue nodule in the subcutaneous tissues of the right breast.", "text" : "1.7 x 0.8 cm", "start" : 11, "value" : 17, "end" : 23, "term" : "avid soft tissue nodule", "dimension_X" : 17, "dimension_Y" : 8, "dimension_Z" : None, "units" : "MILLIMETERS", "location" : [ "subcutaneous tissues of the right breast" ], "condition" : "EQUAL", "value1" : None, "value2" : "", "temporality" : "CURRENT", "min_value" : 8, "max_value" : 17, "pipeline_type" : "MeasurementFinder", "pipeline_id" : 12573, "job_id" : 11131, "batch" : 50, "owner" : "claritynlp", "nlpql_feature" : "LesionMeasurement", "inserted_date" : "2018-11-27T14:35:54.749Z", "concept_code" : -1, "phenotype_final" : False, "report_id" : "1048492", "subject" : "40463", "report_date" : "2119-02-16T00:00:00Z", "report_type" : "Radiology", "source" : "MIMIC", "solr_id" : "1048492" }
print(json.dumps(obj, indent=4))

{
    "_id": "5bfd9c9a31ab5b2e981dca14",
    "sentence": "there is a 1.7 x 0.8 cm fdg-avid soft tissue nodule in the subcutaneous tissues of the right breast.",
    "text": "1.7 x 0.8 cm",
    "start": 11,
    "value": 17,
    "end": 23,
    "term": "avid soft tissue nodule",
    "dimension_X": 17,
    "dimension_Y": 8,
    "dimension_Z": null,
    "units": "MILLIMETERS",
    "location": [
        "subcutaneous tissues of the right breast"
    ],
    "condition": "EQUAL",
    "value1": null,
    "value2": "",
    "temporality": "CURRENT",
    "min_value": 8,
    "max_value": 17,
    "pipeline_type": "MeasurementFinder",
    "pipeline_id": 12573,
    "job_id": 11131,
    "batch": 50,
    "owner": "claritynlp",
    "nlpql_feature": "LesionMeasurement",
    "inserted_date": "2018-11-27T14:35:54.749Z",
    "concept_code": -1,
    "phenotype_final": false,
    "report_id": "1048492",
    "subject": "40463",
    "report_date": "2119-02-16T00:00:00Z",
    "report_type": "Radiology",
    "sour

You will need to view the results in a spreadsheet to see how the field names become column names in the intermediate CSV file. Thus the CSV file provides a 'flattened' view of the database results for a particular job.

### Interpreting the Results

Looking at the output rows above, you can see that the results are broadly grouped by the value of the ``nlpql_feature`` field. There are four such groups with values ``LesionMeasurement``, ``xBetween10and20mm``, ``xyBetween10and20mm``, and ``xyzBetween10and20mm``. Take a look at the NLPQL file above and observe that these are the name strings in each ``define`` statement.

A value of ``NaN`` (not a number) is the equivalent of a null result, meaning that no data was found for that measurement dimension.

Rows 0-144 contain the extracted measurements, all of which have their ``nlpql_feature`` field equal to ``LesionMeasurement``. These rows comprise the output of the measurement extractor. They are the **input** data for the mathematical expressions that constrain the lesion measurements. The underlying documents for these ``LesionMeasurement`` results in the MongoDB database are called *task result documents*.

Rows 145-183 have their ``nlpql_feature`` field equal to ``xBetween10and20mm``. Unlike the ``LesionMeasurement`` rows, which are directly generated by the MeasurementFinder task, these rows are generated by evaluation of a mathematical epxression. This expression places a constraint on the X dimension of each measurement. Only those measurements that satisfy the constraint fill these rows of the intermediate result file.

Rows 184-192 have their ``nlpql_feature`` field equal to ``xyBetween10and20mm``. These rows are generated by evaluation of a mathematical expression that constrains both the X and Y measurement dimensions. Only those measurements that satisfy the constraints fill these rows of the intermediate result file.

Row 193 has its ``nlpql_feature`` field equal to ``xyzBetween10and20mm``.  This row is the only measurement that survives the constraint on all three dimensions.

Note that the ``xBetween10and20mm`` results contain 2D and 3D measurements, some of which have Y or Z dimensions that exceed 20 mm (such as rows 175 and 176). The constraint for these rows is only on the X dimension. The Y and Z dimensions can have any value whatsoever, even NaN (which means they don't exist).

We see a single 3D measurement in the ``xy`` result section, in row 188. This measurement happens to have its Z dimension satisfying the constraints on the X and Y dimensions, but there is no constraint imposed on the measurement by the code itself.

### NLPQL Expressions

In the NLPQL example above, we enforced constraints on the measurement dimensions with NLPQL expressions. In this section we describe the different expression types and provide an overview of how ClarityNLP evaluates them.

#### Mathematical Expressions

An NLPQL mathematical expression is found in a ``define`` statement such as:
<pre>
define hasFever:
    where Temperature.value >= 100.4;
    
define xBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20;
</pre>

The portion of the statement following the ``where`` keyword is the mathematical expression. These expressions involve mathematical operations on variables of the form ``nlpql_feature.variable_name`` such as ``Temperature.value``, ``LesionMeasurement.dimension_X``, etc. They can also include numeric literals such as ``100.4``.

NLPQL mathematical expressions produce a numerical result from data contained in a **single** task result document. Since each task result document comprises a row in the intermediate results CSV file (see above), the evaluation of mathematical expressions is also called a **single-row operation**. The numerical result from the expression evaluation is written to a new MongoDB result document, as demonstrated in the lesion example above.

#### Logical Expressions

An NLPQL logical expression is also found in a ``define`` statement and involves the logical operators AND, OR, and NOT, such as:
<pre>
define hasSepsis:
    where hasFever AND hasSepsisSymptoms;

define hasNoZConstraint:
    where xBetween10and20mm OR xyBetween10and20mm;
</pre>

The ``where`` portion of the statement is the logical expression. Logical expressions **operate on NLPQL features** such as ``hasFever`` and ``hasSepsisSymptoms``, **not** on individual variables such as ``Temperature.value``.

NLPQL logical expressions use data from one or more task result documents to compute results. The results are written back to MongoDB as a set of new result documents. The evaluation of a logical expression is also called a **multi-row operation**, since it typically consumes and generates multiple rows in the intermediate results CSV file.

The presence of an ``nlpql_feature.variable_name`` token indicates that the expression is actually a single-row operation, not multi-row.

One further constraint: **no parentheses are allowed in logical expressions**. The reason for this limitation is probably clear by now: each logical operand refers to a set of result documents in the Mongo database, all of which share an identical ``nlpql_feature`` field value. Parenthesized operands have no such ``nlpql_feature`` value.

The practical meaning of this is that you need to **synthesize complex logic from simpler parts**. For instance:

<pre>
// not legal
define myLogicFeature:
    where A AND (B OR C);
</pre>

If you imagine looking at a spreadsheet containing the intermediate results for this run, you would see a set of rows for nlpql_feature ``A``, a set of rows for nlpql_feature ``B``, and another set of rows for nlpql_feature ``C``. Where is the set of rows for ``(B OR C)``? It doesn't exist! It's not in the spreadsheet nor in the Mongo database.

To actually evaluate this expression you should write it as two simple statements, each with **its own defined nlpql_feature**:

<pre>
// nlpql_feature == BorC
define BorC:
    where B OR C;

define myLogicFeature:
    where A AND BorC;
</pre>

### Evaluation of Single-Row (Mathematical) Expressions

So what does ClarityNLP actually do to evaluate a mathematical expression?

First, the NLPQL front end parses the NLPQL file and generates a string of whitespace-separated tokens for each expression. The token string is passed to the evaluator which determines if it is single-row, multi-row, or something else that cannot be evaluated. If single-row, the the **nlpql_feature** and **field list** are extracted.

Consider these examples, both of which are single-row mathematical expressions:

<pre>
where Temperature.value >= 100.4
where LesionMeasurement.dimension_X < 5 AND LesionMeasurement.dimension_Y < 5
</pre>

The ``nlpql_feature`` and field list for the first example is:
<pre>
    nlpql_feature: Temperature
    field_list:    ['value']
</pre>
For the second example:
<pre>
    nlpql_feature: LesionMeasurement
    field_list:    ['dimension_X', 'dimension_Y']
</pre>

Important point: there is always a **single** ``nlpql_feature`` for a mathematical expression, since each MongoDB result document contains a single value for this field.

### MongoDB Aggregation Framework

It is desirable to use the capabilities of MongoDB for expression evaluation since the data is already contained within Mongo itself. Use of another library for evaluation would require queries to extract the data from Mongo; transmission across a network (from a remote Mongo host); ingest into a new library; then network transmission of the results to the Mongo host followed by insertion back into Mongo. This could be a very inefficient process for large data volumes and busy networks.

The [MongoDB aggregation framework](https://docs.mongodb.com/manual/aggregation/) is a powerful facility upon which to build an expression evaluator. The aggregation framework provides filtering, document transformation, and mathematical operations that ClarityNLP uses to evaluate expressions. 

The aggregation framework evaluates expressions via an "aggregation pipeline", which is a series of commands (called stages) that operate on a set of documents and produce a new set as a result. The pipeline stages are ideally arranged so that the operations that affect the most documents are performed first.

#### Initial Filter Stage

The conversion process involves the generation of an initial [``$match``](https://docs.mongodb.com/manual/reference/operator/aggregation/match/#pipe._S_match) query to filter out everything but the data for the desired job, which is identified by its ``job_id``. The match query also checks for the existence of all entries in the field list and that they have non-null values. **A simple existence check is not sufficient**, since a null field actually exists but has a value that cannot be used for computation. Hence checks for existence and a non-null value are both necessary.

For the two examples above, the ``$match`` query generates a pipeline filter stage that looks like this, assuming a job_id of 12345:

<pre>
// Temperature.value >= 100.4
{
    $match : {
        "job_id" : 12345,
        "nlpql_feature" : {$exists:true, $ne:null},
        "value"         : {$exists:true, $ne:null}
    }
}

// LesionMeasurement.dimension_X < 5 AND LesionMeasurement.dimension_Y < 5
{
    $match : {
        "job_id" : 12345,
        "nlpql_feature" : {$exists:true, $ne:null},
        "dimension_X"   : {$exists:true, $ne:null},
        "dimension_Y"   : {$exists:true, $ne:null}
    }
}
</pre>

Note that the presence of this initial filter is the reason why the ``xBetween10and20mm`` results ignore the Y and Z dimensions, and why the ``xyBetween10and20mm`` results ignore the Z dimension. There are no filters on those variables in their respective ``define`` statements!

#### Conversion from Infix to Postfix

After generating the initial filter stage, ClarityNLP **converts the mathematical expression from infix to postfix**. The postifx conversion removes parentheses and resolves operator precedence and associativity issues. NLPQL uses the same [operator precedence](https://docs.python.org/3/reference/expressions.html#operator-precedence) and associativity as the Python programming language.

#### Stack-Based Evaluation

A postfix expression can be evaluated with a stack-based evaluator. The general idea is to push the postfix tokens onto a stack until an operator is encountered, at which point its operands are popped, the operator expression evaluated, and the result pushed back onto the stack.

ClarityNLP uses this method, but **the evaluation process does not compute a mathematical result**. Instead, it performs string processing to format MongoDB aggregation commands for evaluating the mathematical expression. MongoDB aggregation uses a consistent syntax that makes this automated formatting process possible.

After the postfix evaluation and formatting operations the expressions become:

<pre>
// (nlpql_feature == Temperature) and (value >= 100.4)
{
   $match : {
       "job_id" : 11116,
       "nlpql_feature" : {$exists:true, $ne:null},
       "value"         : {$exists:true, $ne:null}
   }
},
{
    "$project" : {
        "value" : {
            "$and" : [
                {"$eq"  : ["$nlpql_feature", "Temperature"]},
                {"$gte" : ["$value", 100.4]}
            ]
        }
    }
}

// (nlpql_feature == LesionMeasurement) and (dimension_X < 5 and dimension_Y < 5)
{
    "$match" : {
        "job_id" : 11116,
        "nlpql_feature" : {$exists:true, $ne:null},
        "dimension_X"   : {$exists:true, $ne:null},
        "dimension_Y"   : {$exists:true, $ne:null}
    }
},
{
    "$project" : {
        "value" : {
            "$and" : [
                {
                    "$eq" : ["$nlpql_feature", "LesionMeasurement"]
                },
                {
                    "$and" : [
                        {"$lt" : ["$dimension_X", 5]},
                        {"$lt" : ["$dimension_Y", 5]}
                    ]
                }
            ]
        }
    }
}
</pre>

At this point the aggregation pipelines for both expressions are complete. Each pipeline is sent to MongoDB where it runs and generates the results seen in the spreadsheet output above.

### Evaluation of Multi-Row (Logical) Expressions

Multi-row expressions apply the logical operations ``AND``, ``OR``, and ``NOT`` to **sets** of MongoDB result documents. The sets are determined by the different values of the ``nlpql_feature`` field. In the lesion example above, a multi-row expression for accepting measurements with no constraint on the Z-component is:

<pre>
define hasNoZConstraint:
    where xBetween10and20mm OR xyBetween10and20mm;    
</pre>

This logical OR operates on two **sets** of results. The first set contains all result documents whose ``nlpql_feature`` field has the value ``xBetween10and20mm``. The second set contains all result documents whose ``nlpql_feature`` field has the value ``xyBetween10and20mm``. The result of this logical OR is a new set of documents, each of which **satisfies the logical OR condition individually**.

The logical ``NOT`` operation is used to compute **set differences**, such as in this expression:

<pre>
define hasSepsisSymptomsWithoutRigors:
    where hasSepsisSymptoms NOT hasRigors;
</pre>

The result of the set difference operation ``A NOT B`` is all elements of set ``A`` that are not also elements of set ``B``.

Also, **unary ``NOT`` is not supported**. So you cannot write statements such as ``NOT A``, or ``NOT hasNoZConstraint``.

#### Document Filtering and Grouping

Evaluation of an n-ary logical expression uses the MongoDB aggregation pipeline as well. The evaluator proceeds by filtering result documents by the job_id, similar to the process described above for single-row expressions. Next, an additional filter stage is applied that discards all documents whose ``nlpql_feature`` value differs from those of the sets being logically combined.

Any documents that remain are **grouped by value of the context variable**, which is the ``document_id`` for a Document context, or the ``subject`` field for a Patient context. For a logical NOT operation, any documents whose ``nlpql_feature`` field equals that of the "B" set are discarded.

For logical AND the documents within each group are **counted**. Any groups not having **at least** ``n`` members for an n-ary logical AND are discarded. Additionally, any groups not having **at least** ``n`` different nlpql_features are discarded as well.

Next, the documents in each group are sorted on the value of the ‘other’ context variable. Thus for a patient context the documents in each group are sorted on the ``report_id`` field. This sort operation generates **subgroups** of documents sharing the same value of the ‘other’ field.

To summarize the state of the result documents at this point: all surviving documents have been filtered and separated into groups. The members of each group all share identical values of the context variable. Within each group, the documents are further separated into subgroups. The documents in each subgroup have identical values of the ‘other’ context variable.

#### Formation of Ntuples

For set difference and logical OR, the processing is complete, and the documents are written out to MongoDB with an ``nlpql_feature`` determined by the ``define`` statement containing the expression.

For an n-ary logical AND, the grouped documents are further arranged into ntuples, since the AND condition requires n-document groups. Each ntuple contains n documents, each of which has a different value for the ``nlpql_feature`` field.

The ntuple formation process generates a **subset** of the full result of performing an n-way inner join on the context field. **The full inner join is not necessary to find the patient IDs**, since the join condition is **not** the entire document, but only the values of the context field. These values are known completely after the grouping operation mentioned a few paragraphs above, so exploding the result set in to a full n-way inner join is not going to find any more of those values. The same is also true for a document context.

## NLPQL Editor

We would now like to show you how to use our [NLPQL editor tool](https://nlpql-editor.herokuapp.com/demo.html) to reproduce the ``lesion.nlpql`` file that we created earlier. You will see that this tool greatly simplifies the process of creating NLPQL files.

When you start the editor you see a screen that looks like this:

![nlpql_editor_1.png](assets/nlpql_editor_1.png)

On the left side, the blue buttons allow you to create pre-formatted "templates" for NLPQL statements. The text of the templated statement appears in the edit window to the right of the buttons. The blue-green buttons at the upper right allow you to clear the edit window and start over, as well as copy the NLPQL text to the system clipboard.

Now let's walk through the process of creating our lesion.nlpql file using the editor.

To begin, notice that all but the two topmost blue buttons are disabled. You should click one of these buttons to start things off. 

Set the name for the lesion phenotype by pressing the ``Phenotype Name`` button. After you click the button a new dialog box pops up into which you can enter the text strings as shown in the next image:

![nlpql_editor_2.png](assets/nlpql_editor_2.png)

When finished, save your changes and you should see this:

![nlpql_editor_3.png](assets/nlpql_editor_3.png)

Simple, isn't it?

The next step is to click the ``Add Library`` button, which will insert the ``include`` statement for the ClarityNLP libraries. When you do that the lower set of buttons become active, so click the ``Limit Query`` button and enter ``50`` into the dialog that appears. Save your changes and you should see this:

![nlpql_editor_4.png](assets/nlpql_editor_4.png)

There is no button for the ``description`` field, since this is just a quoted string that you can enter directly into the edit window. The description is optional, so we will omit it for now.

Now let's create our documentset of radiology reports. Click the ``Add Document Set`` button and enter ``Docs`` into the topmost edit control. From the ``Document Set Type`` combo box, select the ``By Query`` option. This causes the dialog to expand in size, and you will need to scroll the window to see it all. Enter ``Radiology`` into the ``Report Types`` edit control, as shown here:

![nlpql_editor_5.png](assets/nlpql_editor_5.png)

After entering the data, scroll down to the bottom of the dialog and save your changes. You should see the documentset statement appear in the edit window, as shown here:

![nlpql_editor_6.png](assets/nlpql_editor_6.png)

Now we're getting somewhere. The next task is to enter the termset containing our carefully-selected lesion terms. Click the ``Add Term Set`` button. Set the ``Termset Name`` to ``LesionTerms``, and enter the lesion terms as a (long) comma-separated list:

![nlpql_editor_7.png](assets/nlpql_editor_7.png)

After entering the list of lesion terms, save your changes and you should see the termset appear in the edit window. The long list of terms will scroll off of the screen, so insert some newlines in the list to get everything to appear onscreen at once:

![nlpql_editor_8.png](assets/nlpql_editor_8.png)

The next thing to do is insert the command to run the measurement finder, which you can do by clicking on the ``Define Feature`` button (recall the discussion of NLPQL **features** above). In the popup dialog enter ``Lesion Measurement`` as the feature name and select ``Measurement Finder`` from the algorithm selection combo box. When you do this, additional edit controls will appear. Enter ``LesionTerms`` for the termset and ``Docs`` for the document set:

![nlpql_editor_9.png](assets/nlpql_editor_9.png)

After saving changes you should see this as the result:

![nlpql_editor_10.png](assets/nlpql_editor_10.png)

Set the context by clicking the ``Set Logical Context`` button and selecting ``Patient`` in the pop-up dialog. This should be the result:

![npql_editor_11.png](assets/nlpql_editor_11.png)

Now all that's left is for us to add the expressions that set constraints on the measurements. To do this, click the ``Define Result`` button. In the ``Result Name`` edit control, enter ``xBetween10and20mm``. In the ``Logic`` edit control below it, enter the string ``LesionMeasurement.dimension_X >= 10 and LesionMeasurement.dimension_X <= 20``. Note that you do not need to type the ``where`` keyword or the terminating semicolon:

![nlpql_editor_12.png](assets/nlpql_editor_12.png)

When you save changes you should see this as the result:

![nlpql_editor_13.png](assets/nlpql_editor_13.png)

Observe the appearance of the ``final`` keyword. We did not use this keyword previously because we wanted all results to appear in the single intermediate CSV file, so that we could see everything together in one place. If we keep the ``final`` keyword it will cause the ``xBetween10and20mm`` result set to appear in the "final" phenotype CSV file. Since we want to reproduce the results above, go ahead and delete this keyword from the text in the edit window.

At this point, you could click the ``Define Result`` button and create the other constraints in a similar manner. It is probably easier to just cut and paste from the existing constraint and enter the text directly into the edit control. By whatever method you choose, create the two other constraints and arrive at this final result:

![nlpql_editor_14.png](assets/nlpql_editor_14.png)

Your NLPQL creation can now be copied to the clipboard and saved to a file called ``lesion.nlpql``. With this file you can follow the discussion above and you should see very similar results if you run it on the MIMIC-III data set.

### Thanks

Thank you for joining this week's Cooking with ClarityNLP!  Please send any requests or ideas for future Cooking shows to <charity.hilton@gtri.gatech.edu>.

Have a great week!