## Cooking with ClarityNLP - Session #7 - NLPQL Under the Hood

Today we will take a behind-the-scenes look at how ClarityNLP evaluates NLPQL expressions. We will walk through the construction of an NLPQL file and give a high-level description of how its results are generated. We will also provide an overview of our new NLPQL editor tool that makes the task of creating NLPQL files much easier. For background on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### Extracting Measurements from Radiology Reports

To start things off, suppose that we're developing a promising new immunotherapy drug. This drug has proven effective on tumors of various sizes, but we have noted particular efficacy for tumors in the 1 cm to 2 cm size range. We want to recruit patients for a new clinical trial designed to test the drug on tumors of this size.  We have access to a corpus of radiology reports, and we would like to search these reports for patients with appropriately-sized tumors. How can we use ClarityNLP to find more patients?

As you've learned in previous cooking sessions, we need to create an NLPQL file with the relevant commands.

When developing new NLPQL it is best to limit the number of documents processed, until the NLPQL is fully debugged and working. So let's start by limiting our initial document set to 50 documents. It shouldn't take too long to processes 50 documents, and if we make a mistake, we can quickly recover. 

A limit on the number of documents processed is specified by a ``limit`` statement on the first line of the NLPQL file. So open a text editor, create a new file called ``lesion.nlpql``, and enter the following line:

<pre>limit 50;</pre>

Next we need to insert some boilerplate that identifies the phenotype and version, provides a description, and imports the ClarityNLP libraries. All of your NLPQL files will have something like this at the start.

<pre>phenotype "Lesions1to2Cm" version "1";
description "Find lesions of sizes ranging from 1 to 2 cm.";
include ClarityCore version "1.0" called Clarity;</pre>

Since we want to search only radiology reports, we can create a documentset specifically for this purpose. Note that the ``report_types`` field is an array with the single entry ``Radiology``. We will process documents from the MIMIC-III dataset, which identifies radiology reports with this label.

<pre>
documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Radiology"]
    });
</pre>

Next we need to create a list of the terms we want ClarityNLP to search for. We ponder this for a while and eventually construct a termset that uses language common to radiology:

<pre>
termset LesionTerms: [
    "lesion", "growth", "mass", "malignancy", "tumor",
    "neoplasm", "nodule", "cyst", "focus of enhancement",
    "echodensity", "hypoechoic focus", "echogenic focus"
];
</pre>

Since we need to find and extract measurements, we must insert a command to activate ClarityNLP's measurement finder. The simplest command to do this is:

<pre>
define LesionMeasurement:
    Clarity.MeasurementFinder({
        documentset: [Docs],
        termset: [LesionTerms]
    });
</pre>

Observe how we tell the measurement finder to examine only the documents in our custom document set. The termset specification tells the measurement finder to return a measurement only if it appears in the same sentence as one of our custom lesion terms.

Our goal is to find **patients** with tumors of the specified dimensions, so we specify a ``Patient`` context:

<pre>
context Patient;
</pre>

Now we're ready to write the commands for constraining the lesion measurements to our desired size of 1-2 cm. Here we will insert three commands to do so, and will explain the differences in results for each below:

<pre>
define xBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20;

define xyBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20;

define xyzBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20 AND
          LesionMeasurement.dimension_Z >= 10 AND LesionMeasurement.dimension_Z <= 20;
</pre>

ClarityNLP normalizes all dimensional measurements to units of **millimeters**, so our desired range of 1-2 cm becomes 10-20 mm. These three statements enforce constratints on the X, XY, and XYZ measurement components respectively.

And with that we're done. Here is the text of the final ``lesion.nlpql``:

<pre>
limit 50;
phenotype "LesionDemo" version "1";
description "Find lesions of various sizes.";
include ClarityCore version "1.0" called Clarity;

// radiology documents only in the documentset
documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Radiology"]
    });

// lesion terms
termset LesionTerms: [
    "lesion", "growth", "mass", "malignancy", "tumor",
    "neoplasm", "nodule", "cyst", "focus of enhancement",
    "echodensity", "hyperechogenic focus"
];

// extract lesion measurements
define LesionMeasurement:
    Clarity.MeasurementFinder({
        documentset: [Docs],
        termset: [LesionTerms]
    });

// we want to find patients, so use 'Patient' context
context Patient;

define xBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20;

define xyBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20;

define xyzBetween10and20mm:
    where LesionMeasurement.dimension_X >= 10 AND LesionMeasurement.dimension_X <= 20 AND
          LesionMeasurement.dimension_Y >= 10 AND LesionMeasurement.dimension_Y <= 20 AND
          LesionMeasurement.dimension_Z >= 10 AND LesionMeasurement.dimension_Z <= 20;
</pre>

### Testing the NLPQL Syntax

Before trying to process documents with our new NLPQL file, it is a good idea to first check it for syntax errors. We can do this by submitting it to the ``nlpql_tester`` API endpoint, a useful tool for the NLPQL developer.

In prevous cooking sessions we showed you how to use the [Postman](www.postman.com) GUI tool to submit NLPQL files to the ClarityNLP webserver. Today we will show you how to use a command-line tool called [cURL](https://curl.haxx.se/) to do the same thing.

The nlpql_tester API for a local ClarityNLP instance is typically found at ``localhost:5000/nlpql_tester``. The NLPQL file should be sent via HTTP POST using a content type of ``text/plain``.

To submit the file, install ``curl`` on your system, then open a terminal window, change directories to the location of ``lesion.nlpql``, and run this command:

<pre>
curl -i -X POST http://localhost:5000/nlpql_tester -H "Content-Type: text/plain" --data-binary "@lesion.nlpql"
</pre>

The various options have the following meanings:
```
-i: include the HTTP header in the output
-X: request type (must be ``POST``)
-H: add the subsequent ``Content-Type: text/plain`` to the header of the HTTP request
--data-binary: POST the data exactly as specified, no additional processing
```

You can run the NLPQL tester directly from this notebook by first running the code in the next cell:

In [None]:
# This code below is only required for running ClarityNLP in Jupyter notebooks.
# It is not required if running NLPQL via API or the ClarityNLP GUI.
import pandas as pd
import claritynlp_notebook_helpers as claritynlp

Now run the next cell to test the NLPQL file:

The system should respond with a JSON result with no mention of error.

### Running the NLPQL File

Having verified that the NLPQL file has the proper syntax, you can submit the job to the ClarityNLP server via cURL with a similar command:
<pre>
curl -i -X POST http://localhost:5000/nlpql -H "Content-Type: text/plain" --data-binary "@lesion.nlpql"
</pre>

Alternatively, you can run from the next notebook cell:

The job may take several minutes to run. After it runs to completion, browse to the location of the CSV file containing the intermediate results, and open in in a spreadsheet application such as Microsoft Excel. We have saved the results of a run to ``assets/lesion_intermediate.csv``, some of which is displayed in the next cell:

In [17]:
lesion_csv = pd.read_csv('assets/lesion_intermediate.csv', 
                         usecols=['dimension_X', 'dimension_Y', 'dimension_Z', 
                                  'nlpql_feature', 'subject'])
lesion_csv

Unnamed: 0,dimension_X,dimension_Y,dimension_Z,nlpql_feature,subject
0,28,16,,LesionMeasurement,40463
1,6,,,LesionMeasurement,40463
2,17,8,,LesionMeasurement,40463
3,110,101,,LesionMeasurement,40463
4,7,,,LesionMeasurement,37766
5,6,,,LesionMeasurement,37766
6,7,,,LesionMeasurement,37766
7,39,20,,LesionMeasurement,26259
8,8,,,LesionMeasurement,43634
9,5,,,LesionMeasurement,43634


### Interpreting the Results

Our run generated a CSV file containing a header row and 194 rows of data. This CSV file is a dump of the results stored in a MongoDB collection called ``phenotype_results``, which resides in a database called ``nlp``. It is important to understand that **each row** of data above is a separate document in the MongoDB database.

You can see that the results are broadly grouped by the value of the ``nlpql_feature`` field. There are four such groups with values ``LesionMeasurement``, ``xBetween10and20mm``, ``xyBetween10and20mm``, and ``xyzBetween10and20mm``. Take a look at the NLPQL file and see why this is so.

A value of ``NaN`` (not a number) is the equivalent of a null result, meaning that no data was found for that measurement dimension.

Rows 0-144 contain the extracted measurements, which have their ``nlpql_feature`` field equal to ``LesionMeasurement``. These rows comprise the output of the measurement extractor and form the **input** data for the mathematical expressions that set constraints on the desired lesion measurements. The underlying documents for these rows in the MongoDB database are called *task result documents*.

Rows 145-183 have their ``nlpql_feature`` field equal to ``xBetween10and20mm``. These result from processing the raw measurements and subjecting them to the stated constraint on the X dimension.

Rows 184-192 have their ``nlpql_feature`` field equal to ``xyBetween10and20mm``. These rows result from processing the raw measurements and subjecting them to the constraints on X and Y.

Row 193 has its ``nlpql_feature`` field equal to ``xyzBetween10and20mm``.  This row is the only measurement that survives the constraint on all three dimensions.

Note that the ``xBetween10and20mm`` results contain 2D and 3D measurements, some of which have Y or Z dimensions that exceed 20 mm (such as rows 175 and 176). These rows only impose constraints on the X dimension, so the Y and Z dimensions can have any value whatsoever, even NaN, which means they don't exist.

We see a single 3D measurement in the ``xy`` result section, in row 188. This measurement happens to have its Z dimension satisfying the constraints on the X and Y dimensions, but there is no constraint imposed on the measurement by the code itself.

### NLPQL Expressions

In the NLPQL example above, we expressed constraints on the measurement dimensions via NLPQL expressions. In this section we describe the different types of expression and provide an overview of how ClarityNLP evaluates them.

The ClarityNLP expression evaluator is built...mongo aggregation...more efficient

math expressions
logic expressions

eval of math expressions
eval of logic expressions

## Case #1:  Sentiment Analysis
For this  Cooking session, we are going to integrate a few external APIs that perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) and enable their use within the ClarityNLP ecosystem.  By the end of the session, you should have a good handle on how to incorporate any REST API into your [NLPQL](https://clarity-nlp.readthedocs.io/en/latest/user_guide/intro/overview.html#example-nlpql-phenotype-walkthrough) phenotypes.

### 1.1 Identify external APIs for sentiment analysis

For this example, we want to leverage some of the brilliant minds in text analytics to help us perform Sentiment Analysis using ClarityNLP.  You may or may not be surprised to learn that there are >100 APIs out there for performing sentiment analysis.

![NLPQL_Runner.png](assets/Sentiment_APIs.png)

Our first stop will be [Microsoft Azure Text Analytics](https://westus.dev.cognitive.microsoft.com/docs/services/TextAnalytics.V2.0/operations/56f30ceeeda5650db055a3c9/console).  The Azure Sentiment API lets you pass in a simple sentence or group of sentences and get back an overall sentiment score from 0 to 1.  0 being very negative and 1 very positive.

Here is an example from Postman:

![NLPQL_Runner.png](assets/Azure_Sentiment_Query.png)

The sentiment score for the above sentence is very low (i.e., negative).  Let's try something a little more upbeat.

![NLPQL_Runner.png](assets/Azure_Happy_Query.png)

As you can see, we have a much more positive score (99+).  It's pretty fun to play around with just different sentences ("I am super mad at you" scores a 0.14 whereas "I am not super mad at you" score a 0.03).  Cool stuff, but our goal today is to look at how we might integrate such an API into ClarityNLP. 

### 1.2 Transforming APIs into Custom Tasks 

*Start with a Template*

The first thing we'll do is start with a [Custom API Task Base Template](https://github.com/ClarityNLP/ClarityNLP/blob/ceb40586257078ef4f3f7ea91739141d47e83748/nlp/custom_tasks/SampleAPITask.py). This sample task calls an API to assign a random Chuck Norris joke to every document.

```python
from tasks.task_utilities import BaseTask
from pymongo import MongoClient
import requests


class SampleAPITask(BaseTask):
    task_name = "ChuckNorrisJokeTask"

    # NLPQL

    # define sampleTask:
    # Clarity.ChuckNorrisJokeTask({
    #   documentset: [ProviderNotes]
    # });

    def run_custom_task(self, temp_file, mongo_client: MongoClient):
        for doc in self.docs:

            response = requests.post('http://api.icndb.com/jokes/random')
            if response.status_code == 200:
                json_response = response.json()
                if json_response['type'] == 'success':
                    val = json_response['value']
                    obj = {
                        'joke': val['joke']
                    }

                    # writing results
                    self.write_result_data(temp_file, mongo_client, doc, obj)

            else:
                # writing to log (optional)
                self.write_log_data("OOPS", "No jokes this time!")
```

Now there is a lot of stuff to look at in there, but the only part you really have to pay attention to is the middle part below:

```python
     
        for doc in self.docs:

            response = requests.post('http://api.icndb.com/jokes/random')
            if response.status_code == 200:
                json_response = response.json()
                if json_response['type'] == 'success':
                    val = json_response['value']
                    obj = {
                        'joke': val['joke']
                    }

                    # writing results
                    self.write_result_data(temp_file, mongo_client, doc, obj)

            else:
                # writing to log (optional)
                self.write_log_data("OOPS", "No jokes this time!")
```

What this means is that for each document in the selected documentset, make an API POST request. (The parameter `documentset: [ProviderNotes]` from our NLPQL becomes `self.docs` in the Custom Task code.)  The documentset could be nursing notes containing the word "central line" or  documents tagged "Echocardiogram" or any documentset you can imagine as we discusssed in a [prior Cooking class](https://github.com/ClarityNLP/ClarityNLP/blob/master/notebooks/cooking/Cooking_with_ClarityNLP_091218.ipynb).  They will always be referred to as `self.docs` in a Custom Task.

For every one of these documents, this Task is going to ring up the `http://api.icndb.com/jokes/random` joke API and pick a good joke.  It will then add the joke to an object called `obj` and store it back in our results database.  Now, let's see if we can modify this for our Azure Sentiment API.

*Change the API Call*

For our sentiment analysis, we need to change up the POST headers and body to match the Azure API specifications.  So our we'll change a couple things:

```python
headers = {'Content-Type': 'application/json', 'Ocp-Apim-Subscription-Key': 'XXXXXX'}
payload = {"documents": [{"language": "en", "id": "1", "text": doc}]}
response = requests.post('https://eastus.api.cognitive.microsoft.com/text/analytics/sentiment', headers=headers, json=payload)
```

What we've done is added some of the headers required (like our secret API key) and made the body of the request (the "payload") match the configuration shown in the Postman image above. Then instead of calling the ChuckNorris API, we change our call to Microsoft's URL.

*Change the API Result Handling*

Each API returns results in its own way, so you've got to follow the API documentation so see what you can expect back.  As we saw earlier, this Sentiment API responds with this kind of result:

```json
{
	"documents": [{
		"score": 0.14780092239379883,
		"id": "1"
	}],
	"errors": []
}
```

So we'll build our object a little differently than we did for Chuck Norris.  It'll need to look something like this.

```python
json_response = response.json()
val = json_response['documents'][0]
obj = {
    'sentiment_score': val['score']
    }
```

If we were passing in multiple documents at a time (which we are not), we would need to loop through the response one document at a time.  But in this case, we can just take the first (and only) response, hence the [0].

*API Keys*

Chuck Norris was a free API.  Azure is also free for limited usage, but you need an API key.  In this version, we are going to rely on the user to supply us the API key by passing a parameter in their NLPQL.  Here is example NLPQL we might see:

```
define PatientFeelings:
    Clarity.AzureSentiment({
        documentset: [ProviderNotes],
        "api_key": "{your_api_key}"
    });
```

In order to "catch" this api_key and use it in our Custom Task, we've got a library that get custom_arguments from the NLPQL.  It looks like this:

```
self.pipeline_config.custom_arguments['{parameter_name']
```

So in this case, `self.pipeline_config.custom_arguments['api_key']` would retrieve the API key submitted by the user in the NLPQL.




So putting the whole thing together, we've got our final code:

```python
    for doc in self.docs:
        headers = {'Content-Type': 'application/json', 'Ocp-Apim-Subscription-Key': self.pipeline_config.custom_arguments['api_key']}
        payload = {"documents": [{"language": "en", "id": "1", "text": doc}]}
        response = requests.post('https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment', headers=headers, json=payload)
        json_response = response.json()
        val = json_response['documents'][0]
        obj = {
            'sentiment_score': val['score'],
        }

        # writing results
        self.write_result_data(temp_file, mongo_client, doc, obj)

```

To see the final code, with the wrapping back in place and a little bit of error handling thrown in, take a look at [AzureSentimentTask.py](https://github.com/ClarityNLP/ClarityNLP/blob/master/nlp/custom_tasks/AzureSentimentTask.py) in the repo. We made one additional tweak to be sure we are only sending sentences containing birds.  

```java
for doc in self.docs:
  sentence_list = self.get_document_sentences(doc)
    for sentence in sentence_list:
      if any(word.lower() in sentence.lower() for word in self.pipeline_config.terms):
```

### 1.3 Using the Sentiment API Task in a Query 

Our API can now be called using the NLPQL

In [34]:
nlpql ='''
limit 1;

//phenotype name
phenotype "How we feel about birds" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Birds:
  ["football"];

define BirdFeelings:
  Clarity.AzureSentiment({
    termset:[Birds],
    "api_key":"'''+azure_key+'''"
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/643/phenotype_intermediate",
    "job_id": "643",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=643",
    "main_results_csv": "http://18.220.133.76:5000/job_results/643/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/643",
    "phenotype_id": "643",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/862"
    ],
    "pipeline_ids": [
        862
    ],
    "results_viewer": "?job=643",
    "status_endpoint": "http://18.220.133.76:5000/status/643"
}


## NLPQL Editor

We know-- NLPQL isn't too hard to read or copy/tweak, but it is pretty tough to generate *de novo*.  So we've created an editor that helps you build your NLPQL without worrying about a missed semi-colon here or bracket there.  Let's [check it out](https://nlpql-editor.herokuapp.com/demo.html).

![NLPQL_Runner.png](assets/NLPQL_editor.png)

Thank you for joining this week's Cooking with ClarityNLP!  Please send any requests or ideas for future Cooking shows to charity.hilton@gtri.gatech.edu.

Have a great week!