## Cooking with ClarityNLP - Session #1



The goal of this session is to introduce you to writing basic queries using NLPQL.  For details on installing ClarityNLP, loading data, and tagging document types, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### How to Run NLPQL

In order to run NLPQL, you must submit it to a ClarityNLP server either via API or via the ClarityNLP user interface.  If you are running a local instance, the API endpoint is typically `localhost:5000/nlpql`.  NLPQL should be POSTed as text/plain.  An example from [Postman](www.postman.com) is shown below.

![Postman.png](attachment:Postman.png)

If you are unfamiliar with using tools such as Postman, you can submit NLPQL via the ClarityNLP user interface running in a web browser. For local instances, this will be at [localhost:8200/runner](localhost:8200/runner). 

![NLPQL_Runner.png](attachment:NLPQL_Runner.png)

If you wish to run NLPQL directly from this notebook, then please use the following code.  You will need to edit the `url` variable to "localhost/" or your ClarityNLP server IP address.

In [63]:
# This code below is only required for running ClarityNLP in Jupyter notebooks. It is not required if running NLPQL via API or the ClarityNLP GUI.

import json, csv
import urllib, requests
import pandas as pd

url = 'http://18.220.133.76:5000/'
nlpql_url = url + 'nlpql'
expander_url = url + 'nlpql_expander'
tester_url = url + 'nlpql_tester'

In [87]:
def run_nlpql(nlpql):
    re = requests.post(nlpql_url, data=nlpql, headers={'content-type':'text/plain'})
    global run_result
    global main_csv
    global intermediate_csv
    global luigi
    
    if re.ok:
        run_result = re.json()
        main_csv = run_result['main_results_csv']
        intermediate_csv = run_result['intermediate_results_csv']
        luigi = run_result['luigi_task_monitoring']
        print("Job Successfully Submitted")
        print(json.dumps(run_result, indent=4, sort_keys=True))


Note: Throughout these tutorials, we will prepend all examples with `limit 100;`.  This limits the server to analyzing a maxium of 100 documents, reducing runtime and compute load when testing new queries. Once a query is producing the expected output, removing this line will allow the full dataset to be run.

## Case #1: Congestive Heart Failure
For this first use case, our goal is to find patients with reduced ejection fraction and/or late stage CHF who are experiencing symptomatic orthopnea.  We will begin by breaking this into smaller components then combining into a comprehensive phenotype definition that can be shared across sites.

### 1.1 Find mentions of "Orthopnea" in the patient chart.

We will start with a basic example, simply looking for the presence of a given term in the record.
```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea]
    });
```

To run this NLPQL, copy/paste the above and submit via API or the ClarityNLP interface.  Or if you would like to run the NLPQL directoly within this notebook, run the code below.

In [83]:
# Sample NLPQL
nlpql ='''
limit 10;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea]
    });
'''
run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/307/phenotype_intermediate",
    "job_id": "307",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=307",
    "main_results_csv": "http://18.220.133.76:5000/job_results/307/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/307",
    "phenotype_id": "307",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/479"
    ],
    "pipeline_ids": [
        479
    ],
    "results_viewer": "?job=307",
    "status_endpoint": "http://18.220.133.76:5000/status/307"
}


### Job Monitoring

Small jobs with limits such as that above typically take less than 60 seconds to run.  Bigger jobs can take much (much) longer.  You can view your job's progress and see how ClarityNLP has broken up the query using the [Luigi Status Monitor](http://18.220.133.76:8082/static/visualiser/index.html).  Here is a screenshot of the above query running in Luigi.

![Screen%20Shot%202018-08-28%20at%2010.27.36%20PM.png](attachment:Screen%20Shot%202018-08-28%20at%2010.27.36%20PM.png)

### Viewing Results
There are two types of results from an NLPQL query: intermediate results and final results.  *Intermediate results* refer to all data that are extracted by the query.  *Final results* refer to only those patients or documents meeting a specified logic (eg. cohort criteria).  

For the current task, we have not defined a cohort logic.  We asked only to find mentions of orthopnea, which is a data extraction task and will produce only intermediate results.  To download the intermediate results, you can navigate to the `intermediate_results_csv` URL returned by your API call or if you used this notebook, you can see tne URL and a preview below.

In [73]:
print(intermediate_csv)

http://18.220.133.76:5000/job_results/300/phenotype_intermediate


In [70]:
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,experiencer,inserted_date,job_id,negation,nlpql_feature,owner,...,report_id,report_type,section,sentence,solr_id,source,start,subject,temporality,term
0,5b8609d5d982be0ca4e3d549,0,-1,95,Patient,2018-08-29 02:49:57.706000,300,Affirmed,hasOrthopnea,clarity,...,1079230,Radiology,UNKNOWN,Clip # [**Clip Number (Radiology) 35135**] Rea...,1079230,MIMIC,86,54604,Recent,orthopnea
1,5b8609d5d982be0ca4e3d54a,0,-1,32,Patient,2018-08-29 02:49:57.710000,300,Affirmed,hasOrthopnea,clarity,...,1079230,Radiology,CONDITION,72 year old woman with orthopnea REASON FOR THIS,1079230,MIMIC,23,54604,Recent,orthopnea
2,5b8609d5d982be0ca4e3d54b,0,-1,44,Patient,2018-08-29 02:49:57.711000,300,Affirmed,hasOrthopnea,clarity,...,1079230,Radiology,PHYSICAL_EXAMINATION,pls eval for chf or other cause of orthopnea F...,1079230,MIMIC,35,54604,Recent,orthopnea
3,5b8609d5d982be0ca4e3d54c,0,-1,9,Patient,2018-08-29 02:49:57.711000,300,Affirmed,hasOrthopnea,clarity,...,1079230,Radiology,HISTORY_PRESENT_ILLNESS,Orthopnea.,1079230,MIMIC,0,54604,Recent,Orthopnea
4,5b8609d5d982be0ca4e3d54d,0,-1,70,Patient,2018-08-29 02:49:57.910000,300,Affirmed,hasOrthopnea,clarity,...,769966,Radiology,UNKNOWN,Clip # [**Clip Number (Radiology) 13510**] Rea...,769966,MIMIC,61,17456,Recent,orthopnea


Looking at the results, you may notice that the `TermFinder` function peforms a picks up both negated and affirmed cases as well as historical, hypothetical, and any other mentions.  TermFinder can be tuned to pull only specific results. Such as in the example below, which only gets positive mentions.

```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
```

In [84]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
'''
run_nlpql(nlpql)

Job Successfully Submitted


Because many term searchers are actually looking for non-negated, non-hypothetical, subject=patient mentions, we provide a convenient function `ProviderAssertion` to capture those mentions without needing to configure TermFinder. 

```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });
```

### 1.2 Find NYHA Class III/IV patients or those with EF<30%
For the next component of this tutorial, we will aim to extract specific values about CHF from the chart.  This is commonly done with the [ValueExtraction](https://claritynlp.readthedocs.io/en/latest/developer_guide/algorithms/value_extraction.html) function.  Value extractions can be numeric as well as an enumerated list of values.

In this example, we will be searching for ejection fraction values using a very simple algorithm.  Specifically, we will be looking for certain terms and subsequent values that would be typical for EF values.  (There are many more sophisticated methods to find ejection fraction (e.g [Kim et al](https://www.ncbi.nlm.nih.gov/pubmed/28163196)).)  We will then constrain the "final" cohort to only those with an EF < 30.

```java
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
```


In [90]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
'''
run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/314/phenotype_intermediate",
    "job_id": "314",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=314",
    "main_results_csv": "http://18.220.133.76:5000/job_results/314/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/314",
    "phenotype_id": "314",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/486"
    ],
    "pipeline_ids": [
        486
    ],
    "results_viewer": "?job=314",
    "status_endpoint": "http://18.220.133.76:5000/status/314"
}


The `final` declaration refers to a cohort definition and typically involves some logic.  So in this case we defined an extraction process to pull all values between 10 and 85 following EF, LVEF, etc.  We then specified a `context`, meaning that the logic should operate on the level of a patient.  (The other option is Document context, which we will describe in a future session.)  Our logical rule stated that patients with an EjectionFraction <= 30 would make our criteria for a Low EF Patient. 

Results can be found at the main_results_csv URL from your API response, or if  you ran here in this notebook:

In [91]:
print(main_csv)

http://18.220.133.76:5000/job_results/314/phenotype


In [92]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,batch,concept_code,condition,context_type,dimension_X,dimension_Y,dimension_Z,end,inserted_date,...,source,start,subject,temporality,term,text,units,value,value1,value2
0,5b861463d982be111ec73acc,25,-1,EQUAL,subject,15.0,-1.0,,23,2018-08-29 03:34:48.927000,...,MIMIC,0,28785,,,Ejection fraction,,15.0,,
1,5b861463d982be111ec73acd,25,-1,EQUAL,subject,20.0,-1.0,,74,2018-08-29 03:34:49.370000,...,MIMIC,66,57911,,,LVEF,,20.0,,
2,5b861463d982be111ec73ace,25,-1,APPROX,subject,30.0,-1.0,,58,2018-08-29 03:34:49.566000,...,MIMIC,21,68579,,,ejection fraction,,30.0,,
3,5b861463d982be111ec73acf,25,-1,EQUAL,subject,20.0,-1.0,,126,2018-08-29 03:34:49.628000,...,MIMIC,116,1944,,,LVEF,,20.0,,
4,5b861463d982be111ec73ad0,25,-1,EQUAL,subject,20.0,-1.0,,127,2018-08-29 03:34:49.629000,...,MIMIC,104,1944,,,ejection fraction,,20.0,,
