## Cooking with ClarityNLP - Session #1



The goal of this session is to introduce you to writing basic queries using NLPQL.  For details on installing ClarityNLP, loading data, and tagging document types, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### How to Run NLPQL

In order to run NLPQL, you must submit it to a ClarityNLP server either via API or via the ClarityNLP user interface.  If you are running a local instance, the API endpoint is typically `localhost:5000/nlpql`.  NLPQL should be POSTed as text/plain.  An example from [Postman](www.postman.com) is shown below.

![Postman.png](assets/Postman.png)

If you are unfamiliar with using tools such as Postman, you can submit NLPQL via the ClarityNLP user interface running in a web browser. For local instances, this will be at [localhost:8200/runner](localhost:8200/runner). 

![NLPQL_Runner.png](assets/NLPQL_Runner.png)

If you wish to run NLPQL directly from this notebook, then please use the following code.  You will need to edit the `url` variable to "localhost:5000/" or your ClarityNLP server IP address.

In [1]:
# This code below is only required for running ClarityNLP in Jupyter notebooks. It is not required if running NLPQL via API or the ClarityNLP GUI.

import pandas as pd
import claritynlp_notebook_helpers as claritynlp
from sys import path
from os import getcwd

dependencies_path = getcwd()[:-17] + 'nlp/'
path.insert(1, dependencies_path)

ClarityNLP notebook helpers loaded successfully!



Note: Throughout these tutorials, we will prepend all examples with `limit 100;`.  This limits the server to analyzing a maxium of 100 documents, reducing runtime and compute load when testing new queries. Once a query is producing the expected output, removing this line will allow the full dataset to be run.

## Case #1: Congestive Heart Failure
For this first use case, our goal is to find patients with reduced ejection fraction and/or late stage CHF who are experiencing symptomatic orthopnea.  We will begin by breaking this into smaller components then combining into a comprehensive phenotype definition that can be shared across sites.

### 1.1 Find mentions of "Orthopnea" in the patient chart.

We will start with a basic example, simply looking for the presence of a given term in the record.
```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea]
    });
```

To run this NLPQL, copy/paste the above and submit via API or the ClarityNLP interface.  Or if you would like to run the NLPQL directoly within this notebook, run the code below.

In [2]:
# Sample NLPQL
nlpql ='''
limit 10;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/152/phenotype_intermediate",
    "job_id": "152",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=152",
    "main_results_csv": "http://localhost:5000/job_results/152/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/152",
    "phenotype_id": "152",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/177"
    ],
    "pipeline_ids": [
        177
    ],
    "status_endpoint": "http://localhost:5000/status/152"
}
Job in progress.....
Job successfully completed!


### Job Monitoring

Small jobs with limits such as that above typically take less than 60 seconds to run.  Bigger jobs can take much (much) longer.  You can view your job's progress and see how ClarityNLP has broken up the query using the [Luigi Status Monitor](http://18.220.133.76:8082/static/visualiser/index.html) (see below if the link does not work).  Here is a screenshot of the above query running in Luigi.

![Screen%20Shot%202018-08-28%20at%2010.27.36%20PM.png](assets/Luigi.png)

If the above link for the Luigi task monitor does not work, and you submitted the job using this notebook, then you can run the code block below to get the link for the Luigi task monitoring.

In [3]:
print(luigi)

http://localhost:8082/static/visualiser/index.html#search__search=job=152


### Viewing Results
There are two types of results from an NLPQL query: intermediate results and final results.  *Intermediate results* refer to all data that are extracted by the query.  *Final results* refer to only those patients or documents meeting a specified logic (eg. cohort criteria).  

For the current task, we have not defined a cohort logic.  We asked only to find mentions of orthopnea, which is a data extraction task and will produce only intermediate results.  To download the intermediate results, you can navigate to the `intermediate_results_csv` URL returned by your API call or if you used this notebook, you can see tne URL and a preview below.

In [4]:
print(intermediate_csv)

http://localhost:5000/job_results/152/phenotype_intermediate


In [5]:
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,concept_code_system,display_name,end,experiencer,inserted_date,job_id,negation,...,report_type,result_display,section,sentence,solr_id,source,start,subject,temporality,term
0,6227785948b11a9eeeb5e509,10,,,hasOrthopnea,67,Patient,2022-03-08 10:38:01.474000,152,Affirmed,...,Journal Abstract,"{'date': '2010-03-01T00:00:00Z', 'result_conte...",UNKNOWN,"Approximately 4 hours after cesarean section, ...",pmid_20434111,PubMed,58,Acta anaesthesiologica Taiwanica : official jo...,Recent,orthopnea
1,6227785948b11a9eeeb5e50a,10,,,hasOrthopnea,86,Patient,2022-03-08 10:38:01.477000,152,Negated,...,Journal Abstract,"{'date': '2010-03-01T00:00:00Z', 'result_conte...",UNKNOWN,This case illustrates that we should bear in m...,pmid_20434111,PubMed,77,Acta anaesthesiologica Taiwanica : official jo...,Recent,orthopnea
2,6227785948b11a9eeeb5e50b,10,,,hasOrthopnea,138,Patient,2022-03-08 10:38:01.533000,152,Affirmed,...,Journal Abstract,"{'date': '2016-01-01T00:00:00Z', 'result_conte...",UNKNOWN,A 26-year-old white woman diagnosed with syste...,pmid_26757303,PubMed,129,Chest,Recent,orthopnea
3,6227785948b11a9eeeb5e50c,10,,,hasOrthopnea,83,Patient,2022-03-08 10:38:01.535000,152,Affirmed,...,Journal Abstract,"{'date': '2016-01-01T00:00:00Z', 'result_conte...",UNKNOWN,"Three months later, she presented to our clini...",pmid_26757303,PubMed,74,Chest,Recent,orthopnea
4,6227785948b11a9eeeb5e50d,10,,,hasOrthopnea,86,Patient,2022-03-08 10:38:01.598000,152,Affirmed,...,Journal Abstract,"{'date': '2003-12-12T00:00:00Z', 'result_conte...",UNKNOWN,A 63-year-old woman was admitted with diffuse ...,pmid_14673740,PubMed,77,Deutsche medizinische Wochenschrift (1946),Recent,orthopnea


Looking at the results, you may notice that the `TermFinder` function peforms a picks up both negated and affirmed cases as well as historical, hypothetical, and any other mentions.  TermFinder can be tuned to pull only specific results. Such as in the example below, which only gets positive mentions.

```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
```

In [6]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/153/phenotype_intermediate",
    "job_id": "153",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=153",
    "main_results_csv": "http://localhost:5000/job_results/153/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/153",
    "phenotype_id": "153",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/178"
    ],
    "pipeline_ids": [
        178
    ],
    "status_endpoint": "http://localhost:5000/status/153"
}
Job in progress............
Job successfully completed!


Because many term searchers are actually looking for non-negated, non-hypothetical, subject=patient mentions, we provide a convenient function `ProviderAssertion` to capture those mentions without needing to configure TermFinder. 

```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });
```

### 1.2 Find NYHA Class III/IV patients or those with EF<30%
For the next component of this tutorial, we will aim to extract specific values about CHF from the chart.  This is commonly done with the [ValueExtraction](https://claritynlp.readthedocs.io/en/latest/developer_guide/algorithms/value_extraction.html) function.  Value extractions can be numeric as well as an enumerated list of values.

In this example, we will be searching for ejection fraction values using a very simple algorithm.  Specifically, we will be looking for certain terms and subsequent values that would be typical for EF values.  (There are many more sophisticated methods to find ejection fraction (e.g [Kim et al](https://www.ncbi.nlm.nih.gov/pubmed/28163196)).)  We will then constrain the "final" cohort to only those with an EF < 30.

```java
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
```


In [7]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/154/phenotype_intermediate",
    "job_id": "154",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=154",
    "main_results_csv": "http://localhost:5000/job_results/154/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/154",
    "phenotype_id": "154",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/179"
    ],
    "pipeline_ids": [
        179
    ],
    "status_endpoint": "http://localhost:5000/status/154"
}
Job in progress...............
Job successfully completed!


The `final` declaration refers to a cohort definition and typically involves some logic.  So in this case we defined an extraction process to pull all values between 10 and 85 following EF, LVEF, etc.  We then specified a `context`, meaning that the logic should operate on the level of a patient.  (The other option is Document context, which we will describe in a future session.)  Our logical rule stated that patients with an EjectionFraction <= 30 would make our criteria for a Low EF Patient. 

Results can be found at the main_results_csv URL from your API response, or if  you ran here in this notebook:

In [8]:
print(main_csv)

http://localhost:5000/job_results/154/phenotype


In [9]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,_ids_1,batch,concept_code,concept_code_system,condition,context_type,display_name,end,inserted_date,...,solr_id,source,start,subject,term,text,value,value1,value2,values_before_terms
0,622778771e352e9afcb5e683,6227786b48b11a9eeeb5e5f7,90,,,RANGE,subject,EjectionFraction,108,2022-03-08 10:38:19.958000,...,pmid_30287144,PubMed,89,Journal of Ayurveda and integrative medicine,ef,EF) 10-30%) wherein,10.0,10.0,30.0,False
1,622778771e352e9afcb5e684,6227786c48b11a9eeeb5e5ff,30,,,EQUAL,subject,EjectionFraction,184,2022-03-08 10:38:20.908000,...,pmid_29929385,PubMed,176,Journal of cardiovascular pharmacology and the...,lvef,LVEF] 30,30.0,30.0,,False
2,622778771e352e9afcb5e685,6227786d48b11a9eeeb5e601,30,,,EQUAL,subject,EjectionFraction,186,2022-03-08 10:38:21.093000,...,pmid_30892806,PubMed,179,European journal of heart failure,lvef,LVEF 30,30.0,30.0,,False
3,622778771e352e9afcb5e686,6227786d48b11a9eeeb5e605,30,,,EQUAL,subject,EjectionFraction,52,2022-03-08 10:38:21.474000,...,pmid_30835055,PubMed,43,Internal and emergency medicine,lvef,LVEF by10,10.0,10.0,,False
4,622778771e352e9afcb5e687,6227786d48b11a9eeeb5e60b,0,,,EQUAL,subject,EjectionFraction,114,2022-03-08 10:38:21.686000,...,pmid_30766004,PubMed,82,Journal of the Saudi Heart Association,ejection fraction,ejection fraction (HFrEF) and 22,22.0,22.0,,False


The next step is to use ValueExtraction to pull out an enumerated value set (rather than a quantitative value).  See the example below for NYHA class.

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset NYHATerms:
  ["nyha"];

define NYHAClass:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });
```

In [10]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset NYHATerms:
  ["nyha"];

define NYHAClass:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/155/phenotype_intermediate",
    "job_id": "155",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=155",
    "main_results_csv": "http://localhost:5000/job_results/155/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/155",
    "phenotype_id": "155",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/180"
    ],
    "pipeline_ids": [
        180
    ],
    "status_endpoint": "http://localhost:5000/status/155"
}
Job in progress...............
Job successfully completed!


In [11]:
# view intermediate results
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,concept_code_system,condition,display_name,end,inserted_date,job_id,max_value,...,solr_id,source,start,subject,term,text,value,value1,value2,values_before_terms
0,6227787b48b11a9eeeb5e681,20,,,EQUAL,NYHAClass,145,2022-03-08 10:38:35.149000,155,,...,pmid_28744959,PubMed,137,Journal of cardiovascular electrophysiology,nyha,NYHA III,iii,iii,,False
1,6227787b48b11a9eeeb5e682,20,,,EQUAL,NYHAClass,113,2022-03-08 10:38:35.271000,155,,...,pmid_29801082,PubMed,96,JAMA cardiology,nyha,NYHA II to 49% (3,3,3,,False
2,6227787b48b11a9eeeb5e683,20,,,EQUAL,NYHAClass,45,2022-03-08 10:38:35.352000,155,,...,pmid_28430960,PubMed,25,"Europace : European pacing, arrhythmias, and c...",nyha,NYHA class III and 3,3,3,,False
3,6227787b48b11a9eeeb5e684,20,,,EQUAL,NYHAClass,143,2022-03-08 10:38:35.436000,155,,...,pmid_29267340,PubMed,97,PloS one,nyha,NYHA classification during follow-up were univ,iv,iv,,False
4,6227787b48b11a9eeeb5e685,20,,,EQUAL,NYHAClass,63,2022-03-08 10:38:35.592000,155,,...,pmid_29520133,PubMed,49,Clinical interventions in aging,nyha,NYHA III or IV,iv,iv,,False


### 1.3 Bringing the criteria together to find the target CHF cohort

For the final step in example 1, we want to bring together the above criteria to generate our final cohort.

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];

//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;
```

In [12]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "Final CHF Cohort" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];

//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/156/phenotype_intermediate",
    "job_id": "156",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=156",
    "main_results_csv": "http://localhost:5000/job_results/156/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/156",
    "phenotype_id": "156",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/181",
        "http://localhost:5000/pipeline_id/182",
        "http://localhost:5000/pipeline_id/183"
    ],
    "pipeline_ids": [
        181,
        182,
        183
    ],
    "status_endpoint": "http://localhost:5000/status/156"
}
Job in progress......................................
Job successfully completed!


In [13]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,_ids_1,_ids_2,_ids_3,batch_1,batch_2,concept_code_1,concept_code_2,concept_code_system_1,concept_code_system_2,...,start_2,subject,temporality,term_1,term_2,text,value,value1,value2,values_before_terms
0,622778ae98af687cdfb5e96b,622778a348b11a9eeeb5e840,622778ae98af687cdfb5e8ec,6227789248b11a9eeeb5e78b,60,60,,,,,...,34,International journal of cardiology,Recent,nyha,orthopnea,NYHA class (NYHA III,iii,iii,,False
1,622778ae98af687cdfb5e96c,622778a848b11a9eeeb5e880,622778ae98af687cdfb5e8ed,6227789248b11a9eeeb5e78b,0,60,,,,,...,34,International journal of cardiology,Recent,nyha,orthopnea,NYHA I and II (23,3,3,,False
2,622778ae98af687cdfb5e96d,622778a848b11a9eeeb5e881,622778ae98af687cdfb5e8ee,6227789248b11a9eeeb5e78b,0,60,,,,,...,34,International journal of cardiology,Recent,nyha,orthopnea,NYHA III and IV,iv,iv,,False
3,622778ae98af687cdfb5e96e,622778aa48b11a9eeeb5e89b,622778ae98af687cdfb5e8ef,6227789248b11a9eeeb5e78b,40,60,,,,,...,34,International journal of cardiology,Recent,nyha,orthopnea,NYHA classes III-IV,iv,iv,,False
4,622778ae98af687cdfb5e96f,622778aa48b11a9eeeb5e89c,622778ae98af687cdfb5e8f0,6227789248b11a9eeeb5e78b,40,60,,,,,...,34,International journal of cardiology,Recent,nyha,orthopnea,NYHA class I-III,iii,iii,,False


In the above case, we may need to increase our limit beyond 100 documents to find many matching patients, because multiple criteria are required and a small sample may not be enough.  Try increasing to 500 documents.

### Reviewing Results through the ClarityNLP UI
Downloading the raw CSV results is handy for analysis and data manipulation.  However, a domain-oriented end user may be more interested in just exploring and validating the final results without getting into all the programmatic details.  That's where the Results Viewer comes in, which in a typical installation will be found at localhost:8200/


![Screen%20Shot%202018-08-29%20at%209.33.35%20AM.png](assets/Clarity Validator.png)

## Case #2: Capturing Information on Patient Race
Although there are many useful [core algorithms](https://claritynlp.readthedocs.io/en/latest/developer_guide/index.html#task-algorithms) in ClarityNLP, users will frequently want to extend its functionality.  In this second example, we will explore how to extend ClarityNLP when the built in algorithms are inadequate.  

In this case, we'd like to identify the patient's race.  While some version of this could probably we done with simple search terms, a custom algorithm will likely be necessary.  Below is an example of a custom Python algorithm written to extract race information from a document. 

```python

str_sep = r'(\s-\s|-\s|\s-|\s)'
str_word = r'\b[-a-z.\d]+'
str_punct = r'[,.\s]*'
str_words = r'(' + str_word + str_punct + r'){0,6}'
str_person = r'\b(gentleman|gentlewoman|male|female|man|woman|person|'    +\
             r'child|boy|girl|infant|baby|newborn|neonate|individual)\b'
str_category = r'\b(american' + str_sep + r'indian|'                      +\
               r'alaska' + str_sep + r'native|asian|'                     +\
               r'african' + str_sep + r'american|black|negro|'            +\
               r'native' + str_sep + r'hawaiian|'                         +\
               r'other' + str_sep + r'pacific' + str_sep + r'islander|'   +\
               r'pacific' + str_sep + r'islander|'                        +\
               r'native' + str_sep + r'american|'                         +\
               r'white|caucasian|european)'

str_race1 = r'(\brace:?\s*)' + r'(?P<category>' + str_category + r')'
regex_race1 = re.compile(str_race1, re.IGNORECASE)
str_race2 = r'(?P<category>' + str_category + r')' + str_punct    +\
            str_words + str_person
regex_race2 = re.compile(str_race2, re.IGNORECASE)
str_race3 = str_person + str_punct + str_words + r'(?P<category>' +\
            str_category + r')'
regex_race3 = re.compile(str_race3, re.IGNORECASE)
REGEXES = [regex_race1, regex_race2, regex_race3]

RACE_FINDER_RESULT_FIELDS = ['sentence_index', 'start', 'end', 'race',
                             'normalized_race']
RaceFinderResult = namedtuple('RaceFinderResult', RACE_FINDER_RESULT_FIELDS)


###############################################################################
def normalize(race_text):
    """
    Convert a matching race string to a 'normalized' version.
    """

    NORM_MAP = {
        'african american':'black',
        'negro':'black',
        'caucasian':'white',
        'european':'white',
    }
    
    # convert to lowercase, remove dashes, collapse repeated whitespace
    race = race_text.lower()
    race = re.sub(r'[-]+', '', race)
    race = re.sub(r'\s+', ' ', race)

    if race in NORM_MAP:
        return NORM_MAP[race]
    else:
        return race
    

###############################################################################
def find_race(sentence_list):
    """
    Scan a list of sentences and run race-finding regexes on each.
    Return a dict that maps sentence_index -> race_category.
    """

    result_list = []

    found_match = False
    for i in range(len(sentence_list)):
        s = sentence_list[i]
        for regex in REGEXES:
            match = regex.search(s)
            if match:
                match_text = match.group('category')
                start = match.start()
                end   = match.end()
                normalized = normalize(match_text)
                result = RaceFinderResult(i, start, end, match_text, normalized)
                result_list.append(result)
                found_match = True
                break

        # Reports are unlikely to have more than one sentence stating the
        # patient's race.
        if found_match:
            break
            
    return result_list
```

Without going into the details, this algorithm parses text to find race information, normalizes it to standard terms, and passes back the result.  In order to run this algorithm using NLPQL in ClarityNLP, we create what is called a custom task.  Below is code that creates the CustomTask wrapping this function and provides it with the documents and handling of result ouput.

```python
# use this name in NLPQL
    task_name = "RaceFinderTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all race results in this document
            result_list = find_race(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.race,
                        'value_normalized':result.normalized_race,
                    }
            
                    self.write_result_data(temp_file, mongo_client, doc, obj)
                    ```

These files can be split into two or can be combined as shown in the final custom task below.

In [14]:
import re
from pymongo import MongoClient
from collections import namedtuple
from tasks.task_utilities import BaseTask

str_sep = r'(\s-\s|-\s|\s-|\s)'
str_word = r'\b[-a-z.\d]+'
str_punct = r'[,.\s]*'
str_words = r'(' + str_word + str_punct + r'){0,6}'
str_person = r'\b(gentleman|gentlewoman|male|female|man|woman|person|'    +\
             r'child|boy|girl|infant|baby|newborn|neonate|individual)\b'
str_category = r'\b(american' + str_sep + r'indian|'                      +\
               r'alaska' + str_sep + r'native|asian|'                     +\
               r'african' + str_sep + r'american|black|negro|'            +\
               r'native' + str_sep + r'hawaiian|'                         +\
               r'other' + str_sep + r'pacific' + str_sep + r'islander|'   +\
               r'pacific' + str_sep + r'islander|'                        +\
               r'native' + str_sep + r'american|'                         +\
               r'white|caucasian|european)'

str_race1 = r'(\brace:?\s*)' + r'(?P<category>' + str_category + r')'
regex_race1 = re.compile(str_race1, re.IGNORECASE)
str_race2 = r'(?P<category>' + str_category + r')' + str_punct    +\
            str_words + str_person
regex_race2 = re.compile(str_race2, re.IGNORECASE)
str_race3 = str_person + str_punct + str_words + r'(?P<category>' +\
            str_category + r')'
regex_race3 = re.compile(str_race3, re.IGNORECASE)
REGEXES = [regex_race1, regex_race2, regex_race3]

RACE_FINDER_RESULT_FIELDS = ['sentence_index', 'start', 'end', 'race',
                             'normalized_race']
RaceFinderResult = namedtuple('RaceFinderResult', RACE_FINDER_RESULT_FIELDS)


###############################################################################
def normalize(race_text):
    """
    Convert a matching race string to a 'normalized' version.
    """

    NORM_MAP = {
        'african american':'black',
        'negro':'black',
        'caucasian':'white',
        'european':'white',
    }
    
    # convert to lowercase, remove dashes, collapse repeated whitespace
    race = race_text.lower()
    race = re.sub(r'[-]+', '', race)
    race = re.sub(r'\s+', ' ', race)

    if race in NORM_MAP:
        return NORM_MAP[race]
    else:
        return race
    

###############################################################################
def find_race(sentence_list):
    """
    Scan a list of sentences and run race-finding regexes on each.
    Return a dict that maps sentence_index -> race_category.
    """

    result_list = []

    found_match = False
    for i in range(len(sentence_list)):
        s = sentence_list[i]
        for regex in REGEXES:
            match = regex.search(s)
            if match:
                match_text = match.group('category')
                start = match.start()
                end   = match.end()
                normalized = normalize(match_text)
                result = RaceFinderResult(i, start, end, match_text, normalized)
                result_list.append(result)
                found_match = True
                break

        # Reports are unlikely to have more than one sentence stating the
        # patient's race.
        if found_match:
            break
            
    return result_list


###############################################################################
class RaceFinderTask(BaseTask):
    """
    A custom task for finding a patient's race.
    """
    
    # use this name in NLPQL
    task_name = "RaceFinderTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all race results in this document
            result_list = find_race(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.race,
                        'value_normalized':result.normalized_race,
                    }
            
                    self.write_result_data(temp_file, mongo_client, doc, obj)



[2022-03-08 10:39:31-EST] ERROR in claritynlp_logging: "No option 'username' in section: 'mongo'"
[2022-03-08 10:39:31-EST] ERROR in claritynlp_logging: "No option 'password' in section: 'mongo'"
[2022-03-08 10:39:32-EST] ERROR in claritynlp_logging: "No section: 'results_client'"
[2022-03-08 10:39:32-EST] ERROR in claritynlp_logging: "No option 'batch_size' in section: 'solr'"
[2022-03-08 10:39:32-EST] ERROR in claritynlp_logging: "No section: 'report_mapper'"
[2022-03-08 10:39:32-EST] ERROR in claritynlp_logging: "No section: 'report_mapper'"
[2022-03-08 10:39:33-EST] ERROR in claritynlp_logging: "No section: 'report_mapper'"
[2022-03-08 10:39:33-EST] ERROR in claritynlp_logging: "No section: 'apis'"
[2022-03-08 10:39:33-EST] ERROR in claritynlp_logging: "No section: 'redis'"
[2022-03-08 10:39:33-EST] ERROR in claritynlp_logging: "No section: 'redis'"
[2022-03-08 10:39:34-EST] ERROR in claritynlp_logging: "No section: 'redis'"
[2022-03-08 10:39:34-EST] ERROR in claritynlp_logging: "N

[2022-03-08 10:39:37-EST] INFO in claritynlp_logging: 'Initializing models for term finder...'
[2022-03-08 10:39:37-EST] INFO in claritynlp_logging: 'section_tagger_init...'
[2022-03-08 10:39:38-EST] INFO in claritynlp_logging: ('Context') 'Context init...'
[2022-03-08 10:39:38-EST] INFO in claritynlp_logging: 'Context init...'
[2022-03-08 10:39:38-EST] INFO in claritynlp_logging: 'Segmentation init...'
[2022-03-08 10:39:39-EST] INFO in claritynlp_logging: 'Done initializing models for term finder...'
[2022-03-08 10:39:39-EST] INFO in claritynlp_logging: 'Initializing models for value extractor...'
[2022-03-08 10:39:39-EST] INFO in claritynlp_logging: 'Done initializing models for value extractor...'
[2022-03-08 10:39:40-EST] INFO in claritynlp_logging: 'Initializing models for measurement finder...'
[2022-03-08 10:39:40-EST] INFO in claritynlp_logging: 'Done initializing models for measurement finder..'


This race task can be called in NLPQL as follows:

```java
    limit 100;

    phenotype "Race Finder" version "1";
    include ClarityCore version "1.0" called Clarity;

    documentset DischargeSummaries:
        Clarity.createReportTagList(["Discharge Summary"]);

    define RaceFinderFunction:
        Clarity.RaceFinderTask({
            documentset: [DischargeSummaries]
        });
```

Note:  This example is our first time using `documentset`, which allows us to specify a targeted list of documents such as Discharge Summaries or Radiology notes etc.  We will cover this is greater detail in future Cooking sessions. 

In [99]:
nlpql ='''
limit 100;

phenotype "Race Finder" version "1";
include ClarityCore version "1.0" called Clarity;

documentset DischargeSummaries:
    Clarity.createReportTagList(["Discharge Summary"]);

define RaceFinderFunction:
    Clarity.RaceFinderTask({
        documentset: [DischargeSummaries]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/322/phenotype_intermediate",
    "job_id": "322",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=322",
    "main_results_csv": "http://18.220.133.76:5000/job_results/322/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/322",
    "phenotype_id": "322",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/502"
    ],
    "pipeline_ids": [
        502
    ],
    "results_viewer": "?job=322",
    "status_endpoint": "http://18.220.133.76:5000/status/322"
}


In [100]:
# view intermediate results
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,inserted_date,job_id,nlpql_feature,owner,phenotype_final,pipeline_id,...,report_date,report_id,report_type,sentence,solr_id,source,start,subject,value,value_normalized
0,5b86a5fc2d76670a1377c786,75,-1,23,2018-08-29 13:56:12.628000,322,RaceFinderFunction,clarity,False,502,...,2106-07-16T00:00:00Z,10851,Discharge summary,Overweight white female.,10851,MIMIC,11,11350,white,white
1,5b86a6002d76670a1977c786,25,-1,37,2018-08-29 13:56:16.440000,322,RaceFinderFunction,clarity,False,502,...,2106-05-03T00:00:00Z,8534,Discharge summary,Patient is a 46-year-old white female with his...,8534,MIMIC,25,15160,white,white
2,5b86a60a2d76670a1677c786,50,-1,54,2018-08-29 13:56:26.021000,322,RaceFinderFunction,clarity,False,502,...,2152-03-14T00:00:00Z,10838,Discharge summary,"In general, the patient was a middle-aged whit...",10838,MIMIC,42,26693,white,white
3,5b86a60c2d76670a1377c787,75,-1,27,2018-08-29 13:56:28.023000,322,RaceFinderFunction,clarity,False,502,...,2187-05-10T00:00:00Z,10862,Discharge summary,Elderly caucasian gentleman with Parkinsonian ...,10862,MIMIC,8,1819,caucasian,white
4,5b86a60d2d76670a1377c788,75,-1,362,2018-08-29 13:56:29.182000,322,RaceFinderFunction,clarity,False,502,...,2100-10-19T00:00:00Z,10863,Discharge summary,Past Medical History: Hyperchol HTN Afib with ...,10863,MIMIC,347,25436,caucasian,white


### 2.2 Combining race with other criteria

As you probably gathered, you can now write NLPQL that will look for all patients matching our CHF criteria with the race information extracted above.  The NLPQL would look like this:

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];


//documentsets
documentset DischargeSummaries:
    Clarity.createReportTagList(["Discharge Summary"]);


//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });


define Race:
    Clarity.RaceFinderTask({
        documentset: [DischargeSummaries]
    });
       

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;

define BlackRace:
    where Race.normalized_value = 'black';
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;

define final BlackSevereCHFPatient:
    where SevereCHF AND BlackRace;
```
