## Cooking with ClarityNLP - Session #2

The goal of this series is to introduce you to writing basic queries using NLPQL.  Today we will also be covering an introduction to data ingestion and tagging.  For other details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### Data Ingestion

In order to run NLP jobs using ClarityNLP, data must first be ingested into the system.  You can ingest data from various sources (eg. flat files, relational databases, APIs, etc) and of various types (eg. txt, doc, pdf, etc). Today we will cover one of the most common ingestion patterns-- bringing in data from a CSV.  ClarityNLP has a user interface to support CSV ingestion.  In a typical instance, this will be located at `localhost:6543/csv`.

[INSERT INGESTION PICTURE HERE]

The process of ingesting data from CSV involves the following steps:
1. Select your CSV file to load column headers
2. Assign the required fields for ClarithNLP to columns in your file
3. Add any additional fields you would like to include from your source file
4. Start the Import process


Below is an example of the ingestion screen filled out for the MIMIC-III notes file (NOTEEVENTS.csv)

[Insert MIMIC Ingestion form picture here]

### Document Mapping?? (Save for next time?)

### How to Run NLPQL

In order to run NLPQL, you must submit it to a ClarityNLP server either via API or via the ClarityNLP user interface.  If you are running a local instance, the API endpoint is typically `localhost:5000/nlpql`.  NLPQL should be POSTed as text/plain.  An example from [Postman](www.postman.com) is shown below.

![Postman.png](assets/Postman.png)

If you are unfamiliar with using tools such as Postman, you can submit NLPQL via the ClarityNLP user interface running in a web browser. For local instances, this will be at [localhost:8200/runner](localhost:8200/runner). 

![NLPQL_Runner.png](assets/NLPQL_Runner.png)

If you wish to run NLPQL directly from this notebook, then please use the following code.  You will need to edit the `url` variable to "localhost:5000/" or your ClarityNLP server IP address.

In [1]:
# This code below is only required for running ClarityNLP in Jupyter notebooks. It is not required if running NLPQL via API or the ClarityNLP GUI.

import pandas as pd
import claritynlp_notebook_helpers as claritynlp

ClarityNLP notebook helpers loaded successfully!



Note: Throughout these tutorials, we will prepend all examples with `limit 100;`.  This limits the server to analyzing a maxium of 100 documents, reducing runtime and compute load when testing new queries. Once a query is producing the expected output, removing this line will allow the full dataset to be run.

## Case #1:  Prostate Cancer
For this first use case, we are going to look at a few different approaches to analyzing prostate cancer.  First we will start with just the basic approach we covered last time. 

### 1.1 Find mentions of "Prostate Cancer" in the patient chart.

To run this NLPQL, copy/paste the above and submit via API or the ClarityNLP interface.  Or if you would like to run the NLPQL directoly within this notebook, run the code below.

In [2]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Prostate Cancer" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset ProstateTerms:
  ["prostate cancer","prostate ca"];

define ProstateCA:
  Clarity.ProviderAssertion({
    termset:[ProstateTerms]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/10395/phenotype_intermediate",
    "job_id": "10395",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=10395",
    "main_results_csv": "http://localhost:5000/job_results/10395/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/10326",
    "phenotype_id": "10326",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/10538"
    ],
    "pipeline_ids": [
        10538
    ],
    "results_viewer": "http://localhost:8200/?job=10395",
    "status_endpoint": "http://localhost:5000/status/10395"
}


In [3]:
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,experiencer,inserted_date,job_id,negation,nlpql_feature,owner,...,report_id,report_type,section,sentence,solr_id,source,start,subject,temporality,term
0,5b96beac4606cf403ccca551,75,-1,30,Patient,2018-09-10 14:57:48.859000,10395,Affirmed,ProstateCA,claritynlp,...,1129787,Radiology,CONDITION,77 year old man w/ prostate CA s/p syncopal ep...,1129787,MIMIC,19,27102,Historical,prostate CA
1,5b96beac4606cf403ccca552,75,-1,36,Patient,2018-09-10 14:57:48.867000,10395,Affirmed,ProstateCA,claritynlp,...,1129787,Radiology,HISTORY_PRESENT_ILLNESS,77-year-old man with prostate cancer status po...,1129787,MIMIC,21,27102,Historical,prostate cancer
2,5b96beac4606cf403ccca553,75,-1,11,Patient,2018-09-10 14:57:48.973000,10395,Affirmed,ProstateCA,claritynlp,...,1051534,Radiology,ADMISSION_DIAGNOSIS,PROSTATE CA/SDA UNDERLYING MEDICAL,1051534,MIMIC,0,99810,Recent,PROSTATE CA
3,5b96beac4606cf403ccca554,75,-1,36,Patient,2018-09-10 14:57:48.976000,10395,Affirmed,ProstateCA,claritynlp,...,1051534,Radiology,CONDITION,57 year old man with prostate cancer s/p radic...,1051534,MIMIC,21,99810,Historical,prostate cancer
4,5b96bead4606cf403ccca555,75,-1,36,Patient,2018-09-10 14:57:49.100000,10395,Affirmed,ProstateCA,claritynlp,...,1056030,Radiology,CONDITION,68 year old man with met prostate ca with sob ...,1056030,MIMIC,25,45783,Recent,prostate ca


**Working with Document Sets**

Sometimes we don't want to look for mentions of a concept anywhere, but rather only want to look within certain types of documents.  With ClarityNLP, we can extensively control for the types on documents in which we perform any  algorithm.  There are four modifiers that can be used in selecting documents:

- report_type
- report_tag
- filter_query
- query

***Report Type***

When documents are ingested into ClarityNLP, you have the option to assign a report type.  For clinical documents, this is typically something like Discharge Summary, Head CT WWO Contrast, Colonoscopy Report, etc.  Clarity has a convenient function `createReportTypeList` for building a document set based on report type.  Here is an example:

```
documentset ChestXRayDocuments:
   Clarity.createReportTypeList(["CXR PA/LAT","CXR 2V","AP/LAT CHEST"]);
```

***Report Tag***

In our research looking at just 10 different health systems, we found thousands of different report types.  Such diversity makes it challenging to create phenotypes that can be applied across diverse settings.  To address this, ClarityNLP embeds a Report Type tagging system that facilitates linking report types to the LOINC / RadLex document ontology.  This enables creation of more standardized NLPQL code.

```
documentset ChestXRayDocuments:
   Clarity.createReportTagList(["XR","Chest"]);
```

***Filter Query***

Filter queries are a powerful etc etc Solr documentation.

```
documentset CXRDocuments:
    Clarity.createDocumentSet({
        "report_types":[],
        "report_tags": [],
        "filter_query": "subject:23224"});
```

***Query***

Sometimes a highly customized document selection is required and for this we enable use of the full [Solr query language](https://lucene.apache.org/solr/guide/7_4/query-syntax-and-parsing.html).

```
documentset CXRDocuments:
    Clarity.createDocumentSet({
        "report_types":[],
        "report_tags": [],
        "query":"(french OR fry OR hamburger)"});
```

### 1.2 Finding Biopsy Information
Let's do a slight addition where we look for mention of biopsy as well as Prostate CA.  First, the simple version where we just look for mention of the word biopsy in the same charts as mention of the word prostate cancer.

Example just with adding biopsy

#### Adding Synonyms
ClarityNLP has a number of cool built-in features for creating synonyms, plurals, lexical variants and so forth.  Check out the full list of [Termset Expansion](https://clarity-nlp.readthedocs.io/en/latest/user_guide/nlpql/macros.html?highlight=lexical) functions.  Here is an example of an expansion set.

```java
  phenotype "Test Expansion Using English Phrases";

  // # Structured Data Model #
  datamodel OMOP version "5.3";

  // # Referenced libraries #
  // The ClarityCore library provides common functions for simplifying NLP pipeline creation
  include ClarityCore version "1.0" called Clarity;
  include OHDSIHelpers version "1.0" called OHDSI;

  // ## Code Systems ##
  codesystem OMOP: "http://omop.org"; // OMOP vocabulary https://github.com/OHDSI/Vocabulary-v5.0;

  termset SynonymTesting: [

  // WordNet synonyms for 'prostate'
  // Clarity.Synonyms("prostate"),
  Clarity.Synonyms("prostate"),


  // WordNet synonyms for 'neoplasm'
  // Clarity.Synonyms("neoplasm"),
  Clarity.Synonyms("neoplasm"),

  // Pluralize Synonyms
  // Clarity.Plurals(Clarity.Synonyms("neoplasm")),
  Clarity.Plurals(Clarity.Synonyms("neoplasm")),

  // OHDSI synonyms for 'neoplasm'
  // OHDSI.Synonyms("neoplasm"),
  OHDSI.Synonyms("neoplasm"),

  // OHDSI synonyms for 'myocardial infarction'
  // OHDSI.Synonyms("myocardial infarction"),
  OHDSI.Synonyms("myocardial infarction"),

  //Wordnet synonyms for myocardial infarction
 //  Clarity.Synonyms("myocardial infarction"),
  Clarity.Synonyms("myocardial infarction"),

  // Verb inflections for 'biopsy'
  // Clarity.VerbInflections("biopsy")
  Clarity.VerbInflections("biopsy")

  ];
```



```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
```

In [4]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.TermFinder({
    termset:[Orthopnea],
    negated:"Affirmed"
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://localhost:5000/job_results/10396/phenotype_intermediate",
    "job_id": "10396",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=10396",
    "main_results_csv": "http://localhost:5000/job_results/10396/phenotype",
    "phenotype_config": "http://localhost:5000/phenotype_id/10327",
    "phenotype_id": "10327",
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/10539"
    ],
    "pipeline_ids": [
        10539
    ],
    "results_viewer": "http://localhost:8200/?job=10396",
    "status_endpoint": "http://localhost:5000/status/10396"
}


Because many term searchers are actually looking for non-negated, non-hypothetical, subject=patient mentions, we provide a convenient function `ProviderAssertion` to capture those mentions without needing to configure TermFinder. 

```java
limit 100;

//phenotype name
phenotype "Orthopnea" version "2";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset Orthopnea:
  ["orthopnea","orthopnoea"];

define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });
```

### 1.3 Finding Gleason Scores
Clues to the aggressiveness of a particular prostate tumor can be obtained from a needle biopsy and examination of the tissue under a microscope. Pathologists have developed a numerical scoring system for classifying the microscopic tissue morphology. This score, called the Gleason score, is an important factor in determining the stage of the cancer and the patient's overall prognosis. In this example we will develop a custom task for extracting Gleason scores.

#### 1.3.1 Gleason Score Regular Expressions

To develop a custom Gleason score extractor we first need to investigate how these scores are actually reported in medical records. By searching for the term 'Gleason' in a corpus of prostate cancer notes you will find considerable variation:

* Gleason score 7
* Gleasons score 7 (3+4)
* Gleason's score 3 + 4 = 7/10
* Gleason pattern of 3+4
* Gleasons 5
* Gleason five
* Gleason's 3 + 3
* Gleason grade is 4+
* etc.

In the first example, the value 7 is the overall score, which consists of the sum of two other components. Sometimes these components are listed in the form i+j, as in the second example, sometimes not. Sometimes only the components are listed and the overall score is omitted, as in the fourth example. Whether listed or not, the value of each component ranges from 1 to 5 inclusive, which means that the total Gleason score has a minimum value of 2 and a maximum value of 10.

These forms and others that we observed all fit this generic pattern: 

1. Gleason text string: `Gleason`, `Gleason's`, `Gleasons`, ...
2. (optional) designator text string: `score`, `pattern`, `grade`, possibly followed by "is" or "of"
3. (optional) score, either numeric or text
4. (optional) two-part component values, possibly parenthesized

A regular expression for recognizing these forms is:

```python
str_gleason  = r'Gleason(\'?s)?\s*'
str_desig    = r'(score|sum|grade|pattern)(\s+(is|of))?'

# accept a score in digits under these circumstances:
#     digit not followed by a '+', e.g. 'Gleason score 6'
#     digit followed by '+' but no digit after, e.g. 'Gleason score 3+'
# constructs such as Gleason score 3+3 captured in two-part expression below
str_score = r'(?P<score>(\d+(?!\+)(?!\s\+)|\d+(?=\+(?!\d))|'             +\
            r'two|three|four|five|six|seven|eight|nine|ten))'

# parens are optional, space surrounding the '+' varies
str_two_part = r'((\(\s*)?(?P<first_num>\d+)\s*\+\s*(?P<second_num>\d+)' +\
               r'(\s*\))?)?'

# combine all strings
str_total = str_gleason + r'(' + str_desig + r'\s*)?'                    +\
             r'(' + str_score + r'\s*)?' + str_two_part
             
# final Gleason regex
regex_gleason = re.compile(str_total, re.IGNORECASE)
```

Named capture groups are used in the regex to extract the score and each component, if present.

#### 1.3.2 Extraction of the Gleason Score from Sentences

This regex, along with some additional logic, can be used to recognize and extract Gleason scores from sentences. An outline of the process is:

* Attempt a regex match on the current sentence.
* If match, extract all named capture groups that exist.
* Convert captured values from string to int.
* Check captured values and make sure they fall within expected ranges.
* If valid, save captured text and values.

Our implementation of this logic is in the next code block:

```python
# convert text scores to integers
SCORE_TEXT_TO_INT = {
    'two':2,
    'three':3,
    'four':4,
    'five':5,
    'six':6,
    'seven':7,
    'eight':8,
    'nine':9,
    'ten':10
}

# namedtuple for result
GLEASON_SCORE_RESULT_FIELDS = ['sentence_index', 'start', 'end',
                               'score', 'first_num', 'second_num']
GleasonScoreResult = namedtuple('GleasonScoreResult', GLEASON_SCORE_RESULT_FIELDS)

def find_gleason_score(sentence_list):
    """
    Scan a list of sentences and run Gleason score-finding regexes on each.
    Returns a list of GleasonScoreResult namedtuples.
    """

    result_list = []

    for i in range(len(sentence_list)):
        s = sentence_list[i]
        
        # attempt regex match
        iterator = regex_gleason.finditer(s)
        for match in iterator:
            start = match.start()
            end   = match.end()

            # extract first component if it exists
            try:
                first_num = int(match.group('first_num'))
            except:
                first_num = None

            # extract second component if it exists
            try:
                second_num = int(match.group('second_num'))
            except:
                second_num = None

            # extract score if it exists
            try:
                match_text = match.group('score')
                if match_text.isdigit():
                    score = int(match_text)
                else:
                    # convert text score to int
                    match_text = match_text.strip()
                    if match_text in SCORE_TEXT_TO_INT:
                        score = SCORE_TEXT_TO_INT[match_text]
                    else:
                        score = None
            except:
                # no single score was given
                if first_num is not None and second_num is not None:
                    score = first_num + second_num
                else:
                    score = None

            # Now apply these rules to determine if score is valid:
            #
            #     1 <= first_num <= 5
            #     1 <= second_num <= 5
            #     2 <= score <= 10
            #
            # anything outside of these limits is invalid

            if first_num is not None and (first_num > 5 and first_num <= 10):
                # assume score reported for first_num
                score = first_num
                first_num = None
                second_num = None
            elif score is not None and (score < 2 or score > 10):
                # invalid
                score = None
                continue
                    
            result = GleasonScoreResult(i, start, end, score, first_num, second_num)
            result_list.append(result)

return result_list
```

#### 1.3.3 Gleason Score Custom Task for ClarityNLP

The code presented above can be combined into a custom task for extracting Gleason scores. Using the custom task framework presented in [Cooking with ClarityNLP Session 1](https://github.com/ClarityNLP/ClarityNLP/blob/master/nlp/notebooks/cooking/Cooking%20with%20ClarityNLP%20-%20082818.ipynb), we hav the following code outline:
```python

def find_gleason_score(sentence_list):
    # see code above

class GleasonScoreTask(BaseTask):
    """
    A custom task for finding the Gleason score, which is relevant to 
    prostate cancer diagnosis and staging.
    """
    
    # use this name in NLPQL
    task_name = "GleasonScoreTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all Gleason score results in this document
            result_list = find_gleason_score(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.score,
                        'value_first':result.first_num,
                        'value_second':result.second_num
                    }
            
self.write_result_data(temp_file, mongo_client, doc, obj)
```

Each ClarityNLP custom task must be implemented as a derived class of ClarityNLP's `BaseTask` class.  Our custom Gleason score task is called `GleasonScoreTask`, and it is a child of `BaseTask`, as required.

The `task_name` field is the name by which this custom task will be invoked from NLPQL. This name is `GleasonScoreTask`.

Each custom task must implement the `run_custom_task` function. We do so by iterating over all documents, extracting the document's sentences, and calling our `find_gleason_score` function on each sentence to recognize and extract the Gleason score and its components.

If any Gleason scores are found in the document's sentences they are returned as a list of `GleasonScoreResult` namedtuples. We iterate over the list of these tuples and build a python dict that contains the output desired in the phenotype results. The result fields that we write out are:

* `sentence`: the sentence containing the Gleason score
* `start`: the first character of the text matched by the regex
* `end`: one past the last character matched by the regex
* `value`: the Gleason score value
* `value_first`: the first component of the Gleason score, if any
* `value_second`: the second component of the Gleason score, if any

In the next cell we present a sample NLPQL program to invoke the custom task and extract Gleason scores. This code uses the `createDocumentSet` function to limit the input documents to those with a `report_type` field equal to `Pathology`. Gleason scores are determined by examining tissue under a microscope, so pathology reports are the expected source of these scores.


In [None]:
# Sample NLPQL to find Gleason scores from pathology reports
nlpql ='''
limit 200;

phenotype "Gleason Score Finder" version "1";
include ClarityCore version "1.0" called Clarity;

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define final GleasonFinderFunction:
    Clarity.GleasonScoreTask({
        documentset: [Docs]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

### 1.4 Finding Patients with Elevated PSA and Biopsy Findings
OHDSI Example

## 2. ? Social Suport 

In this example, we will be searching for ejection fraction values using a very simple algorithm.  Specifically, we will be looking for certain terms and subsequent values that would be typical for EF values.  (There are many more sophisticated methods to find ejection fraction (e.g [Kim et al](https://www.ncbi.nlm.nih.gov/pubmed/28163196)).)  We will then constrain the "final" cohort to only those with an EF < 30.

```java
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
```


In [None]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "Ejection Fraction Values" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });

//logical Context (Patient, Document)
context Patient;

define final LowEFPatient:
    where EjectionFraction.value <= 30;
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

The `final` declaration refers to a cohort definition and typically involves some logic.  So in this case we defined an extraction process to pull all values between 10 and 85 following EF, LVEF, etc.  We then specified a `context`, meaning that the logic should operate on the level of a patient.  (The other option is Document context, which we will describe in a future session.)  Our logical rule stated that patients with an EjectionFraction <= 30 would make our criteria for a Low EF Patient. 

Results can be found at the main_results_csv URL from your API response, or if  you ran here in this notebook:

In [None]:
print(main_csv)

In [None]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

The next step is to use ValueExtraction to pull out an enumerated value set (rather than a quantitative value).  See the example below for NYHA class.

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset NYHATerms:
  ["nyha"];

define NYHAClass:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });
```

In [None]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset NYHATerms:
  ["nyha"];

define NYHAClass:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

In [None]:
# view intermediate results
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

### 1.3 Bringing the criteria together to find the target CHF cohort

For the final step in example 1, we want to bring together the above criteria to generate our final cohort.

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];

//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;
```

In [None]:
# Sample NLPQL
nlpql ='''
limit 100;
//phenotype name
phenotype "Final CHF Cohort" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];

//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

In [None]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

In the above case, we may need to increase our limit beyond 100 documents to find many matching patients, because multiple criteria are required and a small sample may not be enough.  Try increasing to 500 documents.

### Reviewing Results through the ClarityNLP UI
Downloading the raw CSV results is handy for analysis and data manipulation.  However, a domain-oriented end user may be more interested in just exploring and validating the final results without getting into all the programmatic details.  That's where the Results Viewer comes in, which in a typical installation will be found at localhost:8200/


![Screen%20Shot%202018-08-29%20at%209.33.35%20AM.png](assets/Clarity Validator.png)

## Case #2: Capturing Information on Patient Race
Although there are many useful [core algorithms](https://claritynlp.readthedocs.io/en/latest/developer_guide/index.html#task-algorithms) in ClarityNLP, users will frequently want to extend its functionality.  In this second example, we will explore how to extend ClarityNLP when the built in algorithms are inadequate.  

In this case, we'd like to identify the patient's race.  While some version of this could probably we done with simple search terms, a custom algorithm will likely be necessary.  Below is an example of a custom Python algorithm written to extract race information from a document. 

```python

str_sep = r'(\s-\s|-\s|\s-|\s)'
str_word = r'\b[-a-z.\d]+'
str_punct = r'[,.\s]*'
str_words = r'(' + str_word + str_punct + r'){0,6}'
str_person = r'\b(gentleman|gentlewoman|male|female|man|woman|person|'    +\
             r'child|boy|girl|infant|baby|newborn|neonate|individual)\b'
str_category = r'\b(american' + str_sep + r'indian|'                      +\
               r'alaska' + str_sep + r'native|asian|'                     +\
               r'african' + str_sep + r'american|black|negro|'            +\
               r'native' + str_sep + r'hawaiian|'                         +\
               r'other' + str_sep + r'pacific' + str_sep + r'islander|'   +\
               r'pacific' + str_sep + r'islander|'                        +\
               r'native' + str_sep + r'american|'                         +\
               r'white|caucasian|european)'

str_race1 = r'(\brace:?\s*)' + r'(?P<category>' + str_category + r')'
regex_race1 = re.compile(str_race1, re.IGNORECASE)
str_race2 = r'(?P<category>' + str_category + r')' + str_punct    +\
            str_words + str_person
regex_race2 = re.compile(str_race2, re.IGNORECASE)
str_race3 = str_person + str_punct + str_words + r'(?P<category>' +\
            str_category + r')'
regex_race3 = re.compile(str_race3, re.IGNORECASE)
REGEXES = [regex_race1, regex_race2, regex_race3]

RACE_FINDER_RESULT_FIELDS = ['sentence_index', 'start', 'end', 'race',
                             'normalized_race']
RaceFinderResult = namedtuple('RaceFinderResult', RACE_FINDER_RESULT_FIELDS)


###############################################################################
def normalize(race_text):
    """
    Convert a matching race string to a 'normalized' version.
    """

    NORM_MAP = {
        'african american':'black',
        'negro':'black',
        'caucasian':'white',
        'european':'white',
    }
    
    # convert to lowercase, remove dashes, collapse repeated whitespace
    race = race_text.lower()
    race = re.sub(r'[-]+', '', race)
    race = re.sub(r'\s+', ' ', race)

    if race in NORM_MAP:
        return NORM_MAP[race]
    else:
        return race
    

###############################################################################
def find_race(sentence_list):
    """
    Scan a list of sentences and run race-finding regexes on each.
    Return a dict that maps sentence_index -> race_category.
    """

    result_list = []

    found_match = False
    for i in range(len(sentence_list)):
        s = sentence_list[i]
        for regex in REGEXES:
            match = regex.search(s)
            if match:
                match_text = match.group('category')
                start = match.start()
                end   = match.end()
                normalized = normalize(match_text)
                result = RaceFinderResult(i, start, end, match_text, normalized)
                result_list.append(result)
                found_match = True
                break

        # Reports are unlikely to have more than one sentence stating the
        # patient's race.
        if found_match:
            break
            
    return result_list
```

Without going into the details, this algorithm parses text to find race information, normalizes it to standard terms, and passes back the result.  In order to run this algorithm using NLPQL in ClarityNLP, we create what is called a custom task.  Below is code that creates the CustomTask wrapping this function and provides it with the documents and handling of result ouput.

```python
# use this name in NLPQL
    task_name = "RaceFinderTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all race results in this document
            result_list = find_race(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.race,
                        'value_normalized':result.normalized_race,
                    }
            
                    self.write_result_data(temp_file, mongo_client, doc, obj)
                    ```

These files can be split into two or can be combined as shown in the final custom task below.

In [None]:
import re
from pymongo import MongoClient
from collections import namedtuple
from tasks.task_utilities import BaseTask

str_sep = r'(\s-\s|-\s|\s-|\s)'
str_word = r'\b[-a-z.\d]+'
str_punct = r'[,.\s]*'
str_words = r'(' + str_word + str_punct + r'){0,6}'
str_person = r'\b(gentleman|gentlewoman|male|female|man|woman|person|'    +\
             r'child|boy|girl|infant|baby|newborn|neonate|individual)\b'
str_category = r'\b(american' + str_sep + r'indian|'                      +\
               r'alaska' + str_sep + r'native|asian|'                     +\
               r'african' + str_sep + r'american|black|negro|'            +\
               r'native' + str_sep + r'hawaiian|'                         +\
               r'other' + str_sep + r'pacific' + str_sep + r'islander|'   +\
               r'pacific' + str_sep + r'islander|'                        +\
               r'native' + str_sep + r'american|'                         +\
               r'white|caucasian|european)'

str_race1 = r'(\brace:?\s*)' + r'(?P<category>' + str_category + r')'
regex_race1 = re.compile(str_race1, re.IGNORECASE)
str_race2 = r'(?P<category>' + str_category + r')' + str_punct    +\
            str_words + str_person
regex_race2 = re.compile(str_race2, re.IGNORECASE)
str_race3 = str_person + str_punct + str_words + r'(?P<category>' +\
            str_category + r')'
regex_race3 = re.compile(str_race3, re.IGNORECASE)
REGEXES = [regex_race1, regex_race2, regex_race3]

RACE_FINDER_RESULT_FIELDS = ['sentence_index', 'start', 'end', 'race',
                             'normalized_race']
RaceFinderResult = namedtuple('RaceFinderResult', RACE_FINDER_RESULT_FIELDS)


###############################################################################
def normalize(race_text):
    """
    Convert a matching race string to a 'normalized' version.
    """

    NORM_MAP = {
        'african american':'black',
        'negro':'black',
        'caucasian':'white',
        'european':'white',
    }
    
    # convert to lowercase, remove dashes, collapse repeated whitespace
    race = race_text.lower()
    race = re.sub(r'[-]+', '', race)
    race = re.sub(r'\s+', ' ', race)

    if race in NORM_MAP:
        return NORM_MAP[race]
    else:
        return race
    

###############################################################################
def find_race(sentence_list):
    """
    Scan a list of sentences and run race-finding regexes on each.
    Return a dict that maps sentence_index -> race_category.
    """

    result_list = []

    found_match = False
    for i in range(len(sentence_list)):
        s = sentence_list[i]
        for regex in REGEXES:
            match = regex.search(s)
            if match:
                match_text = match.group('category')
                start = match.start()
                end   = match.end()
                normalized = normalize(match_text)
                result = RaceFinderResult(i, start, end, match_text, normalized)
                result_list.append(result)
                found_match = True
                break

        # Reports are unlikely to have more than one sentence stating the
        # patient's race.
        if found_match:
            break
            
    return result_list


###############################################################################
class RaceFinderTask(BaseTask):
    """
    A custom task for finding a patient's race.
    """
    
    # use this name in NLPQL
    task_name = "RaceFinderTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all race results in this document
            result_list = find_race(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.race,
                        'value_normalized':result.normalized_race,
                    }
            
                    self.write_result_data(temp_file, mongo_client, doc, obj)



This race task can be called in NLPQL as follows:

```java
    limit 100;

    phenotype "Race Finder" version "1";
    include ClarityCore version "1.0" called Clarity;

    documentset DischargeSummaries:
        Clarity.createReportTagList(["Discharge Summary"]);

    define RaceFinderFunction:
        Clarity.RaceFinderTask({
            documentset: [DischargeSummaries]
        });
```

Note:  This example is our first time using `documentset`, which allows us to specify a targeted list of documents such as Discharge Summaries or Radiology notes etc.  We will cover this is greater detail in future Cooking sessions. 

In [None]:
nlpql ='''
limit 100;

phenotype "Race Finder" version "1";
include ClarityCore version "1.0" called Clarity;

documentset DischargeSummaries:
    Clarity.createReportTagList(["Discharge Summary"]);

define RaceFinderFunction:
    Clarity.RaceFinderTask({
        documentset: [DischargeSummaries]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

In [None]:
# view intermediate results
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

### 2.2 Combining race with other criteria

As you probably gathered, you can now write NLPQL that will look for all patients matching our CHF criteria with the race information extracted above.  The NLPQL would look like this:

```java
limit 100;
//phenotype name
phenotype "NYHA Class" version "1";

//include Clarity  main NLP libraries
include ClarityCore version "1.0" called Clarity;

//termsets
termset Orthopnea:
  ["orthopnea","orthopnoea"];

termset EjectionFractionTerms:
  ["ef","ejection fraction","lvef"];

termset NYHATerms:
  ["nyha"];


//documentsets
documentset DischargeSummaries:
    Clarity.createReportTagList(["Discharge Summary"]);


//data extractions
define hasOrthopnea:
  Clarity.ProviderAssertion({
    termset:[Orthopnea]
    });

define EjectionFraction:
  Clarity.ValueExtraction({
    termset:[EjectionFractionTerms],
    minimum_value: "10",
    maximum_value: "85"
    });


define NYHAClass34:
  Clarity.ValueExtraction({
    termset:[NYHATerms],
    enum_list: ["3","4","iii","iv"];
    });


define Race:
    Clarity.RaceFinderTask({
        documentset: [DischargeSummaries]
    });
       

//logical context (Patient, Document)
context Patient;

define LowEF:
    where EjectionFraction.value <= 30;

define SevereCHF:
    where NYHAClass34 OR LowEF;

define BlackRace:
    where Race.normalized_value = 'black';
    
define final SevereCHFwithOrthopnea:
    where SevereCHF AND hasOrthopnea;

define final BlackSevereCHFPatient:
    where SevereCHF AND BlackRace;
```
