## Cooking with ClarityNLP - Session #2

The goal of this series is to introduce you to writing basic queries using NLPQL.  Today we will also be covering an introduction to data ingestion, document selection, lexical variants, and more custom algorithms.  For other details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

### Data Ingestion

In order to run NLP jobs using ClarityNLP, data must first be ingested into the system.  You can ingest data from various sources (eg. flat files, relational databases, APIs, etc) and of various types (eg. txt, doc, pdf, etc). Today we will cover one of the most common ingestion patterns-- bringing in data from a CSV.  ClarityNLP has a user interface to support CSV ingestion.  In a typical instance, this will be located at `localhost:6543/csv`.

![Postman.png](assets/Ingest_UI.png)

The process of ingesting data from CSV involves the following steps:
1. Select your CSV file to load column headers
2. Assign the required fields for ClarithNLP to columns in your file
3. Add any additional fields you would like to include from your source file
4. Start the Import process


Below is an example of the ingestion screen filled out for the MIMIC-III notes file (NOTEEVENTS.csv)

![Postman.png](assets/MIMIC_Ingest_UI.png)

### How to Run NLPQL

In order to run NLPQL, you must submit it to a ClarityNLP server either via API or via the ClarityNLP user interface.  If you are running a local instance, the API endpoint is typically `localhost:5000/nlpql`.  NLPQL should be POSTed as text/plain.  An example from [Postman](www.postman.com) is shown below.

![Postman.png](assets/Postman.png)

If you are unfamiliar with using tools such as Postman, you can submit NLPQL via the ClarityNLP user interface running in a web browser. For local instances, this will be at [localhost:8200/runner](localhost:8200/runner). 

![NLPQL_Runner.png](assets/NLPQL_Runner.png)

If you wish to run NLPQL directly from this notebook, then please use the following code.  You will need to edit the `url` variable to "localhost:5000/" or your ClarityNLP server IP address.

In [1]:
# This code below is only required for running ClarityNLP in Jupyter notebooks. It is not required if running NLPQL via API or the ClarityNLP GUI.

import pandas as pd
import claritynlp_notebook_helpers as claritynlp

ClarityNLP notebook helpers loaded successfully!



Note: Throughout these tutorials, we will prepend all examples with `limit 100;`.  This limits the server to analyzing a maxium of 100 documents, reducing runtime and compute load when testing new queries. Once a query is producing the expected output, removing this line will allow the full dataset to be run.

## Case #1:  Prostate Cancer
For this first use case, we are going to look at a few different approaches to analyzing prostate cancer.  First we will start with just the basic approach we covered last time. 

### 1.1 Find mentions of "Prostate Cancer" in the patient chart.

To run this NLPQL, copy/paste the above and submit via API or the ClarityNLP interface.  Or if you would like to run the NLPQL directoly within this notebook, run the code below.

In [3]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Prostate Cancer" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

termset ProstateTerms:
  ["prostate cancer","prostate ca"];

define ProstateCA:
  Clarity.ProviderAssertion({
    termset:[ProstateTerms]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/362/phenotype_intermediate",
    "job_id": "362",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=362",
    "main_results_csv": "http://18.220.133.76:5000/job_results/362/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/362",
    "phenotype_id": "362",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/542"
    ],
    "pipeline_ids": [
        542
    ],
    "results_viewer": "?job=362",
    "status_endpoint": "http://18.220.133.76:5000/status/362"
}


In [6]:
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,experiencer,inserted_date,job_id,negation,nlpql_feature,owner,...,report_id,report_type,section,sentence,solr_id,source,start,subject,temporality,term
0,5b98079561cfff00755e44be,50,-1,97,Patient,2018-09-11 18:21:09.622000,362,Affirmed,ProstateCA,claritynlp,...,954047,Radiology,UNKNOWN,BONE SCAN Clip # [**Clip Number (Radiology) 98...,954047,MIMIC,86,7818,Historical,PROSTATE CA
1,5b98079561cfff00755e44bf,50,-1,37,Patient,2018-09-11 18:21:09.635000,362,Affirmed,ProstateCA,claritynlp,...,954047,Radiology,HISTORY_PRESENT_ILLNESS,58 year old male with prostate cancer.,954047,MIMIC,22,7818,Recent,prostate cancer
2,5b98079561cfff00755e44c0,50,-1,31,Patient,2018-09-11 18:21:09.857000,362,Affirmed,ProstateCA,claritynlp,...,906224,Radiology,ADMISSION_DIAGNOSIS,PAIN METESTATIC PROSTATE CANCER UNDERLYING MED...,906224,MIMIC,16,10608,Recent,PROSTATE CANCER
3,5b98079561cfff00755e44c1,50,-1,43,Patient,2018-09-11 18:21:09.858000,362,Affirmed,ProstateCA,claritynlp,...,906224,Radiology,CONDITION,"62 year old man with metastatic prostate CA, a...",906224,MIMIC,32,10608,Recent,prostate CA
4,5b98079561cfff00755e44c2,50,-1,15,Patient,2018-09-11 18:21:09.961000,362,Affirmed,ProstateCA,claritynlp,...,900404,Radiology,ADMISSION_DIAGNOSIS,PROSTATE CANCER;PNEUMONIA UNDERLYING MEDICAL,900404,MIMIC,0,9227,Recent,PROSTATE CANCER


### Working with Document Sets

Sometimes we don't want to look for mentions of a concept in just any document, but rather only want to look within certain types of documents.  With ClarityNLP, we can extensively control for the types on documents in which we perform any  algorithm.  A group of documents is referred to in NLPQL as a `documentset`. There are four modifiers that can be used in creating document sets:

- report_type
- report_tag
- filter_query
- query

***Report Type***

When documents are ingested into ClarityNLP, you have the option to assign a report type.  For clinical documents, this is typically something like Discharge Summary, Head CT WWO Contrast, Colonoscopy Report, etc.  Clarity has a convenient function `createReportTypeList` for building a document set based on report type.  Here is an example:

```
documentset ChestXRayDocuments:
   Clarity.createReportTypeList(["CXR PA/LAT","CXR 2V","AP/LAT CHEST"]);
```

***Report Tag***

In our research looking at multiple health systems, we found thousands of different report type names.  Such diversity makes it challenging to create phenotypes that can be applied across diverse settings.  To address this, ClarityNLP embeds a Report Type tagging system that facilitates linking report types to the LOINC / RadLex document ontology.  This enables creation of more standardized NLPQL code.  We will discuss report tagging in more detail in a future *Cooking with ClarityNLP* session.

```
documentset ChestXRayDocuments:
   Clarity.v(["XR","Chest"]);
```
![Report_Tagger.png](assets/Report_Type_Mapper4.png)

***Filter Query***

Filter queries are a powerful function that will create a apply a filtering function to the documents you have already selected.  These can include dynamic fields added at ingestion time or standard ClarityNLP document fields. See [Solr query documentation](https://lucene.apache.org/solr/guide/7_4/query-syntax-and-parsing.html) for details.

```
documentset CXRDocuments:
    Clarity.createDocumentSet({
        "report_types":[],
        "report_tags": [],
        "filter_query": "subject:23224"});
```

***Query***

Sometimes a highly customized document selection is required that cannot be managed with ClarityNLP's built in functions.  For these situations, you can create an entirely custom document query using Solr, including wildcards, fuzzy searches, proximity searches, range searchers, boosting, etc.  See [Solr query language](https://lucene.apache.org/solr/guide/7_4/query-syntax-and-parsing.html) for full details.

```
documentset CXRDocuments:
    Clarity.createDocumentSet({
        "report_types":[],
        "report_tags": [],
        "query":"*astatin"});
```

![Statin_Result.png](assets/Statin_Result.png)

### 1.2 Finding Prostate mentions in Discharge Summaries

We will start with a basic example, simply looking for a positive assertion of "prostate cancer" or "prostate ca" in the record.

```java
limit 100;

//phenotype name
phenotype "Prostate Cancer" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

documentset DischargeSummaries:
    Clarity.createReportTagList("Discharge summary");

termset ProstateTerms:
  ["prostate cancer","prostate ca"];

define ProstateCA:
  Clarity.ProviderAssertion({
    documentset: [DischargeSummaries],
    termset:[ProstateTerms]
    });
```

In [12]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Prostate Cancer in Discharge Summary" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

documentset DischargeSummaries:
    Clarity.createReportTagList("Discharge summary");

termset ProstateTerms:
  ["prostate cancer","prostate ca"];

define ProstateCA:
  Clarity.ProviderAssertion({
    documentset: [DischargeSummaries],
    termset:[ProstateTerms]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/364/phenotype_intermediate",
    "job_id": "364",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=364",
    "main_results_csv": "http://18.220.133.76:5000/job_results/364/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/364",
    "phenotype_id": "364",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/544"
    ],
    "pipeline_ids": [
        544
    ],
    "results_viewer": "?job=364",
    "status_endpoint": "http://18.220.133.76:5000/status/364"
}


In [19]:
inter_csv_df = pd.read_csv(intermediate_csv)
inter_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,experiencer,inserted_date,job_id,negation,nlpql_feature,owner,...,report_id,report_type,section,sentence,solr_id,source,start,subject,temporality,term
0,5b980a4461cfff0152b5e297,50,-1,97,Patient,2018-09-11 18:32:36.985000,365,Affirmed,ProstateCA,claritynlp,...,954047,Radiology,UNKNOWN,BONE SCAN Clip # [**Clip Number (Radiology) 98...,954047,MIMIC,86,7818,Historical,PROSTATE CA
1,5b980a4461cfff0152b5e298,50,-1,37,Patient,2018-09-11 18:32:36.988000,365,Affirmed,ProstateCA,claritynlp,...,954047,Radiology,HISTORY_PRESENT_ILLNESS,58 year old male with prostate cancer.,954047,MIMIC,22,7818,Recent,prostate cancer
2,5b980a4561cfff0152b5e299,50,-1,31,Patient,2018-09-11 18:32:37.203000,365,Affirmed,ProstateCA,claritynlp,...,906224,Radiology,ADMISSION_DIAGNOSIS,PAIN METESTATIC PROSTATE CANCER UNDERLYING MED...,906224,MIMIC,16,10608,Recent,PROSTATE CANCER
3,5b980a4561cfff0152b5e29a,50,-1,43,Patient,2018-09-11 18:32:37.204000,365,Affirmed,ProstateCA,claritynlp,...,906224,Radiology,CONDITION,"62 year old man with metastatic prostate CA, a...",906224,MIMIC,32,10608,Recent,prostate CA
4,5b980a4561cfff0152b5e29b,50,-1,15,Patient,2018-09-11 18:32:37.306000,365,Affirmed,ProstateCA,claritynlp,...,900404,Radiology,ADMISSION_DIAGNOSIS,PROSTATE CANCER;PNEUMONIA UNDERLYING MEDICAL,900404,MIMIC,0,9227,Recent,PROSTATE CANCER


### Building Term Sets with Lexical Variants
Building term sets can be a time consuming process. ClarityNLP has a number of cool built-in features for creating synonyms, plurals, lexical variants and so forth.  Check out the full list of [Termset Expansion](https://clarity-nlp.readthedocs.io/en/latest/user_guide/nlpql/macros.html?highlight=lexical) functions.  Here are a few handy ones:

- Clarity.Synonyms
- Clarity.Plurals
- Clarity.VerbInflections

- OHDSI.Synonyms

Here is an example of a termset to be expanded.

```java
  phenotype "Test Expansion Using English Phrases";

  // # Structured Data Model #
  datamodel OMOP version "5.3";

  // # Referenced libraries #
  // The ClarityCore library provides common functions for simplifying NLP pipeline creation
  include ClarityCore version "1.0" called Clarity;
  include OHDSIHelpers version "1.0" called OHDSI;

  // ## Code Systems ##
  codesystem OMOP: "http://omop.org"; // OMOP vocabulary https://github.com/OHDSI/Vocabulary-v5.0;

  termset SynonymTesting: [

  // WordNet synonyms for 'prostate'
  // Clarity.Synonyms("prostate"),
  Clarity.Synonyms("prostate"),


  // WordNet synonyms for 'neoplasm'
  // Clarity.Synonyms("neoplasm"),
  Clarity.Synonyms("neoplasm"),

  // Pluralize Synonyms
  // Clarity.Plurals(Clarity.Synonyms("neoplasm")),
  Clarity.Plurals(Clarity.Synonyms("neoplasm")),

  // OHDSI synonyms for 'neoplasm'
  // OHDSI.Synonyms("neoplasm"),
  OHDSI.Synonyms("neoplasm"),

  // OHDSI synonyms for 'myocardial infarction'
  // OHDSI.Synonyms("myocardial infarction"),
  OHDSI.Synonyms("myocardial infarction"),

  //Wordnet synonyms for myocardial infarction
 //  Clarity.Synonyms("myocardial infarction"),
  Clarity.Synonyms("myocardial infarction")
  ];
```

Now post-expansion:

```java
termset SynonymTesting: [

  // WordNet synonyms for 'prostate'
  // Clarity.Synonyms("prostate"),
  "prostate","prostate gland","prostatic",


  // WordNet synonyms for 'neoplasm'
  // Clarity.Synonyms("neoplasm"),
  "neoplasm","tumor","tumour",

  // Pluralize Synonyms
  // Clarity.Plurals(Clarity.Synonyms("neoplasm")),
  "neoplasm","neoplasms","tumor","tumors","tumour","tumours",

  // OHDSI synonyms for 'neoplasm'
  // OHDSI.Synonyms("neoplasm"),
  "neoplasm","neoplasm (morphologic abnormality)","tumor","tumour",

  // OHDSI synonyms for 'myocardial infarction'
  // OHDSI.Synonyms("myocardial infarction"),
  "cardiac infarction","heart attack","infarction of heart","mi - myocardial infarction","myocardial infarct","myocardial infarction","myocardial infarction (disorder)",

  //Wordnet synonyms for myocardial infarction
 //  Clarity.Synonyms("myocardial infarction")
  "myocardial infarct","myocardial infarction"
  ];
```

### 1.3 Prostate Biopsies
Let's do a slight addition where we look for mention of prostate and biopsy.  We can use the new synonym expansion capability we learned.

In [3]:
# Sample NLPQL
nlpql ='''
limit 100;

//phenotype name
phenotype "Prostate Cancer and Biopsy" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

documentset PathologyDocuments:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

termset ProstateTerms:
  ["prostate"];

termset BiopsyTerms:
  [
  Clarity.VerbInflections("biopsy")
  ];
  
define ProstateCA:
  Clarity.ProviderAssertion({
    documentset:[PathologyDocuments],
    termset:[ProstateTerms]
    });

define Biopsy:
  Clarity.ProviderAssertion({
    documentset:[PathologyDocuments],
    termset:[BiopsyTerms]
    });

context Document;

define final ProstateCAandBiopsy:
    where ProstateCA AND Biopsy;
'''
expanded_nlpql = claritynlp.run_term_expansion(nlpql)
print(expanded_nlpql)


limit 100;

//phenotype name
phenotype "Prostate Cancer and Biopsy" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;

documentset PathologyDocuments:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

termset ProstateTerms:
  ["prostate"];

termset BiopsyTerms:
  [
  "biopsied","biopsies","biopsy","biopsying"
  ];
  
define ProstateCA:
  Clarity.ProviderAssertion({
    documentset:[PathologyDocuments],
    termset:[ProstateTerms]
    });

define Biopsy:
  Clarity.ProviderAssertion({
    documentset:[PathologyDocuments],
    termset:[BiopsyTerms]
    });

context Document;

define final ProstateCAandBiopsy:
    where ProstateCA AND Biopsy;



In [4]:
# Run NLPQL after term expansion
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(expanded_nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/400/phenotype_intermediate",
    "job_id": "400",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=400",
    "main_results_csv": "http://18.220.133.76:5000/job_results/400/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/400",
    "phenotype_id": "400",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/596",
        "http://18.220.133.76:5000/pipeline_id/597"
    ],
    "pipeline_ids": [
        596,
        597
    ],
    "results_viewer": "?job=400",
    "status_endpoint": "http://18.220.133.76:5000/status/400"
}


In [45]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end1,end2,inserted_date,job_id,nlpql_feature,owner,phenotype_final,...,report_type,sentence,solr_id,source,start1,start2,subject,value,word1,word2
0,5b9878776d7df905d814f2fb,25,-1,62,69,2018-09-12 02:22:47.730000,386,TermProximityFunction,claritynlp,True,...,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,54,63,34876,"prostate,biopsy",prostate,biopsy
1,5b9878776d7df905d814f2fc,25,-1,69,102,2018-09-12 02:22:47.733000,386,TermProximityFunction,claritynlp,True,...,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,63,94,34876,"biopsy,prostate",biopsy,prostate
2,5b9878796d7df905db14f2fb,0,-1,15,22,2018-09-12 02:22:49.522000,386,TermProximityFunction,claritynlp,True,...,Pathology,He had prostate biopsy done at [**Location (un...,13000000,MIMIC,7,16,43479,"prostate,biopsy",prostate,biopsy
3,5b98787c6d7df905db14f2fc,0,-1,227,242,2018-09-12 02:22:52.834000,386,TermProximityFunction,claritynlp,True,...,Pathology,Other medications: glyburide 10mg [**Hospital1...,13000004,MIMIC,219,236,34714,"prostate,biopsy",prostate,biopsy
4,5b98788a6d7df905d814f2fd,25,-1,12,58,2018-09-12 02:23:06.749000,386,TermProximityFunction,claritynlp,True,...,Pathology,HTN Prostate cancer - transrectal ultrasound-g...,13000045,MIMIC,4,52,31689,"prostate,biopsy",prostate,biopsy


This search is not terribly useful because it is really just finding documents where both words show up.  To perform a more precise search for term pairs by distance, you can use the convenient `TermProximity` task.  This lets you set the TermSets, word distance, and ordering constraint.

In [47]:
nlpql ='''
limit 100;

phenotype "Prostate and Biopsy Proximity" version "1";
include ClarityCore version "1.0" called Clarity;

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define final TermProximityFunction:
    Clarity.TermProximityTask({
        documentset: [Docs],
        "termset1": "prostate",
        "termset2": "biopsy,biopsied,bx",
        "word_distance": 5,
        "any_order": "True"
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/387/phenotype_intermediate",
    "job_id": "387",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=387",
    "main_results_csv": "http://18.220.133.76:5000/job_results/387/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/387",
    "phenotype_id": "387",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/583"
    ],
    "pipeline_ids": [
        583
    ],
    "results_viewer": "?job=387",
    "status_endpoint": "http://18.220.133.76:5000/status/387"
}


In [46]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end1,end2,inserted_date,job_id,nlpql_feature,owner,phenotype_final,...,report_type,sentence,solr_id,source,start1,start2,subject,value,word1,word2
0,5b9878776d7df905d814f2fb,25,-1,62,69,2018-09-12 02:22:47.730000,386,TermProximityFunction,claritynlp,True,...,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,54,63,34876,"prostate,biopsy",prostate,biopsy
1,5b9878776d7df905d814f2fc,25,-1,69,102,2018-09-12 02:22:47.733000,386,TermProximityFunction,claritynlp,True,...,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,63,94,34876,"biopsy,prostate",biopsy,prostate
2,5b9878796d7df905db14f2fb,0,-1,15,22,2018-09-12 02:22:49.522000,386,TermProximityFunction,claritynlp,True,...,Pathology,He had prostate biopsy done at [**Location (un...,13000000,MIMIC,7,16,43479,"prostate,biopsy",prostate,biopsy
3,5b98787c6d7df905db14f2fc,0,-1,227,242,2018-09-12 02:22:52.834000,386,TermProximityFunction,claritynlp,True,...,Pathology,Other medications: glyburide 10mg [**Hospital1...,13000004,MIMIC,219,236,34714,"prostate,biopsy",prostate,biopsy
4,5b98788a6d7df905d814f2fd,25,-1,12,58,2018-09-12 02:23:06.749000,386,TermProximityFunction,claritynlp,True,...,Pathology,HTN Prostate cancer - transrectal ultrasound-g...,13000045,MIMIC,4,52,31689,"prostate,biopsy",prostate,biopsy


### 1.4 Cooking Special: Finding Gleason Scores
Clues to the severity of a prostate tumor can be obtained from a needle biopsy and examination of the tissue under a microscope. Pathologists have developed a numerical scoring system for classifying the microscopic tissue morphology. This score, called the Gleason score, is an important factor in determining the stage of the cancer and the patient's overall prognosis. In this example we will develop a custom task for extracting Gleason scores.

#### 1.4.1 Gleason Score Regular Expressions

To develop a custom Gleason score extractor we first need to investigate how these scores are actually reported in medical records. By searching for the term 'Gleason' in a corpus of prostate-related health data you will find considerable variation:

* Gleason score 7
* Gleasons score 7 (3+4)
* Gleason's score 3 + 4 = 7/10
* Gleason pattern of 3+4
* Gleasons 5
* Gleason five
* Gleason's 3 + 3
* Gleason grade is 4+
* etc.

In the first example, the value 7 is the overall score, which consists of the sum of two other components. Sometimes these components are listed in the form `i+j`, as in the second example, sometimes not. Sometimes only the components are listed and the overall score is omitted, as in the fourth example. Whether listed or not, the value of each component ranges from 1 to 5 inclusive, which means that the total Gleason score has a minimum value of 2 and a maximum value of 10.

These forms and others that we observed all fit this generic pattern: 

1. Gleason text string: `Gleason`, `Gleason's`, `Gleasons`, ...
2. (optional) designator text string: `score`, `pattern`, `grade`, possibly followed by "is" or "of"
3. (optional) score, either numeric or text
4. (optional) two-part component values, possibly parenthesized

A regular expression for recognizing these forms is:

```python
str_gleason  = r'Gleason(\'?s)?\s*'
str_desig    = r'(score|sum|grade|pattern)(\s+(is|of))?'

# accept a score in digits under these circumstances:
#     digit not followed by a '+', e.g. 'Gleason score 6'
#     digit followed by '+' but no digit after, e.g. 'Gleason score 3+'
# constructs such as Gleason score 3+3 captured in two-part expression below
str_score = r'(?P<score>(\d+(?!\+)(?!\s\+)|\d+(?=\+(?!\d))|'             +\
            r'two|three|four|five|six|seven|eight|nine|ten))'

# parens are optional, space surrounding the '+' varies
str_two_part = r'((\(\s*)?(?P<first_num>\d+)\s*\+\s*(?P<second_num>\d+)' +\
               r'(\s*\))?)?'

# combine all strings
str_total = str_gleason + r'(' + str_desig + r'\s*)?'                    +\
             r'(' + str_score + r'\s*)?' + str_two_part
             
# final Gleason regex
regex_gleason = re.compile(str_total, re.IGNORECASE)
```

Named capture groups are used in the regex to extract the score and each component, if present.

#### 1.4.2 Extraction of the Gleason Score from Sentences

This regex, along with some additional logic, can be used to recognize and extract Gleason scores from sentences. An outline of the process is:

* Attempt a regex match on the current sentence.
* If match, extract all named capture groups that exist.
* Convert captured values from string to int.
* Check captured values and make sure they fall within expected ranges.
* If valid, save captured text and values.

Our implementation of this logic is in the next code block:

```python
# convert text scores to integers
SCORE_TEXT_TO_INT = {
    'two':2,
    'three':3,
    'four':4,
    'five':5,
    'six':6,
    'seven':7,
    'eight':8,
    'nine':9,
    'ten':10
}

# namedtuple for result
GLEASON_SCORE_RESULT_FIELDS = ['sentence_index', 'start', 'end',
                               'score', 'first_num', 'second_num']
GleasonScoreResult = namedtuple('GleasonScoreResult', GLEASON_SCORE_RESULT_FIELDS)

def find_gleason_score(sentence_list):
    """
    Scan a list of sentences and run Gleason score-finding regexes on each.
    Returns a list of GleasonScoreResult namedtuples.
    """

    result_list = []

    for i in range(len(sentence_list)):
        s = sentence_list[i]
        
        # attempt regex match
        iterator = regex_gleason.finditer(s)
        for match in iterator:
            start = match.start()
            end   = match.end()

            # extract first component if it exists
            try:
                first_num = int(match.group('first_num'))
            except:
                first_num = None

            # extract second component if it exists
            try:
                second_num = int(match.group('second_num'))
            except:
                second_num = None

            # extract score if it exists
            try:
                match_text = match.group('score')
                if match_text.isdigit():
                    score = int(match_text)
                else:
                    # convert text score to int
                    match_text = match_text.strip()
                    if match_text in SCORE_TEXT_TO_INT:
                        score = SCORE_TEXT_TO_INT[match_text]
                    else:
                        score = None
            except:
                # no single score was given
                if first_num is not None and second_num is not None:
                    score = first_num + second_num
                else:
                    score = None

            # Now apply these rules to determine if score is valid:
            #
            #     1 <= first_num <= 5
            #     1 <= second_num <= 5
            #     2 <= score <= 10
            #
            # anything outside of these limits is invalid

            if first_num is not None and (first_num > 5 and first_num <= 10):
                # assume score reported for first_num
                score = first_num
                first_num = None
                second_num = None
            elif score is not None and (score < 2 or score > 10):
                # invalid
                score = None
                continue
                    
            result = GleasonScoreResult(i, start, end, score, first_num, second_num)
            result_list.append(result)

return result_list
```

#### 1.4.3 Gleason Score Custom Task for ClarityNLP

The code presented above can be combined into a custom task for extracting Gleason scores. Using the custom task framework presented in [Cooking with ClarityNLP Session 1](https://github.com/ClarityNLP/ClarityNLP/blob/master/nlp/notebooks/cooking/Cooking%20with%20ClarityNLP%20-%20082818.ipynb), we have the following code outline:
```python

def find_gleason_score(sentence_list):
    # see code above

class GleasonScoreTask(BaseTask):
    """
    A custom task for finding the Gleason score, which is relevant to 
    prostate cancer diagnosis and staging.
    """
    
    # use this name in NLPQL
    task_name = "GleasonScoreTask"

    def run_custom_task(self, temp_file, mongo_client: MongoClient):

        # for each document in the NLPQL-specified doc set
        for doc in self.docs:

            # all sentences in this document
            sentence_list = self.get_document_sentences(doc)

            # all Gleason score results in this document
            result_list = find_gleason_score(sentence_list)
                
            if len(result_list) > 0:
                for result in result_list:
                    obj = {
                        'sentence':sentence_list[result.sentence_index],
                        'start':result.start,
                        'end':result.end,
                        'value':result.score,
                        'value_first':result.first_num,
                        'value_second':result.second_num
                    }
            
self.write_result_data(temp_file, mongo_client, doc, obj)
```

Each ClarityNLP custom task must be implemented as a derived class of ClarityNLP's `BaseTask` class.  Our custom Gleason score task is called `GleasonScoreTask`, and it is a child of `BaseTask`, as required.

The `task_name` field is the name by which this custom task will be invoked from NLPQL. This name is `GleasonScoreTask`.

Each custom task must implement the `run_custom_task` function. We do so by iterating over all documents, extracting the document's sentences, and calling our `find_gleason_score` function on each sentence to recognize and extract the Gleason score and its components.

If any Gleason scores are found in the document's sentences they are returned as a list of `GleasonScoreResult` namedtuples. We iterate over the list of these tuples and build a python dict that contains the output desired in the phenotype results. The result fields that we write out are:

* `sentence`: the sentence containing the Gleason score
* `start`: the first character of the text matched by the regex
* `end`: one past the last character matched by the regex
* `value`: the Gleason score value
* `value_first`: the first component of the Gleason score, if any
* `value_second`: the second component of the Gleason score, if any

In the next cell we present a sample NLPQL program to invoke the custom task and extract Gleason scores. This code uses the `createDocumentSet` function to limit the input documents to those with a `report_type` field equal to `Pathology`. Gleason scores are determined by examining tissue under a microscope, so pathology reports are the expected source of these scores.


In [48]:
# Sample NLPQL to find Gleason scores from pathology reports
nlpql ='''
limit 100;

phenotype "Gleason Score Finder" version "1";
include ClarityCore version "1.0" called Clarity;

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define final GleasonFinderFunction:
    Clarity.GleasonScoreTask({
        documentset: [Docs]
    });
'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/388/phenotype_intermediate",
    "job_id": "388",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=388",
    "main_results_csv": "http://18.220.133.76:5000/job_results/388/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/388",
    "phenotype_id": "388",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/584"
    ],
    "pipeline_ids": [
        584
    ],
    "results_viewer": "?job=388",
    "status_endpoint": "http://18.220.133.76:5000/status/388"
}


In [28]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,batch,concept_code,end,inserted_date,job_id,nlpql_feature,owner,phenotype_final,pipeline_id,...,report_id,report_type,sentence,solr_id,source,start,subject,value,value_first,value_second
0,5b98117929833c01c7ef8f9d,25,-1,155,2018-09-11 19:03:21.780000,371,GleasonFinderFunction,claritynlp,True,557,...,13000032,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,144,34876,6.0,3.0,3.0
1,5b98117929833c01c7ef8f9e,25,-1,260,2018-09-11 19:03:21.783000,371,GleasonFinderFunction,claritynlp,True,557,...,13000032,Pathology,Onc hx:(as per recent d/c summary) - [**2531-1...,13000032,MIMIC,238,34876,7.0,3.0,4.0
2,5b98117a29833c01c7ef8f9f,25,-1,215,2018-09-11 19:03:22.683000,371,GleasonFinderFunction,claritynlp,True,557,...,13000033,Pathology,Pt uses CPAP at night and has done so for a lo...,13000033,MIMIC,205,35581,8.0,,
3,5b98117a29833c01c7ef8fa0,25,-1,12,2018-09-11 19:03:22.685000,371,GleasonFinderFunction,claritynlp,True,557,...,13000033,Pathology,# Gleason 8 Prostate Adenocarcinoma:,13000033,MIMIC,2,35581,8.0,,
4,5b98117b29833c01caef8f9d,50,-1,158,2018-09-11 19:03:23.357000,371,GleasonFinderFunction,claritynlp,True,557,...,13000057,Pathology,Other ICU medications: Other medications: Past...,13000057,MIMIC,141,31650,8.0,4.0,4.0


In [49]:
# display selected result columns from a run at GA Tech

import csv
import pandas as pd
gleason_df = pd.read_csv('assets/phenotype_gleason.csv')

# limit display cols to extracted score and components
df = gleason_df[['sentence', 'value', 'value_first', 'value_second']]

# make the display wider to see complete sentence text
pd.set_option('max_colwidth', 800)

# print results
df

# The results of this run appear in the next cell. The NaN entries mean
# "not a number", indicating that either the score, the components, 
# or both, were not found.

Unnamed: 0,sentence,value,value_first,value_second
0,"He had prostate biopsy done at [**Location (un) 7194**], [**State 1217**] on [**3304-2-8**] that showed prostate adenocarcinoma in two out of 12 cores in the biopsy that were Gleason 3",3.0,,
1,Other medications: metformin amlodipine hydrochlorothiazide multivitamin aspirin Past medical history: Family history: Social History: sleep apnea not on CPAP prostate cancer Gleason 3+3 in two out of 12 cores,6.0,3.0,3.0
2,h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3),6.0,3.0,3.0
3,"Gleason score 6 (3+3), involving approximately 5% of the core tissue Colonoscopy [**2763**] w/ adenoma R cataract surgery HTN",6.0,3.0,3.0
4,"prior baseline CR 1.5, most recently 1.1-1.2 BPH vs Prostate cancer - h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3)",6.0,3.0,3.0
5,"prior baseline CR 1.5, most recently 1.1-1.2 BPH vs Prostate cancer - h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3)",6.0,3.0,3.0
6,"prior baseline CR 1.5, most recently 1.1-1.2 BPH vs Prostate cancer - h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3)",6.0,3.0,3.0
7,"prior baseline CR 1.5, most recently 1.1-1.2 BPH vs Prostate cancer - h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3) GERD Type",6.0,3.0,3.0
8,"prior baseline CR 1.5, most recently 1.1-1.2 BPH vs Prostate cancer - h/o multiple prostate biopsies with only 1 c/w adenocarcinoma (Gleason 3+3) GERD Type",6.0,3.0,3.0
9,The patient is a 62-year old male with a Gleason score 8 adenocarcinoma of the prostate involving the left and right lobes.,8.0,,


**Prostate Volume**

Another function which is useful in this scenario is our size extraction function.
```java
//phenotype name
phenotype "Prostate Volume v3" version "1";

//include Clarity main NLP libraries
include ClarityCore version "1.0" called Clarity;
include OHDSIHelpers version "1.0" called OHDSI;

cohort PSAPatients:OHDSI.getCohort(336);

termset Prostate:
  ["prostate"];

define final PSADimensions:
  Clarity.MeasurementFinder({
    cohort:PSAPatients,
    termset:[Prostate]
    });
```

### Preview: Working with Coded Data in ClarityNLP
One of the  biggest challenges in phenotyping is the integration of structured and unstructured data.  We will dedicate a future Cooking session to this topic, but let's wrap up today with a preview.

### 1.5 Finding Patients with Prostate Cancer Diagnosis and Abnormal Gleason Scores
Using the OHDSI Atlas stack, we have defined an OHDSI Cohort looking for patients with PSA > 4 using the following SNOMED codes:

- 200962	93974005	Primary malignant neoplasm of prostate	Condition	Standard			
- 4129902	126906006	Neoplasm of prostate	Condition	Standard			
- 4141960	427492003	Hormone refractory prostate cancer	Condition	Standard

![Atlas_PSA.png](assets/Atlas_PCa.png)




In order to look at these patient specifically in ClarityNLP, we can reference the cohort while combining any other features.  For example:

```java
limit 100;

phenotype "Gleason Score and PSA" version "1";
include ClarityCore version "1.0" called Clarity;

cohort ProstateCaPatients:OHDSI.getCohort(336);

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define GleasonScore:
    Clarity.GleasonScoreTask({
        cohort:ProstateCaPatients,
        documentset: [Docs]
    });

define final ElevatedGleason:
    where GleasonScore.value > 5;
```


In [54]:
# Sample NLPQL to find Gleason scores from pathology reports
nlpql ='''
phenotype "Gleason Score and Prosate Ca" version "1";
include ClarityCore version "1.0" called Clarity;

cohort ProstateCaPatients:OHDSI.getCohort(336);

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define GleasonScore:
    Clarity.GleasonScoreTask({
        cohort:ProstateCaPatients,
        documentset: [Docs]
    });

define final ElevatedGleason:
    where GleasonScore.value > 5;

'''
run_result, main_csv, intermediate_csv, luigi = claritynlp.run_nlpql(nlpql)

Job Successfully Submitted
{
    "intermediate_results_csv": "http://18.220.133.76:5000/job_results/399/phenotype_intermediate",
    "job_id": "399",
    "luigi_task_monitoring": "http://18.220.133.76:8082/static/visualiser/index.html#search__search=job=399",
    "main_results_csv": "http://18.220.133.76:5000/job_results/399/phenotype",
    "phenotype_config": "http://18.220.133.76:5000/phenotype_id/399",
    "phenotype_id": "399",
    "pipeline_configs": [
        "http://18.220.133.76:5000/pipeline_id/595"
    ],
    "pipeline_ids": [
        595
    ],
    "results_viewer": "?job=399",
    "status_endpoint": "http://18.220.133.76:5000/status/399"
}


In [53]:
final_csv_df = pd.read_csv(main_csv)
final_csv_df.head()

Unnamed: 0,_id,batch,concept_code,context_type,end,inserted_date,job_date,job_id,nlpql_feature,orig_id,...,report_id,report_type,sentence,solr_id,source,start,subject,value,value_first,value_second
0,5b98869d6d7df90fcfef75f6,75,-1,subject,74,2018-09-12 03:22:05.504000,2018-09-12 03:23:09.790000,397,ElevatedGleason,5b98865d6d7df90ee7ef75fa,...,13000086,Pathology,"- Radiation proctitis [**7-24**] - Prostate cancer, T3b N0 M0, Gleason 4+3 stage III - Coronary artery disease status post multiple percutaneous interventions.",13000086,MIMIC,63,38918,7.0,4.0,3.0
1,5b98869d6d7df90fcfef75f7,75,-1,subject,102,2018-09-12 03:22:06.243000,2018-09-12 03:23:09.790000,397,ElevatedGleason,5b98865e6d7df90ee7ef75fb,...,13000087,Pathology,45% nuclear study [**2514**] - Hypertension - Hyperlipidemia - GERD - Prostate cancer Gleason score 6 - Hypothyroidism - Bilateral ankle edema - Kidney stones - Right retroperitoneal abscess - Segmental pulmonary embolism within the right lower lobe .,13000087,MIMIC,86,30941,6.0,,
2,5b98869d6d7df90fcfef75f8,75,-1,subject,102,2018-09-12 03:22:07.341000,2018-09-12 03:23:09.790000,397,ElevatedGleason,5b98865f6d7df90ee7ef75fc,...,13000088,Pathology,45% nuclear study [**2514**] - Hypertension - Hyperlipidemia - GERD - Prostate cancer Gleason score 6 - Hypothyroidism - Bilateral ankle edema - Kidney stones - Right retroperitoneal abscess - Segmental pulmonary embolism within the right lower lobe .,13000088,MIMIC,86,30941,6.0,,
3,5b98869d6d7df90fcfef75f9,75,-1,subject,90,2018-09-12 03:22:07.878000,2018-09-12 03:23:09.790000,397,ElevatedGleason,5b98865f6d7df90ee7ef75fd,...,13000089,Pathology,"cancer s/p robot assisted lap prostatectomy [**Month (only) 435**] [**3041**] -Gleason 4+5 -deferred adjuvant XRT/hormones,",13000089,MIMIC,79,34885,9.0,4.0,5.0
4,5b98869d6d7df90fcfef75fa,50,-1,subject,158,2018-09-12 03:22:09.469000,2018-09-12 03:23:09.790000,397,ElevatedGleason,5b9886616d7df90ef0ef75f6,...,13000057,Pathology,Other ICU medications: Other medications: Past medical history: Family history: Social History: Prostate Cancer diagnosed [**3-/2699**] with Gleason Score 4+4=8 and 4+5= Bone,13000057,MIMIC,141,31650,8.0,4.0,4.0


In [None]:
limit 100;

phenotype "Gleason Score and PSA" version "1";
include ClarityCore version "1.0" called Clarity;

cohort PSAPatients:OHDSI.getCohort(336);

documentset Docs:
    Clarity.createDocumentSet({
        "report_types":["Pathology"]
    });

define GleasonScore:
    Clarity.GleasonScoreTask({
        cohort:PSAPatients,
        documentset: [Docs]
    });

define final ElevatedGleason:
    where GleasonScore.value > 5;