## Cooking with ClarityNLP - Session #3

The goal of this series is to introduce you to writing basic queries using NLPQL. Today we will also be covering an introduction to iterative data exploration and feature engineering for a given cohort using unsupervised learning techniques. For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions during this presentation, as well as via Slack or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues).

In [1]:
# if running locally, make sure docker swarm up and running first
# https://claritynlp.readthedocs.io/en/latest/setup/local-docker.html#running-locally

# need to be on VPN to hit AWS instance

# !pip3 install --user `package_name` #if you need to install a dependency rather than import it
%matplotlib inline
import pandas as pd
pd.set_option('display.width',100000)
pd.set_option('max_colwidth',4000)
import requests
from collections import OrderedDict
import matplotlib.pyplot as plt
import claritynlp_notebook_helpers as claritynlp

ClarityNLP notebook helpers loaded successfully!


### Tentative outline
- Review ingestion process
- Briefly discuss a phenotype and/or a set membership problem that we hope to use unstructured text for to both mine the data, as well as to potentially expand our set of inclusion criteria, and/or identify clusters within our -identified set.
- Construct a foundational NLPQL query to extract a first set of notes.
- Show how to view commonly occuring n-grams on these notes.
- Show how to call topic modeling algorithm on these notes; walk through return options and how to use them.
- Show how topic vector terms can be fed back into the pipeline to expand the termset.
- Show how topic assignment can be used to cluster patients.

### 1. Review of data ingestion

If needed, re-use content from last session re: ingestion of data; running NLPQL; etc.

### 2. Select a few concepts that we would like to explore and use for phenotype development

In this example, we explore how we can leverage unstructured clinical notes and ClarityNLP to explore a clinically relevant set membership problem for the purpose of feature engineering and/or patient inclusion/exclusion from a cohort of interest. This can be thought of as a step away from rule-based/heuristic patient identification, and toward computational phenotype development. The techniques reviewed here can be particularly useful in situations where the set/term in question cannot be easily derived from structured data alone, and/or when the set/term in question is difficult to quantify or is unlikely to explicitly appear in structured data. Examples include but are by no means limited to: {`poor social support`; `homelessness`; `substance abuse`; `socioeconomic status`; `financial distress`; `domestic abuse`; `insomnia`; `poor medication compliance`; etc.}.

These topics represent important **socioeconomic** and **behavioral determinants of health** and may also effect eligibility for participation in clinical trials, but are difficult for clinical researchers to detect without manual chart review and/or other types of resource-intensive interventions. 

In the sections that follow, we'll walk through ways in which NLP techniques can help address this problem. It is important to note that these steps represent an **iterative, human-in-the-loop process**, rather than a one-step solution. 

We can begin by aggregating our existing clinical and contextual knowledge of the topic/concept, and using this knowledge to develop a baseline NLPQL query. It is not necessarily important at this stage that our list of terms and/or constraints be exhaustive, as our initial goal is to explore the possible universe of patients. we can return to this step in an iterative fashion, and impose more specific criteria as we go. 

For the purpose of today's discussion, we'll consider three different set membership problems: (1) homelessness; (2) insomnia/trouble sleeping; (3) poor medication compliance. For each of these three set membership questions, we're hoping to understand whether a given patient might currently be experiencing the condition, or has experienced the condition in the past (to the extent that their historical data is present in the system).


We can start by brainstorming a list of terms that we might expect to appear in instances where patients have experienced or are currently experiencing each of our target conditions. Note that we're intentionally trying to cast a wide net at this early stage; we can also use termset expansion and/or lemmatization to handle synonyms and/or similar words represented as different parts of speech. 

#### 2.1: Homelessness 

- homeless
- shelter
- exposure
- frostbite
- dehydration
- lack of access to care
- eviction
- evicted

#### 2.2: Insomnia/Trouble Sleeping 

- insomnia
- poor sleep
- snoring
- sleep apnea
- sleeping pills
- tired
- drowsy 
- lethargic
- restless leg syndrome
- narcolepsy
- side effects
- melatonin
- newborn
- daytime fatigue
- low energy
- sleepy
- depression


#### 2.3 Poor Medication Compliance
- poor compliance
- not taking medication
- overdose
- forgot to take medicine
- poor memory
- side effects 
- elderly
- lives alone
- readmitted
- rehospitalization 
- non-adherence
- can't afford medication
- lack of access to care
- substance abuse
- limited mobility
- cognitive impairment
- poor communication
- non-English speaking
- too many medications
- confusion about dose instructions

### 3. Develop and run a foundational NLPQL query to capture instances of patients experiencing homelessness

We can use the list we developed above to develop a foundational NLPQL query. Running this query will tell us which patients meet our initial criteria, and we can then perform topic modeling on the notes associated with these patients to help discover additional relevant terms that we can use to either expand or refine our inclusion criteria.

To do this, we can use the terms from our list to form a ClarityNLP termset. In this way, we're asking ClarityNLP to return the set of notes where at least one of these terms appears (the relationship between the terms is a disjunctive Boolean OR). 

#### 3.1: Homelessness NLPQL

In [None]:
nlpql_homelessness_initial_query = '''

// Phenotype library name
phenotype "Cooking with Clarity Homelessness Initial Set" version "1";

/* Phenotype library description */
description "Sample NLPQL to find notes that may be associated with patients experiencing homelessness; for use in topic modeling.";

// # Structured Data Model #
datamodel OMOP version "5.3";

// # Referenced libraries #
// The ClarityCore library provides common functions for simplifying NLP pipeline creation
include ClarityCore version "1.0" called Clarity;
include OHDSIHelpers version "1.0" called OHDSI;

// ## Code Systems ##
codesystem OMOP: "http://omop.org"; // OMOP vocabulary https://github.com/OHDSI/Vocabulary-v5.0;


// #Manual Term sets#
// simple example-- termset "Vegetables":["brocolli","carrots","cauliflower"]
// can add expansion of structured concepts from terminologies as well with OMOPHelpers

documentset PotentialHomelessnessNotes:
    Clarity.createReportTagList(["Physician","Nurse","Note","Discharge Summary"]);

termset PossibleHomelessness: [
"homeless",
"shelter",
"exposure",
"frostbite",
"dehydration",
"malnutrition",
"lack of access to care",
"no medical records",
"evicted",
"eviction"];

define final isPotentiallyHomeless:
  Clarity.ProviderAssertion({
    termset: [PossibleHomelessness],
    documentset: [PotentialHomelessnessNotes]
  });  
'''

run_result_homeless, main_csv_homeless, intermediate_csv_homeless, luigi_homeless = claritynlp.run_nlpql(nlpql_homelessness_initial_query)

#### 3.2: Insomnia/Trouble Sleeping NLPQL

In [None]:
nlpql_insomnia_initial_query = '''

// Phenotype library name
phenotype "Cooking with Clarity Insomnia Initial Set" version "1";

/* Phenotype library description */
description "Sample NLPQL to find notes that may be associated with patients experiencing insomnia; for use in topic modeling.";

// # Structured Data Model #
datamodel OMOP version "5.3";

// # Referenced libraries #
// The ClarityCore library provides common functions for simplifying NLP pipeline creation
include ClarityCore version "1.0" called Clarity;
include OHDSIHelpers version "1.0" called OHDSI;

// ## Code Systems ##
codesystem OMOP: "http://omop.org"; // OMOP vocabulary https://github.com/OHDSI/Vocabulary-v5.0;


// #Manual Term sets#
// simple example-- termset "Vegetables":["brocolli","carrots","cauliflower"]
// can add expansion of structured concepts from terminologies as well with OMOPHelpers

documentset PotentialInsomniaNotes:
    Clarity.createReportTagList(["Physician","Nurse","Note","Discharge Summary"]);
    
termset PossibleInsomnia: [
"insomnia",
"poor sleep",
"sleep apnea",
"tired",
"lethargic",
"restless leg syndrome",
"side effects",
"pain",
"newborn",
"melatonin",
"infant",
"daytime fatigue",
"low energy",
"depression"];

define final isPotentiallyInsomnia:
  Clarity.ProviderAssertion({
    termset: [PossibleInsomnia],
    documentset: [PotentialInsomniaNotes]
});
  
'''

run_result_insomnia, main_csv_insomnia, intermediate_csv_insomnia, luigi_insomnia = claritynlp.run_nlpql(nlpql_insomnia_initial_query)

#### 3.3 Poor Medication Compliance NLPQL

In [None]:
nlpql_poor_compliance_initial_query = '''

// Phenotype library name
phenotype "Cooking with Clarity Poor Medication Compliance Initial Set" version "1";

/* Phenotype library description */
description "Sample NLPQL to find notes that may be associated with patients experiencing/exhibiting poor medication compliance; for use in topic modeling.";

// # Structured Data Model #
datamodel OMOP version "5.3";

// # Referenced libraries #
// The ClarityCore library provides common functions for simplifying NLP pipeline creation
include ClarityCore version "1.0" called Clarity;
include OHDSIHelpers version "1.0" called OHDSI;

// ## Code Systems ##
codesystem OMOP: "http://omop.org"; // OMOP vocabulary https://github.com/OHDSI/Vocabulary-v5.0;


// #Manual Term sets#
// simple example-- termset "Vegetables":["brocolli","carrots","cauliflower"]
// can add expansion of structured concepts from terminologies as well with OMOPHelpers

documentset PotentialPoorComplianceNotes:
    Clarity.createReportTagList(["Physician","Nurse","Note","Discharge Summary"]);
    
termset PossiblePoorCompliance: [
"insomnia",
"poor sleep",
"sleep apnea",
"tired",
"lethargic",
"restless leg syndrome",
"side effects",
"pain",
"newborn",
"melatonin",
"infant",
"daytime fatigue",
"low energy",
"depression"];

define final isPotentiallyPoorCompliance:
  Clarity.ProviderAssertion({
    termset: [PossiblePoorCompliance],
    documentset: [PotentialPoorComplianceNotes]
  });

'''

run_result_poor_compliance, main_csv_poor_compliance, intermediate_csv_poor_compliance, luigi_poor_compliance, = claritynlp.run_nlpql(nlpql_poor_compliance_initial_query)

### 4. View commonly occuring n-grams on the subset of notes returned by our queries

An **n-gram** is a contiguous sequence of *n* items (in this case, words, separated by whitespace). We can use **n-grams** to help us understand which words commonly appear together within the subset of notes that make up the sub-corpus associated with each of our conditions. 

We can define a helper function to let us visualize the distribution of the most commonly co-occuring terms:



In [4]:
def view_n_gram_terms(df, keyword, min_apperances=2, n=3):
    
    term_sets = [x.split() for x in df['text']]
    terms = [item for sublist in term_sets for item in sublist if item != keyword]
    
    terms_dict = OrderedDict(sorted(((x, terms.count(x)) for x in terms if terms.count(x) >= min_apperances),
                                    key=lambda x: x[1]))

    x = [k for k in terms_dict.keys()]
    y = [v for v in terms_dict.values()]
    
    plt.figure(figsize=(20,10))
    plt.barh(x,y)
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.title("Frequency of co-occuring terms drawn from set of {}-grams associated with keyword: {}".format(n, keyword), fontsize=15)
    plt.show() 

#### 4.1 n-grams on notes of patients potentially experiencing homelessness

In [None]:
homelessness_ngrams_query = '''

 phenotype "potential homelessness ngram" version "1";

 include ClarityCore version "1.0" called Clarity;

 termset HomelessTerms:
    ["homeless"];

  define homelessnessNgram:
    Clarity.ngram({
      termset:[HomelessTerms], 
      "n": "3",
      "filter_nums": true,
      "filter_stops": true,
      "filter_punct": true,
      "min_freq": 2,
      "lemmas": true,
      "limit_to_termset": true
      });
      
'''

ngrams_result_homelessness, ngrams_main_csv_homeless, ngrams_intermediate_csv_homeless, ngrams_luigi_homeless = claritynlp.run_nlpql(homelessness_ngrams_query)

In [None]:
hl_ngrams_df = pd.read_csv(ngrams_result_homelessness['intermediate_results_csv'])
view_n_gram_terms(hl_ngrams_df, "homelessness")  

In [None]:
# TODO: if/when queries get chained on back end, show that this second query runs on the set produced by the first query

In [None]:
insomnia_ngrams_query = '''

 phenotype "potential insomnia ngram" version "1";

 include ClarityCore version "1.0" called Clarity;

 termset InsomniaTerms:
    ["insomnia"];

  define insomniaNgram:
    Clarity.ngram({
      termset:[InsomniaTerms], 
      "n": "3",
      "filter_nums": true,
      "filter_stops": true,
      "filter_punct": true,
      "min_freq": 2,
      "lemmas": true,
      "limit_to_termset": true
      });
      
'''

ngrams_result_insomnia, ngrams_main_csv_insomnia, ngrams_intermediate_csv_insomnia, ngrams_luigi_insomnia = claritynlp.run_nlpql(insomnia_ngrams_query)

In [None]:
insomnia_ngrams_df = pd.read_csv(ngrams_result_insomnia['intermediate_results_csv'])
view_n_gram_terms(insomnia_ngrams_df, "insomnia")  

In [None]:
poor_compliance_ngrams_query = '''

 phenotype "potential poor medication compliance ngram" version "1";

 include ClarityCore version "1.0" called Clarity;

 termset PoorComplianceTerms:
    ["compliance"];

  define poorComplianceNgram:
    Clarity.ngram({
      termset:[InsomniaTerms], 
      "n": "3",
      "filter_nums": true,
      "filter_stops": true,
      "filter_punct": true,
      "min_freq": 2,
      "lemmas": true,
      "limit_to_termset": true
      });
      
'''

ngrams_result_poor_compliance, ngrams_main_csv_poor_compliance, ngrams_intermediate_csv_poor_compliance, ngrams_luigi_poor_compliance = claritynlp.run_nlpql(poor_compliance_ngrams_query)

In [None]:
poor_comp_ngrams_df = pd.read_csv(ngrams_result_poor_compliance['intermediate_results_csv'])
view_n_gram_terms(poor_comp_ngrams_df, "compliance")  

In [None]:
# TODO: Integrate topic modeling explanation and examples here; also do top docs by topic; clustering by topic labels 

#### Scratch content; will either work this in or remove.
Used to explore usefulness of original set of terms by examining the distribution of terms among the 'truth values' dervied from extracted text

In [None]:
#inter_csv_df = pd.read_csv(intermediate_csv)
#counts = inter_csv_df.groupby(['term']).count()['_id'].reset_index().sort_values(by=['_id'], ascending=False)

In [None]:
#counts.plot(kind='bar', x='term', y='_id')