## Cooking with ClarityNLP - Session #7

The goal of this [#cookingWithClarityNLP](https://twitter.com/hashtag/cookingWithClarityNLP?src=hash&lang=en) session is to highlight how ClarityNLP can be used to detect, extract, and classify different sub-categories of [adverse events](https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/cfrsearch.cfm?fr=312.32) from patient-level documents, and/or drug labels. 

For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions during this presentation, as well as [via Slack](https://join.slack.com/t/claritynlp/shared_invite/enQtNTE5NTUzNzk4MTk5LTFmNWY1NWVmZTA4Yjc5MDUwNTRhZTBmNTA0MWM0ZDNmYjdlNTAzYmViYzAzMTkwZDkzODA2YTJhYzQ1ZTliZTQ), Twitter (use the hashtag [#cookingWithClarityNLP](https://twitter.com/hashtag/cookingWithClarityNLP?src=hash&lang=en)), or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues). **We also encourage suggestions for topics to cover in future cooking sessions!**

If you're running ClarityNLP locally via Docker, make sure that the Docker swarm is up and running before following along:
- See https://claritynlp.readthedocs.io/en/latest/setup/local-docker.html#running-locally.

In [36]:
# Import dependencies
%matplotlib inline
import pandas as pd
import numpy as np
import os 
from collections import OrderedDict
pd.set_option('display.width',100000)
pd.set_option('max_colwidth',4000)
import matplotlib.pyplot as plt
import xml.etree.ElementTree
import spacy
from spacy.matcher import Matcher
import textacy
import re
from sklearn.metrics import accuracy_score

#FDA_DIR = "../../../../repos/FDA_AE_challenge_2019"
import claritynlp_notebook_helpers as claritynlp

#TODO: (loose outline)
1. Overview of adverse events (definitions of sub-categories; examples; references); discussion of the subset of AEs we decided to focus on:
    - Pre-existing condition or risk factor
    - Indication
    - Negation
    - Pregnancy
    - Class effect
    - Overdose/withdrawal

2. Overview (high-level) of drug label organization/contents/what it means when an adverse event appears on a drug label; the path toward a reported event appearing on a label
3. Brief discussion of pre-processing steps taken (parse xml; find/replace medra w/label; ngrams)
4. Overview of rule-construction process => frequency analysis; dependency parsing; naive baseline + enhancements; how we set up the structure of the rules engine to be as efficient as possible (e.g., maximize downselection potential, etc.) 
5. Describe how our rules engine can be run via NLP_as_a_Service, and/or as a custom task in ClarityNLP
6. Demonstration (screenshots?) of using chart review to manually validate.

## 1. Overview and Background
### 1.1 The FDA Adverse Drug Event Evaluation Challenge

Our team is currently working on a submission to the [FDA Adverse Drug Event Evaluation Challenge](https://sites.mitre.org/adeeval/). This challenge requires participating teams to ingest a set of drug labels (in XML format), parse these labels to detect and classify adverse events, and output a set of labels indicating the location, class, and associated [MedDRA](https://www.meddra.org/) information for each positive occurrence of an adverse event.

Today, we'll break down some of the components of our drug label evaluation strategy to show how you can use components of the ClarityNLP pipeline (specifically, a custom task, and our chart review GUI-based tool) to detect and classify (a subset of) adverse event mentions when they appear in drug labels and/or patient notes. 

### 1.2 The Challenge Training Dataset
The FDA challenge training dataset contains text from 100 drug labels, stored in XML format. Each drug label is mapped to a series of ground-truth labels; for the sake of clarity, we will continue to distinguish between **drug** labels and **ground-truth** labels. Each ground-truth label corresponds to an adverse event associated with the drug in question (though there are several possible ways that this relationship may manifest itself), and contains the following fields:

<img src="assets/truth_label_screenshot.png" style="height:50px">


### 1.3 Adverse Event Categories
There are several reasons that the FDA may consider an adverse event to be (or not be) of interest to the Office of Surveillance and Epidemiology (OSE). The unique set of reasons, along with their definitions and counts within the FDA training dataset, appear in the table that is generated below:

In [56]:
def generate_labels_df():
    
    '''

    This function lets us read in the 100 drug label XML files, and parse the XML so that we get each ground 
    truth label associated with a given drug as a row in our dataframe. We can use this dataframe to get counts 
    by label.

    '''

    if not os.path.exists("./data/labels_and_info.csv"):

        labels_df = pd.DataFrame(columns = ["file", "len", "id", "reason", "section", "start", "type", "meddra_llt", "meddra_llt_id", "meddra_pt", "meddra_pt_id"])
        counter = 0

        for f in os.listdir('./data/ose_xml_training_20181101/'):

            drug_name = f.split(".xml")[0]
            root = xml.etree.ElementTree.parse("./data/ose_xml_training_20181101/{}".format(f)).getroot()

            for child in root:

                if child.tag == "Mentions":
                    for subchild in child:
                        if subchild.tag == "Mention": 
                            new_row = {}

                            new_row["file"] = drug_name.lower() 
                            new_row = {**new_row, **subchild.attrib}

                            for subsubchild  in subchild:

                                if subsubchild.tag == "Normalization":                    
                                      new_row = {**new_row, **subsubchild.attrib}

                        temp_df = pd.DataFrame.from_dict(new_row, orient='index').T

                        labels_df.loc[counter, :] = temp_df.loc[0,:]
                        counter += 1

    
        reason_info = pd.read_csv("./data/ae_reason_info.csv", header=0)
        outdf = pd.merge(labels_df, reason_info, left_on=["type", "reason"], right_on=["type", "reason"])
    
    else: 
        outdf = pd.read_csv("./data/labels_and_info.csv", header=0, index_col=0)
    
    return outdf
        
    #return outdf.loc[:, ["file", "len", "reason", "type", "meddra_llt", "meddra_pt", "meddra_pt_id", "description"]]


labels_and_info = generate_labels_df()
labels_and_info.to_csv("./data/labels_and_info.csv")
print("Label information dataframe generated and saved as csv.")


Label information dataframe generated and saved as csv.


In [38]:
labels_and_info.tail()

Unnamed: 0,file,len,id,reason,section,start,type,meddra_llt,meddra_llt_id,meddra_pt,meddra_pt_id,description
11515,xeljanz,9,M190,other,S3,4705,NonOSE_AE,Lymphoma,10025310,Lymphoma,10025310,A reason for disinterest other than the specific reasons listed.
11516,benlysta,25,M129,other,S2,10773,NonOSE_AE,Cancer of skin (excl melanoma),10007116,Skin cancer,10040808,A reason for disinterest other than the specific reasons listed.
11517,arcalyst,10,M26,other,S1,4873,NonOSE_AE,Infection,10021789,Infection,10021789,A reason for disinterest other than the specific reasons listed.
11518,arcalyst,12,M88,other,S2,2611,NonOSE_AE,Neoplasm malignant,10028997,Neoplasm malignant,10028997,A reason for disinterest other than the specific reasons listed.
11519,effexor_xr,7,M343,other,S3,4312,NonOSE_AE,Suicide,10042462,Completed suicide,10010144,A reason for disinterest other than the specific reasons listed.


In [57]:
def generate_type_reason_counts(ldf):
    
    reason_counts = pd.DataFrame(ldf.groupby(['type','reason', 'description'])['type','reason', 'description'].size().reset_index(name='counts')).sort_values(['counts'], ascending=False)
    
    return reason_counts

r_counts = generate_type_reason_counts(labels_and_info)
r_counts

Unnamed: 0,type,reason,description,counts
4,NonOSE_AE,AE_only_as_instruction,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest.",3384
7,NonOSE_AE,general_term,"General terms or non-specific text such as broad categories (e.g., MedDRA system organ class) used to introdue AEs or text describe an outcome (e.g., death) rather than an AE. These are not of interest.",2122
14,Not_AE_Candidate,preexisting_condition_or_risk_factor,"Mentions that describe a condition that developed prior to applying the medication of interest, or condition that increases the likelihood of developing a disease or injury. These mentions are not AEs.",1882
12,Not_AE_Candidate,indication,A clinical symptom or circumstance for which the use of the drug of interest would be appropriate. These mentions are not AEs.,1434
8,NonOSE_AE,manifestation_or_complication,"Text describing signs, symptoms, or changes in lab resuts related to the manifestations of an AE and the sequelae of an AE are not of interest.",1144
2,NonOSE_AE,AE_from_drug_interaction,AEs that result from drug-drug interaction or co-administration are not of interest.,328
5,NonOSE_AE,AE_rate_lteq_placebo,Aes with incidence rate equal to or lower than placebo are not of interest.,306
9,NonOSE_AE,negation,AE whose presence or occurrence is negated or denied. These AEs are not of interst.,245
0,NonOSE_AE,AE_animal,AEs observed in animal data are not of interest.,241
6,NonOSE_AE,OD_or_withdrawal,"AE associated with discontinuing a medication or taking more than the prescribed amount. Drug overdoes and withdrawal do not generally occur when a drug is used as indicated. Additionally, in the context of pharmacovigilance, identifying AEs associated with the drug when used as indicated is the highest priority. These AEs are not of interest.",206


### 1.4 Our Selected Subset of Adverse Event Categories
Looking at the table above, it's clear that the majority of labeled adverse events fall into a few categories: specifically: (1) adverse events that are mentioned in the instructions in a hypothetical context; (2) general terms, which are too broad to be useful; (3) indications (e.g., clinical symptom(s) for which using the drug in question would be appropriate; and (4) complications, which might occur as the result of an AE, but are not adverse events per se.

For the purpose of today's discussion, we've decided to focus on on detecting mentions of three different buckets of adverse events; note some of these buckets contain multiple "reasons" as defined in the FDA ground truth labels. 

1. Indication; Contraindication; Risk factor
2. Withdrawal; Pregnancy*
3. Negation

\* Not: in the FDA training dataset, **Pregnancy** is a medDRA term which appears in the **meddra_llt** field, and co-occurs with multiple grouth-truth reasons. The reason that is of interest here is **contraindication** (e.g., drugs that are contraindicated for women who are pregnant, nursing, or may become pregnant in the near term). For example:

<img src="assets/pregnancy_contraindication_ex.png" style="height:70px">

## 2. Implementation

### 2.1 Generation of Sentence-Level Labels
In order to build a rules-based model for detecting/classifying sentences with respect to ground-truth labels, we must first make the provided XML labels into a sentence-level feature matrix that can be used to assess model performance on sentences that do and do not contain each of our ground-truth labels. 

Our first step in this direction was to parse the original XML drug label documents, find the adverse events within the text (using the section-level offsets provided as part of the ground-truth labels), and replace each adverse event with the concatenation of the "type"||"reason" fields. This will allow us to identify patterns at a more generic level (since we'll be working with the finite set of {"type"||"reasons"} rather than the more specific set of underlying adverse events, which are linked to medDRA terms within the ground-truth labels. 

The functions that appear below represent the next step in the process, and serve to generate a sentence-level feature matrix, such that each sentence in each drug label document represents a row, and each possible "type"||"reason" ground-truth label represents a Boolean-valued column. 


In [40]:
def catch_pregnancy_contraindication_references(labelsdf):
    
    def meddra_refers_to_pregnancy(row):
        if row["meddra_llt"] == "Pregnancy":
            res = "{}pregnancy".format(row["reason"])
        else:
            res = row["reason"]
        return res

    
    labelsdf["reason"] = labelsdf.apply(meddra_refers_to_pregnancy, axis=1)
        #if row['reason'] == "contraindication" and row["meddra_llt"] == "Pregnancy":
           # labels_df.loc[i, "reason"] += "pregnancy"
        
    return labelsdf

labelsdf = catch_pregnancy_contraindication_references(labels_and_info)
labelsdf[labelsdf['meddra_llt']=="Pregnancy"].head()

Unnamed: 0,file,len,id,reason,section,start,type,meddra_llt,meddra_llt_id,meddra_pt,meddra_pt_id,description
1767,cytoxan,9,M20,general_termpregnancy,S1,5365,NonOSE_AE,Pregnancy,10036556,Pregnancy,10036556,"General terms or non-specific text such as broad categories (e.g., MedDRA system organ class) used to introdue AEs or text describe an outcome (e.g., death) rather than an AE. These are not of interest."
2124,impavido,9,M98,AE_only_as_instructionpregnancy,S2,317,NonOSE_AE,Pregnancy,10036556,Pregnancy,10036556,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest."
2125,impavido,9,M111,AE_only_as_instructionpregnancy,S3,1412,NonOSE_AE,Pregnancy,10036556,Pregnancy,10036556,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest."
2126,impavido,9,M126,AE_only_as_instructionpregnancy,S3,1574,NonOSE_AE,Pregnancy,10036556,Pregnancy,10036556,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest."
2226,tobi,9,M91,AE_only_as_instructionpregnancy,S2,857,NonOSE_AE,Pregnancy,10036556,Pregnancy,10036556,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest."


In [47]:
def generate_type_reason_combos(reason_counts_df):
    
    out = OrderedDict()
    
    for i, row in reason_counts_df.iterrows():

        out[i] = {"type": row['type'], 
                  "reason": row['reason'], 
                  "combo_term":'{}{}'.format(row['type'].replace("_", "").lower(), row['reason'].replace("_", "").lower())}
        
    return out



In [55]:
def generate_feature_matrix(reasons, txt_file_dir="./data/ose_txt_training_20181101_type_and_reasons/"):
    
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    
    # To generate sentence-level labels, we need to find sentences where the reason-labels we inserted occur    
    patterns = [{"ORTH": r['combo_term']} for r in reasons.values()]
    
    for p in patterns:
        matcher.add(p['ORTH'], None, [(p)])
        
    cols = ["file", "sent_id", "sent_start", "sent_end", "sentence"]
    cols += [r['combo_term'] for r in reasons.values()]
        
    out_df = pd.DataFrame(columns=cols)

    for f in os.listdir(txt_file_dir):
            
        with open ("{}{}".format(txt_file_dir, f), "r") as myfile:
            
            drug_name = f.split("_")[0]
            #print(drug_name)

            lines = myfile.readlines()
            doc = nlp(u'{}'.format(lines))

            feature_matrix = pd.DataFrame(index=np.arange(len(list(doc.sents))), columns=cols)

            feature_matrix.loc[:, "file"] = drug_name.lower()
            feature_matrix.loc[:, "sent_id"] = [i for i in range(len(list(doc.sents)))]
            feature_matrix.loc[:, "sent_start"] = [i.start for i in (list(doc.sents))]
            feature_matrix.loc[:, "sent_end"] = [i.end for i in (list(doc.sents))]
            feature_matrix.loc[:, "sentence"] = [i for i in (list(doc.sents))]

            for r in reasons.values():
                feature_matrix.loc[:, r["combo_term"]] = 0

            matches = matcher(doc)

            for match_id, start, end in matches:

                span = doc[start:end]  # the matched span

                string_id = nlp.vocab.strings[match_id]  # get string representation

                match_sent_start = span.sent.start # the start (offset) of the sentence in which the match occurrs
                match_sent_end  = span.sent.end # the end (offset) of the sentence in which the match occurrs

                match_sent_id = feature_matrix.loc[(feature_matrix['sent_start'] == int(match_sent_start)) &
                                                   (feature_matrix['sent_end'] == int(match_sent_end)), "sent_id"].values[0]

                feature_matrix.loc[match_sent_id, string_id] = 1

            if out_df.empty:
                out_df = feature_matrix
            else:
                out_df = out_df.append(feature_matrix)
        
    out_df.to_csv("./data/feature_matrix.csv")
    
    print("Sentence-level feature matrix saved.")
    return 

In [50]:
#labelsdf = catch_pregnancy_contraindication_references(labels_and_info)


In [58]:
reasons = generate_type_reason_combos(r_counts)
print(reasons)


OrderedDict([(4, {'combo_term': 'nonoseaeaeonlyasinstruction', 'reason': 'AE_only_as_instruction', 'type': 'NonOSE_AE'}), (7, {'combo_term': 'nonoseaegeneralterm', 'reason': 'general_term', 'type': 'NonOSE_AE'}), (14, {'combo_term': 'notaecandidatepreexistingconditionorriskfactor', 'reason': 'preexisting_condition_or_risk_factor', 'type': 'Not_AE_Candidate'}), (12, {'combo_term': 'notaecandidateindication', 'reason': 'indication', 'type': 'Not_AE_Candidate'}), (8, {'combo_term': 'nonoseaemanifestationorcomplication', 'reason': 'manifestation_or_complication', 'type': 'NonOSE_AE'}), (2, {'combo_term': 'nonoseaeaefromdruginteraction', 'reason': 'AE_from_drug_interaction', 'type': 'NonOSE_AE'}), (5, {'combo_term': 'nonoseaeaeratelteqplacebo', 'reason': 'AE_rate_lteq_placebo', 'type': 'NonOSE_AE'}), (9, {'combo_term': 'nonoseaenegation', 'reason': 'negation', 'type': 'NonOSE_AE'}), (0, {'combo_term': 'nonoseaeaeanimal', 'reason': 'AE_animal', 'type': 'NonOSE_AE'}), (6, {'combo_term': 'nono

In [59]:
generate_feature_matrix(reasons)

Sentence-level feature matrix saved.


In [60]:
feat_mat = pd.read_csv("./data/feature_matrix.csv", header=0, index_col=0)

### 2.2 Human-in-the-Loop Review and Rule Generation

Our next objective is to review the sentences that are positive instances of the subset of the ground-truth labels that we are interested in, using a combination of tools, including manual review, n-gram frequency analysis, dependency parsing, topic modeling, and embeddings. Our hope is that these tools can help us to discover latent pattern(s) that we can use to develop a sentence-level Boolean classifier that will offer improved performance and efficiency relative to the status quo. We've divided up this part so that each ground-truth label gets its own subsection, as some of the rules we'll develop are label-specific. 

To start, let's define a few helper functions that we can use for each bucket:


In [61]:
def filter_feature_matrix(list_of_reasons, f_matrix, keep_only_true=True, replace_reason_name_w_noun=False, reason_replace_dict=None):
    
    '''
    We can use this function to subset the feature matrix by ground-truth label(s) and/or to keep only rows where the value for >=1 of the labels we've selected is TRUE
    '''
    
    cols = ["file", "sent_id", "sent_start", "sent_end", "sentence"]
    cols += list_of_reasons
    
    temp = f_matrix.copy()
    
    if replace_reason_name_w_noun and reason_replace_dict is not None:

        for r in list_of_reasons:
        
            temp.loc[:, "sentence"] = ["{}".format(sent).replace(r, reason_replace_dict[r]) for sent in temp.loc[:, "sentence"]]
    
    # For initial review, it can be helpful to only review the positive instances
    if keep_only_true:
        
        subset_df = temp.loc[:, cols]
        subset_df['row_sum'] = temp.loc[:, list_of_reasons].sum(axis=1)  
        
        return subset_df[subset_df["row_sum"] >= 1]
    
    # But it's important not to forget the sentences w/ FALSE values for the subset of labels we've specified 
    else:
        return temp.loc[:, [cols]]

In [106]:
def generate_ngrams(f_mat, r_list, n=3, filter_punc=True, filter_ngrams_by_r=True):
    
    ngram_df = pd.DataFrame()
    
    cols = ["drug_name", "n"]
    
    for i in range(n):
        cols.append("token_{}".format(i))

    counter = 0
    
    for i, row in f_mat.iterrows():
        drug_name = row['file'].lower()
        doc = textacy.Doc(textacy.preprocess_text(u'{}'.format(row['sentence'].lower())), lang='en')
        ngrams = textacy.extract.ngrams(doc, n, filter_stops=False, filter_punct=filter_punc, filter_nums=False)

        if  filter_ngrams_by_r:
            
            #for x in ngrams:
                #print(x, any(reason in str(x) for reason in r_list))
        
            ngrams = [x for x in ngrams if any(reason in str(x).lower() for reason in r_list)]
        
        for ngram in ngrams:
                ngram_df.loc[counter, "drug_name"] = drug_name
                ngram_df.loc[counter, "n"] = n
                ngram_df.loc[counter, "reason"] = None
                
                for i in range(n):
                    ngram_df.loc[counter, "token_{}".format(i)] = ngram[i].text if not None else " "
                    
                    if u'{}'.format(ngram[i]).strip() in r_list:
                        #print(ngram[i])
                        ngram_df.loc[counter, 'reason'] = str(ngram[i])
                counter += 1
        
    return ngram_df

In [63]:
def find_ngrams_with_label_not_in_initial_position(ngramdf, r_list):
    
    out = pd.DataFrame()
    
    cols = ["{}_in_pos_0".format(r) for r in r_list]

    for r in r_list:
        
        ngramdf["{}_in_pos_0".format(r)] = 0
        
        for i, row in ngramdf.iterrows():

            if row['token_0'].text == r:
                ngramdf.loc[i, "{}_in_pos_0".format(r)] = 1
                
    
    ngramdf['row_sum'] = ngramdf.loc[:, cols].sum(axis=1)
    print("Min row sum= ", ngramdf['row_sum'].min())
    return ngramdf[ngramdf['row_sum'] == 0]


In [64]:
def assess_model_performance(ground_truth_df, generated_df, r_list):
    
    for r in r_list:
        y_true = ground_truth_df.loc[:, r]
        y_pred = generated_df.loc[:, r]
        accuracy = accuracy_score(y_true, y_pred)
        print("Reason: {}; accuracy: {}".format(r, accuracy))
    
    return 

### Bucket A: Indication; Contraindication; Risk factor



In [65]:
labels_a = ["notaecandidateindication", "notaecandidatecontraindication", "notaecandidatepreexistingconditionorriskfactor"] # TODO: find label for risk factor 
subdf_a = filter_feature_matrix(labels_a, feat_mat, keep_only_true=True)
subdf_a.shape

(1824, 9)

In [107]:
ngramdf_a = generate_ngrams(subdf_a, labels_a, n=4, filter_punc=True, filter_ngrams_by_r=True)


In [126]:
ngramdf_a[ngramdf_a['reason']=="notaecandidateindication"].groupby(['reason', 'token_0', 'token_1', 'token_2', 'token_3'])['reason', 'token_0', 'token_1', 'token_2', 'token_3'].size().reset_index(name='counts').sort_values(['reason','counts'], ascending=False)
    
    

Unnamed: 0,reason,token_0,token_1,token_2,token_3,counts
548,notaecandidateindication,in,patients,with,notaecandidateindication,108
906,notaecandidateindication,notaecandidateindication,notaecandidateindication,and,notaecandidateindication,41
1444,notaecandidateindication,patients,with,notaecandidateindication,and,34
1754,notaecandidateindication,the,treatment,of,notaecandidateindication,32
1337,notaecandidateindication,of,patients,with,notaecandidateindication,28
147,notaecandidateindication,adult,patients,with,notaecandidateindication,21
1472,notaecandidateindication,patients,with,notaecandidateindication,oselabeledaefromdruguse,20
1489,notaecandidateindication,patients,with,notaecandidateindication,who,19
915,notaecandidateindication,notaecandidateindication,notaecandidateindication,notaecandidateindication,and,16
491,notaecandidateindication,in,adults,with,notaecandidateindication,15


In [127]:
ngramdf_a[ngramdf_a['reason']=="notaecandidatecontraindication"].groupby(['reason', 'token_0', 'token_1', 'token_2', 'token_3'])['reason', 'token_0', 'token_1', 'token_2', 'token_3'].size().reset_index(name='counts').sort_values(['reason','counts'], ascending=False)
    
    

Unnamed: 0,reason,token_0,token_1,token_2,token_3,counts
25,notaecandidatecontraindication,in,patients,with,notaecandidatecontraindication,9
13,notaecandidatecontraindication,family,history,of,notaecandidatecontraindication,3
18,notaecandidatecontraindication,history,of,notaecandidatecontraindication,or,3
32,notaecandidatecontraindication,notaecandidatecontraindication,and,in,patients,3
78,notaecandidatecontraindication,patients,with,severe,notaecandidatecontraindication,3
83,notaecandidatecontraindication,the,treatment,of,notaecandidatecontraindication,3
0,notaecandidatecontraindication,a,history,of,notaecandidatecontraindication,2
4,notaecandidatecontraindication,added,to,usual,notaecandidatecontraindication,2
5,notaecandidatecontraindication,and,without,known,notaecandidatecontraindication,2
14,notaecandidatecontraindication,females,who,are,notaecandidatecontraindication,2


In [128]:
ngramdf_a[ngramdf_a['reason']=="notaecandidatepreexistingconditionorriskfactor"].groupby(['reason', 'token_0', 'token_1', 'token_2', 'token_3'])['reason', 'token_0', 'token_1', 'token_2', 'token_3'].size().reset_index(name='counts').sort_values(['reason','counts'], ascending=False)
    
    

Unnamed: 0,reason,token_0,token_1,token_2,token_3,counts
695,notaecandidatepreexistingconditionorriskfactor,in,patients,with,notaecandidatepreexistingconditionorriskfactor,125
45,notaecandidatepreexistingconditionorriskfactor,a,history,of,notaecandidatepreexistingconditionorriskfactor,60
632,notaecandidatepreexistingconditionorriskfactor,history,of,notaecandidatepreexistingconditionorriskfactor,or,32
1314,notaecandidatepreexistingconditionorriskfactor,notaecandidatepreexistingconditionorriskfactor,notaecandidatepreexistingconditionorriskfactor,or,notaecandidatepreexistingconditionorriskfactor,23
2127,notaecandidatepreexistingconditionorriskfactor,patients,with,notaecandidatepreexistingconditionorriskfactor,or,23
1812,notaecandidatepreexistingconditionorriskfactor,of,notaecandidatepreexistingconditionorriskfactor,or,notaecandidatepreexistingconditionorriskfactor,22
1844,notaecandidatepreexistingconditionorriskfactor,of,patients,with,notaecandidatepreexistingconditionorriskfactor,21
1275,notaecandidatepreexistingconditionorriskfactor,notaecandidatepreexistingconditionorriskfactor,notaecandidatepreexistingconditionorriskfactor,and,notaecandidatepreexistingconditionorriskfactor,20
2153,notaecandidatepreexistingconditionorriskfactor,patients,with,severe,notaecandidatepreexistingconditionorriskfactor,19
1229,notaecandidatepreexistingconditionorriskfactor,notaecandidatepreexistingconditionorriskfactor,may,be,at,17


In [130]:
def bucket_a_classifier_indication(sentence):
    #"treatment of"
    #"patients with"
    #"adult patients with"
    pass


In [129]:
def bucket_a_classifier_contraindication(sentence):
    #"patients with severe"
    #"females who are"
    #"family history"
    #"history of"
    pass

In [None]:
def bucket_a_classifier_preexist_or_risk(sentence):
    
    #"patients with a history of"
    #"patients with severe"
    #"elderly patients"
    #"family history"
    #"risk factors"
    #"women with"
    #"women during"
    
   # else:
        #return 0

### Bucket B: Withdrawal; Pregnancy 

In [584]:
# We have to consider contraindication here again because pregnancy mentions show up under that type_reason. 
labels_b = ["notaecandidatecontraindication", "nonoseaeodorwithdrawal"] 
subdf_b = filter_feature_matrix(labels_b, feat_mat, keep_only_true=True)
print(subdf_b.shape)


(86, 8)


In [667]:
ngramdf_b = generate_ngrams(subdf_b, labels_b, n=3, filter_punc=True, filter_ngrams_by_r=True)

In [373]:
# For contraindication associated with pregnancy: women; childbearing; female; teratogenic effect(s)
# For withdrawal/overdose: withdrawal, overdose; overdosed; abrupt cessation; early depletion; too rapid

# Generalizations: for pregnancy, references to women, childbearing; pre/post-partum; nursing
# Generalizations: for withdrawal/overdose: explicit term references; temporal references that indicate suddenness;
# associated negative outcomes such as addition/death

### Bucket C: Negation

#### Rule Generation via Dependency Parse Analysis:

For the negation bucket, we investigate dependency parsing as a possible method for generating rules. Given the "messy" nature of our example texts this is generally **not** the preferred method to follow. If, however, you happen have a problem with well-formed, grammatically correct sentences, dependency parse methods can be quite powerful. So we present the methods below mainly to whet your appetite.

A **dependency parse** provides part of speech tags for each word as well as dependency information **encoded in tree form**. To illustrate, here is a diagram of a dependency parse of the sentence "The girl has a flower in her hair.":

<img src=./assets/displacy_girl_flower.png>

This diagram was generated with spaCy’s display tool [displacy](https://explosion.ai/demos/displacy). The part of speech tags appear underneath each word. In addition to NOUN, VERB, and ADJ, we also see DET (determiner) and ADP (preposition). Documentation on spaCy’s annotation scheme can be found [here](https://spacy.io/api/annotation).

The arrows represent a child-parent relationship, with the child being at the “arrow” end and the the parent at the other end. **The word at the arrow end modifies the word at the other end.** Thus the word "The" modifies "girl", since the first arrow starts at the word "girl" and points to the word "The". The label on the arrow indicates the nature of the parent-child relationship. For the “girl-The” arrow, the "det" label on the arrow indicates that the word "The" is a determiner that modifies "girl".

The subject of the verb "has" is the word "girl", as indicated by the nsubj (nominal subject) label on the second arrow. The direct object of the verb is the noun "flower", as the arrow labeled dobj shows. The direct object has a DET modifer "a", similarly to the DET modifier for the word "girl".

A prepositional phrase "in her hair" follows the direct object, as the two arrows labeled prep (prepositional modifier) and pobj (object of preposition) indicate. The object of the preposition "in" is the noun "hair", which has a possessional modifier "her".

Considered in its entirety, this depencency parse comprises a tree rooted at the verb "has".

Thus a dependency parse allows one to determine the nature of the relationships between the various components of a sentence.

#### Sample Problem:

We will illustrate dependency parse techniques by developing a set of rules for phrases such as:
- no cases of X were reported ...
- no X occurred in any of the ...
- no evidence of X or X was seen...
- no evidence of serious X was detected in ...

In other words, we want to find phrases that start with "no" followed by one or more X tokens with possible intervening words and modifiers, followed by a verb. Here the symbol X is the token that our rules will try to find. Our methods will be successful if, given an input sentence, we can find the X in phrases of this nature.

So how do we go about finding X with a dependency parse tree?

First, we must make some realistic concessions.

Long run-on sentences containing medical jargon are difficult for NLP tools to process. Depencency parses for such sentences are generally meaningless. We have accordingly confined our attention to the shorter sentences in our data set. So we will illustrate dependency parse methods on sentences containing **a maximum of 16 words**.

The **algorithm** we ultimately implemented is below: 
- Find the ROOT token of the dependency parse (the root of the parse tree).
- If the ROOT token has any modifiers, extract all modifier tokens. 
- Find all nouns in the list of modifier tokens.
- For each noun modifier, get the prepositional phrases that modify it, if any. Also get any conjunction-linked nouns.
- If any prepositional phrases were found, return the object of the preposition as the target token.
- If any conjunction-linked nouns were found, return the original noun token and the linked noun token as the targets.
- The presence of a target indicates classification as 1; absence of target token(s) indicates classification as 0.

Helper functions and step-by-step explanation follows:

In [68]:
# load spaCy's English model
nlp = spacy.load('en')

We need a series of helper functions to process the tokens in a dependency parse. Here is a function that returns a list of modifier tokens for a specified start token. The function recursively finds all 'left' modifiers (in spaCy parlance) of the token and returns them in order:

In [69]:
###############################################################################
def get_modifiers(start_token):
    """
    Find all 'left' modifiers of the start token via depth-first search and
    return as a list, in order of occurrence in the sentence.
    """

    results = []

    # use like a stack for depth-first traversal
    tokens = [start_token]

    # indices of nodes whose 'lefts' have been pushed
    indices = []
    
    # recurse depth-first through the dependency parse tree; will pick up
    # all modifiers of each token this way
    while len(tokens) > 0:
        # top of token stack
        top = tokens[-1]
        # if 'lefts' haven't already been pushed
        if top.i not in indices:
            lefts = []
            for l in top.lefts:
                lefts.append(l)
            indices.append(top.i)
            if len(lefts) > 0:
                # reverse the order to place the most distant modifier
                # at the top of the stack; the most distant mod comes
                # first in a sentence
                for i in reversed(range(len(lefts))):
                    tokens.append(lefts[i])
        else:
            # no more mods for the token at the stack top, so pop it
            token = tokens.pop()
            results.append(token)

    # remove any duplicates
    no_dups = []
    for token in results:
        if token not in no_dups:
            no_dups.append(token)

    return no_dups


The next function solves a similar problem by returning a list of modifier tokens starting with a given search tag and ending on the nearest associated noun:

In [70]:
###############################################################################
def get_modifier_with_tag(search_tag, doc, target_tok):
    """
    Given a spaCy tag of either 'IN' for a preposition or 'CC' for a
    conjunction, find and return all tokens up to and including the nearest
    noun. By doing so we can capture modifying prepositional phrases and
    conjunction-linked nouns (with possible modifiers as well).
    """
    
    found_it = False
    phrase = []
    for token in doc:
        tag = token.tag_
        if search_tag == tag and token.head == target_tok:
            phrase.append(token)
            token_index = token.i + 1
            while token_index < len(doc):
                t = doc[token_index]
                if 'NOUN' != t.pos_:
                    phrase.append(t)
                    token_index += 1
                else:
                    phrase.append(t)
                    found_it = True
                    break
            if found_it:
                break

    return phrase        


This function checks all modifiers of a given token for the word 'no', as our desired patterns require. If we find 'no' as a modifier, then we have potentially found the text pattern we are looking for. We find all modifying prepositional phrases or conjunctions associated with the word modified by the 'no' and return our candidates for X:

In [71]:
###############################################################################
def get_targets_for_no(doc, modifiers, tok):

    # spacy tags for preposition and conjunction
    TAG_PREP = 'IN'
    TAG_CONJ = 'CC'
    
    targets = []
    for mod in modifiers:
        mod_text = '{0}'.format(mod).lower()
        # look for the word 'no' as a modifier
        if 'no' == mod_text:

            # check for modifying prep phrase
            prep_phrase = get_modifier_with_tag(TAG_PREP, doc, tok)
            if len(prep_phrase) > 0:
                for p in prep_phrase:
                    if 'NOUN' == p.pos_:
                        targets.append(p)

            conj_phrase = get_modifier_with_tag(TAG_CONJ, doc, tok)
            if len(conj_phrase) > 0:
                for c in conj_phrase:
                    if 'NOUN' == c.pos_:
                        targets.append(tok)
                        targets.append(c)

            if 0 == len(targets):
                targets.append(tok)

    return targets


In [72]:
###############################################################################
def analyze(doc):

    # find the index of the ROOT token
    root_token = None
    for token in doc:
        if token.dep_ == 'ROOT':
            root_token = token

    # get modifiers of the root token
    root_modifiers = get_modifiers(root_token)
    root_modifiers.remove(root_token)
    
    # find the nouns in the modifier list, if any
    noun_mods = []
    for mod in root_modifiers:
        if 'NOUN' == mod.pos_:
            noun_mods.append(mod)

    # all targets should be 'X' tokens for success
    targets = []

    if len(noun_mods) > 0:
        # for each noun modifier, get prepositional phrases or conjunctions
        # that modify it
        for noun in noun_mods:
            modifiers = get_modifiers(noun)
            these_targets = get_targets_for_no(doc, modifiers, noun)
            targets.extend(these_targets)
    else:
        # no noun modifiers, so find phrases associated with the root node
        these_targets = get_targets_for_no(doc, root_modifiers, root_token)
        targets.extend(these_targets)
                 
    return targets


Here is a driver program for the code above. This code performs some sentence cleanup and the limitation to MAX_WORDS:

In [73]:
###############################################################################
def driver(neg_token, subdf_c):

    # only analyze sentences with this many or fewer words
    MAX_WORDS = 16
    
    # match some throwaway words at the start of a sentence
    regex_there_has_been = re.compile(r'\Athere\s(has|have)\sbeen\s*', re.IGNORECASE)
    
    # iterate through each row of the dataframe
    count = 0
    for index, row in subdf_c.iterrows():

        # get the text associated with the next entry (could be multiple sentences)
        text = row['sentence']

        # do some text cleanup
        
        # replace explicit '\n', with whitespace
        text = re.sub(r'\'?\\n\',?', ' ', text)
    
        # collapse repeated whitespace
        text = re.sub(r'\s+', ' ', text)

        # use spaCy to tokenize into sentences
        doc = nlp(text)
        sentences = [sent.string.strip() for sent in doc.sents]

        for sent in sentences:

            # keep only those sentences that include the negation token
            if neg_token not in sent:
                continue

            # count words; very long sentences can't be analyzed well
            words = sent.split()
            if len(words) > MAX_WORDS:
                continue

            # simplify sentence text by removing throwaway phrases at the beginning
            match = regex_there_has_been.match(sent)
            if match:
                # remove the matching phrase
                sent = sent[match.end():]

            # replace the negation token with just X, for simplicity
            sent = sent.replace(neg_token, 'X')
            
            print('[{0}]: index: {1}, sentence: {2}'.format(count, index, sent))

            # re-analyze this sentence with spacy
            doc = nlp(sent)
            #print_tokens(doc)

            # perform our analysis
            targets = analyze(doc)

            # print any result tokens
            if len(targets) > 0:
                print('TARGETS: ')
                for t in targets:
                    print('\tindex: {0}, text: {1}'.format(t.i, t))
            
            count += 1


In [74]:
NEGATION_COL_NAME = 'nonoseaenegation'
    
# load the the feature matrix from a file
feat_mat = pd.read_csv('./data/negation_feature_matrix.csv', header=0, index_col=0)

# extract negation submatrix
labels_c = [NEGATION_COL_NAME]
subdf_c = filter_feature_matrix(labels_c, feat_mat, keep_only_true=True)
    
# run the algorithm on the negation submatrix
driver(NEGATION_COL_NAME, subdf_c)

[0]: index: 122, sentence: There were no serious X or X reported in either group.
[1]: index: 50, sentence: No cases of X were reported. '
TARGETS: 
	index: 3, text: X
[2]: index: 106, sentence: Clinical studies have shown that LIPITOR does not X or X
[3]: index: 23, sentence: No AndroGel 1% patient discontinued due to X '
TARGETS: 
	index: 4, text: patient
[4]: index: 122, sentence: Free thyroid hormone concentrations remain unchanged, however, and there is no clinical evidence of X ']
[5]: index: 172, sentence: No X occurred in the short-term or longer-term (up to 1 year) notaecandidateindication trials. '
TARGETS: 
	index: 1, text: X
[6]: index: 178, sentence: No X occurred in any of the pediatric trials.
TARGETS: 
	index: 1, text: X
[7]: index: 231, sentence: In clinical trials evaluating Savella in patients with notaecandidateindication X X have not been reported.
[8]: index: 235, sentence: No case met the criteria of X and associated with an X
TARGETS: 
	index: 1, text: case
[9]:

In [None]:
# For negation: grammatical pattern "was/were not associated/observed prepositional phrase"
# "were noted/reported/confirmed/occurred in _modifier_"; 

