## Cooking with ClarityNLP - Session #7

The goal of this [#cookingWithClarityNLP](https://twitter.com/hashtag/cookingWithClarityNLP?src=hash&lang=en) session is to highlight how ClarityNLP can be used to detect, extract, and classify different sub-categories of [adverse events](https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/cfrsearch.cfm?fr=312.32) from patient-level documents, and/or drug labels. 

For details on installing and using ClarityNLP, please see our [documentation](https://claritynlp.readthedocs.io/en/latest/index.html).  We welcome questions during this presentation, as well as [via Slack](https://join.slack.com/t/claritynlp/shared_invite/enQtNTE5NTUzNzk4MTk5LTFmNWY1NWVmZTA4Yjc5MDUwNTRhZTBmNTA0MWM0ZDNmYjdlNTAzYmViYzAzMTkwZDkzODA2YTJhYzQ1ZTliZTQ), Twitter (use the hashtag [#cookingWithClarityNLP](https://twitter.com/hashtag/cookingWithClarityNLP?src=hash&lang=en)), or on [GitHub](https://github.com/ClarityNLP/ClarityNLP/issues). **We also encourage suggestions for topics to cover in future cooking sessions!**

If you're running ClarityNLP locally via Docker, make sure that the Docker swarm is up and running before following along:
- See https://claritynlp.readthedocs.io/en/latest/setup/local-docker.html#running-locally.

In [357]:
# Import dependencies
%matplotlib inline
import pandas as pd
import numpy as np
import os 
from collections import OrderedDict
pd.set_option('display.width',100000)
pd.set_option('max_colwidth',4000)
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
import textacy

FDA_DIR = "../../../../repos/FDA_AE_challenge_2019"


import claritynlp_notebook_helpers as claritynlp

#TODO: (loose outline)
1. Overview of adverse events (definitions of sub-categories; examples; references); discussion of the subset of AEs we decided to focus on:
    - Pre-existing condition or risk factor
    - Indication
    - Negation
    - Pregnancy
    - Class effect
    - Overdose/withdrawal

2. Overview (high-level) of drug label organization/contents/what it means when an adverse event appears on a drug label; the path toward a reported event appearing on a label
3. Brief discussion of pre-processing steps taken (parse xml; find/replace medra w/label; ngrams)
4. Overview of rule-construction process => frequency analysis; dependency parsing; naive baseline + enhancements; how we set up the structure of the rules engine to be as efficient as possible (e.g., maximize downselection potential, etc.) 
5. Describe how our rules engine can be run via NLP_as_a_Service, and/or as a custom task in ClarityNLP
6. Demonstration (screenshots?) of using chart review to manually validate.

## 1. Overview and Background
### 1.1 The FDA Adverse Drug Event Evaluation Challenge

Our team is currently working on a submission to the [FDA Adverse Drug Event Evaluation Challenge](https://sites.mitre.org/adeeval/). This challenge requires participating teams to ingest a set of drug labels (in XML format), parse these labels to detect and classify adverse events, and output a set of labels indicating the location, class, and associated [MedDRA](https://www.meddra.org/) information for each positive occurrence of an adverse event.

Today, we'll break down some of the components of our drug label evaluation strategy to show how you can use components of the ClarityNLP pipeline (specifically, a custom task, and our chart review GUI-based tool) to detect and classify (a subset of) adverse event mentions when they appear in drug labels and/or patient notes. 

### 1.2 The Challenge Training Dataset
The FDA challenge training dataset contains text from 100 drug labels, stored in XML format. Each drug label is mapped to a series of ground-truth labels; for the sake of clarity, we will continue to distinguish between **drug** labels and **ground-truth** labels. Each ground-truth label corresponds to an adverse event associated with the drug in question (though there are several possible ways that this relationship may manifest itself), and contains the following fields:

<img src="assets/truth_label_screenshot.png" style="height:50px">


### 1.3 Adverse Event Categories
There are several reasons that the FDA may consider an adverse event to be (or not be) of interest to the Office of Surveillance and Epidemiology (OSE). The unique set of reasons, along with their definitions and counts within the FDA training dataset, appear in the table that is generated below:

In [341]:
def generate_labels_df(fda_dir=FDA_DIR):
    
    '''

    This function lets us read in the 100 drug label XML files, and parse the XML so that we get each ground 
    truth label associated with a given drug as a row in our dataframe. We can use this dataframe to get counts 
    by label.

    '''

    if not os.path.exists("{}/analyze_labels/data/labels_dataframe.csv".format(fda_dir)):

        labels_df = pd.DataFrame(columns = ["file", "len", "id", "reason", "section", "start", "type", "meddra_llt", "meddra_llt_id", "meddra_pt", "meddra_pt_id"])
        counter = 0

        for f in os.listdir('{}/ose_xml_training_20181101/'.format(fda_dir)):

            drug_name = f.split(".xml")[0]

            root = xml.etree.ElementTree.parse("../ose_xml_training_20181101/{}".format(f)).getroot()

            for child in root:

                if child.tag == "Mentions":
                    for subchild in child:
                        if subchild.tag == "Mention": 
                            new_row = {}

                            new_row["file"] = drug_name.lower() 
                            new_row = {**new_row, **subchild.attrib}

                            for subsubchild  in subchild:

                                if subsubchild.tag == "Normalization":                    
                                      new_row = {**new_row, **subsubchild.attrib}

                        temp_df = pd.DataFrame.from_dict(new_row, orient='index').T

                        labels_df.loc[counter, :] = temp_df.loc[0,:]
                        counter += 1
                        
    else: 
        labels_df = pd.read_csv("{}/analyze_labels/data/labels_dataframe.csv".format(fda_dir), header=0, index_col=0)
 
    reason_info = pd.read_csv("./data/ae_reason_info.csv", header=0)
    
    outdf = pd.merge(labels_df, reason_info, left_on=["reason", "type"], right_on=["reason", "type"])
        
    return outdf.loc[:, ["file", "len", "reason", "type", "meddra_llt", "meddra_pt", "meddra_pt_id", "description"]]


labels_and_info = generate_labels_df(fda_dir="../../../../repos/FDA_AE_challenge_2019")

In [344]:
reason_counts = pd.DataFrame(labels_and_info.groupby(['type','reason', 'description'])['type','reason', 'description'].size().reset_index(name='counts')).sort_values(['counts'], ascending=False)
reason_counts

Unnamed: 0,type,reason,description,counts
4,NonOSE_AE,AE_only_as_instruction,"AES mentioned in instructions are often mentioned in a hypothetical context, with instructions for what to do if they develop. These AES are not of interest.",3384
7,NonOSE_AE,general_term,"General terms or non-specific text such as broad categories (e.g., MedDRA system organ class) used to introdue AEs or text describe an outcome (e.g., death) rather than an AE. These are not of interest.",2122
12,Not_AE_Candidate,indication,A clinical symptom or circumstance for which the use of the drug of interest would be appropriate. These mentions are not AEs.,1434
8,NonOSE_AE,manifestation_or_complication,"Text describing signs, symptoms, or changes in lab resuts related to the manifestations of an AE and the sequelae of an AE are not of interest.",1144
2,NonOSE_AE,AE_from_drug_interaction,AEs that result from drug-drug interaction or co-administration are not of interest.,328
5,NonOSE_AE,AE_rate_lteq_placebo,Aes with incidence rate equal to or lower than placebo are not of interest.,306
9,NonOSE_AE,negation,AE whose presence or occurrence is negated or denied. These AEs are not of interst.,245
0,NonOSE_AE,AE_animal,AEs observed in animal data are not of interest.,241
6,NonOSE_AE,OD_or_withdrawal,"AE associated with discontinuing a medication or taking more than the prescribed amount. Drug overdoes and withdrawal do not generally occur when a drug is used as indicated. Additionally, in the context of pharmacovigilance, identifying AEs associated with the drug when used as indicated is the highest priority. These AEs are not of interest.",206
3,NonOSE_AE,AE_from_off_label,AES associated with off-label or unapproved drug use are not of interest.,87


### 1.4 Our Selected Subset of Adverse Event Categories
Looking at the table above, it's clear that the majority of labeled adverse events fall into a few categories: specifically: (1) adverse events that are mentioned in the instructions in a hypothetical context; (2) general terms, which are too broad to be useful; (3) indications (e.g., clinical symptom(s) for which using the drug in question would be appropriate; and (4) complications, which might occur as the result of an AE, but are not adverse events per se.

For the purpose of today's discussion, we've decided to focus on on detecting mentions of three different buckets of adverse events; note some of these buckets contain multiple "reasons" as defined in the FDA ground truth labels. 

1. Indication; Contraindication; Risk factor
2. Withdrawal; Pregnancy*
3. Negation

\* Not: in the FDA training dataset, **Pregnancy** is a medDRA term which appears in the **meddra_llt** field, and co-occurs with multiple grouth-truth reasons. The reason that is of interest here is **contraindication** (e.g., drugs that are contraindicated for women who are pregnant, nursing, or may become pregnant in the near term). For example:

<img src="assets/pregnancy_contraindication_ex.png" style="height:70px">

## 2. Implementation

### 2.1 Generation of Sentence-Level Labels
In order to build a rules-based model for detecting/classifying sentences with respect to ground-truth labels, we must first make the provided XML labels into a sentence-level feature matrix that can be used to assess model performance on sentences that do and do not contain each of our ground-truth labels. 

Our first step in this direction was to parse the original XML drug label documents, find the adverse events within the text (using the section-level offsets provided as part of the ground-truth labels), and replace each adverse event with the concatenation of the "type"||"reason" fields. This will allow us to identify patterns at a more generic level (since we'll be working with the finite set of {"type"||"reasons"} rather than the more specific set of underlying adverse events, which are linked to medDRA terms within the ground-truth labels. 

The functions that appear below represent the next step in the process, and serve to generate a sentence-level feature matrix, such that each sentence in each drug label document represents a row, and each possible "type"||"reason" ground-truth label represents a Boolean-valued column. 


In [301]:
def generate_type_reason_combos(reason_counts_df):
    
    out = OrderedDict()
    
    for i, row in reason_counts.iterrows():
        out[i] = {"type": row['type'], 
                  "reason": row['reason'], 
                  "combo_term":'{}{}'.format(row['type'].replace("_", "").lower(), row['reason'].replace("_", "").lower())}
    return out

In [353]:
def generate_feature_matrix(reasons, txt_file_dir="{}/ose_txt_training_20181101_type_and_reasons/".format(FDA_DIR)):
    
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    
    # To generate sentence-level labels, we need to find sentences where the reason-labels we inserted occur    
    patterns = [{"ORTH": r['combo_term']} for r in reasons.values()]
    
    for p in patterns:
        matcher.add(p['ORTH'], None, [(p)])
        
    cols = ["file", "sent_id", "sent_start", "sent_end", "sentence"]
    cols += [r['combo_term'] for r in reasons.values()]
        
    out_df = pd.DataFrame(columns=cols)

    for f in os.listdir(txt_file_dir):
            
        with open ("{}{}".format(txt_file_dir, f), "r") as myfile:
            
            drug_name = f.split("_")[0]
            #print(drug_name)

            lines = myfile.readlines()
            doc = nlp(u'{}'.format(lines))

            feature_matrix = pd.DataFrame(index=np.arange(len(list(doc.sents))), columns=cols)

            feature_matrix.loc[:, "file"] = drug_name.lower()
            feature_matrix.loc[:, "sent_id"] = [i for i in range(len(list(doc.sents)))]
            feature_matrix.loc[:, "sent_start"] = [i.start for i in (list(doc.sents))]
            feature_matrix.loc[:, "sent_end"] = [i.end for i in (list(doc.sents))]
            feature_matrix.loc[:, "sentence"] = [i for i in (list(doc.sents))]

            for r in reasons.values():
                feature_matrix.loc[:, r["combo_term"]] = 0

            matches = matcher(doc)

            for match_id, start, end in matches:

                span = doc[start:end]  # the matched span

                string_id = nlp.vocab.strings[match_id]  # get string representation

                match_sent_start = span.sent.start # the start (offset) of the sentence in which the match occurrs
                match_sent_end  = span.sent.end # the end (offset) of the sentence in which the match occurrs

                match_sent_id = feature_matrix.loc[(feature_matrix['sent_start'] == int(match_sent_start)) &
                                                   (feature_matrix['sent_end'] == int(match_sent_end)), "sent_id"].values[0]

                feature_matrix.loc[match_sent_id, string_id] = 1

            if out_df.empty:
                out_df = feature_matrix
            else:
                out_df = out_df.append(feature_matrix)
        
    out_df.to_csv("./data/feature_matrix.csv")
    print("Sentence-level feature matrix saved.")
    return 

In [354]:
reasons = generate_type_reason_combos(reason_counts)
generate_feature_matrix(reasons)

Sentence-level feature matrix saved.


In [501]:
feat_mat = pd.read_csv("./data/feature_matrix.csv", header=0, index_col=0)

In [398]:
feat_mat[feat_mat['nonoseaenegation']==1]

Unnamed: 0,file,sent_id,sent_start,sent_end,sentence,nonoseaeaeonlyasinstruction,nonoseaegeneralterm,notaecandidateindication,nonoseaemanifestationorcomplication,nonoseaeaefromdruginteraction,nonoseaeaeratelteqplacebo,nonoseaenegation,nonoseaeaeanimal,nonoseaeodorwithdrawal,nonoseaeaefromofflabel,notaecandidatecontraindication,nonoseaeaeforanotherdruginclass,notaecandidateother,nonoseaeother
32,valium,32,1673,1788,"No adverse effects on nonoseaenegation or nonoseaenegation were noted at a dose of 80 mg/kg/day (approximately 13 times the MRHD on a mg/m 2 basis).\n', '\n', ' Pregnancy \n', '\n', ' Category D (see WARNINGS: nonoseaeaeonlyasinstruction ).\n', '\n', ' Pediatric Use \n', '\n', ' Safety and effectiveness in pediatric patients below the age of 6 months have not been established.\n', '\n', ' Geriatric Use",1,0,0,0,0,0,1,0,0,0,0,0,0,0
15,cytoxan,15,380,439,"The degree of nonoseaeaeonlyasinstruction is particularly important because it correlates with a nonoseaemanifestationorcomplication nonoseaeaeonlyasinstruction without documented nonoseaenegation has been reported in nonoseaeaeonlyasinstruction patients.\n', '\n', ' Gastrointestinal system: \n', '\n', ' oselabeledaefromdruguse and oselabeledaefromdruguse occur with cyclophosphamide therapy.",1,0,0,1,0,0,1,0,0,0,0,0,0,0
51,benlysta,51,1261,1343,"The proportion of patients who discontinued treatment due to any adverse reaction during the controlled clinical trial was 7.2% of patients receiving BENLYSTA plus standard therapy and 8.9% of patients receiving placebo plus standard therapy.\n', '\n', ' The safety profile observed for BENLYSTA administered subcutaneously plus standard therapy was consistent with the known safety profile of BENLYSTA administered intravenously plus standard therapy, with the exception of local nonoseaenegation \n', '\n",0,0,0,0,0,0,1,0,0,0,0,0,0,0
66,benlysta,66,1817,1847,"In Trial 4 (subcutaneous dosing), there was no formation of nonoseaenegation in 556 patients receiving BENLYSTA 200 mg during the 52-week placebo-controlled period.",0,0,0,0,0,0,1,0,0,0,0,0,0,0
111,benlysta,111,3323,3364,Serious oselabeledaefromdruguse (excluding nonoseaenegation were reported in 0.5% of patients receiving BENLYSTA and 0.4% of patients receiving placebo and included oselabeledaefromdruguse oselabeledaefromdruguse oselabeledaefromdruguse oselabeledaefromdruguse oselabeledaefromdruguse and oselabeledaefromdruguse,0,0,0,0,0,0,1,0,0,0,0,0,0,0
122,benlysta,122,3776,3791,There were no serious nonoseaenegation or nonoseaenegation reported in either group.,0,0,0,0,0,0,1,0,0,0,0,0,0,0
128,benlysta,128,4068,4094,No data are available on the nonoseaenegation from persons receiving live vaccines to patients receiving BENLYSTA or the effect of BENLYSTA on new immunizations.,0,0,0,0,0,0,1,0,0,0,0,0,0,0
59,janumet,59,2126,2315,"Through Week 54, the overall incidence of oselabeledaefromdruguse was 3.9% in patients given add-on sitagliptin and 1.0% in patients given add-on placebo.\n', '\n', ' Vital Signs and Electrocardiograms \n', '\n', ' With the combination of sitagliptin and metformin, no clinically meaningful nonoseaenegation nonoseaenegation vital signs or in nonoseaenegation nonoseaenegation vital signs or in ECG (including in nonoseaenegation were observed.\n', '\n', ' Pancreatitis \n', '\n', ' In a pooled analysis of 19 double-blind clinical trials that included data from 10,246 patients randomized to receive sitagliptin 100 mg/day (N=5429) or corresponding (active or placebo) control (N=4817), the incidence of oselabeledaefromdruguse was 0.1 per 100 patient-years in each group (4 patients with an event in 4708 patient-years for sitagliptin and 4 patients with an event in 3942 patient-years for control).",0,0,0,0,0,0,1,0,0,0,0,0,0,0
85,janumet,85,2968,3030,"The onset of metformin-associated nonoseaeaeonlyasinstruction is often subtle, accompanied only by nonspecific symptoms such as nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication and nonoseaemanifestationorcomplication Metformin-associated oselabeledaefromdruguse was characterized by oselabeledaefromdruguse (>5 mmol/Liter), oselabeledaefromdruguse (without evidence of nonoseaenegation or nonoseaenegation ,",1,0,0,1,0,0,1,0,0,0,0,0,0,0
116,janumet,116,3720,3833,"These cases had a subtle onset and were accompanied by nonspecific symptoms such as nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication or nonoseaemanifestationorcomplication however, nonoseaemanifestationorcomplication nonoseaemanifestationorcomplication and resistant nonoseaemanifestationorcomplication have occurred with severe nonoseaeaeonlyasinstruction Metformin-associated oselabeledaefromdruguse was characterized by oselabeledaefromdruguse (>5 mmol/Liter), oselabeledaefromdruguse (without evidence of nonoseaenegation or nonoseaenegation , and an oselabeledaefromdruguse metformin plasma levels were generally >5 mcg/mL. Metformin decreases liver uptake of lactate oselabeledaefromdruguse which may increase the risk of oselabeledaefromdruguse especially in patients at risk.",1,0,0,1,0,0,1,0,0,0,0,0,0,0


### 2.2 Human-in-the-Loop Review and Rule Generation

Our next objective is to review the sentences that are positive instances of the subset of the ground-truth labels that we are interested in, using a combination of tools, including manual review, n-gram frequency analysis, dependency parsing, topic modeling, and embeddings. Our hope is that these tools can help us to discover latent pattern(s) that we can use to develop a sentence-level Boolean classifier that will offer improved performance and efficiency relative to the status quo. We've divided up this part so that each ground-truth label gets its own subsection, as some of the rules we'll develop are label-specific. 

To start, let's define a few helper functions that we can use for each bucket:


In [498]:
def filter_feature_matrix(list_of_reasons, f_matrix, keep_only_true=True, replace_reason_name_w_noun=False, reason_replace_dict=None):
    
    '''
    We can use this function to subset the feature matrix by ground-truth label(s) and/or to keep only rows where the value for >=1 of the labels we've selected is TRUE
    '''
    
    cols = ["file", "sent_id", "sent_start", "sent_end", "sentence"]
    cols += list_of_reasons
    
    if replace_reason_name_w_noun and reason_replace_dict is not None:

        for r in list_of_reasons:
        
            f_matrix.loc[:, "sentence"] = ["{}".format(sent).replace(r, reason_replace_dict[r]) for sent in f_matrix.loc[:, "sentence"]]
    
    # For initial review, it can be helpful to only review the positive instances
    if keep_only_true:
        
        subset_df = f_matrix.loc[:, cols]
        subset_df['row_sum'] = f_matrix.loc[:, list_of_reasons].sum(axis=1)  
        
        return subset_df[subset_df["row_sum"] >= 1]
    
    # But it's important not to forget the sentences w/ FALSE values for the subset of labels we've specified 
    else:
        return f_matrix.loc[:, [cols]]

In [471]:
def generate_ngrams(f_mat, r_list, n=3, filter_punc=True, filter_ngrams_by_r=True):
    
    ngram_df = pd.DataFrame()
    
    cols = ["drug_name", "n"]
    
    for i in range(n):
        cols.append("token_{}".format(i))

    counter = 0
    
    for i, row in f_mat.iterrows():
        drug_name = row['file'].lower()
        doc = textacy.Doc(u'{}'.format(row['sentence']), lang='en')
        ngrams = textacy.extract.ngrams(doc, n, filter_stops=False, filter_punct=filter_punc, filter_nums=False, min_freq=1)
        
        if  filter_ngrams_by_r:
        
            ngrams = [x for x in ngrams if any(reason in str(x) for reason in r_list)]
        
        for ngram in ngrams:
                ngram_df.loc[counter, "drug_name"] = drug_name
                ngram_df.loc[counter, "n"] = n
                ngram_df.loc[counter, "reason"] = None
                
                for i in range(n):
                    ngram_df.loc[counter, "token_{}".format(i)] = ngram[i] if not None else " "
                    
                    if u'{}'.format(ngram[i]).strip() in r_list:
                        #print(ngram[i])
                        ngram_df.loc[counter, 'reason'] = str(ngram[i]) 
                counter += 1
        
    return ngram_df

In [369]:
def assess_model_performance(ground_truth_df, generated_df, r_list):
    pass

#### Bucket A: Indication; Contraindication; Risk factor



In [352]:
labels_a = ["notaecandidateindication", "notaecandidatecontraindication"] # TODO: find label for risk factor 
subdf_a = filter_feature_matrix(labels_a, feat_mat, keep_only_true=True)
subdf_a.shape

(998, 8)

#### Bucket B: Withdrawal; Pregnancy 

In [379]:
# We have to consider contraindication here again because pregnancy mentions show up under that type_reason. 
labels_b = ["notaecandidatecontraindication", "nonoseaeodorwithdrawal"] 
subdf_b = filter_feature_matrix(labels_b, feat_mat, keep_only_true=True)
print(subdf_b.shape)


(86, 8)


In [396]:
ngramdf_b = generate_ngrams(subdf_b, labels_b, n=5, filter_punc=True)

In [397]:
# Looking at the n-gram results for these labels can help us build upon our existing knowledge of the world 
# to develop an initial set of terms, and/or patterns

ngramdf_b

Unnamed: 0,drug_name,n,reason,token_0,token_1,token_2,token_3,token_4
0,anoro,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,and,fatalities,have,been
1,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,had,their,treatment,temporarily
2,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,has,been,reported,following
3,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,or,early,depletion,of
4,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,symptoms,have,also,been
5,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,may,develop,or,recur
6,lioresal,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,and,with,withdrawal,from
7,xarelto,5.0,nonoseaeodorwithdrawal,nonoseaeodorwithdrawal,was,observed,during,the
8,lipitor,5.0,notaecandidatecontraindication,notaecandidatecontraindication,are,contraindications,to,the
9,aubagio,5.0,notaecandidatecontraindication,notaecandidatecontraindication,or,women,of,childbearing


In [373]:
# For contraindication associated with pregnancy: women; childbearing; female; teratogenic effect(s)
# For withdrawal/overdose: withdrawal, overdose; overdosed; abrupt cessation; early depletion; too rapid

# Generalizations: for pregnancy, references to women, childbearing; pre/post-partum; nursing
# Generalizations: for withdrawal/overdose: explicit term references; temporal references that indicate suddenness;
# associated negative outcomes such as addition/death

#### Bucket C: Negation


In [507]:
labels_c = ["nonoseaenegation"]
labels_c_noun_replacements = {"nonoseaenegation":"unicorn"}
subdf_c = filter_feature_matrix(labels_c, 
                                feat_mat, 
                                keep_only_true=True, 
                                replace_reason_name_w_noun=False, 
                                reason_replace_dict=None)
subdf_c.shape
#subdf_c.head()

(186, 7)

In [503]:
ngramdf_c = generate_ngrams(subdf_c, labels_c, n=5, filter_punc=True, filter_ngrams_by_r=True)
#ngramdf_c = generate_ngrams(subdf_c, ["unicorn"], n=3, filter_punc=False, filter_ngrams_by_r=True)

In [None]:
# For negation: grammatical pattern "was/were not associated/observed prepositional phrase"
# "were noted/reported/confirmed/occurred in _modifier_"; 

