# Exploring NLP for Clinical Trial Records
## Installment 3: Extracting Intervention-related Information
### by Munif Mujib

In this notebook, out strategy will be to utilize specific verbs as _indicators_. From a sentence's syntactical dependency tree, we'll follow the verb to collect sentence fragments that might contain potentially useful information related to interventions. We'll also develop a system to collect a list of verbs using proper nouns.

## Getting tools in place

We'll import the necessary modules and previously defined functions first.

In [1]:
from __future__ import division
import xml.etree.ElementTree as ET
import re, glob, time
import json
from collections import defaultdict, Counter
import nltk
import random
import spacy

In [2]:
nlp = spacy.load("en")

In [3]:
def process_node(parent, parent_dict):
    if list(parent):
        for child in parent:
            child_dict = {}
            child_dict = process_node(child, child_dict)
            parent_dict[child.tag] = child_dict
    else:
        parent_dict["field_value"] = parent.text
    return parent_dict

def load_trial(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    directory = trial_id[:-4] + "xxxx"
    filepath = root_dir + directory + "/" + trial_id + ".xml"
    
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    root_dict = {}
    return process_node(root, root_dict)

def find_textblocks(field, textblocks = [], current = "root"):
    subfields = field.keys()
    if not subfields == ["field_value"]:
        if "textblock" in subfields:
            textblocks.append((current, field["textblock"]["field_value"]))
        if len(subfields) > 1:    
            for subfield in subfields:
                if not subfield == "textblock":
                    textblocks = find_textblocks(field[subfield], textblocks, current = current + ">" + subfield)
    return textblocks

def process_textblock(textblock):
    utext = textblock
    lines = re.sub(r"([^\n.!?\"])\n\n",r"\1.\n\n",utext)
    lines = re.sub("\n*[ ]+", " ", lines)
    lines = lines.strip()
    return lines

def get_textblocks(textblocks):
    textblocks_dict = {}
    for textblock in textblocks:
        levels = re.split(">", textblock[0])
        key = ">".join(levels[1:])
        textblocks_dict[key] = process_textblock(textblock[1])
    return textblocks_dict

def extract_textblocks(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    return get_textblocks(find_textblocks(load_trial(trial_id, root_dir = root_dir), textblocks = []))

def create_regex_format(x):
    regex = re.compile(r"[^a-z0-9](" + x + r")[^a-z0-9]")
    return regex
    
def find_indicators(element, indicators_dict = {}, current = "root"):
    if type(element) == dict:
        subelements = element.keys()
        if "indicators" in subelements and element["indicators"]:
            indicators_dict[re.sub("refinements>", "",current)] = (map(create_regex_format, element["indicators"]))
        if len(subelements) > 1:
            for subelement in subelements:
                if type(element[subelement]) == dict:
                    indicators_dict = find_indicators(
                        element[subelement], indicators_dict, current = current + ">" + subelement
                    )
    return indicators_dict

def match_indicators(d, indicators_dict, secID):
    lines = d.get(secID,"")
    found = defaultdict(list)
    for sentNum, sentence in enumerate(nltk.sent_tokenize(lines)):
        for group in indicators_dict:
            matches = []
            for n, indicator in enumerate(indicators_dict[group]):
                padded = " " + sentence.lower().strip() + " "
                if indicator.search(padded):
                    nuggets = indicator.findall(padded)
                    matches.append(re.sub(r"^.*?\]\((.*?)\)\[.*?$",r"\1",indicator.pattern))
            if matches:
                found[group].append((matches, str(sentNum)))
    return found

In [6]:
def print_parse(sentence):
    doc = nlp(sentence)
    for token in doc:
        print(token.text, token.dep_, token.pos_, token.head.text, token.head.pos_,
             [child for child in token.children])

We'll also load our pre-generated _compound_ index of trial, section, and sentence IDs and the `retrieve_sentences` function, which helps us quickly access example sentences.

In [4]:
compounded = json.load(open("../compounded.json", "r"))

In [5]:
def retrieve_sentences(group = "root>burden", 
                       indicator = "(?:(?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s? (?:\\d+-|\\d+\\.)*\\d+|(?:\\d+-|\\d+\\.)*\\d+ (?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s?)", 
                       secIDs = ["brief_summary", "detailed_description"], 
                       n = 10, 
                       seed = 42,
                       index = compounded,
                       root_dir = "../clinicaltrials_data/trials/"
                      ):
    secIDs = re.compile("|".join(secIDs))
    sentIDs = filter(lambda x: secIDs.search(x), index.get(group, {}).get(indicator, []))
    random.seed(seed)
    lines = []
    
    if len(sentIDs) > n:
        sentIDs = random.sample(sentIDs, n)
    for sentID in sentIDs:
        IDparts = re.split("\.", sentID)
        textblocks = {}
        textblocks = extract_textblocks(IDparts[0], root_dir = root_dir)
        line = nltk.sent_tokenize(textblocks[IDparts[1]])[int(IDparts[2])]
        if type(line) == str:
            line = unicode(line, "utf-8")
        lines.append(line)

    return lines, sentIDs

## Navigating the syntax tree

First, we'll need the handy `find_start` function that helps us locate the root of an indicator inside a sentence.

In [7]:
def find_start(pieces, sentence):
    for piece in pieces:
        for i in range(len(sentence) - len(piece) + 1):
            j = i + len(piece)
            segment = sentence[i:j]
            ancestors = set()
            if tuple([token.text for token in segment]) == piece:
                for token in segment:
                    token_ancestors = set([token])
                    token_ancestors = token_ancestors.union(token.ancestors)
                    if ancestors:
                        ancestors = ancestors.intersection(token_ancestors)
                    else:
                        ancestors = ancestors.union(token_ancestors)
                for token in ancestors:
                    if not set(token.children).intersection(ancestors):
                        yield token, segment

Now, we'll define our extraction function, `extract_intervention`. As with our extraction function for temporal indicators (from Installment 2), this function is provided a sentence and a list of indicator-matched fragments as input arguments. In this case, our indicator-matches will simply be verbs defined as intervention-related indicators. Using the `find_start` function to identify the starting point of our tree navigation, we will collect the verb along with auxiliaries and negatives. Then we will collect the nouns in all subjects and direct objects.

In [47]:
def extract_intervention(sentence, matches):
    sentence = nlp(unicode(sentence))
    indicator_tokens = set()
    for match in matches:
        indicator_tokens.add(tuple([token.text for token in nlp(unicode(match))]))
    
    for start, segment in find_start(indicator_tokens, sentence):
        verb_fragment = set()
        subj_fragments = []
        obj_fragments = []
        children = set()
        subjs = []
        objs = []
        
        head = start
        verb_fragment.add(head)
        
        for child in head.children:
            children.add(child)

        while children:
            child = children.pop()
            if re.search("subj", child.dep_):
                subjs.append(child)
            elif re.search("dobj", child.dep_):
                objs.append(child)
            elif re.search("neg|aux", child.dep_):
                verb_fragment.add(child)
            else:
                for grandchild in child.children:
                    children.add(grandchild)
                    
        verb_fragment = sorted(list(verb_fragment), key = lambda x: x.i)
        
        for subj in subjs:
            subj_fragment = set()
            for token in subj.subtree:
                if token.pos_ == "NOUN" or token.pos_ == "PROPN":
                    subj_fragment.add(token)
            parent = subj
            while parent != head:
                subj_fragment.add(parent)
                parent = parent.head
            subj_fragment = sorted(list(subj_fragment), key = lambda x: x.i)
            subj_fragments.append(subj_fragment)
        
        for obj in objs:
            obj_fragment = set()
            for token in obj.subtree:
                if token.pos_ == "NOUN" or token.pos_ == "PROPN":
                    obj_fragment.add(token)
            parent = obj
            while parent != head:
                obj_fragment.add(parent)
                parent = parent.head
            obj_fragment = sorted(list(obj_fragment), key = lambda x: x.i)
            obj_fragments.append(obj_fragment)
            
        fragment = [subj_fragments, verb_fragment, obj_fragments]
        yield fragment

To test the function, we'll retrieve example sentences for the indicator "receive".

In [12]:
examples, sentIDs = retrieve_sentences(group = "root>burden>active period", indicator = u"receive", n = 20, seed = 42)

To print our output in a readable format, we'll define a handy `print_intervention_output` function.

In [81]:
def print_intervention_output(sentence, verb):
    print sentence
    print ""
    outputs = list(extract_intervention(sentence, [verb]))
    
    for output in outputs:
        subject_string = ""
        object_string = " "

        if len(output[0]) > 1:
            for subject in output[0]:
                words = [token.text for token in subject]
                single_subject_string = " ".join(words)
                subject_string = subject_string + single_subject_string + ", "
            subject_string = subject_string[:-2] + " "
        elif len(output[0]) == 1:
            subject_string += " ".join([token.text for token in output[0][0]]) + " "

        verb_string = " ".join([token.text for token in output[1]])

        if len(output[2]) > 1:
            for objet in output[2]:
                words = [token.text for token in objet]
                single_object_string = " ".join(words)
                object_string = object_string + single_object_string + ", "
            object_string = object_string[:-2]
        elif len(output[2]) == 1:
            object_string = object_string + " ".join([token.text for token in output[2][0]])

        print subject_string + verb_string + object_string
        print ""
    print "____________________________________________\n"

Now we can use all of this to examine some output from the test sentences:

In [82]:
for example in examples:
    print_intervention_output(example, "receive")

Finally, group C, including 10 patients, will receive anti-VEGF in a different dose (injection volume) as compared to S1.

group C patients will receive 

____________________________________________

Arm I: Patients receive oral estramustine three times a day and oral etoposide twice daily on days 1-14 and paclitaxel IV over 1 hour on day 2.

Patients receive estramustine, paclitaxel IV

____________________________________________

In this study, all patients will receive chemotherapy and radiation therapy.

patients will receive chemotherapy radiation therapy

____________________________________________

Volunteers who provide samples for these studies will not routinely receive their individual results from the Additional Investigation.

Volunteers who samples studies will not receive results

____________________________________________

Further, women who received HOPE displayed fewer PTSD arousal and avoidance symptoms of PTSD, less depression, and greater social support and em

#### Is this output useful to a patient?

Some of the extracted sentence fragments are readable and provide concrete information that a potential participant in a trial may find useful. Sometimes, the output is broken, most often due to the difficulty of parsing complex sentences. Sometimes, it appears that the output simply does not contain enough information to be useful. Furthermore, these are still fragments of difficult-to-read sentences, and no real summarization or simplification is being performed.

## Counting verbs

Now, we'll cook up a system for going over trial records and counting up the verbs to populate a list of all verbs, which can potentially point us to indicators that might be useful (e.g., as input to the `extract_intervention` function).

Our strategy will be to take a sentence and find any proper nouns in it. Then, we'll start navigating the tree upwards from this proper noun until we find the verb in the sentence that is associated with it. We'll collect the verb.

To do all this, we'll write three functions:

In [83]:
def find_propns(tokens):
    propn_locations = []
    for token in tokens:
        if token.pos_ == "PROPN":
            propn_locations.append(token.i)
    return propn_locations

def collect_verb(sentence, ind):
    token = sentence[ind]
    while token.pos_ != "VERB" and token.dep_ != "ROOT":
        token = token.head
    if token.pos_ == "VERB":
        return token
    
def find_propns_then_find_verbs(sentence):
    sentence = nlp(unicode(sentence))
    
    propn_locs = find_propns(sentence)
    
    verbs = set()
    for ind in propn_locs:
        verb = collect_verb(sentence, ind)
        if verb:
            verbs.add(verb)
    return verbs

Now, we'll create a __Counter__ object and a function to help populate the counter with verb counts:

In [90]:
verb_counter = Counter()

def count_verbs(verbs):
    if verbs:
        for verb in verbs:
            verb_counter[(verb.text).lower()] += 1

Now, we'll take a handful of random trial IDs and test this counting system on them.

In [85]:
trial_ids = [
    'NCT01711775',
    'NCT02308540',
    'NCT00122642',
    'NCT02407860',
    'NCT00617630',
    'NCT02587949',
    'NCT03101254',
    'NCT02621502',
    'NCT00458601',
    'NCT00476970',
    'NCT00162292',
    'NCT00708656',
    'NCT02170350',
    'NCT00463710',
    'NCT01503853',
    'NCT00001497',
    'NCT02782039',
    'NCT02108431',
    'NCT03245203',
    'NCT00001664',
    'NCT00639028',
    'NCT00381030',
    'NCT02973100',
    'NCT00528047',
    'NCT00559702',
    'NCT02339298',
    'NCT02309437',
    'NCT00496691',
    'NCT01588275',
    'NCT01928641',
    'NCT01972854',
    'NCT03089385',
    'NCT01978704',
    'NCT01205009',
    'NCT03441022',
    'NCT03405025',
    'NCT03301636',
    'NCT00793403',
    'NCT02314390',
    'NCT00942214',
    'NCT00634946',
    'NCT02929056',
    'NCT03136302',
    'NCT01712061',
    'NCT02018705',
    'NCT00954317',
    'NCT03453814',
    'NCT02998398',
    'NCT01256736',
    'NCT01678664',
    'NCT03319732',
    'NCT02269189',
    'NCT02477332',
    'NCT01821911',
    'NCT02386787',
    'NCT01289964',
    'NCT01710163',
    'NCT02867605',
    'NCT01053676',
    'NCT01679782',
    'NCT00003509',
    'NCT00042185',
    'NCT02094092',
    'NCT02344043',
    'NCT01454414',
    'NCT02998424',
    'NCT03180424',
    'NCT03266146',
    'NCT00703352',
    'NCT02614612',
    'NCT00004446',
    'NCT01212237',
    'NCT02191007',
    'NCT00561873',
    'NCT00068081',
    'NCT01820663',
    'NCT02558868',
    'NCT03249064',
    'NCT00329355',
    'NCT01113931',
    'NCT01150032',
    'NCT01324895',
    'NCT03217539',
    'NCT02278263',
    'NCT03277742',
    'NCT01690260',
    'NCT03086889',
    'NCT02068235',
    'NCT00876733',
    'NCT01951339',
    'NCT03327272',
    'NCT01635764',
    'NCT00194415',
    'NCT01839019',
    'NCT02178423',
    'NCT02229656',
    'NCT00465205',
    'NCT01392053',
    'NCT01430598',
    'NCT01689766'
            ]

In [95]:
for trial_id in trial_ids:
    textblocks = extract_textblocks(trial_id)
    for sec in textblocks:
        sentences = nltk.sent_tokenize(textblocks[sec])
        for n, sentence in enumerate(sentences):
            count_verbs(find_propns_then_find_verbs(sentence))

Let's take a look at the counted verbs, sorted from most frequently occurring to least:

In [97]:
verb_counter.most_common()

[(u'is', 265),
 (u'are', 100),
 (u'treated', 75),
 (u'have', 75),
 (u'be', 75),
 (u'performed', 65),
 (u'evaluate', 65),
 (u'received', 60),
 (u'using', 55),
 (u'administered', 55),
 (u'used', 50),
 (u'receive', 45),
 (u'defined', 45),
 (u'compare', 40),
 (u'including', 40),
 (u'based', 40),
 (u'confirmed', 40),
 (u'assess', 40),
 (u'include', 35),
 (u'diagnosed', 35),
 (u'measured', 35),
 (u'assessed', 35),
 (u'considered', 35),
 (u'according', 35),
 (u'associated', 35),
 (u'identified', 30),
 (u'investigate', 30),
 (u'determine', 30),
 (u'compared', 30),
 (u'reported', 25),
 (u'followed', 25),
 (u'screening', 25),
 (u'provide', 25),
 (u'related', 25),
 (u'measure', 25),
 (u'receiving', 25),
 (u'count', 25),
 (u'taken', 20),
 (u'known', 20),
 (u'relapsing', 20),
 (u'found', 20),
 (u'conducted', 20),
 (u'determined', 20),
 (u'improve', 20),
 (u'evaluated', 20),
 (u'had', 20),
 (u'has', 20),
 (u'admitted', 20),
 (u'called', 20),
 (u'managed', 15),
 (u'prevent', 15),
 (u'sign', 15),
 (u'

#### Summing up:

While this is a slow process, it is exhaustive. We can count up all the verbs associated with proper nouns occurring across all of the nearly 260,000 trial records using this method. Reviewing the collected list of verbs, along with their frequencies, should yield some useful indicators for intervention.