# Exploring NLP for Clinical Trial Records
## Installment 2: Extracting Temportal Information
### by Munif Mujib

In this notebook, we'll explore the use of a syntactic parsing-based strategy to extract information that is potentially about scheduling for patients from unstructured trial description text. Then, we'll explore a more lightweight exercise in extracting the potential duration of a trial without leveraging grammar and syntax.

## Getting tools in place

First, we'll import the necessary modules and load the previously developed tools for accessing the trial records.

In [1]:
from __future__ import division
import xml.etree.ElementTree as ET
import re, glob, time
import json
from collections import defaultdict, Counter
import nltk
import random
import spacy

The helpful module we'll be using for grammar and syntax is `spacy`. We'll need to load `spacy`'s English language engine to leverage its capabilities:

In [2]:
nlp = spacy.load("en")

Next, we'll need the functions defined in the last installment to access the trial records:

In [4]:
def process_node(parent, parent_dict):
    if list(parent):
        for child in parent:
            child_dict = {}
            child_dict = process_node(child, child_dict)
            parent_dict[child.tag] = child_dict
    else:
        parent_dict["field_value"] = parent.text
    return parent_dict

def load_trial(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    directory = trial_id[:-4] + "xxxx"
    filepath = root_dir + directory + "/" + trial_id + ".xml"
    
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    root_dict = {}
    return process_node(root, root_dict)

def find_textblocks(field, textblocks = [], current = "root"):
    subfields = field.keys()
    if not subfields == ["field_value"]:
        if "textblock" in subfields:
            textblocks.append((current, field["textblock"]["field_value"]))
        if len(subfields) > 1:    
            for subfield in subfields:
                if not subfield == "textblock":
                    textblocks = find_textblocks(field[subfield], textblocks, current = current + ">" + subfield)
    return textblocks

def process_textblock(textblock):
    utext = textblock
    lines = re.sub(r"([^\n.!?\"])\n\n",r"\1.\n\n",utext)
    lines = re.sub("\n*[ ]+", " ", lines)
    lines = lines.strip()
    return lines

def get_textblocks(textblocks):
    textblocks_dict = {}
    for textblock in textblocks:
        levels = re.split(">", textblock[0])
        key = ">".join(levels[1:])
        textblocks_dict[key] = process_textblock(textblock[1])
    return textblocks_dict

def extract_textblocks(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    return get_textblocks(find_textblocks(load_trial(trial_id, root_dir = root_dir), textblocks = []))

def create_regex_format(x):
    regex = re.compile(r"[^a-z0-9](" + x + r")[^a-z0-9]")
    return regex
    
def find_indicators(element, indicators_dict = {}, current = "root"):
    if type(element) == dict:
        subelements = element.keys()
        if "indicators" in subelements and element["indicators"]:
            indicators_dict[re.sub("refinements>", "",current)] = (map(create_regex_format, element["indicators"]))
        if len(subelements) > 1:
            for subelement in subelements:
                if type(element[subelement]) == dict:
                    indicators_dict = find_indicators(
                        element[subelement], indicators_dict, current = current + ">" + subelement
                    )
    return indicators_dict

def match_indicators(d, indicators_dict, secID):
    lines = d.get(secID,"")
    found = defaultdict(list)
    for sentNum, sentence in enumerate(nltk.sent_tokenize(lines)):
        for group in indicators_dict:
            matches = []
            for n, indicator in enumerate(indicators_dict[group]):
                padded = " " + sentence.lower().strip() + " "
                if indicator.search(padded):
                    nuggets = indicator.findall(padded)
                    matches.append(re.sub(r"^.*?\]\((.*?)\)\[.*?$",r"\1",indicator.pattern))
            if matches:
                found[group].append((matches, str(sentNum)))
    return found

We've created a handy index of sentences that match with the different indicators we've defined and put it in a `json` file. We'll load this "compound" index (containing trial IDs, section IDs, and sentence numbers) from file:

In [6]:
compounded = json.load(open("../compounded.json", "r"))

Next, we'll define a function that can return a collection of `n` sentences matching the criteria for group, indicator, and section. The default indicator (which, frankly, looks quite scary at this point) is the combination of all temporal patterns, capable of picking out mentions of minutes, hours, days, weeks, months, and years. We can modify the `seed` parameter to get the function to retrieve different sets of sentences.

In [38]:
def retrieve_sentences(group = "root>burden", 
                       indicator = "(?:(?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s? (?:\\d+-|\\d+\\.)*\\d+|(?:\\d+-|\\d+\\.)*\\d+ (?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s?)", 
                       secIDs = ["brief_summary", "detailed_description"], 
                       n = 10, 
                       seed = 42,
                       index = compounded,
                       root_dir = "../clinicaltrials_data/trials/"
                      ):
    secIDs = re.compile("|".join(secIDs))
    sentIDs = filter(lambda x: secIDs.search(x), index.get(group, {}).get(indicator, []))
    random.seed(seed)
    lines = []
    
    if len(sentIDs) > n:
        sentIDs = random.sample(sentIDs, n)
    for sentID in sentIDs:
        IDparts = re.split("\.", sentID)
        textblocks = {}
        textblocks = extract_textblocks(IDparts[0], root_dir = root_dir)
        line = nltk.sent_tokenize(textblocks[IDparts[1]])[int(IDparts[2])]
        if type(line) == str:
            line = unicode(line, "utf-8")
        lines.append(line)

    return lines, sentIDs

Let's set `seed` to 0 and retrieve 10 sentences containing matches with the temporal indicator:

In [13]:
examples, _ = retrieve_sentences(seed = 0)

In [18]:
examples

[u'Functional biomechanical outcomes will be measured at 6 months and 12 months using DSX at the Biodynamics Lab.',
 u'This task takes about 30 minutes to complete.',
 u'Despite the challenging perspective of the new antiviral drugs directly acting on hepatitis C viral replication such as protease and polymerase inhibitors, nowadays the standard treatment in genotype 1-chronic hepatitis C (CHC) is the combination of peghylated interferon (PEG-IFN) and ribavirin for 48 weeks.',
 u'These last set of questionnaires should take about 30 minutes to complete.',
 u'Patients who were discharged after an uneventful ERCP were contacted by telephone within 5 days to capture delayed occurrence of the primary end point.',
 u'The study will include 80 children in ages 9-12 years, devided in 2 groups; children with ADHD treated with Methylphenidate, and healthy children without ADHD.',
 u'Course of treatment\uff1a10 days.',
 u"After completion of the study intervention, patients' medical charts are r

## Using grammar and syntax

Now, we'll use the `spacy` language engine to transform the sentence into a set of tokens containing helpful properties such as syntactic dependency, part of speech, and parent-child relations. We'll create a function that can print this output for any sentence passed to it called `print_parse()`. This function transforms a sentence into a list of `spacy` _tokens_ using the `spacy` NLP engine. Then, it prints out the token's index, its text, its dependency relation to its parent token, its part of speech, the text of its parent token, the part of speech of its parent token, and the list of its child tokens.

In [21]:
def print_parse(sentence):
    sentence = unicode(sentence)
    doc = nlp(sentence)
    for token in doc:
        print(token.i, token.text, token.dep_, token.pos_, token.head.text, token.head.pos_,
             [child for child in token.children])

For example, say we want to explore the syntax tree of the sentence "The quick brown fox jumps over the lazy dog." We call the `print_parse` function on it:

In [22]:
print_parse("The quick brown fox jumps over the lazy dog.")

(0, u'The', u'det', u'DET', u'fox', u'NOUN', [])
(1, u'quick', u'amod', u'ADJ', u'fox', u'NOUN', [])
(2, u'brown', u'amod', u'ADJ', u'fox', u'NOUN', [])
(3, u'fox', u'nsubj', u'NOUN', u'jumps', u'VERB', [The, quick, brown])
(4, u'jumps', u'ROOT', u'VERB', u'jumps', u'VERB', [fox, over, .])
(5, u'over', u'prep', u'ADP', u'jumps', u'VERB', [dog])
(6, u'the', u'det', u'DET', u'dog', u'NOUN', [])
(7, u'lazy', u'amod', u'ADJ', u'dog', u'NOUN', [])
(8, u'dog', u'pobj', u'NOUN', u'over', u'ADP', [the, lazy])
(9, u'.', u'punct', u'PUNCT', u'jumps', u'VERB', [])


#### How can we use these helpful syntactic features to extract temporal information?

We'll define two functions, `extract_temporal` and `find_start`. The extraction function is the high-level implementation of our vision to extract a readable fragment of the sentence containing the temporal pattern. The `find_start` function helps initiaite the tree navigation in `extract_temporal`. 

The general idea is to start at the "root" token of the match (e.g. in a sentence like "Follow-up visits will occur at 6 and 12 months", the root is the word "months".) and collect the tokens that are in its subtree, then navigate upwards from this root to the root or verb of the sentence or clause.

In [23]:
def find_start(pieces, sentence):
    for piece in pieces:
        for i in range(len(sentence) - len(piece) + 1):
            j = i + len(piece)
            segment = sentence[i:j]
            ancestors = set()
            if tuple([token.text for token in segment]) == piece:
                for token in segment:
                    token_ancestors = set([token])
                    token_ancestors = token_ancestors.union(token.ancestors)
                    if ancestors:
                        ancestors = ancestors.intersection(token_ancestors)
                    else:
                        ancestors = ancestors.union(token_ancestors)
                for token in ancestors:
                    if not set(token.children).intersection(ancestors):
                        yield token, segment

In [24]:
def extract_temporal(sentence, matches):
    sentence = unicode(sentence)
    sentence = nlp(sentence)
    indicator_tokens = set()
    for match in matches:
        indicator_tokens.add(tuple([token.text for token in nlp(unicode(match))]))
    fragment = set()
    for start, segment in find_start(indicator_tokens, sentence):
        head = start
        for token in head.subtree:
            fragment.add(token)
        while head.pos_ != "VERB" and head.dep_ != "ROOT":
            head = head.head
            fragment.add(head)
    return sorted(fragment, key = lambda x: x.i)

Let's try this out using our massive temporal indicator with the sentences retrieved earlier.

In [26]:
indicator = "(?:(?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s? (?:\\d+-|\\d+\\.)*\\d+|(?:\\d+-|\\d+\\.)*\\d+ (?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s?)"

In [27]:
for example in examples:
    print "Sentence: ", example
    print "Extract: ", extract_temporal(example, re.compile(indicator).findall(example))
    print ""

Sentence:  Functional biomechanical outcomes will be measured at 6 months and 12 months using DSX at the Biodynamics Lab.
Extract:  [measured, at, 6, months, and, 12, months]

Sentence:  This task takes about 30 minutes to complete.
Extract:  [takes, about, 30, minutes]

Sentence:  Despite the challenging perspective of the new antiviral drugs directly acting on hepatitis C viral replication such as protease and polymerase inhibitors, nowadays the standard treatment in genotype 1-chronic hepatitis C (CHC) is the combination of peghylated interferon (PEG-IFN) and ribavirin for 48 weeks.
Extract:  [ribavirin, for, 48, weeks]

Sentence:  These last set of questionnaires should take about 30 minutes to complete.
Extract:  [take, about, 30, minutes]

Sentence:  Patients who were discharged after an uneventful ERCP were contacted by telephone within 5 days to capture delayed occurrence of the primary end point.
Extract:  [contacted, within, 5, days]

Sentence:  The study will include 80 chil

#### Does this output achieve our goals?

While most of the extracts are readable sentence fragments, there are two problems with this output. 

First, we are chopping out parts of the sentences to make digesting the information easier for the user, but more often that not, fragments like "given for 14 days" leave the user with a desire for more information. What is given for 14 days? Further, this pipeline cannot ensure that it will present the user with information about the trial as a whole rather than a small and possibly less-important fragment.

Second, if we keep grabbing larger and larger portions of the original sentence, this doesn't result in reduced complexity for the reader at all. This doesn't get us any closer to our stated goal of improving readability.

#### So, where can we go from here?
Unfortunately, these fragments are often not quite appropriate for patients as simplified text. However, they are temporal extracts describing the scheduling topic! This could be handy when we get into our simplification moonshot. The style-transfer algorithm we're experimenting with requires topic-specific text for training, and this tool gets us text on the scheduling topic.

## A quick-and-dirty exercise

We conduct a separate experiment in extracting temporal information. What if we made the simple assumption that the largest time period mentioned in a summary or description section is likely the approximate duration of the trial?

All we would need to do is extract the temporal pattern matches and perform some simple arithmetic calculation to determine which fragment denotes the largest amount of time.

First, we define a function, `extract_time_strings`, that returns the temporal patterns found in a textblock.

In [39]:
def extract_time_strings(textblock):
    indicator = re.compile("(?:(?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s? (?:\\d+-|\\d+\\.)*\\d+|(?:\\d+-|\\d+\\.)*\\d+ (?:minute|min\\.?|hour|hr\\.?|day|week|wk\\.?|month|mo\\.?|year|yr\\.?)s?)")
    sentences = nltk.sent_tokenize(textblock)
    time_strings = []
    for sentence in sentences:
        padded = " " + sentence.lower().strip() + " "
        if indicator.search(padded):
            nuggets = indicator.findall(padded)
            time_strings.extend(nuggets)
    return time_strings

Next, we'll define a `convert_to_time` function that allows us to compare the different time strings. We'll need some convenient multipliers to perform this calculation.

Just for fun, we derive our multipliers from two constants: the duration of a [stellar day](https://en.wiktionary.org/wiki/stellar_day) in seconds, and the number of stellar days in a year (366.2422). This saves us from dealing with leap years and arbitrary month-lengths!

We'll need to handle some contingencies such as reverse-order time strings and ranges.

In [51]:
multipliers = {
    "minutes" :       59.8362,
    "hours"   :     3590.1708,
    "days"    :    86164.0989,
    "weeks"   :   603148.6923,
    "months"  :  2629744.0953,
    "years"   : 31556929.1435
}

def convert_to_time(time_strings):
    times_and_strings = []
    for time_string in time_strings:
        if re.compile(r"\d+").search(time_string.split()[1]): # handling reverse order time strings
            unit, num = time_string.split()
        else:
            num, unit = time_string.split()
        if len(num.split("-")) > 1: # handling ranges
            num = float(num.split("-")[-1])
        else:    
            num = float(num)
        if unit[-1] != "s":
            unit += "s"
        re.sub("mins", "minutes", unit)
        re.sub("hrs", "hours", unit)
        re.sub("wks", "weeks", unit)
        re.sub("mos", "months", unit)
        re.sub("yrs", "years", unit)
        multiplier = multipliers[unit]
        time = num * multiplier
        times_and_strings.append([time, time_string])
    return times_and_strings

Finally, we define the high-level function `find_largest_time` to encapsulate all the operations.

In [32]:
def find_largest_time(times_and_strings):
    sorted_times_and_strings = sorted(times_and_strings, reverse = True)
    if sorted_times_and_strings:
        return sorted_times_and_strings[0][1]

We collect some blocks of texts from a set of trials and test this out.

In [41]:
trial_ids = [
    'NCT01610479',
    'NCT00817739',
    'NCT01915667',
    'NCT01784029',
    'NCT00003097',
    'NCT02689219',
    'NCT03311841',
    'NCT01599000',
    'NCT02412254',
    'NCT03356236',
    'NCT03187249',
    'NCT03129113',
    'NCT01038180',
    'NCT02668172',
    'NCT00586092',
    'NCT00099502',
    'NCT00138710',
    'NCT02592863',
    'NCT02683902',
    'NCT00009048'
]

In [34]:
blocks_of_text = []
for trial_id in trial_ids:
    textblocks = extract_textblocks(trial_id)
    for section in textblocks.keys():
        if section == "brief_summary" or section == "detailed_description":
            blocks_of_text.append(textblocks[section])

In [52]:
for textblock in blocks_of_text:

    largest_time = find_largest_time(convert_to_time(extract_time_strings(textblock)))
    if largest_time:
        print textblock
        if re.compile(r"\-").search(largest_time): # in case the largest time is part of a range
            parts = largest_time.split("-")
            parts = sorted(parts, key = lambda x: len(x), reverse = True)
            largest_time = parts[0]
        print "\n*** LARGEST TIME: " + largest_time + " ***\n"

In this randomised, sham-controlled trial, investigators will recruit forty patients with primary cranial-cervical dystonia to receive an implanted device for STN-DBS, and participants will be randomly assigned to receive either neurostimulation or sham stimulation for 3 months.The primary end point was the change from baseline to 3 months in the severity of symptoms, according to the Burke-Fahn-Marsden Dystonia Rating Scale. Two masked dystonia experts who unaware of treatment status will assess the severity of dystonia by reviewing standardised videos.Subsequently, all patients will receive open-label neurostimulation; blinded assessment will be repeated after 6 months of active treatment.

*** LARGEST TIME: 6 months ***

The present study is being undertaken to compare counterregulatory hormone responses to a mild and gradual reduction in plasma glucose in young children with T1DM versus responses in adolescents. The studies will be performed under the close supervision of the profe

#### What can this simple algorithm do for us?

By identifying the largest period of time mentioned in a text, it potentially gets us close to an extraction of total trial duration. 
While there is an abundance of false positives, these are often much like the false positives we're encountering in our `extract_temporal()` function. We'll deal with these in a separate experiment using some standard supervised machine learning. But aside from that, this largest-indicator strategy is limited by semantics. For example, we don't know if a temporal mention relates to e.g., an active vs. follow-up period. So, because we are providing a _summary_ of temporal information&mdash;the _largest_ time&mdash;we're losing information from timing.

## Is this the best that rule-based extraction can do?
Both of our rule-based extraction methods have a common drawback here. Because they both stop at _extracting_ morphological components (e.g., words from the descriptions), they are subject to the inconsistencies in the trials, while not being imbued with their full contexts. In other words, to make our rule-based ouput stronger by using more rule-based processing we'd have to write some very complex (and overfit) case-by-case rules. Folks in NLP probably won't be surprised by this, it's really the nature of rule-based systems. While they can have high precision, they can be labor intensive to construct while being easy to over-fit. Stay tuned for our next pass at the temporal information—since the target temporal data are numeric, we'll be building some regression and supervised machine learning tools to estimate durations and frequencies of visits, etcetera.