# Exploring NLP for Clinical Trial Records
## Installment 1: Extracting Sentences of Interest
### by Munif Mujib

## Getting the data

First, we'll download all the available trial records from _clinicaltrials.gov_ using this link: https://clinicaltrials.gov/AllPublicXML.zip.

We'll extract the large ZIP file and place the contents in a suitable directory.

## Reading the data

We'll want to import all the Python modules (open-source libraries) that we'll be using in our code.

In [2]:
from __future__ import division
import xml.etree.ElementTree as ET
import re, glob, time
import json
from collections import defaultdict
import nltk
import random
import spacy
from spacy import displacy

Next, we'll define some functions that'll allow us to navigate the data directory and process the individual XML files corresponding to each record.

We'll write a `load_trial()` function where we can pass in a trial ID number and the location of the directory where we extracted the data, and the function will return a Python dictionary object holding the contents of the XML file. We'll also write a `process_node()` function to recursively navigate the XML tree and copy the data over to a dictionary. Recursion is necessary since not all fields are always present in a trial record, even though there is a schema specifying the XML format (https://clinicaltrials.gov/ct2/html/images/info/public.xsd). Notice that we're using the ElementTree module to read XML files.

In [3]:
def process_node(parent, parent_dict):
    if list(parent):
        for child in parent:
            child_dict = {}
            child_dict = process_node(child, child_dict)
            parent_dict[child.tag] = child_dict
    else:
        parent_dict["field_value"] = parent.text
    return parent_dict

def load_trial(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    directory = trial_id[:-4] + "xxxx"
    filepath = root_dir + directory + "/" + trial_id + ".xml"
    
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    root_dict = {}
    return process_node(root, root_dict)

We can load a trial record like this:

In [16]:
trial = load_trial("NCT00009594")

We have our data in the default root directory, `../clinical_trials/trials`, but you'll need to provide the location of your data directory in case it's somewhere different.

Now, let's take a look at the dictionary we got.

In [28]:
trial

{'brief_summary': {'textblock': {'field_value': '\n      Temporomandibular disorders (TMD) are characterized by pain and tenderness in the muscles of\n      mastication and/or the temporomandibular joint (TMJ), limitations of jaw opening often\n      accompanied by deviations in mandibular path, and clicking, popping or grating TMJ sounds.\n\n      TMD is often found in association with other problems: depression, anxiety, sleep\n      disturbances, gastrointestinal symptoms, frequent infections, etc. This project proposes to\n      holistically address patient symptoms through three different approaches, Naturopathic\n      Medicine (NM), Traditional Chinese Medicine (TCM), and usual care at KPNW. We will conduct a\n      pilot test and Phase II trial to evaluate the two alternative healing approaches, TCM (n=50)\n      and NM (n=50) delivered by TCM and NM practitioners, are as effective as usual TMD care\n      (n=50) provided by dental clinicians in the KPNW TMD Clinic. Subjects wi

#### How do we interpret this output?

It looks very messy at this point. What we want to do is find the unstructured text fields and extract them. So, we'll again define a few functions.

`find_textblocks()` identifies the fields containing unstructured text in the loaded trial dictionary and returns a list of tuples containing the name of the field and the text in it. `process_textblock()` takes a single textblock and performs a little clean-up on the text data by removing extra whitespace. `get_textblocks()` creates a dictionary from the list of tuples with the field names as keys and the textblocks as values. Finally, `extract_textblocks()` is a wrapper for all of these functions. 

In [7]:
def find_textblocks(field, textblocks = [], current = "root"):
    subfields = field.keys()
    if not subfields == ["field_value"]:
        if "textblock" in subfields:
            textblocks.append((current, field["textblock"]["field_value"]))
        if len(subfields) > 1:    
            for subfield in subfields:
                if not subfield == "textblock":
                    textblocks = find_textblocks(field[subfield], textblocks, current = current + ">" + subfield)
    return textblocks

def process_textblock(textblock):
    utext = textblock
    lines = re.sub("\n*[ ]+", " ", utext)
    lines = lines.strip()
    return lines

def get_textblocks(textblocks):
    textblocks_dict = {}
    for textblock in textblocks:
        levels = re.split(">", textblock[0])
        key = ">".join(levels[1:])
        textblocks_dict[key] = process_textblock(textblock[1])
    return textblocks_dict

def extract_textblocks(trial_id, root_dir = "../clinicaltrials_data/trials/"):
    return get_textblocks(find_textblocks(load_trial(trial_id, root_dir = root_dir), textblocks = []))

All you need to do to get the unstructured text from a trial description is:

In [15]:
extract_textblocks("NCT00009594")

{'brief_summary': 'Temporomandibular disorders (TMD) are characterized by pain and tenderness in the muscles of mastication and/or the temporomandibular joint (TMJ), limitations of jaw opening often accompanied by deviations in mandibular path, and clicking, popping or grating TMJ sounds. TMD is often found in association with other problems: depression, anxiety, sleep disturbances, gastrointestinal symptoms, frequent infections, etc. This project proposes to holistically address patient symptoms through three different approaches, Naturopathic Medicine (NM), Traditional Chinese Medicine (TCM), and usual care at KPNW. We will conduct a pilot test and Phase II trial to evaluate the two alternative healing approaches, TCM (n=50) and NM (n=50) delivered by TCM and NM practitioners, are as effective as usual TMD care (n=50) provided by dental clinicians in the KPNW TMD Clinic. Subjects will be females 25-55 years of age with multiple health problems (defined as patients who have had at lea

## Matching indicators inside the text

We wrote out a schema containing the textual patterns we'll be looking for (we call these _indicators_) in the extracted textblocks. We load this schema in like this:

In [9]:
schema = json.load(open("../structured-output-schema.json", "r"))

Next, we'll use regular expressions and write a few functions to match the indicators in the schema in the extracted texblocks. 

`create_regex_format()` compiles a regular expression (_regex_) from the defined indicator pattern. Compiling this regex allows us to match indicators at any position&mdash;beginning, middle, or end. `find_indicators()` processes the schema to generate a dictionary containing the compiled regex from each indicator, as well as the hierarchical information about that indicator. `match_indicators()` is where we use these regexes to find sentences where a match exists and put these matched fragments, along with a sentence ID number, in a dictionary.


In [29]:
def create_regex_format(x):
    regex = re.compile(r"[^a-z0-9](" + x + r")[^a-z0-9]")
    return regex
    
def find_indicators(element, indicators_dict = {}, current = "root"):
    if type(element) == dict:
        subelements = element.keys()
        if "indicators" in subelements and element["indicators"]:
            indicators_dict[re.sub("refinements>", "",current)] = (map(create_regex_format, element["indicators"]))
        if len(subelements) > 1:
            for subelement in subelements:
                if type(element[subelement]) == dict:
                    indicators_dict = find_indicators(
                        element[subelement], indicators_dict, current = current + ">" + subelement
                    )
    return indicators_dict

def match_indicators(d, indicators_dict, secID):
    lines = d.get(secID,"")
    found = defaultdict(list)
    for sentNum, sentence in enumerate(nltk.sent_tokenize(lines)):
        for group in indicators_dict:
            matches = []
            for n, indicator in enumerate(indicators_dict[group]):
                padded = " " + sentence.lower().strip() + " "
                if indicator.search(padded):
                    nuggets = indicator.findall(padded)
                    matches.append(re.sub(r"^.*?\]\((.*?)\)\[.*?$",r"\1",indicator.pattern))
            if matches:
                found[group].append((matches, str(sentNum)))
    return dict(found)

First, we need to create the dictionary of indicators from the schema:

In [11]:
indicators_dict = find_indicators(schema)

Next, we pass a dictionary of textblocks, extracted from a trial record, into `match_indicators()`, along with the section that we want to examine, and store the output in `found`.

In [30]:
textblocks_dict = extract_textblocks("NCT00009594")
found = match_indicators(textblocks_dict, indicators_dict, "brief_summary")

So, for example, this dictionary tells us that an indicator related to burden, "(?:\\d+-)*\\\d+ years", matched with the 5th sentence (sentence IDs start from 0) in the "brief_summary" section of trial NCT00009594.

In [31]:
found

{u'root>burden': [([u'(?:\\d+-)*\\d+ years'], '4'),
  ([u'(?:\\d+-)*\\d+ months'], '5')],
 u'root>intervention': [([u'will be used'], '7')],
 u'root>primary_aim>observational': [([u'evaluate'], '3'),
  ([u'monitor'], '8'),
  ([u'evaluate'], '9')],
 u'root>size of study': [([u'subjects'], '4'), ([u'subjects'], '5')]}

#### Where to go from here?

This notebook is a toolkit to extract sentences that may contain information of value to patients. However, these are only sentences which are _indicated_ and not necessarily the target information. For example, even if the phrase "every 6 months" does relate to patient burden, it is lacking critical information on what occurs. This latter interest is the focus of our subsequent exploration of syntax and grammar.