#Determining abstract section headings

Tong Shu Li<br>
Created on Wednesday 2015-07-22<br>
Last updated 2015-07-22

For the abstract-level chemical-induced disease relation extraction task, we want to format the text into sections ([like PMID 20003049](http://www.ncbi.nlm.nih.gov/pubmed/?term=20003049%5Buid%5D)) to make it easier for the workers to read. Since we are treating the input text as freetext, we cannot use PubMed to determine the section headings.

What we will do instead is query PubMed for the section names of every abstract in the 1000 abstracts of the training data, and use those to parse any new input.

In [1]:
from src.file_util import read_file

---

###Preprocess the training and development data to grab all the PMIDs:

In [2]:
def parse_pmids(fname):
    pmids = set()
    for line in read_file(fname):
        vals = line.split('|')
        if len(vals) == 3 and vals[1] in ['a', 't']:
            pmids.add(vals[0])
            
    return pmids

In [3]:
fname = "data/training/CDR_TrainingSet.txt"
trainingset = parse_pmids(fname)

In [4]:
len(trainingset)

500

In [5]:
fname = "data/development/CDR_DevelopmentSet.txt"
developmentset = parse_pmids(fname)

In [6]:
len(developmentset)

500

---

###Function for grabbing the section names from PubMed:

In [7]:
# last updated 2015-06-12 toby
import sys
sys.path.append("/home/toby/Code/util/")
from convert import query_ncbi

import xml.etree.cElementTree as ET
from unicode_to_ascii import convert_unicode_to_ascii

def get_pubmed_article_xml_tree(pubmed_id):
    request = "efetch.fcgi?db=pubmed&id={0}&rettype=abstract".format(pubmed_id)
    response = query_ncbi(request)
    return ET.fromstring(response)

def parse_article_xml_tree(article_xml_tree):
    """Returns title as a unicode string, and abstract as an Element"""
    for element in article_xml_tree.iter("ArticleTitle"):
        article_title = element.text
        break

    for element in article_xml_tree.iter():
        if element.tag == "Abstract":
            return (article_title, element)

    return (article_title, False) # no abstract, title only

def get_section_labels(abstract_xml_tree):
    """
    Splits an abstract XML tree into individual chunks, if they exist.

    Preserves the background/methods/etc format of some papers (eg pmid 24885308)
    """
    section_labels = set()
    for child in abstract_xml_tree.iter("AbstractText"):
        section_name = child.get("Label")
        
        if section_name is not None and section_name != "UNLABELLED":
            section_labels.add(section_name)
            
    return section_labels

def get_abstract_information(pubmed_id):
    article_xml_tree = get_pubmed_article_xml_tree(pubmed_id)
    title, abstract_xml_tree = parse_article_xml_tree(article_xml_tree)

    if abstract_xml_tree:
        return get_section_labels(abstract_xml_tree)
    
    return set()

---

In [8]:
def get_all_section_labels(fname, dataset, limit = -1):
    for i, pmid in enumerate(dataset):
        if i == limit:
            break
            
        if i % 100 == 0:
            print i
            
        sections = get_abstract_information(pmid)
        with open(fname, "a") as fout:
            for section_name in sections:
                fout.write("{0}\n".format(section_name))
            
    print "done"

###Getting the section names for all 1000 abstracts:

In [9]:
get_all_section_labels("data/training_section_names.txt", trainingset)

0
100
200
300
400
done


In [10]:
get_all_section_labels("data/development_section_names.txt", developmentset)

0
100
200
300
400
done


---

###Creating the unique set of section names for future use:

In [11]:
from collections import Counter

In [15]:
def uniq_section_names(fname):
    names = set()
    for line in read_file(fname):
        names.add(line)
        
    return names

In [17]:
dev_names = uniq_section_names("data/development_section_names.txt")
train_names = uniq_section_names("data/training_section_names.txt")

In [18]:
all_names = dev_names | train_names

In [19]:
len(all_names)

77

In [20]:
with open("data/all_uniq_section_names.txt", "w") as fout:
    temp = sorted(list(all_names))
    for name in temp:
        fout.write("{0}\n".format(name))