# Test extraction 

To test which xpath expressions are needed, here a small example of the structure alongside some code to parse it using XML tree. For this to operate on larger scale, some element deletion and iteration is needed so that not the entire XML file is loaded at once.

In [1]:
# loading required libraries
from lxml import etree  
from io import BytesIO

In [2]:
# incomplete Pubmed citation taken from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed20n1017.xml.gz for structural investigations
# contains all the fields that will be extracted as a first attempt
XML_DATA = ("""<PubmedArticle>
                <MedlineCitation Status="MEDLINE" Owner="NLM">
                 <PMID Version="1">18285460</PMID>
                  <DateCompleted>
                   <Year>2008</Year>
                   <Month>04</Month>
                   <Day>21</Day>
                  </DateCompleted>
                  <DateRevised>
                   <Year>2019</Year>
                   <Month>12</Month>
                   <Day>17</Day>
                  </DateRevised>
                  <Article PubModel="Print-Electronic">
                   <Journal>
                    <ISSN IssnType="Electronic">1098-5549</ISSN>
                    <JournalIssue CitedMedium="Internet">
                    <Volume>28</Volume>
                     <Issue>8</Issue>
                     <PubDate>
                      <Year>2008</Year>
                      <Month>Apr</Month>
                     </PubDate>
                    </JournalIssue>
                    <Title>Molecular and cellular biology</Title>
                    <ISOAbbreviation>Mol. Cell. Biol.</ISOAbbreviation>
                   </Journal>
                   <ArticleTitle>Human Rvb1/Tip49 is required for the histone acetyltransferase activity of Tip60/NuA4 and for the downregulation of phosphorylation on H2AX after DNA damage.</ArticleTitle>
                   <Pagination>
                    <MedlinePgn>2690-700</MedlinePgn>
                   </Pagination>
                   <ELocationID EIdType="doi" ValidYN="Y">10.1128/MCB.01983-07</ELocationID>
                   <Abstract>
                    <AbstractText>The role of chromatin-remodeling factors in transcription is well established, but the link between chromatin-remodeling complexes and DNA repair remains unexplored. Human Rvb1 and Rvb2 are highly conserved AAA(+) ATP binding proteins that are part of various chromatin-remodeling complexes, such as Ino80, SNF2-related CBP activator protein (SRCAP), and Tip60/NuA4 complexes, but their molecular function is unclear. The depletion of Rvb1 increases the amount and persistence of phosphorylation on chromatin-associated H2AX after the exposure of cells to UV irradiation or to mitomycin C, cisplatin, camptothecin, or etoposide, without increasing the amount of DNA damage. Tip60 depletion, but not Ino80 or SRCAP depletion, mimics the effect of Rvb1 depletion on H2AX phosphorylation. Rvb1 is required for the histone acetyltransferase (HAT) activity of the Tip60 complex, and histone H4 acetylation is required prior to the dephosphorylation of phospho-H2AX. Thus, Rvb1 is critical for the dephosphorylation of phospho-H2AX due to the role of Rvb1 in maintaining the HAT activity of Tip60/NuA4, implicating the Rvb1-Tip60 complex in the chromatin-remodeling response of cells after DNA damage.</AbstractText>
                   </Abstract>
                  </Article> 
                 </MedlineCitation>
                </PubmedArticle>""").encode(encoding="UTF-8")

In [3]:
# dictionary of all the elements of interest and their respective xpaths 
elem_of_interest = {
    'pmid': 'MedlineCitation/PMID',
    'abstract_text': 'MedlineCitation/Article/Abstract/AbstractText',
    'title': 'MedlineCitation/Article/ArticleTitle',
    'journal': 'MedlineCitation/Article/Journal/Title',
    'pub_year': 'MedlineCitation/Article/Journal/JournalIssue/PubDate/Year'
}


In [4]:
# parse the XML into a tree structure
context = etree.iterparse(BytesIO(XML_DATA), events=('end',), tag="PubmedArticle")

# go through all the elements of interest, i.e. <PubmedArticle>s
for event, elem in context:
    # attempt to extract all the relevant fields for each publication
    for key, value in elem_of_interest.items():
        eoi = elem.xpath(value)        
        # if there are any entries
        if len(eoi) > 0:
            # take the first one
            if eoi[0].text is not None:
                print(f"{key}: {eoi[0].text} \n")


pmid: 18285460 

abstract_text: The role of chromatin-remodeling factors in transcription is well established, but the link between chromatin-remodeling complexes and DNA repair remains unexplored. Human Rvb1 and Rvb2 are highly conserved AAA(+) ATP binding proteins that are part of various chromatin-remodeling complexes, such as Ino80, SNF2-related CBP activator protein (SRCAP), and Tip60/NuA4 complexes, but their molecular function is unclear. The depletion of Rvb1 increases the amount and persistence of phosphorylation on chromatin-associated H2AX after the exposure of cells to UV irradiation or to mitomycin C, cisplatin, camptothecin, or etoposide, without increasing the amount of DNA damage. Tip60 depletion, but not Ino80 or SRCAP depletion, mimics the effect of Rvb1 depletion on H2AX phosphorylation. Rvb1 is required for the histone acetyltransferase (HAT) activity of the Tip60 complex, and histone H4 acetylation is required prior to the dephosphorylation of phospho-H2AX. Thus, R