<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Advantage-of-XML-data-format" data-toc-modified-id="Advantage-of-XML-data-format-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Advantage of XML data format</a></span></li><li><span><a href="#ElementTree" data-toc-modified-id="ElementTree-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>ElementTree</a></span></li><li><span><a href="#Parsing-XML-Data" data-toc-modified-id="Parsing-XML-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parsing XML Data</a></span></li><li><span><a href="#Project:" data-toc-modified-id="Project:-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Project:</a></span></li></ul></div>

# Advantage of XML data format 

XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework. XML creates a tree-like structure that is easy to interpret and supports a hierarchy. Whenever a page follows XML, it can be called an XML document. Introduction can be found here:

- Data transfer

- Easy to write code to Read and Write

- Document validation

- Human readable

- Support varieties of applications

- Robust parsers in most languages

- It is free



#  ElementTree
Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).


import xml.etree.ElementTree as ET


# Parsing XML Data


In [2]:
import xml.etree.ElementTree as ET



In [7]:
# .tag is a attributes which is usefull
tree = ET.parse('exampleresearcharticle.xml')
root = tree.getroot()
print('root element is:', root.tag)
print('Tags and attrib are: At top level the root has not atrribute' )



# seccond attributes is .text which find the text for up
#  We want to find the title 
title = root.find('./fm/bibl/title')
title_text= ''
for t in title:
    title_text += t.text 
print("Titel:", title_text)



# Look for all email addresses. 
print('List of email addresses:')
emails = root.findall('./fm/bibl/aug/au')
for email in emails:
    email = email.find('email')
    if email is not None:
        print(email.text)
    

root element is: art
Tags and attrib are: At top level the root has not atrribute
Titel: Standardization of the functional syndesmosis widening by dynamic U.S examination
List of email addresses:
omer@extremegate.com
mcarmont@hotmail.com
laver17@gmail.com
nyska@internet-zahav.net
kammarh@gmail.com
gideon.mann.md@gmail.com
barns.nz@gmail.com
eukots@gmail.com


In [24]:
for d in root.find('./fm/bibl'):
    print(d.tag, d.attrib)

title {}
aug {}
insg {}
source {}
issn {}
pubdate {}
volume {}
issue {}
fpage {}
url {}
xrefbib {}


# Project:
The task here is to extract data from xml on authors of an article
and add it to a list, one item for an author. See the provided data structure for the expected format. The tags for first name, surname and email should map directly  to the dictionary keys.

In [35]:
import xml.etree.ElementTree as ET

article_file = "exampleResearchArticle.xml"


def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }
        data["fnm"] = author.find('./fnm').text
        data["snm"] = author.find('./snm').text
        data["email"] = author.find('./email').text
        insr = author.findall('./insr')
        # insr has the form of attribute with the name of iid
        for i in insr:
            data["insr"].append(i.attrib["iid"])
        authors.append(data)

    return authors


def test():
    solution = [{'insr': ['I1'], 'fnm': 'Omer', 'snm': 'Mei-Dan', 'email': 'omer@extremegate.com'},
                {'insr': ['I2'], 'fnm': 'Mike', 'snm': 'Carmont', 'email': 'mcarmont@hotmail.com'},
                {'insr': ['I3', 'I4'], 'fnm': 'Lior', 'snm': 'Laver', 'email': 'laver17@gmail.com'},
                {'insr': ['I3'], 'fnm': 'Meir', 'snm': 'Nyska', 'email': 'nyska@internet-zahav.net'},
                {'insr': ['I8'], 'fnm': 'Hagay', 'snm': 'Kammar', 'email': 'kammarh@gmail.com'},
                {'insr': ['I3', 'I5'], 'fnm': 'Gideon', 'snm': 'Mann', 'email': 'gideon.mann.md@gmail.com'},
                {'insr': ['I6'], 'fnm': 'Barnaby', 'snm': 'Clarck', 'email': 'barns.nz@gmail.com'},
                {'insr': ['I7'], 'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com'}]

    root = get_root(article_file)
    data = get_authors(root)

    assert data[0] == solution[0]
    assert data[1]["insr"] == solution[1]["insr"]


test()

In [36]:
get_root(article_file)
get_authors(root)

[{'fnm': 'Omer',
  'snm': 'Mei-Dan',
  'email': 'omer@extremegate.com',
  'insr': ['I1']},
 {'fnm': 'Mike',
  'snm': 'Carmont',
  'email': 'mcarmont@hotmail.com',
  'insr': ['I2']},
 {'fnm': 'Lior',
  'snm': 'Laver',
  'email': 'laver17@gmail.com',
  'insr': ['I3', 'I4']},
 {'fnm': 'Meir',
  'snm': 'Nyska',
  'email': 'nyska@internet-zahav.net',
  'insr': ['I3']},
 {'fnm': 'Hagay',
  'snm': 'Kammar',
  'email': 'kammarh@gmail.com',
  'insr': ['I8']},
 {'fnm': 'Gideon',
  'snm': 'Mann',
  'email': 'gideon.mann.md@gmail.com',
  'insr': ['I3', 'I5']},
 {'fnm': 'Barnaby',
  'snm': 'Clarck',
  'email': 'barns.nz@gmail.com',
  'insr': ['I6']},
 {'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com', 'insr': ['I7']}]