<img src="https://github.com/i3hsInnovation/resources/blob/master/images/introbanner.png?raw=true" />

<table style="float:right" width="50%">
    <tr>
        <td>                      
            <div style="text-align: left"><a href="" target="_blank">Dr Peter Causey-Freeman</a></div>
            <div style="text-align: left">Lecturer - Healthcare Sciences</div>
            <div style="text-align: left">(Clinical Bioinformatics)</div>
            <div style="text-align: left">The University of Manchester</div>
         </td>
         <td>
             <img src="https://github.com/i3hsInnovation/resources/blob/master/images/pete.001.png?raw=true" width="40%" />
         </td>
     </tr>
</table>

# Parsing JSON and XML 
****

### About this Notebook

In the previous notebook, we learned about the most common response-types returned by REST APIs, *i.e.* JSON and XML. In this notebook we will learn how to request data from a REST API in both the JSON and XML format. We will also learn how to effectively parse and work with JSON and XML data.


This notebook is at <code>Beginner</code> level and will take approximately 1 hour to complete.

-------------------------------------------------------

<div class="alert alert-block alert-warning"><b>Learning Objective:</b> Using electronic and online resources (specifically parsing and extracting data from common data formats)</div>

<b><h2>Table of Contents</h2></b>  


#### Section 1: [Section 1: Requesting JSON and XML](#requests)
- [Requests module recap](#rrecap)
- [Requesting XML](#rxml)
- [XML and ElementTree tutorial](#tutxml)

#### Section 2: [Section 2: Parsing JSON and XML](#parse)
- [JSON tutorial](#tutjs)
- [XML and ElementTree tutorial](#tutxml)

#### Section 3: [Parse the XML returned by the REST API](#restxml)
- [Extract and format the XML](#recxml)
- [Parse the XML](#prsxml)

#### [Summary](#sum)

----------

<a id="requests"></a>
<table width="100%" style="float:left">
    <tr>
        <td width="60%" style="text-align: left">
            <h1>Section 1: Requesting JSON and XML</h1>
        </td>
        <td width="40%">
            <img src="https://github.com/i3hsInnovation/resources/blob/master/images/JSON-vs-XML.png?raw=true" width="60%"/>
        </td>
    </tr>
</table> 

***
<sup>Image by Peter Causey-Freeman</sup>

<a id="rrecap"></a>

<b><h2>Requests module recap</h2></b>  
</div>

### The Python requests module

In a previous notebook, we looked at the Python [Requests](https://2.python-requests.org/en/master/) module. Requests is by no means the only Python module that can be used to request data from web resources, but the reason we recommend it is that it is light-touch and user-friendly. 

> Requests: HTTP for Humans™

> Requests is the only HTTP library for Python safe for human consumption


#### Method

Although we have already used requests, let's step through the method now that we have updated our REST API.

***Note: Remember to activate your API***

```bash
$ python applications/app_v6.py
```

<br>

##### 1. **Import required modules**

In [None]:
# For now we will import requests and JSON
import requests
import json

##### 2. **Request some data**

In [None]:
# Note: we are using the content-type=application/json method we created in the previous notebook 
response = requests.get('http://127.0.0.1:5000/name/Bob?content-type=application/json')

##### 3. **Check the [status code](https://www.w3.org/Protocols/HTTP/HTRESP.html) - we want to see 200**

In [None]:
response.status_code

##### 4. **Look at the response headers**

In [None]:
response.headers

The response headers are returned as Python dictionary, however, it is a very specific dictionary type.

> **[CaselessDictionary](https://gist.github.com/babakness/3901174)**<br><br>
    Dictionary that enables case insensitive searching while preserving case sensitivity 
    when keys are listed, ie, via keys() or items() methods




Here are a couple of little tricks to print dictionaries/JSON in a more human friendly way

<br>

   - ***a) Use the dictionary.items() method***

In [None]:
# The dictionary.keys() method prints dictionary keys
response.headers.keys()

In [None]:
# The dictionary.items() method prints dictionary keys and values
response.headers.items()

In [None]:
# We can loop through items, splitting the key and value into separate objects and print as we go along
for my_key, my_val in response.headers.items():
    print(my_key + ': ' + my_val)

<br>

  - ***b) Use the Python [json](https://www.w3schools.com/python/python_json.asp) module*** 

In [None]:
# CaseLessDictionary is not JSON serialisable
# Convert into a standard Python Dictionary
headers_dict = dict(response.headers)

In [None]:
# Print the dictionary in a human-friendly format using json.dumps()

#                a. What we print  b. Sort keys by order  c. Add indent   d. Specify separators that split i) items and ii) key-val pairs   
print(json.dumps(headers_dict,     sort_keys=True,        indent=4,       separators=(',', ': ')))

##### 5. **Extract and print the response JSON**

In [None]:
# Assign to a dictionary
my_dict = response.json()
# print 
print(json.dumps(my_dict, sort_keys=True, indent=4, separators=(',', ': ')))

<a id="rxml"></a>

<b><h2>Requesting XML</h2></b>  
</div>

### Request XML data with the requests module

The initial work-flow to recover XML data from a REST API using the requests module is pretty much the same as for JSON 

In [None]:
# Make request - now set to content-type=application/xml
response_2 = requests.get('http://127.0.0.1:5000/name/Bob?content-type=application/xml')
# Check status code
print('Status code: ' + str(response_2.status_code))
# Print headers
headers_dict_2 = dict(response_2.headers)
print('Headers: ' + json.dumps(headers_dict_2, sort_keys=True, indent=4, separators=(',', ': ')))

***However, requests does not have a method for converting the response into a Python format that Python can parse***

--------------------

<a id="parse"></a>
<table width="100%" style="float:left">
    <tr>
        <td width="60%" style="text-align: left">
            <h1>Section 2: Parsing JSON and XML</h1>
        </td>
        <td width="40%">
            <img src="https://github.com/i3hsInnovation/resources/blob/master/images/JSON-vs-XML.png?raw=true" width="60%"/>
        </td>
    </tr>
</table> 

***
<sup>Image by Peter Causey-Freeman</sup>

<a id="tutjs"></a>

<b><h2>JSON tutorial</h2></b>  
</div>

Although XML is a very powerful data format, it is not the easiest format to work with in Python. Take a look at this stack overflow thread entitled [Is there any python xml parser that was designed with humans in mind?](https://stackoverflow.com/questions/1493899/is-there-any-python-xml-parser-that-was-designed-with-humans-in-mind)

JSON works well with Python because it is essentially structured like a Python dictionary. Therefore, as we saw earlier, we can simply parse the dictionary using Pythons inbuilt [dictionary methods](https://www.w3schools.com/python/python_dictionaries.asp) 

Read through the [datacamp.com](https://www.datacamp.com/) tutorial [JSON Data in Python](https://www.datacamp.com/community/tutorials/json-data-python)

I have added the code from the Tutorial in to directory `json_totorial` and the XML file is `blog.json`. I have also provided the commands so that we you follow the tutorial in this notebook.

#### 1. Import json

In [None]:
import json

#### 2. Read the JSON file

In [None]:
# Open file for reading
fo = open("json_tutorial/blog.json","r")
# Read the json
my_json_string = fo.read()
# Close the file
fo.close()

#### 3. Convert to a Python dictionary object

In [None]:
to_python = json.loads(my_json_string)

#### 4. Print the blog

In [None]:
to_python['blog']

<div class="alert alert-info">

### Exercise

Complete the totorial following the structure we have begun above, *i.e.* add cells to this notebook with your code and some markdown to describe the steps

<br>

*Note: reading from and writing to the JSON file will require an additional path variable, i.e. `json_tutorial/blog.json`*

</div>

<a id="tutxml"></a>

<b><h2>XML and ElementTree tutorial</h2></b>  
</div>

Parsing XML is more complicated, but can be extremely powerful when you need to parse structured data returned by web-APIs

Before we take at the XML returned by our REST API, read through the [datacamp.com](https://www.datacamp.com/) tutorial [Python XML with ElementTree: Beginner's Guide](https://www.datacamp.com/community/tutorials/python-xml-elementtree)

I have added the code from the Tutorial in to directory `xml_totorial` and the XML file is `movies.xml`. I have also provided the commands so that we you follow the tutorial in this notebook.

#### 1. Import ElementTree

In [None]:
import xml.etree.ElementTree as ET

#### 2. Read in the file with ElementTree

In [None]:
tree = ET.parse('xml_tutorial/movies.xml')
root = tree.getroot()

#### 3. Print the root tag and attribute

In [None]:
print(root.tag)
print(root.attrib)

<div class="alert alert-info">

### Exercise

Complete the totorial following the structure we have begun above, *i.e.* add cells to this notebook with your code and some markdown to describe the steps

<br>

*Note: reading from and writing to the XML file will require an additional path variable, i.e. `xml_tutorial/movies.xml`*

</div>

<a id="restxml"></a>
<table width="100%" style="float:left">
    <tr>
        <td width="60%" style="text-align: left">
            <h1>Section 3: Parse the XML returned by the REST API</h1>
        </td>
        <td width="25%">
            <img src="https://github.com/i3hsInnovation/resources/blob/master/images/xml.png?raw=true" width="40%"/>
        </td>
    </tr>
</table>
<div><sup>
        Icon by <a href="https://www.flaticon.com/authors/smashicons" title="Smashicons">Smashicons</a> from <a href="https://www.flaticon.com/"                 title="Flaticon">www.flaticon.com</a> is licensed by <a href="http://creativecommons.org/licenses/by/3.0/"                 title="Creative Commons BY 3.0" target="_blank">CC 3.0 BY</a>
<sup></div>

<a id="recxml"></a>

<b><h2>Extract and format the XML</h2></b>  
</div>

### So where is our XML in the API response?

When we use `requests` to retrieve XML data from an API, the XML is stored in `.content`

##### 1. Recover the content

In [None]:
xml = response_2.content
xml

##### 2. Convert the content from a [bytestring](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string) into a string

In [None]:
xml = xml.decode("utf-8")
xml

##### 3. Import ElementTree and parse the XML

In [None]:
import xml.etree.ElementTree as ET

In [None]:
# Note: the fromstring method requires a slightly different syntax
tree = ET.ElementTree(ET.fromstring(xml))

<a id="prsxml"></a>

<b><h2>Parse the XML</h2></b>  
</div>

### Run some of the methods we learned in the XML tutorial above

In [None]:
root = tree.getroot()

In [None]:
root.tag

In [None]:
root.attrib

In [None]:
for child in root:
    print(child.tag, child.attrib)

In [None]:
print(ET.tostring(root, encoding='utf8').decode('utf8'))

In [None]:
[elem.tag for elem in root.iter()]

In [None]:
print(ET.tostring(root, encoding='utf8').decode('utf8'))

<a id="sum"></a>

<b><h2>Summary</h2></b>  
</div>

In this notebook, we have learned how to request both JSON and XML data from REST APIs using the `requests` module. We have also explored the differences in the way that the `requests` module handles JSON and XML data, and how to extract the data from the API response.

We have learned how to convert JSON and XML data into a format that Python can work with using the `json` and `xmlxml.etree.ElementTree` mosules respectively. We have also learned how to effectively parse the data.

In the next notebook, we will use what we have learned in this notebook to explore data returned by the Ensembl REST APIs

-----------------------------------------------------------

#### Notebook details
<br>
<i>Notebook created by <strong>Dr. Pete Causey-Freeman</strong> with <strong>Frances Hooley</strong> 
    

Publish date: October 2020<br>
Review date: October 2021</i>

Please give your feedback using the button below:

<a class="typeform-share button" href="https://form.typeform.com/to/YMpwLTNy" data-mode="popup" style="display:inline-block;text-decoration:none;background-color:#3A7685;color:white;cursor:pointer;font-family:Helvetica,Arial,sans-serif;font-size:18px;line-height:45px;text-align:center;margin:0;height:45px;padding:0px 30px;border-radius:22px;max-width:100%;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;font-weight:bold;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;" target="_blank">Rate this notebook </a> <script> (function() { var qs,js,q,s,d=document, gi=d.getElementById, ce=d.createElement, gt=d.getElementsByTagName, id="typef_orm_share", b="https://embed.typeform.com/"; if(!gi.call(d,id)){ js=ce.call(d,"script"); js.id=id; js.src=b+"embed.js"; q=gt.call(d,"script")[0]; q.parentNode.insertBefore(js,q) } })() </script>
