# JSON and HTML Processing
In this tutorial it is covered basic operations with HTML and JSON.

For more informations about related stuff see:
* <a href="https://en.wikipedia.org/wiki/JSON">JavaScript Object Notation</a>
* <a href="https://en.wikipedia.org/wiki/HTML">HyperText Markup Language (HTML)</a>

## HTML (XML) parsing
There are to main approaches how to parse data

* SAX (Simple API for XML) - it scan elements on the fly. This approach does not store anything in memory.

* DOM (Document Object Model) - it creates model of all elements in memory. Allows higher functions.


### HTML parsing with Python HTMLParser class
In this section is introduced <a href="https://docs.python.org/2/library/htmlparser.html">HTMLParser</a>. This is a SAX parser. In next examples is used following sample HTML content:

In [1]:
sample_html =  """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Heading!</h1>
        <p class="major_content">Some content.</p>
        <p class="minor_content">Some other content.</p>
    </body>
</html>
"""

Simle example of usage follows. Following parser print out encountered tags and data.

In [2]:
# from HTMLParser import HTMLParser # Python 2.7
from html.parser import HTMLParser

class TestHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        print("Tag start:", tag)

    def handle_endtag(self, tag):
        print("Tag end:", tag)

    def handle_data(self, data):
        print("Tag data:", data)
    
# instantiate the parser and fed in some HTML
parser = TestHTMLParser()
parser.feed(sample_html)

Tag data: 

Tag start: html
Tag data: 
    
Tag start: head
Tag data: 
        
Tag start: title
Tag data: Test
Tag end: title
Tag data: 
    
Tag end: head
Tag data: 
    
Tag start: body
Tag data: 
        
Tag start: h1
Tag data: Heading!
Tag end: h1
Tag data: 
        
Tag start: p
Tag data: Some content.
Tag end: p
Tag data: 
        
Tag start: p
Tag data: Some other content.
Tag end: p
Tag data: 
    
Tag end: body
Tag data: 

Tag end: html
Tag data: 



The goal of this second parser is to get content from paragraph with class: `major_content`.

In [3]:
class Test2HTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self, convert_charrefs=True)
        self.recording = False
    
    def handle_starttag(self, tag, attrs):
        if tag == "p" and "major_content" in dict(attrs).values():
            self.recording = True

    def handle_endtag(self, tag):
        self.recording = False

    def handle_data(self, data):
        if self.recording:
            print(data)

# instantiate the parser and fed in some HTML
parser2 = Test2HTMLParser()
parser2.feed(sample_html)

Some content.


### Examples with the ElementTree XML API

See <a href="https://docs.python.org/2/library/xml.etree.elementtree.html">ElementTree XML API</a> for more information. This library is designed for XML parsing, but it si possible to use it also for HTML parsing with various levels of success. This parser is DOM parser. Simple example that iterates over HTML tree (only first and second level) follows:

In [4]:
import xml.etree.ElementTree as ET

tree = ET.fromstring(sample_html)

for child1 in tree:
    print(child1.tag)
    for child2 in child1:
        print("\t", child2.tag, "-", child2.text)

head
	 title - Test
body
	 h1 - Heading!
	 p - Some content.
	 p - Some other content.


Second example prints just content of paragraph with `major_content` class:

In [5]:
import xml.etree.ElementTree as ET

tree = ET.fromstring(sample_html)
        
tree.findall("./body/p[@class='major_content']")[0].text

'Some content.'

### Examples with BeautifulSoup library
The <a href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> is library dedicated to simplify scraping information from HTML pages. It is an DOM parser. Sample data follows:

In [6]:
SAMPLE_HTML = """<html>
    <head>
        <title>Example webpage!</title>
    </head>
    <body>
        <table id="main_table">
            <tr>
                <th>Firstname</th>
                <th>Lastname</th>
                <th>Age</th>
            </tr>
            <tr>
                <td class="first_name">Alice</td>
                <td class="last_name">Smith</td>
                <td class="age">31</td>
            </tr>
            <tr>
                <td class="first_name">Bob</td>
                <td class="last_name">Stone</td>
                <td class="age">38</td>
            </tr>
            <tr>
                <td class="first_name">Narcissus</td>
                <td class="last_name">Hyacinth</td>
                <td class="age">34</td>
            </tr>
            <tr>
                <td class="first_name">Adelmar</td>
                <td class="last_name">Egino</td>
                <td class="age">50</td>
            </tr>
        </table> 
    </body>
</html>"""

In [7]:
from bs4 import BeautifulSoup

# template for printing the output
sentence = "{} {} is {} years old."

# create tree
soup = BeautifulSoup(SAMPLE_HTML, "html.parser")

# get title and print it
title = soup.find("title")
print(title.text, "\n")

# select all rows in table
table = soup.find("table",  {"id": "main_table"})
table_rows = table.findAll("tr")  

# iterate over table and print results
for row in table_rows:
    first_name = row.find("td", {"class": "first_name"})
    last_name = row.find("td", {"class": "last_name"})
    age = row.find("td", {"class": "age"})
    if first_name and last_name and age:
        print(sentence.format(first_name.text, last_name.text, age.text))

Example webpage! 

Alice Smith is 31 years old.
Bob Stone is 38 years old.
Narcissus Hyacinth is 34 years old.
Adelmar Egino is 50 years old.


Attributes of the elements are accessible as simple as follows:

In [8]:
print(table.attrs)

{'id': 'main_table'}


### Getting specific string from HTML (or other text)
In some cases can be benefical to get the particular information from source without parsing. In next examples is used following source.

In [9]:
sample_html =  """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Heading!</h1>
        <p class="major_content">Some content. And even more content.</p>
        <p class="minor_content">
            Some other content.
            Numbers related content.
            The important information is, that the key number is 23.
        </p>
    </body>
</html>
"""

If you need just the <i>key number</i> value from the text. And it is sure that:

* information appers only once in the text
    
* information will not change the form (words, word order ...)

You can use following approach.

In [10]:
# unclean way
target_start = sample_html.find("the key number is ") + len("the key number is")
target_end = sample_html[target_start:].find(".") + target_start
print(sample_html[target_start:target_end])

 23


Or you can do the same thing, but more correctly with <a href= https://docs.python.org/2/library/re.html>Regex</a>.

In [11]:
# much beter way (with regex)
import re
print(re.search('the key number is (.*).', sample_html).group(1))

23


## Work with JSON
In next piece of code is shown how to create JSON encoded message in Python with <a href="https://docs.python.org/2/library/json.html">JSON library</a>.

### Simple example

In [12]:
import json

# sample data
message = [
    {"time": 123, "value": 5},
    {"time": 124, "value": 6},
    {"status": "ok", "finish": [True, False, False]}, 
]

# pack message as json
js_message = json.dumps(message)

# show result
print(type(js_message))
print(js_message)

<class 'str'>
[{"time": 123, "value": 5}, {"time": 124, "value": 6}, {"status": "ok", "finish": [true, false, false]}]


Note, that the output is string. In similar way you can unpack the message back to Python standard list/dictionary. Example follows.

In [13]:
# unpack message
message = json.loads(js_message)

# show result
print(type(message))
print(message)

<class 'list'>
[{'time': 123, 'value': 5}, {'time': 124, 'value': 6}, {'status': 'ok', 'finish': [True, False, False]}]


The json library can also make json dumps more readable for humans. See following example.

In [14]:
nice_message = json.dumps(message, indent=4, sort_keys=True)
print(nice_message)

[
    {
        "time": 123,
        "value": 5
    },
    {
        "time": 124,
        "value": 6
    },
    {
        "finish": [
            true,
            false,
            false
        ],
        "status": "ok"
    }
]
