# <span style="color:#54B1FF">Parsing data:</span> &nbsp; <span style="color:#1B3EA9"><b>HTML files</b></span>

<br>

[HTML](https://en.wikipedia.org/wiki/HTML) is a text-file format that is used widely as the main format for internet sites.

This notebook demonstrates how to parse relatively simple HTML files.
<br>

⚠️ **NOTE!**  &nbsp; &nbsp; All data files are saved in the same directory as this notebook.


___

First let's import the modules we'll need for this lecture.

In [1]:
import lxml.html

<a name="toc"></a>
# Table of Contents

* [Simple HTML](#html-simple)
* [HTML attributes](#html-attributes)


___
<a name="html"></a>
# HTML
[Back to Table of Contents](#toc)
<br>



___
<a name="html-simple"></a>
## `html` <span style="background-color:powderblue;">Simple HTML</span>
[Back to Table of Contents](#toc)
<br>

The file `data4.html` contains the following text:

<br>

```
<html lang="en">
	<head>
		<title>Example web page.</title>
	</head>
	<body>
		<p><b>This is an example HTML data file</b></p>
		<p>Mass = 65 kg</p>
		<p>Height = 170 cm</p>
	</body>
</html>
```

<br>

This file, when opened in a web browser, looks like this:


<br>
<br>

<img alt="html_example" width=600 src="html_screenshot1.png"/>

<br>
<br>

To view the underlying HTML code, open the HTML file in a text editor like Notepad or Wordpad.

Let's use `lxml.html` to parse this HTML file and try to retrieve the mass and height data.

In [2]:
tree      = lxml.html.parse('data4.html')
print(tree)

<lxml.etree._ElementTree object at 0x7fd9f8372f80>


All HTML contents are now saved as an `ElementTree` object in the `tree` variable.

To retrieve a specific node from the element tree, use `find` like this:

In [3]:
head = tree.find('head')
title = head.find('title')

print(title)

<Element title at 0x7fd9f8443590>


To access the text for this element, use the `text` attribute:

In [4]:
s = title.text
print( s )

Example web page.


Alternatively, you can access nested elements like this:

In [5]:
print( tree.find('head').find('title') )
print( tree.find('head/title') )


<Element title at 0x7fd9f8443590>
<Element title at 0x7fd9f8443590>


Thus specific elemets' text can be retrieved using a single command line like this:

In [6]:
s = tree.find('head/title').text
print(s)

Example web page.


Next let's extract all of the text from the body paragraphs (i.e. all `<p>` elements).  For this, the `finall` function is convenient.

In [7]:
elements = tree.findall('body/p')

print(elements)

[<Element p at 0x7fd9f8443bd0>, <Element p at 0x7fd9f8443770>, <Element p at 0x7fd9f8443f90>]


Note that three elements are found because there are three paragraphs. The text can be retrieved, similar to above, as follows:

In [8]:
s = [e.text for e in tree.findall('body/p')]

print(s)

[None, 'Mass = 65 kg', 'Height = 170 cm']


Note that no text appears for the first `<p>` element, because there is a `<b>` (bold) element inside the `<p>` element:

In [9]:
print( tree.findall('body/p/b')[0].text )

This is an example HTML data file


For this example, we only want the mass and height data, so we can ignore the first `<p>` element. The second and third elements contain the following text:

In [10]:
s1 = s[1]
s2 = s[2]

print(s1)
print(s2)

Mass = 65 kg
Height = 170 cm


One way to retrieve the values is to use the `split` function, as above for the simple [open](#open-simple-csv) function.

In [11]:
print( s1.split(' ') )
print( s2.split(' ') )

['Mass', '=', '65', 'kg']
['Height', '=', '170', 'cm']


The mass and height are the third items in these lists. We can retrieve them as integers, like this:

In [12]:
mass   = int( s1.split(' ')[2] )
height = int( s2.split(' ')[2] )

print( mass, height )

65 170


Putting everything together, we have:

In [13]:
tree      = lxml.html.parse('data4.html')
elements  = tree.findall('body/p')
mass      = int( elements[1].text.split(' ')[2] )
height    = int( elements[2].text.split(' ')[2] )

print(mass, height)

65 170


___
<a name="html-attributes"></a>
## `html` <span style="background-color:powderblue;">HTML attributes</span>
[Back to Table of Contents](#toc)
<br>


Attributes are variables that are associated with specific elements, and which --- like elements --- are not visible in rendered HTML files (i.e. when viewing HTML files in a web browser).

Although they cannot be seen in web browsers, they can be very useful for parsing HTML files, as we shall see in the example below.

<br> 

The file `data5.html` contains the following text:

<br>

```
<html>
	<head>
		<title>Example HTML file with attributes</title>
	</head>
	<body>
		<h2>HTML file with attributes</h2>
		<p>
			<button id="X Button">X Button 1</button>
			<span>45</span>
		</p>
		<p>
			<button id="X Button">X Button 2</button>
			<span>90</span>
		</p>
		<p>
			<button id="X Button">X Button 3</button>
			<span>23</span>
		</p>
		<p>
			<button id="Y Button">Y Button 1</button>
			<span>40</span>
		</p>
		<p>
			<button id="Y Button">Y Button 2</button>
			<span>55</span>
		</p>
		<p>
			<button id="X Button">X Button 4</button>
			<span>64</span>
		</p>
	</body>
</html>
```

<br>

This file, when opened in a web browser, looks like this:


<br>
<br>

<img alt="html_example" width=600 src="html_screenshot2.png"/>


<br>
<br>


The goal here will be to extract only the "X Button" values. We shall ignore the "Y Button" values.

In [14]:
tree      = lxml.html.parse('data5.html')
elements  = tree.findall('body/p/button')

print(elements)

[<Element button at 0x7fd9f8456b80>, <Element button at 0x7fd9f8456c20>, <Element button at 0x7fd9f8456c70>, <Element button at 0x7fd9f8456bd0>, <Element button at 0x7fd9f8456b30>, <Element button at 0x7fd9f8456ae0>]


All six buttons have been found, but we want only the four "X Button" values. We can check the button type using the `id` attribute:

In [15]:
b = elements[0]
print( b.attrib['id'] )
print( b.attrib['id'] == 'X Button' )

X Button
True


Thus we can retrieve the X buttons by checking the `id` attribute, like this: 

In [16]:
elements = [b for b in tree.findall('body/p/button') if b.attrib['id'] == 'X Button']

print( elements )

[<Element button at 0x7fd9f8456b80>, <Element button at 0x7fd9f8456c20>, <Element button at 0x7fd9f8456c70>, <Element button at 0x7fd9f8456ae0>]


Now only the four X button elements are retrieved. Great!

However, we want the values that lie in the `<span>` elements, not in the `<button>` elements, so we need to be a bit more clever in our parsing. Since both the `<button>` and `<span>` elements lie within `<p>` elements, we should parse all `<p>` elements, like this:

In [17]:
data = []
for p in tree.findall('body/p'):
    b = p.find('button')
    if b.attrib['id'] == 'X Button':
        s = p.find('span').text
        data.append( int(s) )
        
print(data)

[45, 90, 23, 64]
