# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Parsing an Entry</b></span>

<br>

The word "entry" implies one unit of data. For this example, "one entry" means "one book".

<br>

Let's first import the modules that we'll need. Let's also specify the directory in which the data are saved.

In [1]:
import os
import lxml.html

dir0       = os.path.abspath('')              # directory in which this notebook is saved
dirLesson  = os.path.dirname( dir0 )          # Lesson directory
dirData    = os.path.join(dirLesson, 'Data')  # Data directory


___

## HTML code for a single entry
<br>

Let's open the first HTML file in a **text editor** (e.g. Notepad), and find where the data are stored for the first entry. The easiest way to do this is to search the text file for a word in the title, for example "Django".  This search will take us to a spot in the file as depicted below.

<br>
<img src="img/ss1.png" alt="screenshot" width=700 />
<br>

Note that, near the bottom of this image, on Line 459, there is a `<span>` tag that contans the title of the first book: <span style="color:red">"Django: Django , Web framework for Python"</span>.

Note also that the class of this `<span>` tag is `"name"`.

We can use this information to parse the HTML file, as demonstrated below.

___

# Parsing book titles

If we look at the HTML file in a browser, we can see that there are 40 books.

We therefore want to find an HTML element that repeats 40 times.

In the image above we see that the book titles are stored in a tag called `<span>` with the class `name`.

Let's see if this `<span class="name">` tag repeats 40 times.  The most convenient way to do this is to use the `find_class` function of a parsed HTML tree, like this:

<br>

In [2]:
fnameHTML  = os.path.join(dirData, 'page1.html')
tree       = lxml.html.parse(fnameHTML)
body       = tree.find('body') 
name_nodes = body.find_class('name')

print( len(name_nodes) )   # number of name nodes

40


We have found 40 name nodes! This implies that all of the book titles are specified in individual `<span class="name">` tags. You can verify this by using your text editor to search the HTML file for `<span class="name">`.

Let's check the text stored in the first `<span class="name">` tag:

In [3]:
name  = name_nodes[0].text

print( name )

Django: Django , Web framework for Python


<br>

Excellent!  We have extracted the title of the first book.

<br>

___

# Parsing book prices

How about the book price?

If you look a few lines further in the file, you will see the following:

<br>
<img src="img/ss2.png" alt="screenshot" width=700 />
<br>

Note that the price is saved on Line 471, in the text field of a `<span>` tag with the class: `"itemCatPrice"`.

We can retrieve all `"itemCatPrice"` nodes like this:

<br>

In [4]:
price_nodes = body.find_class('itemCatPrice')

print( len(price_nodes) )   # number of name nodes

50


<br>

Oh no!  There are 50 prices, but only 40 titles!  What has happened?

If you look near the top of the page you'll see a horizontal preview bar with prices indicated.

<br>
<br>
<img src="img/ss3.png" alt="screenshot" width=700 />
<br>
<br>

If you search (from the beginning of the HTML source code) for the first instance of `itemCatPrice`, you'll find that it appears in a section called the `sponsorShopArea`.


<br>
<img src="img/ss4.png" alt="screenshot" width=700 />
<br>

Let's check how many `itemCatPrice` items appear inside this section.

<br>

In [5]:
shop_area = body.find_class('sponsorShopArea is-grid is-imgSmall')[0]

nodes     = shop_area.find_class('itemCatPrice')

print( len(nodes) )

10


<br>

OK! There are 10 `itemCatPrice` nodes in this section, which means that the other 40 nodes must be in a different section.

If we keep searching the HTML source code for the 11th instance of `itemCatPrice`, we'll find (on Line 450) that it lies in a section called `itemCatBox`.

<br>
<img src="img/ss5.png" alt="screenshot" width=700 />
<br>

<br>

Let's check how many `itemCatPrice` items appear inside this `itemCatBox` section.

<br>

In [6]:
box   = body.find_class('itemCatBox')[0]

nodes = box.find_class('itemCatPrice')

print( len(nodes) )

40


<br>

Excellent! We have found the 40 instances we were looking for.

Although we've already found the book name nodes, let's check if we can also find them inside the `itemCatBox` section.

<br>

In [7]:
nodes = box.find_class('name')

print( len(nodes) )

40


<br>

Excellent again!  We used `body.find_class` above, but that was only good for the `'name'` nodes, and not for the `'itemCatPrice'` nodes.  We now know that we can use `box.find_class` for both. For consistencey let's summarize using only `box.find_class`:

<br>

In [8]:
fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
name_nodes  = box.find_class('name')
price_nodes = box.find_class('itemCatPrice')

print( len(name_nodes) )    # number of name nodes
print( len(price_nodes) )   # number of price nodes

40
40


<br>

Last, let's extract the name and price for a single book.

<br>

In [9]:
name_node   = name_nodes[0]
price_node  = price_nodes[0]

print(name_node.text)
print(price_node.text)

Django: Django , Web framework for Python
￥1,075


<br>

Good, this matches the first entry in the HTML page (see first screenshot above).

However, since we'll later want to work with the numbers, it would be more convenient to save the price as `1075` than as `¥1,075`. To do this we can do the following:

* Ignore the first character (which will always be `¥`)
* Remove all `,` characters
* Convert the string object to an integer object

This can be achieved in Python like this:

<br>

In [10]:
s0  = price_node.text       # original string
s1  = s0[1:]                # discard the first character
s2  = s1.replace(',', '')   # replace all "," characters with empty characters
x   = int( s2 )             # convert the resulting string to an integer

print( s0, type(s0) )
print( s1, type(s1) )
print( s2, type(s2) )
print( x, type(x) )

￥1,075 <class 'str'>
1,075 <class 'str'>
1075 <class 'str'>
1075 <class 'int'>


We could instead achieve the same result on a single line:

In [11]:
x  = int( price_node.text[1:].replace(',', '') )

print( x )

1075


<br>

Excellent!

How about the number of pages for each book?

<br>

___

# Parsing number of pages



If you look at Line 463 in the HTML, file you will see that it contains the number of pages, in the text field of the `<span class="itemCatsetsumei">` tag, directly after "ページ"

<br>
<img src="img/ss2.png" alt="screenshot" width=700 />
<br>

To extract the page count, we will have to first find all of the `<span class="itemCatsetsumei">` tags, then parse the text field to extract the number of pages.

Let's first check whether we can extract the text field using our previous `itemCatBox` strategy:

<br>

In [12]:
fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
s           = box.find_class('itemCatsetsumei')[0].text

print( s )

ページ: 111, ペーパーバック, Independently published


<br>

Excellent!  This is the text that we need.

Let's parse the number of pages in this text string. This can be done with two `split` commands, to first remove `ページ`, then to next isolate the number of pages, like this:

<br>

In [13]:
ss = s.split(':')
print( ss )
print()

ss = ss[1].split(',')
print( ss )
print()

ss = ss[0]
print( ss )


['ページ', ' 111, ペーパーバック, Independently published']

[' 111', ' ペーパーバック', ' Independently published']

 111


<br>

Great! We can do this in a single line, and also convert the string to an integer, like this:

<br>

In [14]:
n = int(  s.split(':')[1].split(',')[0]  )

print(n)

111


<br>

Finished!  We have successfully parsed the first book on **page1.html**.

In the next notebook, let's proceed to parse the entire page.

<br>