# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Parsing A Whole Page</b></span>

<br>

Just when you think you've understood the structure of a data file, you will very frequently encounter formatting problems, where entries don't follow the format you expect.

The **page1.html** file is no exception. In the screenshot below you will notice that the "Python Projects" item (third from the top in this screenshot) does not contain the "ページ" word that we used to parse the page count in the previous notebook. Not only this, there is no page count at all!

<br>
<img src="img/ss6.png" alt="screenshot" width=700 />
<br>
<br>

How can we deal with this problem?

One way is to ignore the entry if there is no "ページ" word found.  We'll get to this strategy below.

First let's discuss a general strategy for parsing the whole page.

<br>

___

<a name="parse-page"></a>
## General strategy
[Back to Table of Contents](#toc)
<br>

A good general parsing strategy is to create a separate function for each part of the parsing process.

Let's first import the modules we'll need for this notebook, then let's specify `dirData` as the directory in which the HTML data files are saved.

<br>

In [None]:
import os
import lxml.html

dir0       = os.path.abspath('')              # directory in which this notebook is saved
dirLesson  = os.path.dirname( dir0 )          # Lesson directory
dirData    = os.path.join(dirLesson, 'Data')  # Data directory

<br>

Let's next create functions to parse titles, prices and page counts, test them on the first book.

<br>

In [2]:

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    x = s.split(':')[1].split(',')[0]
    return x


fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
name_nodes  = box.find_class('name')
price_nodes = box.find_class('itemCatPrice')
page_nodes  = box.find_class('itemCatsetsumei')


name        = parse_name_node( name_nodes[0] )
price       = parse_price_node( price_nodes[0] )
pages       = parse_page_count_node( page_nodes[0] )


print(name)
print(price)
print(pages)

Django: Django , Web framework for Python
1075
 111


<br>

Great!

It should now be relatively easy to parse the entire page, **but only** if all entries are formatted in the same way!!

We know ahead of time that one of the pages will give us a problem, because it doesn't contain contain a page count.  Let's check which entry this is as follows:

<br>

In [3]:
for i,node in enumerate( page_nodes ):
    print( i, node.text )


0 ページ: 111, ペーパーバック, Independently published
1 ページ: 237, ペーパーバック, Independently published
2 ページ: 205, ペーパーバック, Independently published
3 ページ: 187, ペーパーバック, Independently published
4 ページ: 110, ペーパーバック, Independently published
5 ページ: 245, ペーパーバック, CreateSpace Independent Publishing Platform
6 ページ: 224, ペーパーバック, Independently published
7 ページ: 264, ペーパーバック, Lulu.com
8 ページ: 104, ペーパーバック, Independently published
9 ページ: 408, ペーパーバック, Independently published
10 ページ: 220, ペーパーバック, Independently published
11 ページ: 53, ペーパーバック, Independently published
12 ページ: 162, ペーパーバック, Independently published
13 ページ: 340, ペーパーバック, Packt Publishing
14 ページ: 262, エディション: 1, ペーパーバック, Pragmatic Bookshelf
15 Python ProjectsA guide to completing Python projects for those ready to take their skills to the next level  Python P...
16 ページ: 706, エディション: 3, ペーパーバック, O'Reilly Media
17 ページ: 215, エディション: 1st ed. 2018, ハードカバー, Springer
18 ページ: 110, ペーパーバック, Independently published
19 ページ: 107, ペーパーバック, Independently published


<br>

We can see that the 15th entry contains text that does not contain an integer page count. Let's deal with this problem by deciding what to do if the entry does not contain the word "ページ". Since we require an integer page count for all entries, let's set the page count for this entry to be `-1`. Setting the page count to `-1` when it is unknown will make it easier later to process all entries, irrespective of their page counts.

This can be done in Python using an `if... else` command as demonstrated below.　Refer to the [control flow tools](https://docs.python.org/3/tutorial/controlflow.html) Python documentation, or [this tutorial in Japanese](https://www.javadrive.jp/python/if/index1.html) for details about `if... else` statements.

First let's retrieve the problem node:

<br>

In [4]:
node = page_nodes[15]
print(node.text)

Python ProjectsA guide to completing Python projects for those ready to take their skills to the next level  Python P...


<br>

If we were to try to run the command `x = parse_page_count_node(node)` an error would be generated because this problem node does not follow the expeceted pattern. So let's adjust this function to deal with the problem node:

<br>

In [5]:
def parse_page_count_node(node):
    s = node.text
    if 'ページ' in s:
        x = s.split(':')[1].split(',')[0]
    else:
        x = -1  # assign a value of -1 if "ページ" is not in the text field
    return int( x )


n = parse_page_count_node( node )

print( n )

-1


<br>

Great! Let's next confirm whether we can parse all page numbers using this new function.

<br>


In [6]:
pages = []
for node in page_nodes:
    n = parse_page_count_node( node )
    pages.append( n )

print(pages)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]


<br>

Excellent!  We have parsed all entries, and have returned a value of -1 for the problem entry.

Note that the cell above can be written more compactly using a [list comprehension](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python) statement, like this:

<br>

In [7]:
pages = [parse_page_count_node( node )  for node in page_nodes]
print(pages)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]


<br>

OK, we're now ready to parse the whole page, saving the title, page count and price for each entry:

<br>

In [8]:
title  = []
pages  = []
price  = []

for i in range(40):
    name_node  = name_nodes[i]
    page_node  = page_nodes[i]
    price_node = price_nodes[i]
    s          = parse_name_node( name_node )
    n          = parse_page_count_node( page_node )
    x          = parse_price_node( price_node )
    title.append( s )
    pages.append( n )
    price.append( x )

print(title)
print()
print(pages)
print()
print(price)

['Django: Django , Web framework for Python', 'PYTHON PROGRAMMING ADVANCED: The Guide for Data Analysis and Data Science. Discover Machin...', "Python Data Analytics: The Beginner's Real World Crash Course", 'Programacion Con Python: Guia Completa para Principiantes   Aprende sobre Los Reinos De La...', 'Snake Reptile Week Planner Weekly Organizer Calendar 2020 / 2021 - Green Tree Python: Cute...', 'Python for Everybody: Exploring Data in Python 3', 'Python language for your growing children and for beginners', '101 Extra Python Challenges with Solutions / Code Listings', 'CIE IGCSE COMPUTER SCIENCE 9-1 SYLLABUS 2020-2021: PAPER 2 SPECIFICATION BOOK WITH FULL PY...', 'Computer Programming And Cyber Security for Beginners: This Book Includes: Python Machine ...', 'Python GUI: For Signal and Image Processing', 'Machine Learning with Scikit-Learn and TensorFlow: Deep Learning with Python (Random Fores...', 'Python Machine Learning: How to learn Machine Learning with Python. The Complete G

<br>

Success!!  But to make our program shorter, let's do the same thing using list comprehensions. In the cell below, we'll print just `pages` and `price` to avoid the very long `title` output.

<br>

In [9]:
title  = [parse_name_node( node )  for node in name_nodes]
pages  = [parse_page_count_node( node )  for node in page_nodes]
price  = [parse_price_node( node)  for node in price_nodes]

print(pages)
print()
print(price)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194]


<br>

Excellent!  One advantage of list comprehensions is that there are far fewer lines of code to read (and to debug).

We could get even fancier, and parse everything using just a single line of code, like this:

<br>

In [10]:
title,pages,price = zip( *[[parse_name_node( n ), parse_page_count_node(p), parse_price_node( pp )]  for n,p,pp in zip(name_nodes, page_nodes, price_nodes)] )
    
print(price)

(1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194)


<br>

While this type of single-line parsing is possible, you should generally avoid this style of programming, because the single-line is very long, and using many different functions, which means that it will be difficult to debug problems.

OK, let's summarize our whole-page parser code, from start-to-finish:

<br>

In [11]:
import os
import lxml.html

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    if 'ページ' in s:
        x = s.split(':')[1].split(',')[0]
    else:
        x = -1  # assign a value of -1 if "ページ" is not in the text field
    return int( x )


# specify data directory:
dir0        = os.path.abspath('')              # directory in which this notebook is saved
dirLesson   = os.path.dirname( dir0 )          # Lesson directory
dirData     = os.path.join(dirLesson, 'Data')  # Data directory

# parse the HTML file for name, page and price nodes:
fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
name_nodes  = box.find_class('name')
page_nodes  = box.find_class('itemCatsetsumei')
price_nodes = box.find_class('itemCatPrice')

# parse the entries:
title       = [parse_name_node( node )  for node in name_nodes]
pages       = [parse_page_count_node( node )  for node in page_nodes]
price       = [parse_price_node( node)  for node in price_nodes]

print(pages)
print()
print(price)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194]


<br>

The next notebook considers how to extend this code to multiple pages.

<br>