# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Parsing Whole Page</b></span>

<br>

Just when you think you've understood the structure of a data file, you will very frequently encounter formatting problems, where entries don't follow the format you expect.

The **page1.html** file is no exception. In the screenshot below you will notice that third item ("Python Projects") does not contain the "ページ" word that we used to parse the page count in the previous notebook. Not only this, there is no page count at all!

<br>
<img src="img/ss6.png" alt="screenshot" width=700 />
<br>
<br>

How can we deal with this problem?

One way is to ignore the entry if there is no "ページ" word found.  We'll get to this strategy below.

First let's discuss a general strategy for parsing the whole page.

<br>

___

<a name="parse-page"></a>
## General strategy
[Back to Table of Contents](#toc)
<br>

A good general parsing strategy is to create a separate function for each part of the parsing process.

Let's first create functions to parse titles, prices and page counts, test them on the first book.

In [1]:
import os
import lxml.html

dir0       = os.path.abspath('')              # directory in which this notebook is saved
dirLesson  = os.path.dirname( dir0 )          # Lesson directory
dirData    = os.path.join(dirLesson, 'Data')  # Data directory


In [5]:

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    x = s.split(':')[1].split(',')[0]
    return x


fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
name_nodes  = box.find_class('name')
price_nodes = box.find_class('itemCatPrice')
page_nodes  = box.find_class('itemCatsetsumei')


name        = parse_name_node( name_nodes[0] )
price       = parse_price_node( price_nodes[0] )
pages       = parse_page_count_node( page_nodes[0] )


print(name)
print(price)
print(pages)

Django: Django , Web framework for Python
1075
 111
40


<br>

Great!

It should now be relatively easy to parse the entire page, **but only** if all entries are formatted in the same way!!

We know ahead of time that one of the pages will give us a problem, because it doesn't contain contain a page count.  Let's check which entry this is as follows:

<br>

In [7]:
for i,node in enumerate( page_nodes ):
    print( i, node.text )


0 ページ: 111, ペーパーバック, Independently published
1 ページ: 237, ペーパーバック, Independently published
2 ページ: 205, ペーパーバック, Independently published
3 ページ: 187, ペーパーバック, Independently published
4 ページ: 110, ペーパーバック, Independently published
5 ページ: 245, ペーパーバック, CreateSpace Independent Publishing Platform
6 ページ: 224, ペーパーバック, Independently published
7 ページ: 264, ペーパーバック, Lulu.com
8 ページ: 104, ペーパーバック, Independently published
9 ページ: 408, ペーパーバック, Independently published
10 ページ: 220, ペーパーバック, Independently published
11 ページ: 53, ペーパーバック, Independently published
12 ページ: 162, ペーパーバック, Independently published
13 ページ: 340, ペーパーバック, Packt Publishing
14 ページ: 262, エディション: 1, ペーパーバック, Pragmatic Bookshelf
15 Python ProjectsA guide to completing Python projects for those ready to take their skills to the next level  Python P...
16 ページ: 706, エディション: 3, ペーパーバック, O'Reilly Media
17 ページ: 215, エディション: 1st ed. 2018, ハードカバー, Springer
18 ページ: 110, ペーパーバック, Independently published
19 ページ: 107, ペーパーバック, Independently published


<br>

We can 

In [20]:

names  = []
prices = []

for i in range(40):
    name_node  = name_nodes[i]
    price_node = price_nodes[i]
    name       = parse_name_node( name_node )
    price      = parse_price_node( price_node )
    names.append( name )
    prices.append( price )

print(names)
print()
print(prices)

['Django: Django , Web framework for Python', 'PYTHON PROGRAMMING ADVANCED: The Guide for Data Analysis and Data Science. Discover Machin...', "Python Data Analytics: The Beginner's Real World Crash Course", 'Programacion Con Python: Guia Completa para Principiantes   Aprende sobre Los Reinos De La...', 'Snake Reptile Week Planner Weekly Organizer Calendar 2020 / 2021 - Green Tree Python: Cute...', 'Python for Everybody: Exploring Data in Python 3', 'Python language for your growing children and for beginners', '101 Extra Python Challenges with Solutions / Code Listings', 'CIE IGCSE COMPUTER SCIENCE 9-1 SYLLABUS 2020-2021: PAPER 2 SPECIFICATION BOOK WITH FULL PY...', 'Computer Programming And Cyber Security for Beginners: This Book Includes: Python Machine ...', 'Python GUI: For Signal and Image Processing', 'Machine Learning with Scikit-Learn and TensorFlow: Deep Learning with Python (Random Fores...', 'Python Machine Learning: How to learn Machine Learning with Python. The Complete G

We can achieve the same result with briefer, slightly more clever Python code:

In [21]:

names,prices = zip( *[[parse_name_node( n ), parse_price_node( p )]  for n,p in zip(name_nodes, price_nodes)] )
    
print(prices)

(1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194)
