# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Parsing Multiple Pages</b></span>

<br>

In the previous notebook we saw how to parse a single HTML page. Let's replicate that code here:

<br>


In [1]:
import os
import lxml.html

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    if 'ページ' in s:
        x = s.split(':')[1].split(',')[0]
    else:
        x = -1  # assign a value of -1 if "ページ" is not in the text field
    return int( x )


# specify data directory:
dir0        = os.path.abspath('')              # directory in which this notebook is saved
dirLesson   = os.path.dirname( dir0 )          # Lesson directory
dirData     = os.path.join(dirLesson, 'Data')  # Data directory

# parse the HTML file for name, page and price nodes:
fnameHTML   = os.path.join(dirData, 'page1.html')
tree        = lxml.html.parse(fnameHTML)
body        = tree.find('body') 
box         = body.find_class('itemCatBox')[0]
name_nodes  = box.find_class('name')
page_nodes  = box.find_class('itemCatsetsumei')
price_nodes = box.find_class('itemCatPrice')

# parse the entries:
title       = [parse_name_node( node )  for node in name_nodes]
pages       = [parse_page_count_node( node )  for node in page_nodes]
price       = [parse_price_node( node)  for node in price_nodes]

print(pages)
print()
print(price)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194]


<br>

To parse multiple pages, it is most convenient to create a new function, that can be used to parse an entire page. Let's create a `parse_page` function that will do just that:

<br>

In [2]:
def parse_page(fnameHTML):
    tree        = lxml.html.parse(fnameHTML)
    body        = tree.find('body') 
    box         = body.find_class('itemCatBox')[0]
    name_nodes  = box.find_class('name')
    page_nodes  = box.find_class('itemCatsetsumei')
    price_nodes = box.find_class('itemCatPrice')

    # parse the entries:
    title       = [parse_name_node( node )  for node in name_nodes]
    pages       = [parse_page_count_node( node )  for node in page_nodes]
    price       = [parse_price_node( node)  for node in price_nodes]

    return title, pages, price

title,pages,price = parse_page(fnameHTML)

print(pages)
print()
print(price)

[111, 237, 205, 187, 110, 245, 224, 264, 104, 408, 220, 53, 162, 340, 262, -1, 706, 215, 110, 107, 648, 248, 240, 378, 386, 39, 358, 216, 225, 334, 244, 330, 156, 320, 110, 252, 496, 128, 286, 372]

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194]


<br>

Great!  We're now ready to parse multiple pages. 

<br>

In [3]:
title  = []
pages  = []
price  = []

for i in range(5):
    fnameHTML = os.path.join(dirData, f'page{i+1}.html')
    s,n,x     = parse_page(fnameHTML)
    title    += s
    pages    += n
    price    += x


print( len(title) )
print( len(pages) )
print( len(price) )
print()
print(price)



200
200
200

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194, 3447, 734, 8621, 8976, 1890, 4600, 1234, 5053, 2363, 1320, 1759, 2358, 734, 1770, 1650, 6336, 5393, 1906, 5971, 836, 2066, 2402, 4307, 2399, 2094, 3582, 2329, 14055, 2389, 1905, 5971, 5169, 3607, 5374, 2364, 4179, 5413, 769, 4927, 11281, 1804, 1539, 2085, 3751, 3607, 2987, 1067, 2984, 1190, 474, 1949, 11770, 5971, 834, 18498, 2045, 3251, 815, 2261, 2400, 1551, 2554, 1965, 838, 2388, 5044, 837, 4775, 1018, 12320, 4190, 1164, 2280, 469, 3706, 5354, 269, 5374, 9809, 1803, 3556, 5374, 816, 5080, 2069, 1219, 3322, 1430, 2461, 1099, 7456, 2384, 2200, 4583, 2563, 7457, 8360, 5971, 2045, 3425, 4012, 2277, 2286, 1189, 6368, 2264, 2079, 5019, 4776, 4776, 1936, 1441, 2767, 2143, 3978, 7218, 4576, 5969, 4777, 1099, 3072, 414, 2072, 2207, 2340, 5772, 

<br>

Excellent!  We have parsed all five pages, and have successfully stored the title, page count, and price for all 200 books!

⚠️ Often you will encounter formatting problems when dealing with multiple files!  In this case you will need to adjust your parsing functions to deal with different problems. In this case, we were luck that all entries on all pages could be handled successfully by the parsers we developed for **page1.html**

<br>

In the next notebook, we'll write the parsed results to file.

<br>