# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Parsing Multiple Pages</b></span>

<br>



<br>

⚠️ **NOTE!**  &nbsp; &nbsp; All data files are saved in the same directory as this notebook.

<br>


___

<a name="parse-pages"></a>
## Parsing multiple pages
[Back to Table of Contents](#toc)
<br>

To parse multiple pages, let's first move our single-page parsing code into a custom function.

In [22]:
def parse_page(fnameHTML):
    tree         = lxml.html.parse(fnameHTML)
    body         = tree.find('body') 
    box          = body.find_class('itemCatBox')[0]
    name_nodes   = box.find_class('name')
    price_nodes  = box.find_class('itemCatPrice')
    names,prices = zip( *[[parse_name_node( n ), parse_price_node( p )]  for n,p in zip(name_nodes, price_nodes)] )
    return list(names), list(prices)

fnameHTML    = os.path.join(dir0, 'kakaku-com', 'page1.html')
names,prices = parse_page(fnameHTML)

print(prices)
    

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194]


Great!  It will now be relatively easy to parse multiple pages. We just need to:

* iteratively update the HTML file names
* assemble all names and prices from each page into a larger list

One way to do this is:

In [23]:
names  = []
prices = []
for i in range(5):
    fnameHTML = os.path.join(dir0, 'kakaku-com', f'page{i+1}.html')
    n,p       = parse_page(fnameHTML)
    names    += n
    prices   += p

print( len(prices) )
print()
print(prices)

200

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194, 3447, 734, 8621, 8976, 1890, 4600, 1234, 5053, 2363, 1320, 1759, 2358, 734, 1770, 1650, 6336, 5393, 1906, 5971, 836, 2066, 2402, 4307, 2399, 2094, 3582, 2329, 14055, 2389, 1905, 5971, 5169, 3607, 5374, 2364, 4179, 5413, 769, 4927, 11281, 1804, 1539, 2085, 3751, 3607, 2987, 1067, 2984, 1190, 474, 1949, 11770, 5971, 834, 18498, 2045, 3251, 815, 2261, 2400, 1551, 2554, 1965, 838, 2388, 5044, 837, 4775, 1018, 12320, 4190, 1164, 2280, 469, 3706, 5354, 269, 5374, 9809, 1803, 3556, 5374, 816, 5080, 2069, 1219, 3322, 1430, 2461, 1099, 7456, 2384, 2200, 4583, 2563, 7457, 8360, 5971, 2045, 3425, 4012, 2277, 2286, 1189, 6368, 2264, 2079, 5019, 4776, 4776, 1936, 1441, 2767, 2143, 3978, 7218, 4576, 5969, 4777, 1099, 3072, 414, 2072, 2207, 2340, 5772, 4470, 24

Excellent!  We now have names and prices for all 200 books from the five HTML pages.

These data are difficult to visualize in Python, so let's save them to CSV file.