# <span style="color:#54B1FF">Exploring Data:</span> &nbsp; <span style="color:#1B3EA9"><b>Write Results</b></span>

<br>

In the previous notebooks in this lesson we saw how to parse multiple entries from multiple pages. In this notebook we'll consider how to save the parsed results.

As general data analysis rules:
* Separate your **parsing** code from your **analysis** code
* Save parsed results separately from the original data files, so that your analysis code can focus on the parsed results. 

Let's first assemble our multi-page parsing code, and parse all pages:

<br>


In [1]:
import os
import lxml.html

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    if 'ページ' in s:
        x = s.split(':')[1].split(',')[0]
    else:
        x = -1  # assign a value of -1 if "ページ" is not in the text field
    return int( x )


def parse_page(fnameHTML):
    tree        = lxml.html.parse(fnameHTML)
    body        = tree.find('body') 
    box         = body.find_class('itemCatBox')[0]
    name_nodes  = box.find_class('name')
    page_nodes  = box.find_class('itemCatsetsumei')
    price_nodes = box.find_class('itemCatPrice')

    # parse the entries:
    title       = [parse_name_node( node )  for node in name_nodes]
    pages       = [parse_page_count_node( node )  for node in page_nodes]
    price       = [parse_price_node( node)  for node in price_nodes]
    
    return title, pages, price


# specify data directory:
dir0        = os.path.abspath('')              # directory in which this notebook is saved
dirLesson   = os.path.dirname( dir0 )          # Lesson directory
dirData     = os.path.join(dirLesson, 'Data')  # Data directory

# parse all HTML pages:
title  = []
pages  = []
price  = []

for i in range(5):
    fnameHTML = os.path.join(dirData, f'page{i+1}.html')
    s,n,x     = parse_page(fnameHTML)
    title    += s
    pages    += n
    price    += x

print( len(title) )
print( len(pages) )
print( len(price) )
print()
print(price)

200
200
200

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194, 3447, 734, 8621, 8976, 1890, 4600, 1234, 5053, 2363, 1320, 1759, 2358, 734, 1770, 1650, 6336, 5393, 1906, 5971, 836, 2066, 2402, 4307, 2399, 2094, 3582, 2329, 14055, 2389, 1905, 5971, 5169, 3607, 5374, 2364, 4179, 5413, 769, 4927, 11281, 1804, 1539, 2085, 3751, 3607, 2987, 1067, 2984, 1190, 474, 1949, 11770, 5971, 834, 18498, 2045, 3251, 815, 2261, 2400, 1551, 2554, 1965, 838, 2388, 5044, 837, 4775, 1018, 12320, 4190, 1164, 2280, 469, 3706, 5354, 269, 5374, 9809, 1803, 3556, 5374, 816, 5080, 2069, 1219, 3322, 1430, 2461, 1099, 7456, 2384, 2200, 4583, 2563, 7457, 8360, 5971, 2045, 3425, 4012, 2277, 2286, 1189, 6368, 2264, 2079, 5019, 4776, 4776, 1936, 1441, 2767, 2143, 3978, 7218, 4576, 5969, 4777, 1099, 3072, 414, 2072, 2207, 2340, 5772, 

<br>

We now have book titles, page counts and prices saved in three separate lists, each of which has 200 elements.

Let's ave these parsed data in a CSV file, so that our analysis code can focus on the data-of-interest in that CSV file. One way to do this is using the [csv](https://docs.python.org/3/library/csv.html) module (Japanese documentation [here](https://docs.python.org/ja/3/library/csv.html)).

<br>

In [2]:
import csv

fnameCSV = os.path.join( dir0, 'parsed_data.csv' )
header   = ['Title', 'Pages', 'Price']  # column labels
with open(fnameCSV, 'w') as f:          # open the CSV file in write mode
    writer = csv.writer(f)              # create a writer object
    writer.writerow( header )           # write column labels
    for s,n,x in zip(title, pages, price):   # cycle through all entries
        writer.writerow( [s, n, x] )    # write the current row to file

<br>

After executing this code, you will find a file called **parsed_data.csv** that has been saved to the directory in which this notebook is saved. Note that this file has three columns of data, corresponding to: title, pages, and price, respectively, and that these fields have been filled for all 200 books.

Note especially:
* The saved CSV file has a size of approximately **20 KB**.
* Each of the original HTML data files have a size of approximately **240 KB**, for a total of about **1.2 MB**.
* We have therefore greatly compressed the original data, by a factor of approximately 50.
* This means that our analysis code --- which will use only the parsing results in the CSV file ---  will be able to execute much more efficiently.

While use of the [csv](https://docs.python.org/3/library/csv.html) module is fine, it is usually more convenient to use [pandas](https://pandas.pydata.org). Using pandas, the csv file can be written using more compact, easier-to-read code:

<br>

In [3]:
import pandas as pd

df       = pd.DataFrame( dict(Title=title, Pages=pages, Price=price) )
df.to_csv(fnameCSV, index=False)

<br>

Like above, this code will write a CSV file that contains all data for all 200 books.

Note that the command: `dict(Title=title, Pages=pages, Price=price)` creates a dictionary with keys: `Title`, `Pages` and `Price`, and that the `to_csv` method uses these keys as the column labels in the CSV file.

Note also that the `index=False` keyword argument prevents the `to_csv` method from writing a column of row numbers. By default, `to_csv` will create a CSV file whose first column contains integers that indicate the row number.

After writing the data, the contents of the CSV file can easily be re-read as follows:

<br>

In [4]:
df       = pd.read_csv(fnameCSV)
title    = df['Title']
page     = df['Pages']
price    = df['Price']

print( price )

0       1075
1       1790
2       2379
3       2364
4        960
       ...  
195      989
196      838
197     3698
198      956
199    14047
Name: Price, Length: 200, dtype: int64


<br>

Note that we can easily convert pandas objects to lists or NumPy arrays:

<br>

In [5]:
import numpy as np

a = list( price )
b = np.array( price )

print( a )
print()
print( b )

[1075, 1790, 2379, 2364, 960, 1219, 8491, 3205, 4729, 3033, 2352, 1581, 1801, 5374, 4325, 4526, 7192, 14542, 960, 834, 5388, 2385, 5136, 5408, 5374, 1076, 11060, 5900, 2567, 5152, 5374, 5374, 2153, 4531, 836, 4182, 1089, 1766, 5981, 4194, 3447, 734, 8621, 8976, 1890, 4600, 1234, 5053, 2363, 1320, 1759, 2358, 734, 1770, 1650, 6336, 5393, 1906, 5971, 836, 2066, 2402, 4307, 2399, 2094, 3582, 2329, 14055, 2389, 1905, 5971, 5169, 3607, 5374, 2364, 4179, 5413, 769, 4927, 11281, 1804, 1539, 2085, 3751, 3607, 2987, 1067, 2984, 1190, 474, 1949, 11770, 5971, 834, 18498, 2045, 3251, 815, 2261, 2400, 1551, 2554, 1965, 838, 2388, 5044, 837, 4775, 1018, 12320, 4190, 1164, 2280, 469, 3706, 5354, 269, 5374, 9809, 1803, 3556, 5374, 816, 5080, 2069, 1219, 3322, 1430, 2461, 1099, 7456, 2384, 2200, 4583, 2563, 7457, 8360, 5971, 2045, 3425, 4012, 2277, 2286, 1189, 6368, 2264, 2079, 5019, 4776, 4776, 1936, 1441, 2767, 2143, 3978, 7218, 4576, 5969, 4777, 1099, 3072, 414, 2072, 2207, 2340, 5772, 4470, 2400, 3

<br>

Great!  For use in the final section of this lesson, let's assemble all of our **parsing** code, including CSV data file writing.

<br>

In [6]:
'''
Complete parsing code.
'''

import os
import lxml.html
import pandas

def parse_name_node(node):
    return node.text

def parse_price_node(node):
    s = node.text
    x = s[1:].replace(',', '')
    return int( x )

def parse_page_count_node(node):
    s = node.text
    if 'ページ' in s:
        x = s.split(':')[1].split(',')[0]
    else:
        x = -1  # assign a value of -1 if "ページ" is not in the text field
    return int( x )


def parse_page(fnameHTML):
    tree        = lxml.html.parse(fnameHTML)
    body        = tree.find('body') 
    box         = body.find_class('itemCatBox')[0]
    name_nodes  = box.find_class('name')
    page_nodes  = box.find_class('itemCatsetsumei')
    price_nodes = box.find_class('itemCatPrice')

    # parse the entries:
    title       = [parse_name_node( node )  for node in name_nodes]
    pages       = [parse_page_count_node( node )  for node in page_nodes]
    price       = [parse_price_node( node)  for node in price_nodes]
    
    return title, pages, price



# specify data directory:
dir0        = os.path.abspath('')              # directory in which this notebook is saved
dirLesson   = os.path.dirname( dir0 )          # Lesson directory
dirData     = os.path.join(dirLesson, 'Data')  # Data directory


# parse all HTML pages:
title  = []
pages  = []
price  = []
for i in range(5):
    fnameHTML = os.path.join(dirData, f'page{i+1}.html')
    s,n,x     = parse_page(fnameHTML)
    title    += s
    pages    += n
    price    += x


# write parsed results:
df       = pd.DataFrame( dict(Title=title, Pages=pages, Price=price) )
fnameCSV = os.path.join( dir0, 'parsed_data.csv' )
df.to_csv(fnameCSV, index=False)

<br>

OK, now that we've completed our **parsing** code, let's next move on to **analysis**, which is discussed in the next notebook.

<br>