# EEBO XML to CSV Converter

This notebook converts EEBO Phase 1 P4 XML files to CSV format for text mining and machine learning applications.

**Output format**: Each row represents one page with columns: `author`, `place`, `date`, `page_text`

In [None]:
import sys
from pathlib import Path

# Add processing_code to path to import package
sys.path.insert(0, str(Path.cwd()))

import pandas as pd
from processing_code import parse_xml, process_files

# EEBO data source
EEBO_DATA_PATH = Path('/Volumes/X9 Pro/Text-Machine-Data/P4_XML_TCP')

print(f"Data path exists: {EEBO_DATA_PATH.exists()}")
if EEBO_DATA_PATH.exists():
    xml_files = list(EEBO_DATA_PATH.glob("**/[!.]*.xml"))
    print(f"Found {len(xml_files)} XML files")

Data path exists: True
Found 5012 XML files


In [None]:
# Test the parser on a single file
if EEBO_DATA_PATH.exists() and len(xml_files) > 0:
    test_file = xml_files[0]
    print(f"Testing parser on: {test_file.name}")
    test_pages = parse_xml(test_file)
    print(f"\nExtracted {len(test_pages)} pages")
    
    if test_pages:
        # Display first page as example
        test_df = pd.DataFrame([test_pages[0]])
        print("\nFirst page sample:")
        print(test_df.to_string())
        print(f"\nPage text preview: {test_pages[0]['page_text'][:200]}...")

Testing parser on: N00001.p4.xml

Extracted 49 pages

First page sample:
    author                place             date                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           page_text
0  Unknown  [Cambridge, Mass. :  Imprinted 1640.  THE VVHOLE BOOKE OF PSALMES Faithfully TRANSLATED into ENGLISH Metre. Whereunto is prefixed a discourse de∣claring not only the lawfullnes, but also the necessity of the heavenly Ordinance of singing Scripture Psalmes in the Churches of God. Coll.  III. \n Let the word of God dwe

In [None]:
# Process all XML files and create CSV
# Using process_files from text_parser module

# Option 1: Test with first 10 files
# df = process_files(xml_files, output_path="data/eebo_pages_sample.csv", max_files=10)

# Option 2: Process all files (can take a while depending on collection size)
# df = process_files(xml_files, output_path="data/eebo_pages_full.csv")

In [None]:
# Example: Run the conversion on a sample
# Uncomment the line below and execute to process files

# df = process_files(xml_files, output_path="data/evan_pages_full.csv")

Processing 5012 files...


100%|██████████| 5012/5012 [11:01<00:00,  7.58it/s]  



Total pages extracted: 161599

DataFrame info:
<class 'pandas.DataFrame'>
RangeIndex: 161599 entries, 0 to 161598
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   author     161599 non-null  str  
 1   place      161599 non-null  str  
 2   date       161599 non-null  str  
 3   page_text  161599 non-null  str  
dtypes: str(4)
memory usage: 4.9 MB
None

First few rows:
    author                place             date  \
0  Unknown  [Cambridge, Mass. :  Imprinted 1640.   
1  Unknown  [Cambridge, Mass. :  Imprinted 1640.   
2  Unknown  [Cambridge, Mass. :  Imprinted 1640.   
3  Unknown  [Cambridge, Mass. :  Imprinted 1640.   
4  Unknown  [Cambridge, Mass. :  Imprinted 1640.   

                                           page_text  
0  THE VVHOLE BOOKE OF PSALMES Faithfully TRANSLA...  
1  The Preface. THe singing of Psalmes, though it...  
2              chron Reu. Reu. Num. Reu. Gal. chron.  
3                           