# File Formats

One "Swiss knife" tool (or library) for each format:
- XML: `BeautifoulSoup`
- JSON: `json`
- CSV: `csv`

<img src='https://cdn-images-1.medium.com/max/1600/1*Emm10TxVEOvWqwF9oPJb1w.jpeg' width='300px'>

## XML

### `BeautifulSoup`

There are several Python libraries to parse XML but `BeautifulSoup` is somehow the swiss knife of XML parsing.

It can parse HTML, XML, as well as ill-formed or broken XML documents (very useful for legacy XML or even SGML data).

In [None]:
import os
import bs4
from bs4 import BeautifulSoup

In [None]:
data_folder = 'data/altoxml/'

In [None]:
# let's get the path of XML files
# we filter only files with XML extension
# it can be useful to ignore e.g. `.DS_Store` files (under MacOS)

xml_files = [
    os.path.join(data_folder, file)
    for file in os.listdir(data_folder)
    if ".xml" in file
]

In [None]:
xml_files

In [None]:
# prefixing a code cell's content with `!`
# tells jupyter to execute it as a bash shell command
# Here we use the command `head` to peek at the first 100 lines
# of our XML file.

!head -n 50 data/altoxml/27971740_1890-04-01_38_077_0_001.xml

In [None]:
with open(xml_files[0], 'r') as inpfile:
    xml_doc = BeautifulSoup(inpfile)

In [None]:
xml_doc

### Finding elements

Finding the `<textblock>` element with `@id` = `Page1_Block2`:

In [None]:
xml_doc.find_all?

In [None]:
target_element = xml_doc.find_all(
    'textblock',
    attrs={'id': 'Page1_Block1'}
)

In [None]:
# by definition, there should exist excatly one element
# with a given ID within the same document
assert len(target_element) == 1

The same search logic applies to *any* XML attribute. 

Here we search for all `<composedblock>` with `@type` = `container`:

In [None]:
composed_blocks = xml_doc.find_all(
    'composedblock',
    {'type': 'container'}
)

Finding all XML elements with a given name:

In [None]:
textline_elements = xml_doc.find_all('textline')

In [None]:
x = textline_elements[0].get('vpos')
y = textline_elements[0].get('hpos')
w = textline_elements[0].get('width')
h = textline_elements[0].get('height')

In [None]:
print(
    f'The coordinates of the first line are : {x} (x), {y} (y), {h} (height), {w} (width)'
)

### Navigating the XML tree

In [None]:
el = xml_doc.find('styles')

In [None]:
for child in el.children:
    print(type(child), child.name)

In [None]:
parent = el.parent

In [None]:
el.previousSibling

In [None]:
el.nextSibling

### Excercise

Let's now try to put all these things together to solve a real problem that you have already encountered, i.e. **turning a bunch of XML files into processable data**. Why this can be useful?

(This exercise will take around 20-30 minutes to complete).

In [None]:
import pandas as pd

In [None]:
def parse_alto(filepath):
    """
    Convert each file to a dictionary with the
    following keys: fulltext (list of lines), wordcount, filename.
    """
    parsed_data = {}
    
    # add here your solution
    # you'll need to parse the xml elements
    # containing the information you are interested in
    
    # HINT: you may want to split the parsing of individual
    # XML elements into dedicated functions that get called from
    # `parse_alto()`
    
    return parsed_data

In [None]:
# once your function is in place, you should be
# able to execute this cell, which applies your function
# to all Alto files.

data = [
    parse_alto(xml_file)
    for xml_file in xml_files
]

df = pd.DataFrame(data)

In [None]:
df.head()

---

## JSON and CSV

Via self learning.

* For JSON see the [documentation](https://docs.python.org/3.11/library/json.html).
* For CSV see the [documentation](https://docs.python.org/3.11/library/csv.html).

In [None]:
import json

# Python dictionary
person = {"name": "John", "age": 30, "city": "New York"}

# Serialize: Convert Python dictionary to JSON string
person_json = json.dumps(person)
print("JSON string:", person_json)

# Deserialize: Convert JSON string back to Python dictionary
person_dict = json.loads(person_json)
print("Python dictionary:", person_dict)

In [None]:
import csv

# Data to be written to the CSV file
data = [
    ['Name', 'Age', 'City'],
    ['John Doe', '30', 'New York'],
    ['Jane Doe', '25', 'Los Angeles']
]

# Open a CSV file for writing
with open('stuff/example.csv', 'w', newline='') as file:
    writer = csv.writer(file)

    # Write the data to the CSV file
    writer.writerows(data)

print("Data written to example.csv")

In [None]:
import csv

# Open the CSV file for reading
with open('stuff/example.csv', 'r') as file:
    reader = csv.reader(file)

    # Read and print each row
    for row in reader:
        print(row)

---