# alto2txt demo notebook

This is a demo notebook, illustrating how to run the `alto2txt` tool from a Jupyter notebook.

It assumes that you have downloaded and installed the latest version of `alto2txt` using `pip`:

```sh
$ pip install alto2txt
```

### Loading functionality

First, we import the function we need from the `alto2txt` tool. The function is called `process_mets_files_in_directory`.

We also need the function called `load_xslts`, which provides a dictionary with all the flavours of XML and links to the correct XSLT documents, that needs to be passed as a parameter in each transformation we try to make (we will get back to that below).

In [4]:
from alto2txt.xml_to_text import process_mets_files_in_directory
from alto2txt.xml import load_xslts

xslts = load_xslts()

## Testing files from BL repository

In this first example, we run the alto2txt tool on one issue of one newspaper from the British Library's research repository. (The full file with the entire digitised run of the newspaper can be found [here](TODO).)

We downloaded the file and expanded one issue inside the `bl-repo-files` folder inside this `demo-files` folder.

The `process_mets_files_in_directory` function needs an input directory (`input_dir`) and an output directory (`output_dir`) passed to it, and the dictionary with the XSLT transformation files, which we created above.

In the example below, we provide the input directory `./bl-repo-files` as a relative path. (You can also provide it as an absolute path if you want to.) For output, we have used the `demo-output` directory which is one level up from the `demo-files`. Feel free to play around with this parameter to see where the files end up on your computer.

In [5]:
result = process_mets_files_in_directory(
  input_dir="./bl-repo-files",
  output_dir="../demo-output",
  xslts=xslts)

You will note that we assign the result of the `process_mets_files_in_directory` function to a variable called `result`. Let's have a look at it (using `pprint` which provides a prettier look at the resulting dictionary):

In [7]:
from pprint import pprint

pprint(result)

{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 0,
 'num_files': 1}


In this resulting directory, `num_files` (at the very end) designates how many successful files were processed. All the other keys represent errors that have occurred during the transformations of the `input_dir` that you provided above.

### Another example

We have provided yet another example, which is located in the `0002647` folder right here in the `demo-files`. All the other parameters are the same as the example above so you will expect to find the result in the `demo-output` folder, once again.

Let's run the function and look at the result:

In [8]:
result = process_mets_files_in_directory(
  input_dir="./0002647/",
  output_dir="../demo-output",
  xslts=xslts)

pprint(result)

{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 0,
 'num_files': 1}


Once again, one METS file was successfully processed.

## Files that follow a "bad" directory

In previous versions of `alto2txt`, the `bad_directory` in our test files would crash the program (as it used to expect a particular folder structure). In the current version, the `bad_directory` test will no longer fail:

In [10]:
result = process_mets_files_in_directory(
  input_dir="../tests/tests/test_files/bad_directory/",
  output_dir="../demo-output/",
  xslts=xslts)

pprint(result)

{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 0,
 'num_files': 1}


## No directory structure

The tool should now also run on a directory tree that is flat, i.e. where we have no directory tree (but also, as we will see in a later example, when we have multiple METS files in the same directory):

In [11]:
result = process_mets_files_in_directory(
  input_dir="../tests/tests/test_files/no_dir_tree/",
  output_dir="../demo-output/",
  xslts=xslts)

pprint(result)

{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 0,
 'num_files': 1}


What if we have multiple METS files in the same folder? It works like a charm:

In [12]:
result = process_mets_files_in_directory(
  input_dir="../tests/tests/test_files/multiple_mets_files_one_dir/",
  output_dir="../demo-output/",
  xslts=xslts)

pprint(result)

{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 0,
 'num_files': 2}


### Missing page references in the METS file

So, what happens if there are page references in the METS file that are missing? The `alto2txt` tool should fail to process that particular METS file, but continue processing the other METS file in the directory.

Let's try with a METS file that has (at least) one missing page:

In [13]:
# This should fail
result = process_mets_files_in_directory(
  input_dir="../tests/tests/test_files/missing_page/",
  output_dir="../demo-output/",
  xslts=xslts)

result

An XMLError occurred: XSLT Error, cannot resolve URI: /Users/kwesterling/Repositories/lwm/alto2txt/tests/tests/test_files/missing_page/1824/0217/0002647_18240217_0004.xml


XSLT Error, cannot resolve URI: /Users/kwesterling/Repositories/lwm/alto2txt/tests/tests/test_files/missing_page/1824/0217/0002647_18240217_0004.xml


{'Unknown XML schema': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'Unknown root': 0,
 'XML file failed to give XSLT output': 0,
 'No dates found': 0,
 'Too many dates found': 0,
 'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'No identifiers found': 0,
 'Too many identifiers found': 0,
 'Identifier has no text': 0,
 'General XSLT Error': 0,
 'XSLT Error, cannot resolve URI': 1,
 'num_files': 0}

As you can see, the `alto2txt` tool logs the error. The `process_mets_files_in_directory` will also print any errors that occurs (so you'll get two notifications above).

As you can see in the resulting dictionary (`result`), the error `"XSLT Error, cannot resolve URI"` has one count.

So what happens if we have multiple METS files in one directory, where one METS file has missing pages, but another one has all its pages? Let's try:

In [15]:
result = process_mets_files_in_directory(
  input_dir="../tests/tests/test_files/multiple_mets_files_one_dir_one_missing_page/",
  output_dir="../demo-output/",
  xslts=xslts)

pprint(result)

An XMLError occurred: XSLT Error, cannot resolve URI: /Users/kwesterling/Repositories/lwm/alto2txt/tests/tests/test_files/multiple_mets_files_one_dir_one_missing_page/0002978_18900516_0008.xml


XSLT Error, cannot resolve URI: /Users/kwesterling/Repositories/lwm/alto2txt/tests/tests/test_files/multiple_mets_files_one_dir_one_missing_page/0002978_18900516_0008.xml
{'Date has no text': 0,
 'Date is not formatted correctly (YYYY-MM-DD)': 0,
 'General XSLT Error': 0,
 'Identifier has no text': 0,
 'No dates found': 0,
 'No identifiers found': 0,
 'Too many dates found': 0,
 'Too many identifiers found': 0,
 'Unknown XML schema': 0,
 'Unknown root': 0,
 'Unsupported XML schema for operation': 0,
 'XML Syntax Error': 0,
 'XML file failed to give XSLT output': 0,
 'XSLT Error, cannot resolve URI': 1,
 'num_files': 1}


You will see that `num_files` denotes one functioning file that has been transformed by the XSLT. Meanwhile, another file has failed with an error `"XSLT Error, cannot resolve URI"`. That's what we expected!