# Using the eScriptorium Connector

This library provides a simple API to access the eScriptorium online platform.

Install with `pip install escriptorium-connector` and import it as follows.

In [1]:
from escriptorium_connector import EscriptoriumConnector
import os
from dotenv import load_dotenv
import io

## Instantiating the connector

It is probably a best practice to store your validation credentials separately. Here we use `python-dotenv` to load from a `.env` file the user's login credentials and the eScriptorium instance address.

The url of the eScriptorium instance is passed to the connector along with the url address for its api and the user's login credentials. The connector will take care of getting the API key for the user and any necessary cookies for asynchronous requests.

In [2]:
load_dotenv()
url = str(os.getenv('ESCRIPTORIUM_URL'))
api = f'{url}api/'
username = str(os.getenv('ESCRIPTORIUM_USERNAME'))
password = str(os.getenv('ESCRIPTORIUM_PASSWORD'))
escr = EscriptoriumConnector(url, api, username, password)

## Usage

Once instantiated, the connector provides well-documented and convenient methods to interact with your data on eScriptorium.

You might get a list of all your documents.

In [3]:
my_documents = escr.get_documents()
document_pk = my_documents[0]["pk"]


In [4]:
print(len(my_documents))
print(my_documents[0])

45
{'pk': 285, 'name': 'Berlin1594_new', 'project': 'sofer_mahir_project', 'transcriptions': [{'pk': 1689, 'name': 'Pawel'}, {'pk': 1174, 'name': 'Horovitz_Finkelstein_FA'}, {'pk': 1162, 'name': 'Vat32unordered'}, {'pk': 1099, 'name': 'MarginsFromDoc16'}, {'pk': 703, 'name': 'master'}, {'pk': 702, 'name': 'Acad_Full'}, {'pk': 701, 'name': 'kraken:B1594bothmain_7'}, {'pk': 700, 'name': 'kraken:B1594main2_9'}, {'pk': 699, 'name': 'additions_master'}, {'pk': 698, 'name': 'Acad'}, {'pk': 690, 'name': 'kraken:B1594main_29'}, {'pk': 688, 'name': 'kraken:B1594cor02GPUbox_best'}, {'pk': 687, 'name': 'kraken:genHeb'}, {'pk': 686, 'name': 'Zip Import'}, {'pk': 684, 'name': 'manual'}], 'main_script': 'Hebrew', 'read_direction': 'rtl', 'line_offset': 1, 'valid_block_types': [{'pk': 3, 'name': 'Commentary'}, {'pk': 4, 'name': 'Illustration'}, {'pk': 2, 'name': 'Main'}, {'pk': 384, 'name': 'Margin'}, {'pk': 385, 'name': 'Paratext'}, {'pk': 1, 'name': 'Title'}], 'valid_line_types': [{'pk': 3, 'name':

## Convenience

The connector provides several conveniences.  Perhaps the most broadly useful is automatically following the paging information to acquire the full dataset requested.  Here the request for document parts collects the full list of parts without the need to check if there are further pages of results available.

In [5]:
document_parts = escr.get_document_parts(document_pk)

In [7]:
print(len(document_parts))
selected_part = document_parts[1]
print(selected_part)

344
{'pk': 51794, 'name': '', 'filename': '00000006.jpg', 'title': 'Element 2', 'typology': None, 'image': {'uri': '/media/documents/285/00000006.jpg', 'size': [3626, 4571], 'thumbnails': {'card': '/media/documents/285/00000006.jpg.180x180_q85_crop-smart.jpg', 'large': '/media/documents/285/00000006.jpg.1000x1000_q85.jpg'}}, 'bw_image': None, 'workflow': {'convert': 'done', 'segment': 'done', 'transcribe': 'done'}, 'order': 1, 'recoverable': False, 'transcription_progress': 100, 'source': ''}


## Downloading Files

Images can be easily downloaded/uploaded with the connector. Also the exported XML files are available.

In [None]:
image = escr.get_image(selected_part["image"]["uri"])

In [None]:
from IPython.display import Image
display(Image(image))

In [10]:
import zipfile

transcriptions = escr.get_document_transcriptions(document_pk)
page_xmls_zipped = escr.download_part_pagexml_transcription(document_pk, [x["pk"] for x in document_parts[0:4]], transcriptions[0]["pk"])
zip = zipfile.ZipFile(io.BytesIO(page_xmls_zipped))
page_xmls = [zip.read(x) for x in zip.infolist()]
print(len(page_xmls))

4


In [13]:
zip = zipfile.ZipFile(io.BytesIO(page_xmls_zipped))
print(zip.infolist())

[<ZipInfo filename='00000005.xml' filemode='?rw-------' file_size=34266>, <ZipInfo filename='00000006.xml' filemode='?rw-------' file_size=30620>, <ZipInfo filename='00000007.xml' filemode='?rw-------' file_size=27657>, <ZipInfo filename='00000008.xml' filemode='?rw-------' file_size=27622>]


In [9]:
import xml.dom.minidom as md

# Let's see what it looks like:
for page_xml in page_xmls:
    dom = md.parse(io.BytesIO(page_xml))
    pretty_xml = dom.toprettyxml()
    # remove the weird newline issue:
    pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines() if s.strip()])
    print(pretty_xml)

<?xml version="1.0" ?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
	<Metadata>
		<Creator>escriptorium</Creator>
		<Created>2021-09-30T11:46:16.293992+00:00</Created>
		<LastChange>2021-09-30T11:46:16.294074+00:00</LastChange>
	</Metadata>
	<Page imageFilename="00000005.jpg" imageWidth="3626" imageHeight="4571">
		<TextRegion id="eSc_textblock_aa212ea9" custom="structure {type:Main;}">
			<Coords points="893,3435 3147,3435 3147,601 893,601"/>
			<TextLine id="eSc_line_caab9f49">
				<Coords points="876,681 884,617 911,594 949,598 1025,655 1048,639 1098,651 1181,609 1223,632 1258,601 1277,617 1311,598 1349,620 1376,594 1422,632 1490,594 1616,617 1650,639 1711,620 1727,636 1822,628 1864,643 1929,643 1997,601 2032,601 2070,636 2158,617 22