## Load Packages

We first load the required packages/modules for this simple exploration of the API.
You can find documentation about how importing works at [this link](https://docs.python.org/3/tutorial/modules.html).

See below to the documentation for the following packages/modules:

**I. Data Processing**
- [Requests](https://docs.python-requests.org/en/latest/index.html): This module helps you make HTTP requests to interact with web APIs or servers.
- [json](https://docs.python.org/3/library/json.html): This is a built-in Python module for working with JSON data.
- [xml.etree.ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html): This sub-module provides tools for parsing and working with XML data.

**II. Other Utilities**

- [getpass](https://docs.python.org/3/library/getpass.html): This module retrieves a password from the user without echoing it to the console. This is useful for security purposes when prompting users for passwords in your script.
- [urllib.parse](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse): This module provides functions for parsing URLs and manipulating URL components.
    - We will only use one function from this sub-module, `urlparse`.
- [re](https://docs.python.org/3/library/re.html): This module provides regular expression functionalities for searching and manipulating text data.

In [1]:
import requests
import json
import xml.etree.ElementTree as ET

import getpass

from urllib.parse import urlparse
import re

## Needed Constants
The following are constants which will be queried once for each session.

As I will be using these global variables throughout my functions, they will be in uppercase to make them easier to notice.

Please note the use of `getpass` here. If running this yourself, enter your own information when prompted.

In [2]:
USER_NAME = getpass.getpass('Enter your username')
PASSWORD = getpass.getpass('Enter your password')

API_URL = "https://transkribus.eu/TrpServer/rest/"
SESSION_CACHE = {}       # Store as a dictionary
COLLECTIONS_CACHE = []   # Store as a json

Enter your username ········
Enter your password ········


Notice I'm using `API_URL` to represent the server we're interfacing with.

You can find all the paths you can interface with at this [link](https://transkribus.eu/TrpServer/rest/application.wadl?detail=true).

A more visually pleasing version of these paths can be found [here](https://transkribus.eu/TrpServer/Swadl/wadl.html).

You may also learn more about the available API at their [website](https://readcoop.eu/transkribus/docu/rest-api/).

## Useful Links

The following may prove useful when considering how to connect to Transkribus

- [Solo Dev API Test - Sydney Stock Exchange](https://github.com/wragge/sydney-stock-exchange/blob/b1365b230c67bf14ffebac81d7c22b610f4e2248/transkribus-api-tests.ipynb)
    - This repo brought to my attention the use of `requests.Session` for which you can find out more at [this link](https://docs.python-requests.org/en/latest/user/advanced/). 
- [WADL Files and Packages](https://pypi.org/project/wadllib/)
    - Apparently there are packages in python which can help users interact with [Web Application Description Language](https://www.w3.org/submissions/wadl/) which this particular API is built on. I haven't tried it, but am noting it here.

## Helper Functions


The following are useful details when examining the below functions.

At a later date I may place these in their own files and import them.

_Bear in mind comments will also be added to the functions, and elsewhere, to help understand them._

---


How truthiness works in Python:
- [Link to docs](https://docs.python.org/3/library/stdtypes.html#truth-value-testing)
- I will use `bool` _where possible_ to explicitly note that I'm comparing a boolean, but note that **using `bool` explicitly is not required**. I simply believe it makes the code easier to read.
- Essentially Python will treat the folliwng as False when evaluated for truthiness:
    - Constants defined to be false: `None` and `False`.
    - Zero of any numeric type: `0`, `0.0`, `0j`, `Decimal(0)`, `Fraction(0, 1)`.
    - Empty sequences and collections: `''`, `()`, `[]`, `{}`, `set()`, `range(0)`.

Shorthand for `if else` statements
- [Link to docs](https://docs.python.org/3/reference/expressions.html#conditional-expressions)
- We may write a general `if else` statement in Python with the following shorthand:
    - `x if C else y` where `x` is evaluated if `C` is `True`, else `y` will be evaluated when `C` is `False`.

In [3]:
def correct_url(url, add_w = False):
    """
    Corrects a URL by ensuring it has a scheme, proper path, and network location.

    Args:
        url (str): The URL to be corrected.
        add_w (bool, optional): If True, add 'www.' to the front of the link. Defaults to False

    Returns:
        ParsedURL: The parsed URL object.
    """

    ### Parse the url
    up = urlparse(url)

    ### Reformat the url
    up = up._replace(scheme = 'https' if not bool(up.scheme) else up.scheme)
    netloc, _, path = (up.netloc or up.path).partition('/')
    up = up._replace(path = re.sub( '/+', '/', up.path if bool(up.netloc) else path) )
    # Add `www` to the website if needed
    netloc = ('www.' + netloc) if bool(add_w) else netloc
    up = up._replace(netloc = re.sub('^w*.?w*w.', 'www.', netloc))

    return up


#####################################
#####################################
#####################################


def create_session(retry_num = 5,
                   backoff_factor = 0.5,
                   status_forcelist = [502, 503, 504],
                   ####
                   headers = None):
    """
    Creates an HTTP session with retry settings and headers.    

    Args:
        retry_num (int, optional): The maximum number of retries. Defaults to 5
        backoff_factor (float, optional): The backoff factor. Defaults to 0.5
        status_forcelist (list, optional): HTTP status codes on which to retry. Defaults to [502, 503, 504]
        headers (dict, optional): The headers to use continuouly through this session. Defaults to None

    Returns:
        requests.sessions.Session: The created session.
    """
    session = requests.Session()
    
    # Configure retry behavior
    retries = requests.packages.urllib3.util.retry.Retry(
        total = retry_num,  # Maximum number of retries
        backoff_factor = backoff_factor,  # Exponential backoff factor (0.5 means 1s, 2s, 4s, 8s, ...)
        status_forcelist = status_forcelist,  # Retry on these HTTP status codes
    )
    
    # Mount the retry settings to both HTTP and HTTPS adapters
    session.mount('http://', requests.adapters.HTTPAdapter(max_retries = retries))
    session.mount('https://', requests.adapters.HTTPAdapter(max_retries = retries))
    
    ###
    
    # Add/Update headers
    if (headers is not None) and isinstance(headers, dict):
        session.headers.update(headers)
    
    return session


#####################################
#####################################
#####################################


def act_on_url(url,
               ###
               method = 'GET',
               req_data = None,
               timeout = 20,
               backoff_factor = 0.5,
               retry_num = 5,
               ###
               add_w = False,
               ###
               session = None,
               **kwargs):
    """
    Fetches the body from a URL after carrying out some 'act', be that `GET`, `POST`, or otherwise.

    Args:
        url (str): The URL to fetch data from.
        method (str, optional): The HTTP method to use. Defaults to 'GET'.
        req_data (dict, optional): Request data. Defaults to None.
        timeout (int, optional): The number of seconds to wait for the server. Defaults to 20.        

        ### Inherited from `create_session` ###
        backoff_factor (float, optional): The backoff factor. Defaults to 0.5.
        retry_num (int, optional): The number of retries. Defaults to 5.
        
        ### Inherited from `correct_url`, excluding 'url' ###
        add_w (bool, optional): If True, add 'www.' to the front of the link.
        
        Notice the use of `**kwargs` which allows us to hand off all other named inputs.
        If you're interested, check this nice tutorial imo, Link: https://calmcode.io/course/args-kwargs/introduction

    Returns:
        requests.Response: The response object.
    """

    # Notice the use of `.geturl()`
    parsed_url = correct_url(url, add_w = add_w).geturl()

    # Choose a session or create one
    if (session is not None) and isinstance(session, requests.sessions.Session):
        sess = session
    else:
        sess = create_session(retry_num = retry_num, backoff_factor = backoff_factor) #requests.Session()
        
    with sess as s:
        try:
            response = requests.Request(method = method, url = parsed_url, data = req_data, **kwargs)
            prepped = s.prepare_request(response)
            resp = s.send(prepped, timeout = timeout)
            resp.raise_for_status()  # Raise an exception if the request failed, otherwise returns `None`
            return resp
        except requests.exceptions.HTTPError as http_err:
            print(f'HTTP error occurred for the url {parsed_url}:\n\t {http_err}')
            return resp
        except requests.exceptions.Timeout:
            print('The request timed out')
            return resp
        except Exception as err:
            print(f'Other error occurred for the url {parsed_url}:\n\t {err}')
            return resp


#####################################
#####################################
#####################################


def build_headers(sessionid = None):
    """
    Build headers to use for authenticated session

    Args:
        sessionid (str): The Session ID as a string.

    Returns:
        dict: The headers for the session.
    """
    if not bool(sessionid):
        return None

    # This references a global variable implicitly. It can be quite a bad idea to design in this way
    # Thus do this with caution
    header = {'Cookie': f'''JSESSIONID={SESSION_CACHE['sessionId']}'''}
    #header.update( {'Content-Type': 'application/x-www-form-urlencoded'} )
    
    return header


#####################################
#####################################
#####################################


def authenticate(url = f'{API_URL}auth/login',
                 ###
                 method = 'POST',
                 req_data = {'user': USER_NAME, 'pw': PASSWORD},
                 print_results = False,
                 session = None,
                 **kwargs):
    """
    Logs in to the Transkribus server.

    Args:
        ### Inhereted ALL from `act_on_url` EXCEPT 'print_results', but changes the defaults for the following ###
        url (str, optional): The URL to fetch data from. Defaults to the login URL, 'https://transkribus.eu/TrpServer/rest/auth/login'.
        method (str, optional): The HTTP method to use. Defaults to 'POST'.
        req_data (dict, optional): Request data. Defaults to {'user': USER_NAME, 'pw': PASSWORD}.
        print_results (bool, optional): Whether to print the login results. Defaults to False.

    Returns:
        dict or requests.Response:  The session as a dict on success, the failed response otherwise.
    """

    # Get the response from the url
    response = act_on_url(url, method = method, req_data = req_data, session = session, **kwargs)

    ### Returns response from the page as text and parses it, assuming its an XML file
    # You can confirm what page is encoded as (e.g. text, json, xml, etc.) using the
    # following function/method(?), `response.headers['Content-Type']`, where response is the response object
    root = ET.fromstring(response.text) 
    
    session_cache = {}
    for child in root:
        session_cache.update({child.tag : child.text})
        if bool(print_results):
            if child.tag != 'ip':
                print(f"{child.tag}: {child.text}")

    if bool(session_cache['sessionId']):
        return session_cache
    else:
        print('failed to get session details from authentication')
        return response

#### Authenticate and Get Session Key

Our first step is to get our session key through authentication.

If you wish to see your details printed, switch `False` to `True`.

In [4]:
SESSION_CACHE = authenticate(print_results = False)

In [5]:
SESSION_CACHE['sessionId']

'D17BC690D59B7E9C9899DB999FDB9D80'

#### Build Headers & Session to Use Throughout

We next build the headers which will be used throughout the entire task.

In [6]:
headers = build_headers(sessionid = SESSION_CACHE['sessionId'])

In [7]:
headers

{'Cookie': 'JSESSIONID=D17BC690D59B7E9C9899DB999FDB9D80'}

Our session is then constructed which includes our authentication key.

Instead of referencing the authentication key directly, we'll instead reference the session.

In [8]:
s = create_session(headers = headers)

#### Check  Collection List

Note the following:
- We use `.json()` at the end of our function to directly translate its content as json.

In [9]:
collection_list_json = act_on_url(url = f'{API_URL}collections/list', session = s).json()

In [10]:
collection_list_json

[{'type': 'trpCollection',
  'colId': 284394,
  'colName': 'dario.dat.c+rcoop@gmail.com Collection',
  'description': 'dario.dat.c+rcoop@gmail.com',
  'created': '2024-03-01T16:10:50.344+01:00',
  'crowdsourcing': False,
  'elearning': False,
  'pageId': 66002544,
  'url': 'https://files.transkribus.eu/Get?fileType=view&id=QPDSFBPIZVUZVXYCWHTJLNHD',
  'thumbUrl': 'https://files.transkribus.eu/Get?fileType=thumb&id=QPDSFBPIZVUZVXYCWHTJLNHD',
  'nrOfDocuments': 3,
  'role': 'Owner',
  'accountingStatus': 1}]

If we have different collections,
this is one way which allows us to get a quick overview for all of them.

This method of using loops and relevant keys will be repeated throughout this notebook.

In [11]:
for dict in collection_list_json:
    print('Id num.', ': ', dict['colId'],
          '\t|\t',
          'Description', ': ', dict['description'], # Depending on the length, it may be good to remove
          '\t|\t',
          'num. of Doc.', ': ', dict['nrOfDocuments'],
          sep = '')

Id num.: 284394	|	Description: dario.dat.c+rcoop@gmail.com	|	num. of Doc.: 3


#### Check The Documents in a Specific Collection

Here we select the collection id we're interested in.

As it's a json file we switch between using an index (i.e. numbers) and keys (i.e. text) to query the part we're interested in.

This method of querying is also repeated thoughout this notebook.

In [12]:
col_id = collection_list_json[0]['colId']

Notice below how we're inserting the collection id into the url.

This insertion can be done programtically if needed.

In [13]:
document_list_json = act_on_url(url = f'{API_URL}collections/{col_id}/list', session = s).json()

In [14]:
document_list_json

[{'type': 'trpDocMetadata',
  'docId': 1858998,
  'title': 'test_1',
  'uploadTimestamp': 1709505983687,
  'uploader': 'dario.dat.c+rcoop@gmail.com',
  'uploaderId': 223153,
  'nrOfPages': 1,
  'pageId': 66002544,
  'url': 'https://files.transkribus.eu/Get?fileType=view&id=QPDSFBPIZVUZVXYCWHTJLNHD',
  'thumbUrl': 'https://files.transkribus.eu/Get?fileType=thumb&id=QPDSFBPIZVUZVXYCWHTJLNHD',
  'status': 0,
  'fimgStoreColl': 'TrpDoc_DEA_1858998',
  'origDocId': 0,
  'collectionList': {'colList': [{'colId': 284394,
     'colName': 'dario.dat.c+rcoop@gmail.com Collection',
     'description': 'dario.dat.c+rcoop@gmail.com',
     'crowdsourcing': False,
     'elearning': False,
     'nrOfDocuments': 0}]},
  'attributes': [],
  'mainColId': 284394,
  'isInMain': True},
 {'type': 'trpDocMetadata',
  'docId': 1865707,
  'title': 'test_title',
  'author': 'test_author',
  'uploadTimestamp': 1709774302892,
  'genre': 'test_genre',
  'writer': 'test_writer',
  'uploader': 'dario.dat.c+rcoop@gmail

Similar to the above collection list, we can also pull specific details each document in the list.

In [15]:
for dict in document_list_json:
    print('Id num.', ': ', dict['docId'],
          '\t|\t',
          'Title', ': ', dict['title'],
          '\t|\t',
          'num. of Pages', ': ', dict['nrOfPages'],
          sep = '')

Id num.: 1858998	|	Title: test_1	|	num. of Pages: 1
Id num.: 1865707	|	Title: test_title	|	num. of Pages: 1


#### Check a Specific Document

Here we select the document id we're interested in.

In [16]:
doc_id = document_list_json[0]['docId']

Next we get the full details for this particular document.

In [17]:
full_document_json = act_on_url(url = f'''{API_URL}collections/{col_id}/{doc_id}/fulldoc''', session = s).json()

Note the use of semi colon after our statement, this stops the code from displaying anything, if you wish to see the contents, simply remove the semi colon.

In [18]:
full_document_json;

As before, we iterate through the json.

In [19]:
for val in ('nrOfRegions', 'nrOfTranscribedLines', 'nrOfTranscribedWords'):
    print(val, ': ', full_document_json['md'][val], sep = '')

nrOfRegions: 3
nrOfTranscribedLines: 37
nrOfTranscribedWords: 296


#### Check Jobs List

In [20]:
job_list_json = act_on_url(url = f'{API_URL}jobs/list', session = s).json()

In [21]:
job_list_json;

In [22]:
for dict in job_list_json:
    print('Job Id num.', ': ', dict['jobId'],
          '\t|\t',
          'Doc. Id num.', ': ', dict['docId'],
          '\t|\t',
          'Type', ': ', dict['type'],
          '\t|\t',
          'State', ': ', dict['state'],
          '\t|\t',
          'Succeeded', ': ', dict['success'],
          sep = '')

Job Id num.: 8354205	|	Doc. Id num.: 1878170	|	Type: Create Document	|	State: FINISHED	|	Succeeded: True
Job Id num.: 8275348	|	Doc. Id num.: 1865707	|	Type: Create Document	|	State: FINISHED	|	Succeeded: True
Job Id num.: 8224005	|	Doc. Id num.: 1858998	|	Type: PyLaia Decoding	|	State: FINISHED	|	Succeeded: True
Job Id num.: 8223063	|	Doc. Id num.: 1858998	|	Type: Create Document	|	State: FINISHED	|	Succeeded: True


#### Check a Specific Job

In [23]:
job_id = job_list_json[0]['jobId']

In [24]:
full_job_json = act_on_url(url = f'{API_URL}jobs/{job_id}', session = s).json()

In [25]:
full_job_json

{'jobId': '8354205',
 'docId': 1878170,
 'pageNr': -1,
 'type': 'Create Document',
 'state': 'FINISHED',
 'success': True,
 'description': 'Done, duration: 1s 77ms',
 'userName': 'dario.dat.c+rcoop@gmail.com',
 'userId': 223153,
 'createTime': 1710397012368,
 'startTime': 1710397012567,
 'endTime': 1710397013644,
 'jobData': '#Thu Mar 14 07:16:52 CET 2024\ncolId=284394\n',
 'resumable': False,
 'jobImpl': 'UploadImportJob',
 'moduleUrl': 'http://srv6155:8080/UtilityModule-trpProd-2.12.0',
 'moduleName': 'UtilityModule',
 'moduleVersion': '2.12.0',
 'started': '2024-03-14T07:16:52.567+01:00',
 'ended': '2024-03-14T07:16:53.644+01:00',
 'created': '2024-03-14T07:16:52.368+01:00',
 'batchId': 0,
 'pageid': 0,
 'tsid': 0,
 'parent_jobid': 0,
 'parent_batchid': 0,
 'colId': 284394,
 'progress': 1,
 'totalWork': 1,
 'nrOfErrors': 0,
 'docTitle': 'Deleted Document',
 'priority': 0}

In [26]:
for val in ('type', 'docTitle', 'description', 'state'):
    print(val, ': ', full_job_json[val], sep = '')

type: Create Document
docTitle: Deleted Document
description: Done, duration: 1s 77ms
state: FINISHED


#### Push an Image to the Server

**Work In Progress**

This section is more involved and requires pushing a json object with a specific format to transkribus via `POST`, and then the image file via `PUT`.

This json only has a minimum requirements of including a `fileName` and a `pageNr`, all else is optional.

Further details, inlcuding the full json template, can be found [here](https://readcoop.eu/transkribus/docu/rest-api/upload/), but for now will use just the minimum example

In [27]:
default_img_up = {
    "md": {
        "title": "test_title_2",
        "author": "test_author",
        "genre": "test_genre",
        "writer": "test_writer"
    },
    "pageList": {
        "pages": [
            {
                "fileName": "view_default.jpg",
                "pageNr": 1
            }
            # Add more details if needed
        ]
    }
}

In [28]:
default_img_up['pageList']['pages'][0]['fileName']

'view_default.jpg'

In [31]:
def load_img(api_url = API_URL,
             col_id = None,
             file_path = None,
             def_json = None,
             session = None,
             headers_post = {'Content-Type': 'application/json'}):
    """
    xxx
    """

    # Logic valid for only one file
    file_name = re.split(r'/|\\', repr(file_path)[1:-1])[-1]
    if file_name != def_json['pageList']['pages'][0]['fileName']:
        print('The file name of the image and in the json must match:')
        return None 
                         
    
    # Post Details
    post = act_on_url(url = f'{api_url}uploads?collId={col_id}',
                      method = 'POST',
                      json = def_json,
                      session = session,
                      headers = headers_post
                     )


    # Get upload id
    up_id = ET.fromstring(post.text).findall('uploadId')[0].text


    # Put Details
    with open(file_path, 'rb') as bf:
        files = {'img': bf, 'Content-Type': 'application/octet-stream'}
        tmp = act_on_url(url = f'{api_url}uploads/{up_id}',
                         method = 'PUT',
                         files = files,
                         session = session)
        
        
        
    return tmp    

In [32]:
tmp = load_img(col_id = col_id,
               file_path = 'view_default.jpg',
               def_json = default_img_up,
               session = s)

In [33]:
tmp.content

b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?><trpUpload><md><docId>-1</docId><title>test_title_2</title><author>test_author</author><uploadTimestamp>0</uploadTimestamp><genre>test_genre</genre><writer>test_writer</writer><uploaderId>0</uploaderId><nrOfPages>0</nrOfPages><collectionList/></md><pageList><pages><fileName>view_default.jpg</fileName><pageUploaded>true</pageUploaded><pageNr>1</pageNr></pages></pageList><uploadId>1878176</uploadId><created>2024-03-14T07:24:14.509+01:00</created><finished>2024-03-14T07:24:19.635+01:00</finished><userId>223153</userId><userName>dario.dat.c+rcoop@gmail.com</userName><nrOfPagesTotal>1</nrOfPagesTotal><uploadType>JSON</uploadType><jobId>8354263</jobId><colId>284394</colId></trpUpload>'

In [34]:
for child in ET.fromstring(tmp.text):
    if not bool(child):
        print(f"{child.tag}: {child.text}")
    else:
        print(child.tag)
        for x in child:
            if not bool(x):
                print(f"\t{x.tag}: {x.text}")
            else:
                print(f'\t{x.tag}')
                for y in x:
                    if not bool(y):
                        print(f"\t\t{y.tag}: {y.text}")
                    else:
                        print(f'\t\t{y.tag}')

md
	docId: -1
	title: test_title_2
	author: test_author
	uploadTimestamp: 0
	genre: test_genre
	writer: test_writer
	uploaderId: 0
	nrOfPages: 0
	collectionList: None
pageList
	pages
		fileName: view_default.jpg
		pageUploaded: true
		pageNr: 1
uploadId: 1878176
created: 2024-03-14T07:24:14.509+01:00
finished: 2024-03-14T07:24:19.635+01:00
userId: 223153
userName: dario.dat.c+rcoop@gmail.com
nrOfPagesTotal: 1
uploadType: JSON
jobId: 8354263
colId: 284394


In [38]:
job_api_id = ET.fromstring(tmp.text).findall('jobId')[0].text

In [39]:
job_api_id

'8354263'

In [40]:
job_api_details = act_on_url(url = f'''{API_URL}jobs/{job_api_id}''', session = s).json()

In [None]:
job_api_details;

In [41]:

for val in ('type', 'docTitle', 'docId', 'colId', 'description', 'state'):
    print(val, ': ',
          job_api_details[val],
          sep = '')

type: Create Document
docTitle: test_title_2
docId: 1878176
colId: 284394
description: Done, duration: 1s 244ms
state: FINISHED


#### Start an OCR Job

In [43]:
col_id_ocr = job_api_details['colId']
doc_id_ocr = job_api_details['docId']
pgnr_ocr = 1

The link, `https://transkribus.eu/TrpServer/rest/recognition/ocr?collId=284394&id=1865707&pages=1` gives the following message:

b'The legacy OCR service is no longer available.'

It would seem the documentation about OCR needs to be updated [here](https://readcoop.eu/transkribus/docu/rest-api/)

---

If we look at the xml of possible options, we find the following

```xml
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="collId" style="query" type="xs:int"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="id" style="query" type="xs:int"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="pages" style="query" type="xs:string"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="typeFace" style="query" type="xs:string"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="language" style="query" type="xs:string"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="doBlockSegOnly" style="query" type="xs:boolean" default="false"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="type" style="query" type="xs:string" default="Legacy"/>
```

From this, I first assumed `type` is our issue, but the question for is what are allowable values?


After some more digging, I'm now assuming I need to use the API at this [link](https://transkribus.eu/processing/swagger/) which I may attempt in the future.

In [44]:
test_link_1 = f'''{API_URL}recognition/ocr?collId={col_id_ocr}&id={doc_id_ocr}&pages={pgnr_ocr}&type='''
test_link_2 = f"{API_URL}recognition/ocr?collId={col_id_ocr}&id={doc_id_ocr}&pages={pgnr_ocr}&language=English&typeFace=combined"

In [45]:
ocr_job_json = act_on_url(url = test_link_2 ,
                          method = 'POST', session = s)

HTTP error occurred for the url https://transkribus.eu/TrpServer/rest/recognition/ocr?collId=284394&id=1878176&pages=1&language=English&typeFace=combined:
	 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/ocr?collId=284394&id=1878176&pages=1&language=English&typeFace=combined


In [46]:
ocr_job_json.content

b'The legacy OCR service is no longer available.'

#### Retrieve The Text

To extract the transcript, we require three things, which are:
- The collection id
- The document id
- The page number

Note that the page number **is not** the page id, it is the page number you've assigned to that particular image.

Ours are chosen below.

In [47]:
ts_col_id = full_document_json['collection']['colId']
ts_doc_id = full_document_json['md']['docId']
ts_pg_num = 1

We then use the following url to query for the transcript.

In [48]:
ts_url = f'''https://transkribus.eu/TrpServer/rest/collections/{ts_col_id}/{ts_doc_id}/{ts_pg_num}/text'''

The response we receive is the transcript stored as an xml document.

In [49]:
ts_response = act_on_url(ts_url, session = s)
ts_root = ET.fromstring(ts_response.text)

See below for a simple example on querying the xml.

In [50]:
list(ts_root[1].iter('{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextRegion'))

[<Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextRegion' at 0x0000026176E69D10>,
 <Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextRegion' at 0x0000026176EC6540>,
 <Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextRegion' at 0x0000026176EC7540>]

We will use two different queries to extract the transcript.

The first, `text_0` maintains the formatting as shown in the browser, while pulling the text and giving a faithful reconstruction. The issue is that it's more involved.

The second, `text_1` is a much simpler query, but loses some formatting and has some minor duplication.

##### Method One

We create a dictionary to make our query a little cleaner.

In [51]:
ts_dict = {
    'TextRegion' : '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextRegion',
    'TextLine' : '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextLine',
    'TextEquiv' : '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}TextEquiv'
}

Our query:

In [52]:
text_0 = ''

# Select text data [1] and exclude meta data [0]
for id, text_region in enumerate(ts_root[1].iter(ts_dict['TextRegion'])):
    # Isolate Regions
    #if id != 0:
    text_0 = text_0 + f'\n[[Region {id+1}]]\n'
    # Isolate just text
    for all_text in text_region.iter(ts_dict['TextLine']):
        for idd, text_in_line in enumerate(all_text.iter(ts_dict['TextEquiv'])):
            # Isolate only full sentences, assuming it's the last element in its list
            if idd == len(list(all_text.iter(ts_dict['TextEquiv']))) - 1:
                for line in text_in_line:
                    if line.text is not None:
                        text_0 = text_0 + line.text + '\n'


print(text_0)
        


[[Region 1]]
well defered so publick a Mark of their Gratitude,
the inferiour far to the Obligations they are under
to youg, to the amazing Effects of whop Capacely
& ermness it is they owe their presant Happynep,
& the proffect they have of future greatret
with you a long Continuance of this propperity
& ye it may be attended w all other Felicity  &
I am wth the greatef Reffect & Leal &
19 tohy 86
M Woodhouss
think Eyfham hath usd Me so very ill about ye
Affair of the fale of his Estate near Wyebridges, that
I am determind to bear wth his continuing in Arrear
no longer, & therefore defire you to make an immediate
Seizure of what ie kath, & not let hem go free till he
hath pd every penny of what he ower: pray let Me
know too how many years, if any, He hath in his
leased, for pinteled to turne hem out of the Parmfortt.
fw have the Fifhery kept still in o Rental
but discharged under the head of Paymt & if Po
I hope ye whole itefairs
much p. .H.H.
Ocnbury will be all finiht 7 year. I pre

##### Method Two

Our query:

In [53]:
text_1 = ''
for i in ts_root[1].itertext():
    if (i.isspace() == False):
        if i.find(' ') == -1:
            text_1 = text_1 + ' ' + i
        elif i.find(' ') >= 0:
            text_1 = text_1 + '\n'


print(text_1)
                

 well defered so publick a Mark of their Gratitude,
 the inferiour far to the Obligations they are under
 to youg, to the amazing Effects of whop Capacely
 & ermness it is they owe their presant Happynep,
 & the proffect they have of future greatret
 with you a long Continuance of this propperity
 & ye it may be attended w all other Felicity &
 I am wth the greatef Reffect & Leal &
 19 tohy 86
 M Woodhouss
 think Eyfham hath usd Me so very ill about ye
 Affair of the fale of his Estate near Wyebridges, that
 I am determind to bear wth his continuing in Arrear
 no longer, & therefore defire you to make an immediate
 Seizure of what ie kath, & not let hem go free till he
 hath pd every penny of what he ower: pray let Me
 know too how many years, if any, He hath in his
 leased, for pinteled to turne hem out of the Parmfortt.
 fw have the Fifhery kept still in o Rental
 but discharged under the head of Paymt & if Po
 I hope ye whole itefairs
 much p. .H.H.
 Ocnbury will be all finiht 7 yea