# Transkribus API Exercise 

 [__API Documentation__](https://readcoop.eu/transkribus/docu/rest-api/)

## Import Modules

* [__requets__](https://requests.readthedocs.io/en/latest/): To send HTTP requests to API
* [__os__](https://docs.python.org/3/library/os.html): To get secrets (email,password) from .env file
* [__dotenv__](https://pypi.org/project/python-dotenv/#load-env-files-in-ipython): To load .env file with secrets into script
* [__lxml__](https://lxml.de/index.html#documentation): To process XML responses from the API 

In [32]:
import requests
import os
from dotenv import load_dotenv
from lxml import etree

## Authentication

The _authentication_ function sends a post request with login credentials from .env file to login URL, processes the XML reponse from the API to find and return the session ID. 

All subsequent requests must include the session ID (within the cookies or URL params). 

In [33]:
def authentication():    

    login_url = "https://transkribus.eu/TrpServer/rest/auth/login"

    load_dotenv('.env')
    user = os.getenv("EMAIL")
    pw = os.getenv("PASSWORD")

    login_credentials = {
        "user" : user,
        "pw" : pw
    }

    response = requests.post(login_url, data=login_credentials)
    

    response.raise_for_status()
        
        
    auth_root = etree.fromstring(response.content)

    for child in auth_root:
        if child.tag == "sessionId":
            JSESSIONID = child.text

    sessionId = {
        "JSESSIONID":JSESSIONID
    }

    return sessionId

print(authentication())

sessionId = authentication()

{'JSESSIONID': 'B08AE8F1CF547690C3D148C9A1F37F26'}


## Get Collection ID

The _get_collectionId_ function accepts the sessionID as a parameter, sends a get request to the collection list URL, gets the  JSON response from the API to find and return the collection ID.

The JSON response is a dictionary within a list

In [68]:
def get_collectionId(sessionId):
    
    collection_url = "https://transkribus.eu/TrpServer/rest/collections/list"
    collection_response = requests.get("https://transkribus.eu/TrpServer/rest/collections/list", cookies = sessionId)
    
    collection_response.raise_for_status()
    
    colId = collection_response.json()[0]["colId"] 
    
    return colId

print(get_collectionId(authentication()))

colId = get_collectionId(sessionId)

284395


## Upload an Image

[API Documentation to Upload an Image](https://readcoop.eu/transkribus/docu/rest-api/upload/)

The _upload_image_ function accepts the session and collection IDs as parameters, sends a post request to the upload url with the relevant header content type settings, image metadata, session and collection ID. 

Afterwards it processes the XML reponse from the API to find the upload ID, sends put request to the upload file data URL with the image file binary data, session and upload ID. 

After that it processes the XML reponse from the API to find and return the job ID.

In [71]:
def upload_image(sessionId, colId):

    upload_url = f"https://transkribus.eu/TrpServer/rest/uploads?collId={colId}"

    headers = {
        "content_type" : "application/json"
    }

    img = {
    "md": {
        "title": "Something",
        "author": "Somebody",
        "genre": "Some Genre",
        "writer": "Someone"
    },
    "pageList": {"pages": [
        {
            "fileName": "default2.jpg",
            "pageNr": 1
        }
    ]}
    }


    upload_response = requests.post(upload_url, cookies = sessionId, json=img, headers=headers)

    upload_response.raise_for_status()

    upload_root = etree.fromstring(upload_response.content)

    for child in upload_root:
        if child.tag == "uploadId":
            uploadId = child.text


    uploadfile_url =f"https://transkribus.eu/TrpServer/rest/uploads/{uploadId}"

    file = {'img': open('default2.jpg', 'rb')} #rb: Read Binary

    uploadfile_response = requests.put(uploadfile_url, cookies = sessionId, files=file)

    uploadfile_response.raise_for_status()

    print(uploadfile_response.status_code)

    print(uploadfile_response.text)


    uploadfile_root = etree.fromstring(uploadfile_response.content)

    for c in uploadfile_root:
        if c.tag == "jobId":
            jobId = c.text

    return jobId

jobId = upload_image(sessionId=sessionId,colId=colId)


print(jobId)



200
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><trpUpload><md><docId>-1</docId><title>Something</title><author>Somebody</author><uploadTimestamp>0</uploadTimestamp><genre>Some Genre</genre><writer>Someone</writer><uploaderId>0</uploaderId><nrOfPages>0</nrOfPages><collectionList/></md><pageList><pages><fileName>default2.jpg</fileName><pageUploaded>true</pageUploaded><pageNr>1</pageNr></pages></pageList><uploadId>1874308</uploadId><created>2024-03-12T02:51:58.921+01:00</created><finished>2024-03-12T02:52:06.211+01:00</finished><userId>223154</userId><userName>vividrakes2147@gmail.com</userName><nrOfPagesTotal>1</nrOfPagesTotal><uploadType>JSON</uploadType><jobId>8327963</jobId><colId>284395</colId></trpUpload>
8327963


## Upload Multiple Images

[API Documentation to Upload Images](https://readcoop.eu/transkribus/docu/rest-api/upload/)

The _upload_image_ function accepts the session and collection IDs as parameters, sends a post request to the upload url with the relevant header content type settings, image metadata (containing multiple pages in the "pageList" key), session and collection ID. 

Afterwards it processes the XML reponse from the API to find the upload ID, sends put request to the upload file data URL with  session ID, upload ID and the binary data from the image files using a for loop. 

After that it processes the XML reponse from the API to find and return the job ID.

In [66]:
def upload_multiple_images(sessionId, colId):
    upload_url = f"https://transkribus.eu/TrpServer/rest/uploads?collId={colId}"

    headers = {
        "content_type" : "application/json"
    }

    img = {
    "md": {
        "title": "Multiple_Test",
        "author": "Somebody",
        "genre": "Some Genre",
        "writer": "Someone"
    },
    "pageList": {"pages": [
        {
            "fileName": "X.jpg",
            "pageNr": 1 #Page Number 1
        },
        {
            "fileName": "Y.jpg",
            "pageNr": 2 #Page Number 2
        }
    ]}
    }


    upload_response = requests.post(upload_url, cookies = sessionId, json=img, headers=headers)

    upload_response.raise_for_status()

    upload_root = etree.fromstring(upload_response.content)

    for child in upload_root:
        if child.tag == "uploadId":
            uploadId = child.text


    uploadfile_url =f"https://transkribus.eu/TrpServer/rest/uploads/{uploadId}"

    pages = img["pageList"]["pages"]


    for page in pages:
        print(page["fileName"]) 

        file = {'img': open(page["fileName"], 'rb')} #rb: Read Binary

        uploadfile_response = requests.put(uploadfile_url, cookies = sessionId, files=file)

        uploadfile_response.raise_for_status()

        print(uploadfile_response.status_code)

        print(uploadfile_response.text) #Doc ID is -1??? 


    uploadfile_root = etree.fromstring(uploadfile_response.content)

    for c in uploadfile_root:
         if c.tag == "jobId":
            jobId = c.text

    print(jobId)
    return jobId

jobId= upload_multiple_images(sessionId=sessionId, colId=colId)

X.jpg
200
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><trpUpload><md><docId>-1</docId><title>Multiple_Test</title><author>Somebody</author><uploadTimestamp>0</uploadTimestamp><genre>Some Genre</genre><writer>Someone</writer><uploaderId>0</uploaderId><nrOfPages>0</nrOfPages><collectionList/></md><pageList><pages><fileName>X.jpg</fileName><pageUploaded>true</pageUploaded><pageNr>1</pageNr></pages><pages><fileName>Y.jpg</fileName><pageUploaded>false</pageUploaded><pageNr>2</pageNr></pages></pageList><uploadId>1874306</uploadId><created>2024-03-12T02:43:44.876+01:00</created><userId>223154</userId><userName>vividrakes2147@gmail.com</userName><nrOfPagesTotal>2</nrOfPagesTotal><uploadType>JSON</uploadType><colId>284395</colId></trpUpload>
Y.jpg
200
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><trpUpload><md><docId>-1</docId><title>Multiple_Test</title><author>Somebody</author><uploadTimestamp>0</uploadTimestamp><genre>Some Genre</genre><writer>Someone</writer><uploaderId

## Start OCR

This not working. 



In [82]:


collection_md_url = f"https://transkribus.eu/TrpServer/rest/collections/{colId}/list"

collection_md_response = requests.get(collection_md_url , cookies = sessionId)

collection_md_response.raise_for_status()

collection_md_response.text

docId = collection_md_response.json()[2]["docId"] 
print(docId)


pages = 1
 

OCR_URL=f"https://transkribus.eu/TrpServer/rest/recognition/ocr?collId={colId}&id={docId}&pages={pages}"

OCR_response = requests.post(OCR_URL, cookies = sessionId)

OCR_response.raise_for_status() 

#HTTPError 403: Forbidden 

1874308


HTTPError: 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/ocr?collId=284395&id=1874308&pages=1