## Running handwritten text recognition on Rackham notebooks

#### **Important** set these two values before running the code.

In [1]:
manifest = "https://cudl.lib.cam.ac.uk//iiif/MS-CCCC-00014-00006-00002-00001-00063.json" # the URI of the IIIF manifest containing the images
spread = "Y" # set to "N" if the images are not a double spread

import the libraries we need and which Colab already has

In [2]:
import json
import os
import requests

Colab does not have Google's own vision library (even though they are both Google products) so we need to install it before importing it

In [4]:
!pip install google-cloud-vision



In [5]:
from google.cloud import vision

Upload a JSON file with the authentication details.

If not connecting to an institutional Drive account you could put it there and connect, which is more secure.
The code below assumes that the `.json` file is at the top level of the files menu (icon to the left), at the same level as the `sample_data` directory.
Note that all files in this directory are deleted after your session ends.

In [6]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "prime-poetry-475713-s7-2e04e578e446.json"

Get a file ready to write any errors to it.

In [7]:
qafile=open('qafile.txt', 'w')

This big function gets the images and extracts text from them.

In [8]:
def detect_handwriting_paragraphs(path):
    """Detects handwriting in the file."""
    client = vision.ImageAnnotatorClient()

    image = vision.Image()
    image.source.image_uri = path

    #image = vision.Image(content=content)
    response = client.document_text_detection(image=image)  # Use document_text_detection for handwriting
    annotation = response.full_text_annotation

    paragraphs = []
    lines = []

    for page in annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                para = ""
                line = ""
                for word in paragraph.words:
                    for symbol in word.symbols:
                        line += symbol.text
                        #SPACE
                        if symbol.property.detected_break.type == 1:
                            line += ' '
                        #EOL_SURE_SPACE
                        if symbol.property.detected_break.type == 3:
                            line += ' '
                            line=line.replace("<", "&lt;")
                            line=line.replace(">", "&gt;")
                            line+="<lb/>\n"
                            lines.append(line)
                            para += line
                            line = ''
                        #LINE_BREAK
                        if symbol.property.detected_break.type == 5:
                            #line+="<lb/>\n"
                            line=line.replace("<", "&lt;")
                            line=line.replace(">", "&gt;")
                            lines.append(line)
                            para += line
                            line = ''
                paragraphs.append(para)


    if response.error.message:
        raise Exception('{}\nFor more info on error messages, check: https://cloud.google.com/apis/design/errors'.format(response.error.message))



    return paragraphs

Get the IIIF manifest, which contains the URLs of the images we want.
This URL is set right at the top of the notebook.

In [9]:
r = requests.get(manifest)
json_data=r.json()


Create an XML file. We will write all transcriptions to one file.

In [10]:
xmlfile=open('xmlfile.xml', 'w')

Add the opening elements to the new file

In [11]:
xmlfile.write("<text>\n<body>\n<div>\n")

20

Now we can iterate through the IIIF manifest. We need to set the counter to 0 to start with.

In [12]:
counter=0

Iterate through the manifest, passing each URL to Google Vision to do the HTR.

In [13]:
for sequence in json_data['sequences']:
  for canvas in sequence['canvases']:
    label=canvas['label']
    xmlid="page-pb-"+label
    counter+=1
    facs="#page-surface-"+str(counter)
    for image in canvas['images']:
      iiif_image=image['resource']['@id']
      image_id=iiif_image.rsplit('/', 1)[-1]

      #writes out page break information
      pb=f'<pb n="{label}" xml:id="{xmlid}" facs="{facs}"/>'
      print(pb)
      xmlfile.write(pb+"\n")

      paragraphs=[]

      #gets the text from the image
      if spread=="Y":
        paragraphs_1=detect_handwriting_paragraphs(iiif_image+"/pct:0,0,50,100/full/0/default.jpg")
        paragraphs_2=detect_handwriting_paragraphs(iiif_image+"/pct:50,0,100,100/full/0/default.jpg")
        paragraphs=paragraphs_1+paragraphs_2
      else:
        paragraphs=detect_handwriting_paragraphs(iiif_image+"/full/full/0/default.jpg")

      for paragraph in paragraphs:
        paragraph=paragraph.replace("&", "&amp;")
        xmlfile.write("<p>"+paragraph+"</p>\n")


<pb n="cover" xml:id="page-pb-cover" facs="#page-surface-1"/>
<pb n="5125" xml:id="page-pb-5125" facs="#page-surface-2"/>
<pb n="5126-5127" xml:id="page-pb-5126-5127" facs="#page-surface-3"/>
<pb n="5128-5129" xml:id="page-pb-5128-5129" facs="#page-surface-4"/>
<pb n="5130-5131" xml:id="page-pb-5130-5131" facs="#page-surface-5"/>
<pb n="5132-5133" xml:id="page-pb-5132-5133" facs="#page-surface-6"/>
<pb n="5134-5135" xml:id="page-pb-5134-5135" facs="#page-surface-7"/>
<pb n="5136-5137" xml:id="page-pb-5136-5137" facs="#page-surface-8"/>
<pb n="5138-5139" xml:id="page-pb-5138-5139" facs="#page-surface-9"/>
<pb n="5140-5141" xml:id="page-pb-5140-5141" facs="#page-surface-10"/>
<pb n="5142-5143" xml:id="page-pb-5142-5143" facs="#page-surface-11"/>
<pb n="5144-5145" xml:id="page-pb-5144-5145" facs="#page-surface-12"/>
<pb n="5146-5147" xml:id="page-pb-5146-5147" facs="#page-surface-13"/>
<pb n="5148-5149" xml:id="page-pb-5148-5149" facs="#page-surface-14"/>
<pb n="5150-5151" xml:id="page-pb

Now the file should have been populated and we can append the closing elements to make the XML well formed.

In [14]:
xmlfile.write("</div>\n</body>\n</text>")

22

Show us the top of the file to see what was produced.

In [15]:
!head xmlfile.xml

<text>
<body>
<div>
<pb n="cover" xml:id="page-pb-cover" facs="#page-surface-1"/>
<p>5125-5212</p>
<p>29 Oct. 1961</p>
<p>Cambridge</p>
<p>Horning <lb/>
Ditc</p>
<p>&amp;C. <lb/>


Want to try with something else? How about this text in Chinese?
https://cudl.lib.cam.ac.uk/view/MS-FC-00099-00025/1

If you have a look at this, you'll see the pages are single spreads not doubles like the Rackham. That means you'll have to change the SPREAD variable in the cell at the top of the notebook.

The IIIF manifest for this set of images is here, so you need to swap the address at the top of the notebook to run it:

https://cudl.lib.cam.ac.uk//iiif/MS-FC-00099-00025.json

Finally you might want to change the name of the output file, so that the Rackham transcription is not overwritten. This file will also be an XML file.