In [1]:
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

In [2]:
%%time
import json
import pandas as pd
import numpy as np
import cv2
from PIL import Image
#from tesserocr import PyTessBaseAPI, PSM, RIL, PT
import matplotlib.pyplot as plt

CPU times: user 676 ms, sys: 442 ms, total: 1.12 s
Wall time: 51.7 s


In [3]:
%%time
import lib.utils as utils
import lib.config as config
import lib.model as model
import lib.roi as roi
import lib.split_pages as spages
import lib.utils as utils

  from .autonotebook import tqdm as notebook_tqdm


CPU times: user 2.11 s, sys: 1.27 s, total: 3.37 s
Wall time: 2min 19s


In [4]:
image_path = "images/cropped/The_Lightfoot_Herbarium_05_cropped_1.jpg"
image_dir = "images/cropped/"

In [5]:
images = sorted(utils.load_images(image_dir))[:4]

In [6]:
newprompt = """
The input image contains a cropped image of a botanical catalogue listing the contents of folders sorted and listed in taxonomic order.
These information are splitted into columns in each page.
The image provides the division name, family name, species name and their folders' contents. 
---
An example of the provided format is:

Dicotyledones
ACERACEAE
Acer campestre L.
1 folder. Acer campestre [TA]
Acer pseudoplatanus L.
2 folders.
Folder 1. Acer Pseudo-Platanus
[G]. i. "Maple. Bulls: [Bulstrode]
Park" [JL]
Folder 2. Acer Pseudo-Platanus
[TA].
AMARANTHACEAE
Amaranthus lividus L. Flora Europaea 1: 110 (1964)
1 folder. Amaranthus Blitum [TA].
i. Cities Ray's Syn. 1957. ii. "Blite 
Amaranth. Aug. It is often found
on Dunghills in the neighbourhood
of London. I gather'd this on a
Dunghill at Fulham near London"
[JL]. iii. "Amaranthus Blitum.
Monoec: 5. and:" [JL]
---

where divison names are characterised by large bold words; family names as captial words; species names as italic words.
The list of contents under each species lists the number of folders first and then their content. 
When more than one folder is provided roman (like i, ii, iii) are used to differentiate the folders.
Furthermore, for each collection the collectors initials are also cited using square brackets '[]'.

In some cases, the input image provided contains folder information from previously provided images and thus these information had continued from the last page.

---
Example of output lists of JSON dicts

[
{
'division':'Dicotyledones'
'contents': [{'family'='ACERACEAE', contents=[{'species'='Acer campestre L.', folders=[{'folder'=1, 'content'='Acer campestre [TA]'}]}, 
                                              {'species'='Acer pseudoplatanus L.', folders=[{'folder'=1, 'content'='Acer Pseudo-Platanus [G]. i. "Maple. Bulls: [Bulstrode] Park" [JL]'},
                                              {'folder'=2, 'content'='Acer Pseudo-Platanus [TA]'}]}]
},{'family'='AMARANTHACEAE', contents=[{'species'='Amaranthus lividus L. Flora Europaea 1: 110 (1964)', folders=[{'folder'=1, 'content'='Amaranthus Blitum [TA].
i. Cities Ray's Syn. 1957. ii. "Blite 
Amaranth. Aug. It is often found
on Dunghills in the neighbourhood
of London. I gather'd this on a
Dunghill at Fulham near London"
[JL]. iii. "Amaranthus Blitum.
Monoec: 5. and:" [JL]'}]}, ]}

]}

---


Process the image using the following steps to generate a JSON file that supports the above provided structure

1. Create a list of JSON files
2. Find the division name where possible, if not available set as 'N/A'.
3. Under the division, create a contents lists
4. For each family name under division name create a JSON dict and add it to the list
5. Under each family name find the species name and thier folder contents and add them to the list
6. Where appropriate if a value for a specific key cannot be found, set it as 'N/A'
7. In cases where extra text is provided before Division name, family names or species name, consider the division, family or species name as 'Extra info'

Compile these information into a list of JSON dicts and return the list. Please output this list of JSON dicts and not the example. 
The example (denoted between --- and ---) is only to be used as a template for the structure of the JSON format.


Family names are characterised by captial words whilst species names are characterised by italic words.
The list of contents under each species lists the number of folders first and then their content. 
Furthermore, for each collection the collectors initials are also cited using square brackets '[]'.
"""

In [7]:
simple_prompt = "Parse the text in the image into a JSON format"

In [8]:
%%time
qwen_model = model.QWEN_Model(prompt=simple_prompt, max_new_tokens = 5000, batch_size=1)

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:46<00:00, 33.26s/it]


CPU times: user 2.92 s, sys: 11.3 s, total: 14.2 s
Wall time: 2min 54s


In [9]:
%%time
outputs = qwen_model.batch_infer(images, save=False, debug=True)

>>> Using: 
 	Maximum new tokens = 5000 
 	Batch size = 1 
 	save_path = None
>>> Batch 1 starting...
	Processing text prompts...
	Reading Images...
	Processing inputs...
	Moving inputs to gpu...
	Performing inference...
	Inference Finished
	Seperating Ids...
	Decoding Ids...
	Seperating Outputs...
	Outputs stored!
>>> Batch 2 starting...
	Processing text prompts...
	Reading Images...
	Processing inputs...
	Moving inputs to gpu...
	Performing inference...
	Inference Finished
	Seperating Ids...
	Decoding Ids...
	Seperating Outputs...
	Outputs stored!
>>> Batch 3 starting...
	Processing text prompts...
	Reading Images...
	Processing inputs...
	Moving inputs to gpu...
	Performing inference...
	Inference Finished
	Seperating Ids...
	Decoding Ids...
	Seperating Outputs...
	Outputs stored!
>>> Batch 4 starting...
	Processing text prompts...
	Reading Images...
	Processing inputs...
	Moving inputs to gpu...
	Performing inference...
	Inference Finished
	Seperating Ids...
	Decoding Ids...
	Seper

In [10]:
outputs

[('images/cropped/The_Lightfoot_Herbarium_05_cropped_1.jpg',
  '{\n    "metadata": {\n        "division": "N/A",\n        "page": "132"\n    },\n    "contents": [\n        {\n            "familyName": "Aceraceae",\n            "species": [\n                {\n                    "speciesName": "Acer campestre L.",\n                    "folders": [\n                        {\n                            "description": "1 folder. Acer campestre [TA]",\n                            "citations": ["N/a"]\n                        }\n                    ]\n                },\n                {\n                    "speciesName": "Acer pseudoplatanus L.",\n                    "folders": [\n                        {\n                            "description": "2 folders.",\n                            "citations": ["n/a"]\n                        }\n                    ]\n                }\n            ]\n        },\n        {\n            "familyName": "Amaranthaceae",\n            "species": [

In [15]:
out = outputs[1][1]

In [17]:
print(eval(out))

{'metadata': {'division': 'Caryophyllaceae', 'page': '133'}, 'contents': [{'familyName': 'Caryophyllaceae', 'species': [{'speciesName': 'Arenaria serpyllifolia L.', 'folders': [{'description': '1 folder. Arenaria Serpyllifolia [TA].', 'citations': ['', '']}]}, {'speciesName': 'Bufonia tenuifolia L., Flora Europaea 1: 133 (1964', 'folders': [{'description': '1 folder. Bufonia Tenuifolia [TA].', 'citations': ['', '']}]}, {'speciesName': 'Cerastium alpinum L.', 'folders': [{'description': 'Folder 1. Cerastium latifolium [G]. i. "Cerastium alpinum. This was gathered upon Snowdon at the top of the highest Rock call\'d Clogwyn y Garnedh. June" [JL]. ii. "Top of Snowdon" [JL]. iv. "Dr. Solander affirms this to be the true Cerastium alpinum. I had it from Snowdon. He found it at Terra del Fuego & named it C. hirtum with a mark of Dubitation, but now now thinks them both one" [JL].', 'citations': ['', '']}, {'description': 'Folder 2. Cerastium alpinum [TA]; Cerastium latifolium [G]. i. "Clogwyn