## GPT4 Vision OCR implementation - Smriti

We modify the existing prompt to GPT4-V to conform the results to the Darwin Core standard that is typical for biodiversity specimen information. We plan on utilizing Scientific Name, Locality/Country and Collector Name for the evaluation of the labels we obtain from the LLMs.

In [1]:
pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Using the API we make requests to the newly released GPT4-V model to transcribe the text from the TROCR Evaluation set data and save the responses in a json format in a dictionary.

In [6]:
from openai import OpenAI
from collections import defaultdict
import base64
import requests
import os

# os.chdir("ml-herbarium-grp")
print(os.getcwd())
json_results = defaultdict()
too_big = []
# OpenAI API Key
api_key = key

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
# folder_path = "ml-herbarium-data/TROCR_Training/goodfiles/"
folder_path = "resized-images/"
image_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith(('jpg', 'png'))]

for img in image_files:
    if os.stat(img).st_size > 19000000:
        too_big.append(img)
    else:
    # Getting the base64 string
        base64_image = encode_image(img)
        
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }
        
        payload = {
            "model": "gpt-4-vision-preview",
            "messages": [
              {
                "role": "user",
                "content": [
                  {
                    "type": "text",
                    "text": "Extract all the text, both typed and handwritten from this image and display it in a JSON format according to the Darwin Core standard for biodiversity specimen"
                  },
                  {
                    "type": "image_url",
                    "image_url": {
                      "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                  }
                ]
              }
            ],
            "max_tokens": 4096
        }
    
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
        json_results[img] = response.json()
    
    # print(response.json())

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


#### Note : We observed that the image file size by Azure and GPT4-V were limited to 20MB and so we resized the images to size under 4MB for easier processing by the APIs. 

In [7]:
copyjson = json_results.copy()
print(json_results.keys())
for l in list(json_results.keys()):
    if "choices" not in json_results[cl].keys():
        del json_results[l]
    else:
        json_results[l] = json_results[l]['choices'][0]['message']['content']

print(len(json_results))    


dict_keys(['resized-images/1320398138.jpg', 'resized-images/1802552799.jpg', 'resized-images/1998322454.jpg', 'resized-images/2236142683.jpg', 'resized-images/2848499425.jpg', 'resized-images/2446828826.jpg', 'resized-images/2284257102.jpg', 'resized-images/2608680770.jpg', 'resized-images/2595747531.jpg', 'resized-images/3356834058.jpg', 'resized-images/2859042459.jpg', 'resized-images/3005750161.jpg', 'resized-images/1320488541.jpg', 'resized-images/1998358368.jpg', 'resized-images/1998413329.jpg', 'resized-images/1322099762.jpg', 'resized-images/1998836464.jpg', 'resized-images/2549603947.jpg', 'resized-images/1675940934.jpg', 'resized-images/1998540182.jpg', 'resized-images/1990825865.jpg', 'resized-images/2236176339.jpg', 'resized-images/1999330570.jpg', 'resized-images/3356803607.jpg', 'resized-images/2512789170.jpg', 'resized-images/1998481102.jpg', 'resized-images/2425404585.jpg', 'resized-images/3092906623.jpg', 'resized-images/2512801142.jpg', 'resized-images/2265485412.jpg',

In [5]:
print(json_results)

defaultdict(None, {'resized-images/1320398138.jpg': 'To display the extracted text in a JSON format according to the Darwin Core standard for biodiversity specimen records, the following fields have been identified and organized:\n\n```json\n{\n  "institutionCode": "US",\n  "collectionCode": "Botany",\n  "catalogNumber": "1622390",\n  "scientificName": "Monopyle maxima Morton",\n  "typeStatus": "Holotype",\n  "recordedBy": "Ynes Mexia",\n  "recordNumber": "7017",\n  "country": "Ecuador",\n  "stateProvince": "Zamora",\n  "locality": "Beyond Estacion Zamora",\n  "habitat": "Cloud forest",\n  "dateIdentified": "1945",\n  "eventDate": "22-26, 1935",\n  "decimalLatitude": "",\n  "decimalLongitude": "",\n  "minimumElevationInMeters": "500",\n  "maximumElevationInMeters": "900",\n  "verbatimElevation": "altitude 500-900 meters",\n  "occurrenceRemarks": "(Tree cover!)"\n}\n```\n\nPlease note that some standard Darwin Core fields such as `decimalLatitude` and `decimalLongitude` cannot be provid

In [8]:
print(len(too_big))

0


In [10]:
paths = list(json_results.keys())
# folder_path = "ml-herbarium-data/TROCR_Training/goodfiles/"
folder_path = "resized-images/"
img_names = [i.replace(folder_path, "") for i in paths]

# print(img_names)
final_dict = defaultdict()

for img in img_names:
    final_dict[img] = json_results[folder_path+img]

print(len(final_dict.keys()))

100


In [62]:
#TEST CODE
# lists = list(json_results.values())
# print(lists[0]['choices'])
# contents = [l['choices'][0]['message']['content'] for l in lists]


[{'message': {'role': 'assistant', 'content': '```json\n{\n  "Image Number": "00427028",\n  "Specimen ID": "1627083",\n  "Herbarium": "UNITED STATES NATIONAL MUSEUM",\n  "Flora": "Flora Hawaiiensis",\n  "Collected by": "C. N. Forbes on Oahu",\n  "Species Name": "Cheirodendron platyphyllum (Hook. & Arn.) Frodin",\n  "Collection Date": "Apr. 26 - May 6 - 1911",\n  "Accession Number": "No. 74318",\n  "Barcode of the Bishop Museum Herbarium": "Image No. 00427028"\n}\n```'}, 'finish_details': {'type': 'stop', 'stop': '<|fim_suffix|>'}, 'index': 0}]


In [63]:
# print(len(contents))

86


In [70]:
# print(contents[1])

```json
{
  "Herbarium Label": {
    "Scientific Name": "Cheirodendron trigynum (Gaud.) A. Heller var. helleri Sherff",
    "Collection Information": "Flora Hawaiianensis. Collected by C. N. Forbes on Oahu.",
    "Location": "Punaluu, Koolau Mts.",
    "Elevation": "Apl. 1-20" + "May 6 - 1914",
    "Collector Number": "XV: 5/18",
    "Barcode": "00427028"
  },
  "Institution Label": {
    "Institution": "UNITED STATES NATIONAL MUSEUM",
    "Specimen Number": "1627083"
  },
  "Imaging Label": {
    "Image Number": "Image No. 00427028"
  }
}
```
Please note that there could be slight inaccuracies in transcription due to the handwriting and the quality of the image. The elevation appears to be a range given with a mixture of dates (April 1-20, May 6, 1914) which is transcribed as given.


#### We save the contents of each image in the form of txt files since there are also comments from the model regarding most of the transcriptions that could be useful while evaluating results.

In [12]:
print(os.getcwd())
for i in final_dict:
    f = open("gpt4v-resized-results/"+i.replace("jpg", "txt"), "w")
    f.writelines(final_dict[i])
    f.close()

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


### Observation from the results : 

**TROCR Evaluation set** : Images were resized and we were able to get results for all 250 evaluation set images, according to the Darwin Core format the client requested. We are working on the evaluation metrics now to compare the ground truth labels with the LLM generated ones. We plan on pushing evaluation code by this week. 