## GPT4 Vision OCR implementation - Smriti

In [1]:
#installations necessary
# pip install openai

In [2]:
#import statements
from openai import OpenAI
from collections import defaultdict
import base64
import requests
import os

### Using the API we make requests to the newly released GPT4-V model to transcribe the text from the TROCR Evaluation set data and save the responses in a json format in a dictionary.

In [3]:
# os.chdir("ml-herbarium-grp")
print(os.getcwd())
json_results = defaultdict()
too_big = []
# OpenAI API Key
api_key = key

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
# folder_path = "ml-herbarium-data/TROCR_Training/goodfiles/"
folder_path = "resized-images/"
image_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith(('jpg', 'png'))]

for img in image_files:
    if os.stat(img).st_size > 19000000:
        too_big.append(img)
    else:
    # Getting the base64 string
        base64_image = encode_image(img)
        
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }
        
        payload = {
            "model": "gpt-4-vision-preview",
            "messages": [
              {
                "role": "user",
                "content": [
                  {
                    "type": "text",
                    "text": "Extract all the text, both typed and handwritten from this image and display it in a JSON format according to the Darwin Core standard for biodiversity specimen"
                  },
                  {
                    "type": "image_url",
                    "image_url": {
                      "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                  }
                ]
              }
            ],
            "max_tokens": 4096
        }
    
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
        json_results[img] = response.json()
    
    # print(response.json())

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


#### Note : We observed that the image file size was limited to 20MB and so we were not able to pass the higher resolution images (162 of them) to the model. Future steps would involve compressing the larger files to be more size appropriate for GPT4-V. We currently obtain results for 64 images from the evaluation set.

In [5]:
copyjson = json_results.copy()
# print(json_results.keys())
for l in list(json_results.keys()):
    if "choices" not in json_results[l].keys():
        del json_results[l]
    else:
        json_results[l] = json_results[l]['choices'][0]['message']['content']

print(len(json_results))    


224


In [6]:
print(json_results)

defaultdict(None, {'resized-images/1998481102.jpg': 'To extract the text from this image and convert it to JSON format according to the Darwin Core standard, we would follow these steps:\n\n1. Extract any clearly readable text, including typed and handwritten annotations.\n2. Organize the extracted text into Darwin Core fields.\n\nThe image provided contains text which seems to relate to a herbarium specimen, so the fields selected must be relevant to the information typically recorded for herbarium specimens like scientific name, collector name, date collected, etc.\n\nHere\'s the textual information extracted from the image in JSON format with the best approximation of Darwin Core terms:\n\n```json\n{\n  "institutionCode": "NEBC",\n  "collectionCode": "Herbarium of Kate Furbish",\n  "catalogNumber": "<Unknown catalog number>",\n  "scientificName": "Artemisia stelleriana Bess. var. maxima (Rydb.) Cronq.",\n  "country": "USA",\n  "stateProvince": "Maine",\n  "county": "Washington Count

In [7]:
print(len(too_big))

0


In [8]:
paths = list(json_results.keys())
# folder_path = "ml-herbarium-data/TROCR_Training/goodfiles/"
folder_path = "resized-images/"
img_names = [i.replace(folder_path, "") for i in paths]

# print(img_names)
final_dict = defaultdict()

for img in img_names:
    final_dict[img] = json_results[folder_path+img]

print(len(final_dict.keys()))

224


In [9]:
#TEST CODE
# lists = list(json_results.values())
# print(lists[0]['choices'])
# contents = [l['choices'][0]['message']['content'] for l in lists]


In [10]:
# print(len(contents))

In [11]:
# print(contents[1])

#### We save the contents of each image in the form of txt files since there are also comments from the model regarding most of the transcriptions that could be useful while evaluating results.

In [12]:
print(os.getcwd())
for i in final_dict:
    f = open("gpt4v-resized-results/"+i.replace("jpg", "txt"), "w")
    f.writelines(final_dict[i])
    f.close()

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


### To compare with results from Azure Vision, we use a subset of specimen from GBIF (low resolution since Azure vision does not accept the size of specimen in the TROCR eval set), the code is exactly the same as before, only this time on low resolution images.

In [13]:
# os.chdir("ml-herbarium-grp")
print(os.getcwd())
json_results = defaultdict()
too_big = []
# OpenAI API Key
api_key = key

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
folder_path = "LLM_Specimens/"
image_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith(('jpg', 'png'))]

for img in image_files:
    if os.stat(img).st_size > 19000000:
        too_big.append(img)
    else:
    # Getting the base64 string
        base64_image = encode_image(img)
        
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }
        
        payload = {
            "model": "gpt-4-vision-preview",
            "messages": [
              {
                "role": "user",
                "content": [
                  {
                    "type": "text",
                    "text": "Extract all the text, both typed and handwritten from this image and display it in a JSON format"
                  },
                  {
                    "type": "image_url",
                    "image_url": {
                      "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                  }
                ]
              }
            ],
            "max_tokens": 4096
        }
    
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
        json_results[img] = response.json()
    
    # print(response.json())

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


In [14]:
copyjson = json_results.copy()
# print(json_results.keys())
for l in list(json_results.keys()):
    if "choices" not in json_results[l].keys():
        del json_results[l]
    else:
        json_results[l] = json_results[l]['choices'][0]['message']['content']

print(len(json_results))    

0


In [15]:
paths = list(json_results.keys())
folder_path = "LLM_Specimens/"
img_names = [i.replace(folder_path, "") for i in paths]

# print(img_names)
final_dict = defaultdict()

for img in img_names:
    final_dict[img] = json_results[folder_path+img]

# print(final_dict.keys())

In [16]:
print(os.getcwd())
for i in final_dict:
    f = open("gpt4v-results-gbif/"+i.replace("png", "txt"), "w")
    f.writelines(final_dict[i])
    f.close()

/projectnb/sparkgrp/ml-herbarium-grp/fall2023


### Observation from the results : 

1. **TROCR Evaluation set** : The labels for each specimen extracted are different according to GPT4-V. Some specimen have labels that are not present in the image itself but GPT4 tries to allocate an appropriate label for the text that it extracts in the json format. This is interesting to note and we will be working on creating uniform labels for every specimen after discussion with the Clients and Professor. There are still inaccuracies in the extracted handwritten texts but we will be able to confirm the accuracy measure only after the subject matter experts go through the results folder. There are also additional comments provided by the model to more accurately assess the text extraction with caution warnings that certain phrases it extracts are from handwritten samples and may not be the best match. Overall promising results found for all the specimen. Folder with extracted text is named **gpt4v-results**

2. **GBIF Evaluation set** : This smaller specimen set was formed to compare results obtained by GPT4-V and Azure Vision. Azure vision cannot handle images of higher size and resolution so we manually extracted specimen (with both plant+label and only label) from the GBIF database to run through both the LLMs. Observations were not as good for GPT4-V, since it could not understand the request we gave it for most of the specimen and was only able to extract text for a few of them. We are experimenting with this a bit more, since it could be an effect of the lower quality images used in this case or an API request issue from their end. Updates on the Azure Vision part are present in the Azure OCR notebook. Folder with extracted text is named **gpt4v-results-gbif**

In [17]:
c=0
l=[]
for i in final_dict:
    if final_dict[i] == "I'm sorry, but I cannot assist with that request." or final_dict[i] == "I'm sorry, but I can't assist with that request.":
        c+=1
        l.append(i)

### The following files could not be processed by GPT4-V

In [18]:
print(c, l)

0 []
