# Herbaria Image Processing Pipeline

## Overview

This notebook serves as a vital component of our herbaria image processing pipeline. The primary objective is to automate the extraction and translation of text from images of herbaria specimens uploaded by users. The extracted and translated text is a critical input for subsequent data sorting and analysis phases, which help in deriving meaningful herbaria information to assist researchers and enthusiasts.

## Features

### Batch Processing
- **File Upload and Extraction**: Users are prompted to upload a ZIP file containing JPEG/JPG images of herbaria specimens. The system automatically extracts these files to a designated directory and identifies all the image files for processing.
- **Text Extraction and Translation**: Utilizes Google Cloud Document AI to perform optical character recognition (OCR) on the images, extracting the embedded text. Post extraction, the text is translated into English (or another specified language) using Google Cloud Translation, making the data more accessible for further analysis.
- **Data Aggregation**: Results, including the filename, extracted text, and translated text from each image, are compiled into a DataFrame. This structured format facilitates easy review and downstream processing.

## Usage
To use this notebook:
1. Ensure that the Google Cloud services (Document AI and Translation API) are properly configured with the appropriate credentials.
2. Upload a ZIP file containing the herbaria images when prompted.
3. Review the output DataFrame displayed at the end of the notebook for extracted and translated texts.

# Import Libraries

In [None]:
# Library imports
!pip install google-cloud-documentai google-cloud-storage Pillow google-cloud-translate

Collecting google-cloud-documentai
  Downloading google_cloud_documentai-2.25.0-py2.py3-none-any.whl (308 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.7/308.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-2.25.0


In [None]:
import os
# Upload credential json file from default compute service account
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "herbaria-ai-3c860bcb0f44.json"

# Single File Upload

In [None]:
# Import necessary libraries
from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai  # Ensure to match the version
from google.cloud.documentai_v1.types import RawDocument
from google.colab import files
from google.cloud import translate_v2 as translate
import io

# Set your Google Cloud Document AI processor details here
project_id = "herbaria-ai"
location = "us"
processor_id = "4307b078717a399a"

# This function utilized Google Translate API to translate extracted Chinese text to English
def translate_text(text, target_language="en"):
    """Translates text into the target language.

    Target language must be an ISO 639-1 language code.
    See https://cloud.google.com/translate/docs/languages for a list of available languages.
    """
    translate_client = translate.Client()
    result = translate_client.translate(text, target_language=target_language)
    return result["translatedText"]

# Takes imported image, checks that it is a valid file type, and extracts text using Document AI Processor
def batch_process_documents(file_stream: io.BytesIO, file_mime_type: str) -> None:
    """Process a single document uploaded by the user in Google Colab."""
    # Setup client options and create the Document AI client
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # Prepare the raw document for processing
    file_stream.seek(0)  # Ensure the file stream is at the start
    raw_document = RawDocument(content=file_stream.read(), mime_type=file_mime_type)

    # The full resource name of the processor
    name = client.processor_path(project_id, location, processor_id)

    # Process the document
    request = documentai.ProcessRequest(name=name, raw_document=raw_document)
    result = client.process_document(request=request)

    # Display the document text
    print("The document contains the following text:")
    print(result.document.text)

    # Translate the extracted text to English
    translated_text = translate_text(result.document.text)
    print("\nTranslated text:")
    print(translated_text)

def main():
    print("Please upload a JPEG/JPG file to process:")
    uploaded_files = files.upload()  # This will prompt the user to upload a file

    for filename, file_content in uploaded_files.items():
        print(f"Processing file: {filename}")
        file_stream = io.BytesIO(file_content)
        batch_process_documents(file_stream, "image/jpeg")

if __name__ == "__main__":
    main()


Please upload a JPEG/JPG file to process:


Saving 02333972.jpg to 02333972.jpg
Processing file: 02333972.jpg
The document contains the following text:
Chinese National Herbarium (PE)
Plants of Xizang
CHINA, Xizang, Lhoka City, Lhozhag County, Lhakang
Town, Kharchhu Gompa vicinity
西藏自治区山南市洛扎县拉康镇卡久寺附近
28°5'37.15"N, 91°7'24.74"E; 3934 m
Trees. Slopes near roadsides.
PE-Xizang Expedition #PE6679
14 September 2017
9
w
NOI 中国数字植物標本館
4 5 6 7 8 9 10
N? 251176
中国科学院
植物研究所
标本馆
CHINESE NATIONAL HERBARIUM (PE)
PE
02333972
西藏
TIBET
#PE66 6679
BETULACEAE 桦木科
Betula utilis D.Don 糙皮桦
鉴定人: 陈之端 Zhi-duan CHEN
2 Jan. 2019


Translated text:
Chinese National Herbarium (PE) Plants of Xizang CHINA, Xizang, Lhoka City, Lhozhag County, Lhakang Town, Kharchhu Gompa vicinity 28°5&#39;37.15&quot;N, 91°7 &#39;24.74&quot;E; 3934 m Trees. Slopes near roadsides. PE-Xizang Expedition #PE6679 14 September 2017 9 w NOI China Digital Herbarium 4 5 6 7 8 9 10 N? 251176 CHINESE NATIONAL HERBARIUM, Institute of Botany, Chinese Academy of Sciences (PE) PE 02333972 Ti

# Batch Processing

In [None]:
import pandas as pd
from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.cloud.documentai_v1.types import RawDocument
from google.cloud import translate_v2 as translate
from google.colab import files
import zipfile
import os
import io

# Global DataFrame declaration
results_df = pd.DataFrame(columns=["Filename", "Extracted Text", "Translated Text"])

# Set your Google Cloud Document AI processor details here
project_id = "herbaria-ai"
location = "us"
processor_id = "4307b078717a399a"

def translate_text(text, target_language="en"):
    translate_client = translate.Client()
    result = translate_client.translate(text, target_language=target_language)
    return result["translatedText"]

def batch_process_documents(file_path: str, file_mime_type: str) -> tuple:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    with open(file_path, "rb") as file_stream:
        raw_document = RawDocument(content=file_stream.read(), mime_type=file_mime_type)

    name = client.processor_path(project_id, location, processor_id)
    request = documentai.ProcessRequest(name=name, raw_document=raw_document)
    result = client.process_document(request=request)

    extracted_text = result.document.text
    translated_text = translate_text(extracted_text)
    return extracted_text, translated_text

def find_images(directory, extensions=('.jpeg', '.jpg')):
    for root, _, filenames in os.walk(directory):
        for filename in filenames:
            if filename.lower().endswith(extensions) and not filename.startswith('.'):
                yield os.path.join(root, filename)

def main():
    global results_df
    results_df = results_df.iloc[0:0]  # Clear the DataFrame if re-running this cell

    print("Please upload a zip file containing JPEG/JPG files to process:")
    uploaded_files = files.upload()

    for filename in uploaded_files.keys():
        print(f"Extracting {filename}...")
        with zipfile.ZipFile(io.BytesIO(uploaded_files[filename]), 'r') as zip_ref:
            zip_ref.extractall("extracted_files")

        image_files = list(find_images("extracted_files"))
        print(f"Found {len(image_files)} image files for processing.")

        for file_path in image_files:
            try:
                print(f"Processing {os.path.basename(file_path)}...")
                extracted_text, translated_text = batch_process_documents(file_path, "image/jpeg")
                new_row = pd.DataFrame([{
                    "Filename": os.path.basename(file_path),
                    "Extracted Text": extracted_text,
                    "Translated Text": translated_text
                }])
                results_df = pd.concat([results_df, new_row], ignore_index=True)
            except Exception as e:
                print(f"An error occurred while processing {file_path}: {e}")

if __name__ == "__main__":
    main()


Please upload a zip file containing JPEG/JPG files to process:


Saving batch_processing.zip to batch_processing (7).zip
Extracting batch_processing (7).zip...
Found 12 image files for processing.
Processing 02334125.jpg...
Processing 02334129.jpg...
Processing 02334128.jpg...
Processing 02334122.jpg...
Processing 02334123.jpg...
Processing 02334130.jpg...
Processing 02334126.jpg...
Processing 02334124.jpg...
Processing 02334119.jpg...
Processing 02334121.jpg...
Processing 02334127.jpg...
Processing 02334120.jpg...


In [None]:
print(results_df)

        Filename                                     Extracted Text  \
0   02334125.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
1   02334129.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
2   02334128.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
3   02334122.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
4   02334123.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
5   02334130.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
6   02334126.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
7   02334124.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
8   02334119.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
9   02334121.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
10  02334127.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   
11  02334120.jpg  Chinese National Herbarium (PE)\nPlants of Xiz...   

                                      Translated Text  
0   Chinese National

In [None]:
from google.colab import files

# Save the DataFrame to a CSV file
csv_filename = "output_data.csv"
results_df.to_csv(csv_filename, index=False)

# Trigger the download
files.download(csv_filename)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>