### THIS PROCESS WORKS WELL FOR HYMENOPTERA FIGURES, OR OTHER FIGURES WITH CAPTIONS RIGHT NEXT TO THEM

This notebook shows two ways to generate image captions. Right now, I've got 5 of our images sitting in an S3 bucket, as an example. To begin, we will first do a side by side of a single image.

In [None]:
from IPython.display import Image

# Display the image from a URL
Image(url="https://ccber-tester-bucket.s3.us-east-1.amazonaws.com/hymenoptera-gena.png")

### Blip2 (Free, but not good!)

In [None]:
import requests
from PIL import Image
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

url = 'https://ccber-tester-bucket.s3.us-east-1.amazonaws.com/hymenoptera-gena.png'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
resized = image.resize((596, 437))

In [None]:
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/882 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/122k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

Blip2ForConditionalGeneration(
  (vision_model): Blip2VisionModel(
    (embeddings): Blip2VisionEmbeddings(
      (patch_embedding): Conv2d(3, 1408, kernel_size=(14, 14), stride=(14, 14))
    )
    (encoder): Blip2Encoder(
      (layers): ModuleList(
        (0-38): 39 x Blip2EncoderLayer(
          (self_attn): Blip2Attention(
            (dropout): Dropout(p=0.0, inplace=False)
            (qkv): Linear(in_features=1408, out_features=4224, bias=True)
            (projection): Linear(in_features=1408, out_features=1408, bias=True)
          )
          (layer_norm1): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
          (mlp): Blip2MLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1408, out_features=6144, bias=True)
            (fc2): Linear(in_features=6144, out_features=1408, bias=True)
          )
          (layer_norm2): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((

In [None]:
prompt = "This is"

inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=300)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

This is a diagram showing how to use the word "genital"


### GPT (Cheap, and super good!)

In [None]:
from openai import OpenAI
import openai
from google.colab import userdata


api_key = userdata.get('openai_api_key')

client = OpenAI(api_key=api_key)


In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://ccber-tester-bucket.s3.us-east-1.amazonaws.com/hymenoptera-gena.png",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

The image contains a textual description and diagrams related to the concept of "gena," which refers to the cheek area of an insect's head. It describes the location of the gena in relation to features like the compound eye and the occipital carina. The diagrams illustrate different head shapes, including hypognathous and prognathous heads.


Great, so, ChatGPT is obviously better. It should also only going to cost ~1$ to generate the captions for all of our diagrams, so, it is definitely worth it!

### So, How do we map this process over 200+ photos?

We are hosting our photos in an S3 bucket using AWS. This option is effectively free, and makes it very easy to interact with our photos (both for caption generation, and for image rendering for our chatbot).

In [None]:

import boto3
from getpass import getpass

# Securely prompt for AWS credentials
aws_access_key_id = getpass("Enter your AWS access key ID: ")
aws_secret_access_key = getpass("Enter your AWS secret access key: ")

# Create an S3 client with the entered credentials
s3_client = boto3.client(
    "s3",
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key
)

BUCKET_NAME = "ccber-tester-bucket"  # Replace with your actual bucket name

# Function to list all object URLs in the bucket
def list_s3_object_urls(bucket_name):
    object_urls = []

    # List objects in the bucket
    response = s3_client.list_objects_v2(Bucket=bucket_name)

    if "Contents" in response:
        for obj in response["Contents"]:
            object_key = obj["Key"]
            object_url = f"https://{bucket_name}.s3.amazonaws.com/{object_key}"
            object_urls.append(object_url)

    return object_urls

# Retrieve and print object URLs
object_urls = list_s3_object_urls(BUCKET_NAME)

print("\nList of Object URLs:")
for url in object_urls:
    print(url)


Enter your AWS access key ID: ··········
Enter your AWS secret access key: ··········

List of Object URLs:
https://ccber-tester-bucket.s3.amazonaws.com/MMD-wings-1.png
https://ccber-tester-bucket.s3.amazonaws.com/MMD-wings-2.png
https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-frontal-carina.png
https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-gena.png
https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-torulus.png


In [None]:
from IPython.display import Image, display
for url in object_urls:
  # Display the image from a URL
  display(Image(url=url))
  # Generate caption of image
  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": url,
                    },
                },
            ],
        }
    ],
    max_tokens=300,
  )

  caption = response.choices[0].message.content

  print(f"\n🔗 Image URL: {url}\n📝 Caption: {caption}\n" + "-"*80)


🔗 Image URL: https://ccber-tester-bucket.s3.amazonaws.com/MMD-wings-1.png
📝 Caption: The image appears to be an anatomical diagram of a wing, likely from an insect, showing various veins and sections labeled with their respective names. Terms like "Rs," "R," "Cu," and others correspond to different parts of the wing's venation system. This type of diagram is often used in entomology to study wing structures and classifications.
--------------------------------------------------------------------------------



🔗 Image URL: https://ccber-tester-bucket.s3.amazonaws.com/MMD-wings-2.png
📝 Caption: The image appears to depict a diagram of an insect wing, likely highlighting its venation system. Various lines and labels indicate different veins such as R (radius), M (media), Cu (cubitus), and others, which are essential for understanding the wing structure and its function. This type of illustration is commonly found in entomological studies or scientific literature related to insect morphology.
--------------------------------------------------------------------------------



🔗 Image URL: https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-frontal-carina.png
📝 Caption: The image is a labeled diagram illustrating the concept of the "frontal carina." It shows a pair of longitudinal ridges on the frontal area between the toruli. The diagram provides a visual representation of these anatomical features, which are often relevant in entomology or related biological fields.
--------------------------------------------------------------------------------



🔗 Image URL: https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-gena.png
📝 Caption: The image contains text and illustrations explaining the term "gena," which refers to the cheek area in certain insects. It defines the lateral part of the head between the compound eye and, when present, the occipital carina. The illustrations depict different head structures, labeled to show the position of the gena in relation to other anatomical features.
--------------------------------------------------------------------------------



🔗 Image URL: https://ccber-tester-bucket.s3.amazonaws.com/hymenoptera-torulus.png
📝 Caption: The image depicts a diagram illustrating the concept of "torulus." It labels the paired sockets located on the front of the head where the scape (a part of an insect's antenna) is articulated. Arrows point to specific areas, likely indicating the locations of the toruli on the diagram. The text provides a definition for the term "torulus" and mentions its plural form, "toruli."
--------------------------------------------------------------------------------


### But what if we embed the images directly?