In [2]:
pip install krixik

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

LUCAS_STAGING_API_KEY=os.getenv('LUCAS_STAGING_API_KEY')
LUCAS_STAGING_API_URL=os.getenv('LUCAS_STAGING_API_URL')

# import Krixik
from krixik import krixik
krixik.init(api_key = LUCAS_STAGING_API_KEY, 
            api_url = LUCAS_STAGING_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


---

---

---

# Testing Image Caption Models Through Basic Image Transformations

[Image caption](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) models take image input and generate a textual caption for each image. Each model is trained on an enormous trove of image/caption pairs—enough so the model "learns" to associate image substructures with words and groups of words. In theory, the greater and more varied the training set of images, the more accurate and flexible the model will be when identifying the components of an input image and generating a caption accordingly.

In this article we'll analyze how much we can "transform" a relatively straightforward image before the caption model no longer recognizes its contents and gives an incorrect or downright [hallucinatory](https://www.ibm.com/topics/ai-hallucinations) caption. To keep things straightforward, the transformations we'll use will be basic ones:

- Vertical flip
- Horizontal flip
- Color inversion
- Combination of the above

The base image for this experiment will be one of friends sitting around a table, drinking coffee and having a good time.

All currently available models will be put to the test:
- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module))
- [git-base](https://huggingface.co/microsoft/git-base)
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)

We'll need just one [Krixik](https://krixik-docs.readthedocs.io/en/latest/) [pipeline](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/components_of_a_krixik_pipeline/) for this entire exercise: a [single-module pipeline](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_caption/) with a [`caption`](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) module in it. This is because we can determine the model we use on the pipeline every time we [`.process`](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/) a file through it; it's the sort of welcome flexibility that Krixik offers.

We [create](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/create_pipeline/) our pipeline thus:

In [4]:
# instantiate a single-module pipeline with an image caption module
pipeline_1 = krixik.create_pipeline(name='my_caption_pipeline',
                                    module_chain=['caption'])

### Base Image Captions

Before transforming the image, let's see what base caption the models provide.

Given that the first model in our list, [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning), is the [default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module) model for the [`caption`](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) module, we don't have to specify it when [processing](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/) our image through it. Using the pipeline we just created, the line of code for this is thus very simple:

In [7]:
# generate a base caption for our image with the module's default model
pipeline_1.process(local_file_path='./test_files/coffee-with-friends.jpg')

INFO: hydrated input modules: {'module_1': {'model': 'vit-gpt2-image-captioning', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_asxzzozsdt.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Fri Jun  7 23:04:20 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: d54a977c-d72b-3be4-f627-0dd0940f4050
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '4fe940f8-e739-47bb-bb93-8964d577d3c8',
 'file_id': '5ada2b74-bdf8-49a7-ba51-d30db8f52bc9',
 'message': 'SUCCESS - output fetched for file_id 5ada2b74-bdf8-49a7-ba51-d30db8f52bc9.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'people sitting around a table'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/5ada2b74-bdf8-49a7-ba51-d30db8f52bc9.json']}

The caption that was generated is:

> people sitting around a table

Which, if low on detail, is perfectly accurate.

Now let's see what the second model on our list, [git-base](https://huggingface.co/microsoft/git-base), returns. Since that's not the [default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module) model for the [`caption`](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) module, we'll have to [specify](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/#selecting-models-via-the-modules-argument) it in the code:

In [8]:
# generate a base caption for our image with the the git-base model active
pipeline_1.process(local_file_path='./test_files/coffee-with-friends.jpg',
                   modules={'caption': {'model': 'git-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'git-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_jlslvhpjui.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Fri Jun  7 23:13:59 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 06129d66-6bf4-6090-b9bf-b242045c5d2c
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': 'c4b7dba8-dea9-48ec-934e-8c1d110e616f',
 'file_id': 'b1e25b66-72c7-4238-a0e5-bdb710d1d84f',
 'message': 'SUCCESS - output fetched for file_id b1e25b66-72c7-4238-a0e5-bdb710d1d84f.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'man wearing a white shirt'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/b1e25b66-72c7-4238-a0e5-bdb710d1d84f.json']}

The new caption says:

> man wearing a white shirt

Although that's not what we're aiming for, the caption is not inaccurate. The central figure in the photograph of friends having coffee around a table is indeed a man in a white shirt. Not inaccurate, but not complete either. Model accuracy is not what we're here to test, though, so let's move on.

We'll exclude the other two code blocks for brevity and present the four-set of captions generated for our base image:

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module)) - 'people sitting around a table'
- [git-base](https://huggingface.co/microsoft/git-base) - 'man wearing a white shirt'
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - 'group of friends having a coffee break'
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - 'several people sitting at a table with cups of coffee and smiling'

### Transformation - Horizontal Flip

Our first transformation will be a simple [horizontal flip](https://www.mathsisfun.com/definitions/horizontal-flip.html), which will generate a horizontal mirror image of the original. Image captions should be very similar, if not identical, given that the scene is essentially the same.

Let's first see what the [`caption`](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) module's [default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module) model returns for this image:

In [7]:
# generate a caption for a horizontally flipped version of image with the module's default model
pipeline_1.process(local_file_path='./test_files/coffee-with-friends - horizontal.jpg')

INFO: hydrated input modules: {'module_1': {'model': 'vit-gpt2-image-captioning', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_xtlzzpxyid.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Fri Jun  7 23:34:44 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: a3314aa8-f56d-bd28-44d1-09137beaa692
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '4e1a608b-d035-4812-b139-ee8e6f002f0b',
 'file_id': 'f76cf947-84e4-4e98-a91f-a9456d72f113',
 'message': 'SUCCESS - output fetched for file_id f76cf947-84e4-4e98-a91f-a9456d72f113.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'people sitting around a table'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/f76cf947-84e4-4e98-a91f-a9456d72f113.json']}

The caption, just as before, is:

>people sitting around a table

Let's skip the code for the other three [processes](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/). Here are our captions:

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module)) - 'people sitting around a table'
- [git-base](https://huggingface.co/microsoft/git-base) - 'man wearing a white shirt'
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - 'group of friends having a coffee break'
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - 'several people sitting at a table with cups of coffee and a phone'

The first three models get full marks for consistency, and it's what I would've expected from all four. The image is essentially the same, its contents essentially unchanged and equally interpretable. There's an odd difference in the output to [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large), however. In the original image the model notes "several people sitting at a table with cups of coffee and smiling". In the horizontally flipped version, the caption is identical in every way, save that at the end instead of "smiling" it says "a phone".

Why does this (large) model now specify the phone instead of the smiles? The clearly observable phone and the smile of the central character are on the central image axis, so their relative position in the image doesn't change much despite the horizontal flip. However, the second most visible smile, that of the unfocused gentleman in the white t-shirt, has passed from the left side of the image to the right. Could it be that this model analyzes the image from left to right? Or what else could generate this generally irrelevant but very highlightable difference?

### Transformation - Vertical Flip

Let's move on to another transformation, the [vertical flip](https://www.mathsisfun.com/definitions/vertical-flip.html). This transformation generates a vertical mirror image of the original. We'll skip the code and go directly to our captions. This transformation is significant, given that the image is "upside-down", so we expect that it'll confuse at least some of the models.

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module)) - 'a crowd of people standing around a group of people'
- [git-base](https://huggingface.co/microsoft/git-base) - 'a group of men sitting around a table'
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - 'a group of people standing around a table'
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - 'several people standing around a table with plates of food and drinks'

The results are not bad. Three of the four models still agree that the image is of a group of people around a table. A couple of things to note:

- Three of the four models identify people "standing" where they used to be "sitting". Flipping the image vertically seems to have confused the models' vertical perception, and it's less clear how high up the individuals are.
- There is no longer any mention of coffee, and there used to be two. It's been replaced by a single mention of "food and drinks".
- The second model, [git-base](https://huggingface.co/microsoft/git-base), specifies that all the people around the table are men. The specificity of this detail is odd, no less because it's inaccurate.

Finally, we should note that one of the models was confused beyond a reasonable caption: the first one, [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning), which is also the module's default model. It detects a group of people, which in theory could be divided into more than one group, but then provides an answer that while not entirely out of nowhere is simply incorrect: "a crowd of people standing around a group of people." Perhaps it'll fare better with our next transformation.

### Transformation - Color Inversion

Our third transformation is [color inversion](https://skylum.com/how-to/how-to-invert-colors-on-a-picture), which some of you may better associate with the concept of [photography negatives](https://en.wikipedia.org/wiki/Negative_(photography)). Although all colors on the image will be inverted, the structure of the image will remain as it originally was. Let's see how the models take it. Here are our results:

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module)) - 'three people are sitting around a table with flowers'
- [git-base](https://huggingface.co/microsoft/git-base) - 'three people with painted faces'
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - 'three people are sitting at a table with blue paint'
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - 'several people are sitting at a table with blue paint on their faces'

There's plenty going on here.

For one thing, the default model is coloring within the lines again (no pun intended), and tells us "three people are sitting around a table with flowers". It agrees with the other models on several things:

- Three people, as 3/4 models interpret.
- Sitting, as 3/4 models interpret.
- They're around a table, as 3/4 models interpret.

What's curious about the first model's output is its observation of the flowers (we had to check the image, assuming a hallucination, but no, there are indeed flowers there). It saw them with more presence than the phone or the coffee cups. Or perhaps, because of the color change, it also sees the coffee cups as flowers?

The other three models all seem to agree that there is paint involved here, and two of them interpret the image as the people having their faces painted. This is understandable: the models "know" the color spectrum that human faces normally fall in through their training data, and this bright blue is nowhere in there. The third model doesn't even attach the paint to the faces; it just says that the people are sitting "with blue paint." This is clear indication that [image caption](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) models are able to differentiate form from color.

### Transformation - All of the Above

The final transformation we'll try is a combination of the previous three. We will take our base image and flip it [horizontally](https://www.mathsisfun.com/definitions/horizontal-flip.html), flip it [vertically](https://www.mathsisfun.com/definitions/vertical-flip.html), [invert](https://skylum.com/how-to/how-to-invert-colors-on-a-picture) its colors, and then process it through the image caption [pipeline](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_caption/) we created earlier. What do you think that'll do to pipeline output? Take a look:

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) ([default](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module)) - 'a series of colorful animals with faces painted on them'
- [git-base](https://huggingface.co/microsoft/git-base) - 'a wall on the side of a building'
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - 'a bunch of blue and white flowers'
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - 'there are many mannequins in a display of clothing'

This is naturally the toughest transformation (given that it's three distinct transformations), and it poses a significant challenge to the models. To their credit, each of them interprets something and returns a grammatically valid caption, though two of the captions are more on-point than the others.

The two better captions are of the first and last models we used, [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) and [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large). One of them sees colorful animals with painted faces and the other sees mannequins in a clothing display. Both off, but the identifications of faces, figures, and unusual colors strikes true. The colorful animal one, in particular, hits close to home. This model has now revindicated itself from the earlier failure we noted.

Speaking of failure, let's look at the output of the other two models, [git-base](https://huggingface.co/microsoft/git-base) and [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base). The former returned "a wall on the side of a building", which doesn't really mean anything. Worst-case scenario, it's completely wrong, best-case, it's going with the cop-out of "well, anything could be painted on a wall." Very low marks for this. And the other caption says, "a bunch of blue and white flowers." There's certainly blue and white in this image, and flowers do sometimes have haphazard form and distribution like this, but it's not a particularly strong caption.

### Conclusion

The models fared better than we expected. The group of people and the table amidst them made it most of the way through with all the models, and hallucinations/incorrectness largely stayed out of it until the last, most difficult transformation. There was never much detail in any of the captions, but that's how these models have been trained: low detail, go for the general sense of the picture.

The final image was meant to be a challenge, and to the models' credit, none of them spouted out pure nonsense.

There are many ways to get silly captions, incorrect captions, and hallucinatory captions from image models; they don't always fare well with transformations or with strange images. If you'd like to see this in action, why don't you set up a [Krixik pipeline](https://krixik-docs.readthedocs.io/en/latest/) like the one above and give it a whirl?