In [1]:
pip install krixik

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

LUCAS_STAGING_API_KEY=os.getenv('LUCAS_STAGING_API_KEY')
LUCAS_STAGING_API_URL=os.getenv('LUCAS_STAGING_API_URL')

# import Krixik
from krixik import krixik
krixik.init(api_key = LUCAS_STAGING_API_KEY, 
            api_url = LUCAS_STAGING_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


---

---

---

# Hallucinations when Captioning Monochrome

Certain types of AI models take non-text input and generate a textual interpretation of this input. They don't extract existing text (as [OCR](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/ocr_module/) does) or search through text (like [semantic search](https://krixik-docs.readthedocs.io/en/latest/system/search_methods/semantic_search_method/)), but generate the text outright. Two fine examples are [transcription](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/) models, which receive an audio input and output a textual transcript of any spoken words within the audio, and [image captioning](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) models, which generate a textual description of an input image file.

In this article we explore the hallucinations image captioning models produce when their input is devoid of processable content (click here[LINK] for an article in which we perform a similar exercise for transcription models). Image captioning is a complex task, and models struggle when presented with little or nothing to generate a textual interpretation of. The extreme version of this is a monochrome image file: just a single color occupying the entire space.

To see what this looks like in practice, we'll use [Krixik](https://krixik-docs.readthedocs.io/en/latest/) to build a [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_caption/) pipeline with an [image caption](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) module:

In [4]:
# instantiate a single-module pipeline with an image caption module
pipeline_1 = krixik.create_pipeline(name='my_caption_pipeline',
                                    module_chain=['caption'])

### Hallucinations when Captioning Yellow

Our file, *yellow.jpg*, is a plain yellow space generated in MS Paint. Let's see what happens when we [.process](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/) it through our pipeline while the four currently [available caption models](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module) on Krixik, which are:

- [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
- [git-base](https://huggingface.co/microsoft/git-base)
- [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)


We'll use them in that same order, so [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) will go first. Since it's the module's [default model](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module), we don't have to specify it:

In [5]:
# .process an all-yellow image through the caption module without specifying the model, since vit-gpt2-image-captioning is default
pipeline_1.process(local_file_path='./test_files/yellow.jpg')

INFO: hydrated input modules: {'module_1': {'model': 'vit-gpt2-image-captioning', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_eyegsekiry.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 14:09:36 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 2c22686d-d45d-3a6c-7d76-bcca03ff82b6
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '0bc6ba01-496f-47f5-911f-b1371cd40bfa',
 'file_id': '5b1d36b3-5576-4b2d-ba8e-2f3f06521a26',
 'message': 'SUCCESS - output fetched for file_id 5b1d36b3-5576-4b2d-ba8e-2f3f06521a26.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'a yellow and blue striped and red and white striped and red and white striped and red and white'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/5b1d36b3-5576-4b2d-ba8e-2f3f06521a26.json']}

This first model has, as some might say, 'blown a fuse' with this image. The image is entirely yellow, but the model has hallucinated colors in stripes: "a yellow and blue striped and red and white striped and red and white striped and red and white." It's also been unable to generate a grammatically correct caption.

Let's see what our second model in line, [git-base](https://huggingface.co/microsoft/git-base), interprets:

In [7]:
# .process an all-yellow image through the caption module with git-base as the active model
pipeline_1.process(local_file_path='./test_files/yellow.jpg',
                   modules={'caption': {'model': 'git-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'git-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_vjclhmccsn.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 14:13:04 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 43b91d36-44b6-9c08-756a-ae181d80ab76
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '0eee6b5b-43e5-4999-a09d-55b1ff85f61e',
 'file_id': 'bb45f693-bb5c-4377-8c19-a36f280e895c',
 'message': 'SUCCESS - output fetched for file_id bb45f693-bb5c-4377-8c19-a36f280e895c.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'this is a yellow shirt'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/bb45f693-bb5c-4377-8c19-a36f280e895c.json']}

The caption here is "this is a yellow shirt." Although the 'shirt' part is hallucinatory, it has correctly identified the color and added no other colors. Moreover, one could argue (as a stretch) that the caption reflects imaginative hallucination, and that the model is "assuming" a zoomed-in image.

The third model we'll try is [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base):

In [9]:
# .process an all-yellow image through the caption module with blip-image-captioning-base as the active model
pipeline_1.process(local_file_path='./test_files/yellow.jpg',
                   modules={'caption': {'model': 'blip-image-captioning-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'blip-image-captioning-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_mqkqfczjdb.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 14:19:16 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: c3321bf1-9539-24c9-29c2-eabcaf55cdb4
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '0152b61b-9b1b-44f5-aaed-f17f6182a4d1',
 'file_id': '80c780cc-bce3-410b-83ba-24a2d63d6c25',
 'message': 'SUCCESS - output fetched for file_id 80c780cc-bce3-410b-83ba-24a2d63d6c25.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'a yellow background with a white border'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/80c780cc-bce3-410b-83ba-24a2d63d6c25.json']}

We're increasing in accuracy as we go down the list. This third model accurately describes "a yellow background". Then it adds "a white border", which is not in the image, but it may be framed in by the way the model is set up.

One last model to go, [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large):

In [10]:
# .process an all-yellow image through the caption module with blip-image-captioning-large as the active model
pipeline_1.process(local_file_path='./test_files/yellow.jpg',
                   modules={'caption': {'model': 'blip-image-captioning-large', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'blip-image-captioning-large', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_ixzbmkpuno.jpg
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 14:23:44 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: f9140a8c-9346-ce8b-f465-4354dd7d92de
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': 'f9094c94-5c31-41b5-994b-a8ac8d1b7267',
 'file_id': '8f3d1f92-98b8-4d2c-9328-2626757d2c87',
 'message': 'SUCCESS - output fetched for file_id 8f3d1f92-98b8-4d2c-9328-2626757d2c87.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'yellow background with a white border and a black border'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/8f3d1f92-98b8-4d2c-9328-2626757d2c87.json']}

Interestingly, this fourth model—[blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)—despite being larger (and of the same family) as the previous model, [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base), is hallucinating **more**. It now interprets two borders, one black and one white, around the yellow background.

### Hallucinations when Captioning Red

Now let's try the same exercise but with a slightly different file: *red.png*, an entirely red space also generated in MS Paint. We'll use the same models as above, and see if results differ noticeably because of the color of the image.

First we go with the [default model](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/#available-models-in-the-caption-module), [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning), which as you know we don't have to specify in the code:

In [11]:
# .process an all-red image through the caption module without specifying the model, since vit-gpt2-image-captioning is default
pipeline_1.process(local_file_path='./test_files/red.png')

INFO: hydrated input modules: {'module_1': {'model': 'vit-gpt2-image-captioning', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_uqsmqphlkv.png
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 14:32:17 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: f21d8b19-84b2-5fe6-b13a-b1dd32237f8c
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': 'd3f16a7c-f750-4f7f-b6d3-e6bfd6264639',
 'file_id': 'b94ef900-2bd6-49b5-abe2-bfc3fcfffe0d',
 'message': 'SUCCESS - output fetched for file_id b94ef900-2bd6-49b5-abe2-bfc3fcfffe0d.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'a red and white flag with a blue and white stripe'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/b94ef900-2bd6-49b5-abe2-bfc3fcfffe0d.json']}

This hallucination is similar to the one this same model produced earlier for the yellow image, but less acute. For one thing, the caption mentions less colors; this time it's only adding two colors (blue and white) as opposed to blue, red, and white above. Moreover, this is the first time any model references a flag, which is a great direction for simply-colored images like these. Also note that the caption "a red and white flag with a blue and white stripe" is this time grammatically correct. What makes red 'easier' than yellow for the [vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) model is anybody's guess, but it may have to do with yellow vs red content in the training data.

Now let's try [git-base](https://huggingface.co/microsoft/git-base) again:

In [12]:
# .process an all-red image through the caption module with git-base as the active model
pipeline_1.process(local_file_path='./test_files/red.png',
                   modules={'caption': {'model': 'git-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'git-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_joddyfoemc.png
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 15:52:21 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 2deb844f-c603-25ed-2c11-26622deb425f
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '51455ef9-ca1e-4090-90c3-1604081c4b55',
 'file_id': 'f4de1069-5f21-4698-85b3-4e58978e14af',
 'message': 'SUCCESS - output fetched for file_id f4de1069-5f21-4698-85b3-4e58978e14af.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'the red color of the flag'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/f4de1069-5f21-4698-85b3-4e58978e14af.json']}

This is a great caption: "the red color of the flag." As above, it also references a flag, but the subject of the noun is "color". No hallucination here (nor reference to a shirt). It's debatable if the 'flag' reference even counts as a hallucination. Top marks.

It's interesting that the yellow captions mentioned no flags, but so far two of two red captions have flags as a critical element of the caption. Are red-including flags used more frequently in training image caption models than yellow-including ones?

Let's see what the third model, [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base), has to say:

In [13]:
# .process an all-red image through the caption module with blip-image-captioning-base as the active model
pipeline_1.process(local_file_path='./test_files/red.png',
                   modules={'caption': {'model': 'blip-image-captioning-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'blip-image-captioning-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_jujshhbnyf.png
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 15:56:26 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 2aea9393-1e35-c047-c66e-94310cd79884
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '3c3cda34-ceb8-43e6-a157-a4ddd054919e',
 'file_id': 'b2f4bcbe-afc6-45e7-a22c-2474f233de71',
 'message': 'SUCCESS - output fetched for file_id b2f4bcbe-afc6-45e7-a22c-2474f233de71.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'a red background with a white border'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/b2f4bcbe-afc6-45e7-a22c-2474f233de71.json']}

Here we see consistency. For the all-yellow image, the [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) model generated: "a yellow background with a white border". For this image we have "a red background with a white border." The same white border hallucination is present, but top marks for consistency.

Our fourth and final model is once again [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large), from the same blip family as the above. Will it also be consistent in its output?

In [14]:
# .process an all-red image through the caption module with blip-image-captioning-large as the active model
pipeline_1.process(local_file_path='./test_files/red.png',
                   modules={'caption': {'model': 'blip-image-captioning-large', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'blip-image-captioning-large', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_lbjzltkrsi.png
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 16:01:50 2024 UTC
INFO: my_caption_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 577352d7-2f97-8525-b433-6ffbb291ad3b
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_caption_pipeline',
 'request_id': '9c908090-0d84-4597-a129-50c99d495da0',
 'file_id': 'ed4c1152-52aa-4db0-9dc8-9eeba8ced54f',
 'message': 'SUCCESS - output fetched for file_id ed4c1152-52aa-4db0-9dc8-9eeba8ced54f.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'caption': 'a close up of a red background with a white border'}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/ed4c1152-52aa-4db0-9dc8-9eeba8ced54f.json']}

The caption is "a close up of a red background with a white border." This is an odd framing; does it mean that it interprets the closing-up as having removed the white border and it's thus an entirely red image?

Although there's no consistency with what this same model produced for the yellow image ("yellow background with a white border and a black border"), it's an improvement: although the white border hallucination remains, the black border one is gone.

### Conclusion

There are a few conclusions that we can draw from this exercise:

- Image captioning models seem to better associate flags with the color red than with the color yellow.
- Red produces less hallucinations than yellow does. Both this and the above conclusion suggest that there was more red than yellow in the training data.
- Within the same family, larger models are not necessarily better (while being heavier and more expensive). In the above example, the [blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) model was both more consistent in its output and less hallucination prone than its in-theory stronger sibling, [blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large).

As the technology evolves, hallucinations will likely become less and less frequent in the output of models like these. In the meantime, though, they offer a valuable perspective on how they work, how they've been created, and how/why they fail.