In [2]:
pip install krixik

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

LUCAS_STAGING_API_KEY=os.getenv('LUCAS_STAGING_API_KEY')
LUCAS_STAGING_API_URL=os.getenv('LUCAS_STAGING_API_URL')

# import Krixik
from krixik import krixik
krixik.init(api_key = LUCAS_STAGING_API_KEY, 
            api_url = LUCAS_STAGING_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


---

---

---

# Hallucinations when Transcribing Silence and Noise

Certain types of AI models take non-text input and generate a textual interpretation of this input. They don't extract existing text (as [OCR](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/ocr_module/) does) or search through text (like [semantic search](https://krixik-docs.readthedocs.io/en/latest/system/search_methods/semantic_search_method/)), but generate the text outright. Two fine examples are [transcription](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/) models, which receive an audio input and output a textual transcript of any spoken words within the audio, and [image captioning](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/caption_module/) models, which generate a textual description of an input image file.

In this article we explore the hallucinations transcription models produce when their input is devoid of processable content (click here[LINK] for an article in which we perform a similar exercise for image captioning models). Transcription is a complex task, and models struggle when presented with little or nothing to generate a textual interpretation of. The extreme version of this is entirely silent audio, but audio with noise/music and no clear spoken voice proves just as challenging a task.

To see what this looks like in practice, we'll use [Krixik](https://krixik-docs.readthedocs.io/en/latest/) to build a [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_transcribe/) pipeline with a [transcribe](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/) module:

In [5]:
# instantiate a single-module pipeline with a transcribe module
pipeline_1 = krixik.create_pipeline(name='my_transcribe_pipeline',
                                    module_chain=['transcribe'])

### Hallucinations when Transcribing Silence

Our file, *silence.mp3*, is a completely silent audio. Let's see what happens when we [.process](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/) it through our pipeline while leveraging three different [available models](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/#available-models-in-the-transcribe-module), which upon random selection are:

- [whisper-medium](https://huggingface.co/openai/whisper-medium)
- [whisper-small](https://huggingface.co/openai/whisper-small)
- [whisper-base](https://huggingface.co/openai/whisper-base)

[whisper-medium](https://huggingface.co/openai/whisper-medium), the strongest of these three, will go first:

In [6]:
# .process a silent audio through the transcribe module with whisper-medium as the active model
pipeline_1.process(local_file_path='./test_files/silence.mp3',
                   modules={'transcribe': {'model': 'whisper-medium', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'whisper-medium', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_zbwodqyniy.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:04:17 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 77258e5a-19c8-8978-051f-d28fa6b5928c
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '45d40379-3175-40e8-b54e-d36d0ff08951',
 'file_id': '31b668d4-32c1-42d2-939f-089a1e28e771',
 'message': 'SUCCESS - output fetched for file_id 31b668d4-32c1-42d2-939f-089a1e28e771.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': ' you',
   'timestamped_transcript': [{'id': 0,
     'start': 0.0,
     'end': 0.82,
     'text': ' you',
     'no_speech_prob': 0.934208869934082,
     'confidence': 0.29,
     'words': [{'text': 'you',
       'start': 0.0,
       'end': 0.82,
       'confidence': 0.29}]}]}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/31b668d4-32c1-42d2-939f-089a1e28e771.json']}

As you can see, despite the audio being completely silent, the model detects the word "you" in its first second.

Now [whisper-small](https://huggingface.co/openai/whisper-small), a less powerful member of the Whisper family:

In [7]:
# .process a silent audio through the transcribe module with whisper-small as the active model
pipeline_1.process(local_file_path='./test_files/silence.mp3',
                   modules={'transcribe': {'model': 'whisper-small', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'whisper-small', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_tymepxopnh.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:07:50 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 80a06ebf-bb3b-0591-362d-b48d77d85fae
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '34dac7c6-252e-4467-b6e3-ad0a5bfc22ed',
 'file_id': '2d0c214e-1be0-45ce-af29-ac80732b0a40',
 'message': 'SUCCESS - output fetched for file_id 2d0c214e-1be0-45ce-af29-ac80732b0a40.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': ' you',
   'timestamped_transcript': [{'id': 0,
     'start': 0.18,
     'end': 1.1,
     'text': ' you',
     'no_speech_prob': 0.9380698204040527,
     'confidence': 0.234,
     'words': [{'text': 'you',
       'start': 0.18,
       'end': 1.1,
       'confidence': 0.234}]}]}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/2d0c214e-1be0-45ce-af29-ac80732b0a40.json']}

Both models have detected the word "you" in roughly the same place, though note that the timestamps differ a bit. [whisper-medium](https://huggingface.co/openai/whisper-medium) detects it from 0.0 to 0.82, while [whisper-small](https://huggingface.co/openai/whisper-small) detects it from 0.18 to 1.1. Once again, bear in mind that there is no sound of any type in this audio.

What about [whisper-base](https://huggingface.co/openai/whisper-base), the smallest of these three?

In [8]:
# .process a silent audio through the transcribe module with whisper-base as the active model
pipeline_1.process(local_file_path='./test_files/silence.mp3',
                   modules={'transcribe': {'model': 'whisper-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'whisper-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_utxagxjchl.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:11:02 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 7326d3f8-e7d7-6163-f1f4-5c4185253467
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '640434e0-1b85-4077-8c60-e9e289bba5f1',
 'file_id': 'a728067e-4d4d-4113-9c2c-c640df855e26',
 'message': 'SUCCESS - output fetched for file_id a728067e-4d4d-4113-9c2c-c640df855e26.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': ' you',
   'timestamped_transcript': [{'id': 0,
     'start': 2.48,
     'end': 2.5,
     'text': ' you',
     'no_speech_prob': 0.9320588707923889,
     'confidence': 0.247,
     'words': [{'text': 'you',
       'start': 2.48,
       'end': 2.5,
       'confidence': 0.247}]}]}],
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/a728067e-4d4d-4113-9c2c-c640df855e26.json']}

Fascinating. This smallest model also detects the word "you", but long after the other two do: not in the first second of the audio, but in the third!

### Hallucinations when Transcribing Noise

Now let's try an audio file in which there are only modem sounds. Those of you 25 years or older may remember these well. We'll use new random selection of three [available models](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/#available-models-in-the-transcribe-module):

- [whisper-tiny](https://huggingface.co/openai/whisper-tiny), the module's default model
- [whisper-medium](https://huggingface.co/openai/whisper-medium)
- [whisper-base](https://huggingface.co/openai/whisper-base)

[whisper-tiny](https://huggingface.co/openai/whisper-tiny) goes first. Since it's the module's [default model](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/transcribe_module/#available-models-in-the-transcribe-module), we don't have to specify it:

In [9]:
# .process a modem-sounds audio through the transcribe module without specifying the model, since whisper-tiny is default
pipeline_1.process(local_file_path='./test_files/modem_sounds.mp3')

INFO: hydrated input modules: {'module_1': {'model': 'whisper-tiny', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_efyjfaauqr.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:20:05 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 59b5a63d-403a-4ba5-1255-ca1909fa5284
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '3438d535-b799-48ba-9445-15611c425eb6',
 'file_id': 'f26060b4-b318-4ec6-b98f-539b45044abb',
 'message': 'SUCCESS - output fetched for file_id f26060b4-b318-4ec6-b98f-539b45044abb.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': " I'm not sure if you can see the sound of the sound of the sound of the sound. I'm not sure if you can see the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the sound of the soun

Look at that: it supposedly detects the words "I'm not sure if you can see the sound of the sound", and then essentially repeats the words "of the sound" ad nauseam. Does its training data somehow suggest that "you" is a valid interpretation of silence? Is the interpretation of "you" parametrically close to that of silence, so one results in the other? Is this, as some have suggested, the model 'complaining' to us?

Let's see what happens if we use a stronger model, [whisper-medium](https://huggingface.co/openai/whisper-medium):

In [10]:
# .process a modem-sounds audio through the transcribe module with whisper-medium as the active model
pipeline_1.process(local_file_path='./test_files/modem_sounds.mp3',
                   modules={'transcribe': {'model': 'whisper-medium', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'whisper-medium', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_kjunzkblsz.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:30:19 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: a13ed43a-da7a-8428-dbe2-36701d6de1e7
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '3359fa8e-2807-4415-b9f7-8da5da9b5598',
 'file_id': '882e5b18-b891-4c3c-9ffc-83d5368d01fc',
 'message': 'SUCCESS - output fetched for file_id 882e5b18-b891-4c3c-9ffc-83d5368d01fc.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': ' You You You You You',
   'timestamped_transcript': [{'id': 0,
     'start': 0.0,
     'end': 0.84,
     'text': ' You',
     'no_speech_prob': 0.759978175163269,
     'confidence': 0.032,
     'words': [{'text': 'You',
       'start': 0.0,
       'end': 0.84,
       'confidence': 0.032}]},
    {'id': 1,
     'start': 32.48,
     'end': 32.5,
     'text': ' You',
     'no_speech_prob': 0.4594312906265259,
     'confidence': 0.723,
     'words': [{'text': 'You',
       'start': 32.48,
       'end': 32.5,
       'confidence': 0.723}]},
    {'id': 2,
     'start': 60.0,
     'end': 61.1,
     'text': ' You',
     'no_speech_prob': 0.4354951977

The stronger model gives us a no less baffling transcription: the word "you" repeated five times. It detects the word at five random points of the audio. It is difficult not to note the similarity with the transcription of the silent audio above, although neither audio includes the word "you".

The third model we'll leverage here, [whisper-base](https://huggingface.co/openai/whisper-base), falls between the previous two models:

In [11]:
# .process a modem-sounds audio through the transcribe module with whisper-base as the active model
pipeline_1.process(local_file_path='./test_files/modem_sounds.mp3',
                   modules={'transcribe': {'model': 'whisper-base', 'params': {}}})

INFO: hydrated input modules: {'module_1': {'model': 'whisper-base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_aaxqkuksje.mp3
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sat Jun  1 00:33:00 2024 UTC
INFO: my_transcribe_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 081016b3-7321-5aa8-0ea3-2a67b0639570
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_transcribe_pipeline',
 'request_id': '1a5ae2c9-0bd9-4c39-b779-09b2d154b5c3',
 'file_id': '46261d75-6c7d-4872-9ae0-95060739aa81',
 'message': 'SUCCESS - output fetched for file_id 46261d75-6c7d-4872-9ae0-95060739aa81.Output saved to location(s) listed in process_output_files.',
 'process_output': [{'transcript': " I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one. I'm not sure if I can get the right one.",
   'timestamped_transcript': [{'id': 0,
     'start': 60.14,
     'end': 64.14,
     'text': " I'm not sure if I can get the right one.",
     'no_speech_prob': 0.6607950329780579,
     'confidence': 0.118,
     'words': [{'text': "I'm",
       'start': 60.14,
       'end': 6

As you can see, this transcription is closer to what [whisper-tiny](https://huggingface.co/openai/whisper-tiny) produced: the repetition of a phrase that is nowhere to be found in the audio. In this case, the repeated phrase is "I'm not sure if I can get the right one."

### Conclusion

This hallucination comparison gives us a peek at how these models function, how they've been built, and how they can fail. Much has been written about this topic elsewhere, so please feel free to explore; we will not delve into detail here. I offer the following thought instead: [whisper-tiny](https://huggingface.co/openai/whisper-tiny) was no less likely to hallucinate than [whisper-medium](https://huggingface.co/openai/whisper-medium), a heavier, more powerful, more expensive model. In what other ways is the smaller model just as good as its heftier siblings?

Transcription hallucinations can be interesting, fun, and sometimes even a little spooky (for instance, try transcribing an audio file that holds Cradle of Filth's cover of the Misfits' *Death Comes Ripping*â€”why don't you [create a Krixik pipeline](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_transcribe/) and give it a shot?). This technology is evolving at extraordinary speed, and hallucinations like these these may soon largely become a thing of the past.

That said, how close they'll come to being "perfect" is another matter altogether... although to be fair, the human auditory/interpretative system doesn't always score perfect marks either.