In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">
<td style="text-align: center">
<a href="https://colab.research.google.com/github/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_audio.ipynb">
<img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
</a>
</td>
      <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fprompting_recipes%2Fmultimodal%2Fmultimodal_prompting_audio.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
<td style="text-align: center">
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/applied-ai-engineering-samples/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_audio.ipynb">
<img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
</a>
</td>    
<td style="text-align: center">
<a href="https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_audio.ipynb">
<img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
</a>
</td>
</table>

| | |
|----------|-------------|
| Author(s)   | Michael Chertushkin |
| Reviewers(s) | Rajesh Thallam |
| Last updated | 2024 25 07: Draft |

### Best practices for multimodal prompting

This notebook provides set of best practices for multimodal prompts. We will briefly mention this set of best practices right now and then we will show the examples.

Here is the list of best practices:
- Use specific instructions
- Explicitly set the Persona
- Explicitly set the Mission
- Explain the steps one-by-one
- Use structured output
- If you ask to output reasoning behind the decision, ask to output reasoning before the decision, for example helpfullness reasoning should come in output before helpfullness score
- Put media content before text content. If you have images, but image before the prompt, the same goes for audio and video.



### Content
- Setup
- Introduction
- Audio Understanding Task
- Audio Understanding. Advanced Prompt
- Audio Understanding Task. System instruction
- Audio Understanding Task. Structured output

### Setup

In [None]:
PROJECT_ID = "chertushkin-genai-sa"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
MODEL_NAME = "gemini-1.5-flash-001"

import sys

if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()
    print('Authenticated')

Authenticated


In [None]:
!gsutil --version

gsutil version: 5.30


In [None]:
bucket_name = "multimodal-examples"
source_blob_name = "sound_1.mp3"
destination_file_name = "sound_1.mp3"

!gsutil cp gs://{bucket_name}/{source_blob_name} {destination_file_name}

Copying gs://multimodal-examples/sound_1.mp3...
/ [1 files][  3.0 MiB/  3.0 MiB]                                                
Operation completed over 1 objects/3.0 MiB.                                      


In [None]:
from IPython.display import Audio

Audio(destination_file_name)

### Introduction

This notebook shows how to prompt multimodal models with the audio domain. The set of tasks working with audios can be split into two different categories:
- Audio Understanding
- Audio Generation

This notebook provides only examples about Audio Understanding with Gemini. Audio Generation task is the subject of another notebook.

#### Important comment
There is terminological confusion about what is "multimodal model". We are going to use the definition that multimodal model is the model that can process different types of inputs, including texts, images, videos and audios. Recent generation of Gemini 1.5 pro is the example of multimodal model.

### Audio Understanding Task

This task requires the input to be presented in two different modalities: text and audio. The example of the API call is below, however this is non-optimal prompt and we can make it better.

In [None]:
import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.preview.generative_models as generative_models

vertexai.init(project=PROJECT_ID, location=LOCATION)
model = GenerativeModel(MODEL_NAME)

In [None]:
generation_config = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

safety_settings = {
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

def generate(model, input_prompt, input_audio, safety_settings, generation_config):
  responses = model.generate_content(
      [input_audio, input_prompt],
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=True,
  )

  for response in responses:
    print(response.text, end="")

In [None]:
import base64
audio_uri = f"gs://{bucket_name}/{source_blob_name}"
audio_content = Part.from_uri(audio_uri, mime_type="audio/mp3")

prompt = """Provide a description of the audio.
The description should also contain anything important which people say in the audio."""

In [None]:
generate(model, prompt, audio_content, safety_settings, generation_config)

This is an audio program to accompany English in Action 1, 2nd edition by Barbara H. Foley and Elizabeth R. Nebblett. It is a copyright of 2018 National Geographic Learning, a part of Cengage Learning. The audio program starts with a section called "Listen and Repeat". It then goes on to list 16 sentences, each containing a present continuous verb. The sentences are: 1. He is eating. 2. He is washing the car. 3. She is listening to the radio. 4. They are studying. 5. He is cooking. 6. She is sleeping. 7. He is reading. 8. She is drinking. 9. They are talking. 10. They are watching TV. 11. He is doing his homework. 12. She is cleaning the house. 13. She is driving. 14. They are walking. 15. She is making lunch. 16. He is doing the laundry. 

As we see the model correctly picked that this is a lesson in English, however we can improve the level of details.

### Audio Understanding. Advanced Prompt

Advanced prompt uses the following best practices:
- Persona
- Mission
- Instructions in XML format
- Suggestions

In [None]:
prompt = """You are an audio analyzer. You receive an audio and produce the detailed description about what happens in the audio.

<Instructions>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</Instructions>

Now analyse the following audio
"""

In [None]:
generate(model, prompt, audio_content, safety_settings, generation_config) # updated description with prompt changes

This audio is a language learning exercise. It presents a series of sentences in English, each describing an action. The sentences are spoken clearly and slowly, with a slight pause between each one. The purpose of the audio is to help learners practice listening and repeating the sentences. 

Here is a breakdown of the sentences:

1. **He is eating.**
2. **He is washing the car.**
3. **She is listening to the radio.**
4. **They are studying.**
5. **He is cooking.**
6. **She is sleeping.**
7. **He is reading.**
8. **She is drinking.**
9. **They are talking.**
10. **They are watching TV.**
11. **He is doing his homework.**
12. **She is cleaning the house.**
13. **She is driving.**
14. **They are walking.**
15. **She is making lunch.**
16. **He is doing the laundry.**

The audio is designed to help learners improve their pronunciation and understanding of basic English verbs and sentence structures. 


#### What changed
We were able to capture much more details with these prompt, although this prompt is rather generic and can be used for other audios. Now let's add system instruction and see what will be the result.

### Audio Understanding Task. System instruction

To achieve the desired behaviour, we can use system instruction. System instruction behaves more "sticky" during multi-turn behaviour, for example if you want to achieve some behaviour that the model will constantly follow, then system instruction is the best way to put this instruction.

There is no consensus whether you should use system instruction versus the usual prompt, they all work more or less the same. Let's demonstrate it

In [None]:
system_prompt = """You are an audio analyzer. You receive an audio and produce the detailed description about what happens in the audio.

<Instructions>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</Instructions>
"""

In [None]:
model = GenerativeModel(MODEL_NAME, system_instruction=system_prompt)
simple_prompt = "Now analyze the audio"

In [None]:
generate(model, prompt, audio_content, safety_settings, generation_config) # updated description with prompt changes

The audio is a recording of a CD-ROM for an English language learning program called "English in Action 1". It is the second edition of the program, authored by Barbara H. Foley and Elizabeth R. Nebblett. The copyright is held by National Geographic Learning, a part of Cengage Learning.

The audio begins with a brief introduction, stating that it is an audio program to accompany the English in Action 1 textbook. Then, it moves into a "Listen and Repeat" section, where a narrator reads out 16 sentences, each describing a different activity. The sentences are simple and repetitive, using the present continuous tense. 

Here are the sentences and their corresponding activities:

1. **He is eating.**
2. **He is washing the car.**
3. **She is listening to the radio.**
4. **They are studying.**
5. **He is cooking.**
6. **She is sleeping.**
7. **He is reading.**
8. **She is drinking.**
9. **They are talking.**
10. **They are watching TV.**
11. **He is doing his homework.**
12. **She is cleani

### Audio Understanding Task. Structured Output

One of the best practices to achieve better understanding is to use structured output, and to force llm to output "Reasoning" about the matter. This has proven to be very effective method, however it increases latency sometimes. Here is the algorithm to design structure output:
- Ask LLM to provide output as JSON
- Also ask to provide "reasoning" 4-5 sentences about the matter
- The field "reasoning" should come **before** the actual answer because of autoregressive nature of LLMs

In [None]:
structured_prompt = """You are an audio analyzer. You receive an audio and produce harmfullness score - how harmful this audio can be for kids.

Key Points:
- The output MUST BE A VALID JSON object. This requires the following:
  - Use double quotes for all keys and string values.
  - Separate multiple key-value pairs with commas.
  - Avoid trailing commas after the last key-value pair.
  - Special characters in strings should be escaped using backslashes (e.g., use \\" for double quotes within strings).
- The JSON output should clearly define a harmfullness score and reasoning:
  {{
    "harmfullness_reasoning": "<Step-by-step detailed reasoning about how harmful is the audio>",
    "harmfullness_score": <Number between 0 and 5 indicating how harmful is the audio>,
  }}
  """


In [None]:
generate(model, structured_prompt, audio_content, safety_settings, generation_config) # updated description with prompt changes

```json
{
  "harmfullness_reasoning": "The audio is a simple English language learning exercise. It features a voice reading out various sentences describing everyday activities. There is no explicit content, violence, or harmful language. The audio is designed for educational purposes and is appropriate for children.",
  "harmfullness_score": 0
}
```

### What changed
In this example we achieved the good level of scoring the audio by forcing LLM to output first the "reasoning" bedind the score, and then output the score. It is **very important** to put "reasoning" field before the score, such that LLM can generate "reasoning" first and rely on the reasoning sentences to properly output the score.

### Audio Understanding Task (Conclusion)

To summarize, with using the tips and tricks for better prompt, you can achieve better performance of the model, without changing the model/hyperparameters themselfes.
Let's mention the best practices one more time:
- Use specific instructions
- Explicitly set the Persona
- Explicitly set the Mission
- Explain the steps one-by-one
- Use structured output
- If you ask to output reasoning behind the decision, ask to output reasoning before the decision, for example helpfullness reasoning should come in output before helpfullness score
- Put media content before text content. If you have images, but image before the prompt, the same goes for audio and video.

Following these steps can help you to increase the performance of LLMs without a need for sophisticated finetuning.