# Multimodal Prompt Engineering with Google Gemini

## Overview

Gemini 1.5 Flash is a new language model from the Gemini family. This model introduces a breakthrough long context window of up to 1 million tokens that can help seamlessly analyze large amounts of information and long-context understanding. It can process text, images, audio, video, and code all together for deeper insights. Learn more about [Gemini 1.5](https://deepmind.google/technologies/gemini/flash/).

Here we will:

- analyze images for insights.
- analyze audio for insights.
- understand videos (including their audio components).
- extract information from PDF documents.
- process images, video, audio, and text simultaneously.

## Getting Started

### Install Google Gen AI library for Python


In [None]:
!pip install google-generativeai==0.8.3

## Enter API Tokens

In [None]:
from getpass import getpass

GOOGLE_API_KEY = getpass('Enter Gemini API Key:')

### Import libraries


In [None]:
import google.generativeai as genai

genai.configure(api_key=GOOGLE_API_KEY)

for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

### Load the Gemini 1.5 Flash model



In [None]:
generation_config = genai.types.GenerationConfig(
    temperature=0
)
gemini = genai.GenerativeModel(model_name='gemini-1.5-flash-latest',
                               generation_config=generation_config)

### LLM basic usage

Below is a simple example that demonstrates how to prompt the Gemini 1.5 Flash model using the API. Learn more about the [Gemini API parameters](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini#gemini-pro).

In [None]:
from IPython.display import Markdown, display

prompt = """
  Explain what is Generative AI in 3 bullet points
"""

response = gemini.generate_content(contents=prompt)
display(Markdown(response.text))

## Image Analysis

In [None]:
# download images using curl
!curl https://i.imgur.com/6b9jwkk.png -o image1.png
!curl https://i.imgur.com/9CWuU2q.png -o image2.png

In [None]:
from IPython.display import Image as ImageDisp, display

display(ImageDisp('image1.png'))

In [None]:
display(ImageDisp('image2.png'))

In [None]:
from PIL import Image

image1 = Image.open('image1.png')
image2 = Image.open('image2.png')

In [None]:
prompt = """
  Given the following images which can contain graphs, tables and text,
  analyze all of them to answer the following question:

  Tell me about the top 5 years with largest Wildfires
"""

contents = [image1, image2,  prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

In [None]:
prompt = """
  Given the following images which can contain graphs, tables and text,
  analyze all of them to answer the following question:

  Tell me about trend of wildfires in terms of acreage burned by region and ownership
"""

contents = [image1, image2,  prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

## PDF Doc Analysis

In [None]:
!wget https://sgp.fas.org/crs/misc/IF10244.pdf

In [None]:
pdf_ref = genai.upload_file(path='./IF10244.pdf')
pdf_ref

In [None]:
prompt = """
  Given the PDF file, use it to answer the following question:

  Tell me about the top 5 years with largest Wildfires
"""

contents = [pdf_ref, prompt]

response = gemini.generate_content(contents)

Markdown(response.text)

## Audio understanding

Gemini 1.5 Flash can directly process audio for long-context understanding.


In [None]:
!wget "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3"

In [None]:
import IPython

IPython.display.Audio('./pixel.mp3')

In [None]:
audio_file = genai.upload_file(path='./pixel.mp3')

In [None]:
audio_file

#### Example 1: Title Generation

In [None]:

prompt = """
  Please provide a summary for the audio.
  Provide chapter titles with timestamps, be concise and short, no need to provide chapter summaries.
  Do not make up any information that is not part of the audio and do not be verbose.
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
print(response.text)

#### Example 2: Transcription

In [None]:
prompt = """
    Can you transcribe this interview, in the format of [timecode] - [speaker] : caption.
    Use speaker A, speaker B, etc. to identify the speakers. Map each speaker to their real name at the start of the output
    Each speaker should have a single caption based on their starting timestamp
    Do not break up the transcript into multiple timestamps for the same speaker
    Show the output only for the part of the conversation about the pixel watch and follow the format mentioned above for the output
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
print(response.text)

#### Example 3: Summarization

In [None]:
prompt = """
    Given the audio file, generate a comprehensive summary of:
     - Key Speakers
     - Key products and features discussed
     - Any other noteworthy discussions
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
display(Markdown(response.text))

## Video with audio understanding

Try out Gemini 1.5 Flash’s native multimodal and long context capabilities on video interleaving with audio inputs.

In [None]:
!wget "https://storage.googleapis.com/cloud-samples-data/generative-ai/video/pixel8.mp4"

In [None]:
IPython.display.Video('pixel8.mp4', embed=True, width=450)

In [None]:
video_file = genai.upload_file(path='./pixel8.mp4')

In [None]:
prompt = """
  Provide a comprehensive summary of the video.
  The summary should also contain anything important which people discuss in the video.
"""

contents = [video_file, prompt]

response = gemini.generate_content(contents=contents)
display(Markdown(response.text))

In [None]:
!gdown -O 'awsq_video.mp4' '1shnBXeuXYcbRr9IhxofHkT3rlSaqPG1e'

In [None]:
IPython.display.Video('awsq_video.mp4', embed=True, width=450)

In [None]:
vid = genai.upload_file(path='./awsq_video.mp4')

In [None]:
prompt = """
  Provide a description of the video.
  The description should cover the key steps covered in the video in bullet points
"""

contents = [vid, prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

Gemini 1.5 Pro model is able to process the video with audio, retrieve and extract textual and audio information.

## All modalities (images, video, audio, text) at once

Gemini 1.5 Pro is natively multimodal and supports interleaving of data from different modalities, it can support a mix of audio, visual, text, and
code inputs in the same input sequence.

In [None]:
!wget 'https://storage.googleapis.com/cloud-samples-data/generative-ai/video/behind_the_scenes_pixel.mp4'

In [None]:
!wget 'https://storage.googleapis.com/cloud-samples-data/generative-ai/image/a-man-and-a-dog.png'

In [None]:
IPython.display.Image('a-man-and-a-dog.png', width=450)

In [None]:
video_file = genai.upload_file(path='./behind_the_scenes_pixel.mp4')
image_file = genai.upload_file(path='./a-man-and-a-dog.png')

In [None]:
prompt = """
  Look through each frame in the video carefully and answer the questions.
  Only base your answers strictly on what information is available in the video attached.
  Do not make up any information that is not part of the video and summarize your answer
  in three bullet points max

  Questions:
  - When is the moment in the image happening in the video? Provide a timestamp.
  - What is the context of the moment and what does the narrator say about it?
"""

contents = [video_file, image_file, prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

## Conclusion

In this tutorial, you've learned how to use the Gemini 1.5 Flash to:

- analyze images for insights.
- analyze PDF docs for insights.
- analyze audio for insights.
- understand videos (including their audio components).
- process images, video, audio, and text simultaneously.