In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Document Processing with Gemini

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fdocument-processing%2Fdocument_processing.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>       
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>


| | |
|-|-|
|Author(s) | [Holt Skinner](https://github.com/holtskinner), [Drew Gillson](https://github.com/drewgillson) |

## Overview

In today's information-driven world, the volume of digital documents generated daily is staggering. From emails and reports to legal contracts and scientific papers, businesses and individuals alike are inundated with vast amounts of textual data. Extracting meaningful insights from these documents efficiently and accurately has become a paramount challenge.

Document processing involves a range of tasks, including text extraction, classification, summarization, and translation, among others. Traditional methods often rely on rule-based algorithms or statistical models, which may struggle with the nuances and complexities of natural language.

Generative AI offers a promising alternative to understand, generate, and manipulate text using natural language prompting. Gemini on Vertex AI allows these models to be used in a scalable manner through:

- [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) in the Cloud Console
- [Vertex AI REST API](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk-ref)
- [Other client libraries](https://cloud.google.com/vertex-ai/docs/start/client-libraries)

This notebook focuses on using the **Vertex AI SDK for Python** to call the Gemini API in Vertex AI with the Gemini 1.5 Flash model.

For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.


### Objectives

In this tutorial, you will learn how to use the Gemini API in Vertex AI with the Vertex AI SDK for Python to process PDF documents.

You will complete the following tasks:

- Install the Vertex AI SDK for Python
- Use the Gemini API in Vertex AI to interact with Gemini 1.5 Flash (`gemini-1.5-flash`) model:
  - Extract structured entities from an unstructured document
  - Classify document types
  - Combine classification and entity extraction into a single workflow
  - Summarize documents


### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting Started


### Install Vertex AI SDK for Python


In [2]:
%pip install --upgrade --user --quiet google-cloud-aiplatform

[0mNote: you may need to restart the kernel to use updated packages.


### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [3]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).


In [1]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
# Define project information
PROJECT_ID = "qwiklabs-gcp-00-8f9bc8d932c0"  # @param {type:"string"}
LOCATION = "europe-west1"  # @param {type:"string"}

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

In [21]:
# Please like share & subscribe to Techcps
# YouTube https://www.youtube.com/@techcps

print("Please like share & subscribe to Techcps https://www.youtube.com/@techcps")

Please like share & subscribe to Techcps https://www.youtube.com/@techcps


### Import libraries


In [3]:
import json

from IPython.display import Markdown, display_pdf
from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory,
    Part,
)

### Load the Gemini 1.5 Flash model

Gemini 1.5 Flash (`gemini-1.5-flash`) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.

In [4]:
model = GenerativeModel(
    "gemini-1.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH
    },
)
# This Generation Config sets the model to respond in JSON format.
generation_config = GenerationConfig(
    temperature=0.0, response_mime_type="application/json"
)

### Define helper function

Define helper function to print the multimodal prompt

In [5]:
PDF_MIME_TYPE = "application/pdf"


def print_multimodal_prompt(contents: list) -> None:
    """
    Given contents that would be sent to Gemini,
    output the full multimodal prompt for ease of readability.
    """
    for content in contents:
        if not isinstance(content, Part):
            print(content)
        elif content.inline_data:
            display_pdf(content.inline_data.data)
        elif content.file_data:
            gcs_url = (
                "https://storage.googleapis.com/"
                + content.file_data.file_uri.replace("gs://", "").replace(" ", "%20")
            )
            print(f"PDF URL: {gcs_url}")


# Send Google Cloud Storage Document to Vertex AI
def process_document(
    prompt: str,
    file_uri: str,
    mime_type: str = PDF_MIME_TYPE,
    generation_config: GenerationConfig | None = None,
    print_prompt: bool = False,
    print_raw_response: bool = False,
) -> str:
    # Load file directly from Google Cloud Storage
    file_part = Part.from_uri(
        uri=file_uri,
        mime_type=mime_type,
    )

    # Load contents
    contents = [file_part, prompt]

    # Send to Gemini
    response = model.generate_content(contents, generation_config=generation_config)

    if print_prompt:
        print("-------Prompt--------")
        print_multimodal_prompt(contents)

    if print_raw_response:
        print("\n-------Raw Response--------")
        print(response)

    return response.text

## Entity Extraction

[Named Entity Extraction](https://en.wikipedia.org/wiki/Named-entity_recognition) is a technique of Natural Language Processing to identify specific fields and values from unstructured text. For example, you can find key-value pairs from a filled out form, or get all of the important data from an invoice categorized by the type.

### Extract entities from an invoice

In this example, you will use a sample invoice and get all of the information in JSON format.

This is the prompt to be sent to Gemini along with the PDF document. Feel free to edit this for your specific use case.

In [6]:
invoice_extraction_prompt = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:
{
	"amount_paid_since_last_invoice": "",
	"carrier": "",
	"currency": "",
	"currency_exchange_rate": "",
	"delivery_date": "",
	"due_date": "",
	"freight_amount": "",
	"invoice_date": "",
	"invoice_id": "",
	"line_items": [
		{
			"amount": "",
			"description": "",
			"product_code": "",
			"purchase_order": "",
			"quantity": "",
			"unit": "",
			"unit_price": ""
		}
	],
	"net_amount": "",
	"payment_terms": "",
	"purchase_order": "",
	"receiver_address": "",
	"receiver_email": "",
	"receiver_name": "",
	"receiver_phone": "",
	"receiver_tax_id": "",
	"receiver_website": "",
	"remit_to_address": "",
	"remit_to_name": "",
	"ship_from_address": "",
	"ship_from_name": "",
	"ship_to_address": "",
	"ship_to_name": "",
	"supplier_address": "",
	"supplier_email": "",
	"supplier_iban": "",
	"supplier_name": "",
	"supplier_payment_ref": "",
	"supplier_phone": "",
	"supplier_registration": "",
	"supplier_tax_id": "",
	"supplier_website": "",
	"total_amount": "",
	"total_tax_amount": "",
	"vat": [
		{
			"amount": "",
			"category_code": "",
			"tax_amount": "",
			"tax_rate": "",
			"total_amount": ""
		}
	]
}

- The JSON schema must be followed during the extraction.
- The values must only include text found in the document
- Do not normalize any entity value.
- If an entity is not found in the document, set the entity value to null.
"""

In [7]:
# Download a PDF from Google Cloud Storage
! gsutil cp "gs://cloud-samples-data/generative-ai/pdf/invoice.pdf" ./invoice.pdf

Copying gs://cloud-samples-data/generative-ai/pdf/invoice.pdf...
/ [1 files][340.0 KiB/340.0 KiB]                                                
Operation completed over 1 objects/340.0 KiB.                                    


In [8]:
# Load file bytes
with open("invoice.pdf", "rb") as f:
    file_part = Part.from_data(data=f.read(), mime_type="application/pdf")

# Load contents
contents = [file_part, invoice_extraction_prompt]

# Send to Gemini with GenerationConfig
response = model.generate_content(contents, generation_config=generation_config)

In [9]:
print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Raw Response--------")
print(response.text)

-------Prompt--------
You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:
{
	"amount_paid_since_last_invoice": "",
	"carrier": "",
	"currency": "",
	"currency_exchange_rate": "",
	"delivery_date": "",
	"due_date": "",
	"freight_amount": "",
	"invoice_date": "",
	"invoice_id": "",
	"line_items": [
		{
			"amount": "",
			"description": "",
			"product_code": "",
			"purchase_order": "",
			"quantity": "",
			"unit": "",
			"unit_price": ""
		}
	],
	"net_amount": "",
	"payment_terms": "",
	"purchase_order": "",
	"receiver_address": "",
	"receiver_email": "",
	"receiver_name": "",
	"receiver_phone": "",
	"receiver_tax_id": "",
	"receiver_website": "",
	"remit_to_address": "",
	"remit_to_name": "",
	"ship_from_address": "",
	"ship_from_name": "",
	"ship_to_address": "",
	"ship_to_name": "",
	"supplier_address": "",
	"supplier_email": "",
	"supplier_iban": "",
	"supplier_name": "",
	"supplier_payment_ref": "",


This response can then be parsed as JSON into a Python dictionary for use in other applications.

In [10]:
print("\n-------Parsed Entities--------")
json_object = json.loads(response.text)
print(json_object)


-------Parsed Entities--------
{'amount_paid_since_last_invoice': None, 'carrier': None, 'currency': '$', 'currency_exchange_rate': None, 'delivery_date': None, 'due_date': None, 'freight_amount': None, 'invoice_date': '02/23/2021', 'invoice_id': '3222', 'line_items': [{'amount': '490.12', 'description': 'Drag Series Transmission Build - A WD DSM', 'product_code': None, 'purchase_order': None, 'quantity': '1', 'unit': None, 'unit_price': '490.12'}, {'amount': '220.15', 'description': 'Drive Shaft Automatic Right', 'product_code': None, 'purchase_order': None, 'quantity': '7', 'unit': None, 'unit_price': '31.45'}, {'amount': '549.10', 'description': 'Multigrade Synthetic Technology Bench', 'product_code': None, 'purchase_order': None, 'quantity': '1', 'unit': None, 'unit_price': '549.10'}, {'amount': '1,187.79', 'description': '6689 Transit Stan', 'product_code': None, 'purchase_order': None, 'quantity': '1', 'unit': None, 'unit_price': '1,187.79'}, {'amount': '883.12', 'description': 

You can see that Gemini extracted all of the relevant fields from the document.

### Extract entities from a payslip

Let's try with another type of document, a payslip or paystub.

In [11]:
payslip_extraction_prompt = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:
{
"earning_item": [
{
"earning_rate": "",
"earning_hours": "",
"earning_type": "",
"earning_this_period": ""
}
],
"direct_deposit_item": [
{
"direct_deposit": "",
"employee_account_number": ""
}
],
"current_deduction": "",
"ytd_deduction": "",
"employee_id": "",
"employee_name": "",
"employer_name": "",
"employer_address": "",
"federal_additional_tax": "",
"federal_allowance": "",
"federal_marital_status": "",
"gross_earnings": "",
"gross_earnings_ytd": "",
"net_pay": "",
"net_pay_ytd": "",
"ssn": "",
"pay_date": "",
"pay_period_end": "",
"pay_period_start": "",
"state_additional_tax": "",
"state_allowance": "",
"state_marital_status": "",
"tax_item": [
{
"tax_this_period": "",
"tax_type": "",
"tax_ytd": ""
}
]
}

- The JSON schema must be followed during the extraction.
- The values must only include text strings found in the document.
- Generate null for missing entities.
"""

In [12]:
response_text = process_document(
    payslip_extraction_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/earnings_statement.pdf",
    generation_config=generation_config,
    print_prompt=True,
)

-------Prompt--------
PDF URL: https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/earnings_statement.pdf
You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:
{
"earning_item": [
{
"earning_rate": "",
"earning_hours": "",
"earning_type": "",
"earning_this_period": ""
}
],
"direct_deposit_item": [
{
"direct_deposit": "",
"employee_account_number": ""
}
],
"current_deduction": "",
"ytd_deduction": "",
"employee_id": "",
"employee_name": "",
"employer_name": "",
"employer_address": "",
"federal_additional_tax": "",
"federal_allowance": "",
"federal_marital_status": "",
"gross_earnings": "",
"gross_earnings_ytd": "",
"net_pay": "",
"net_pay_ytd": "",
"ssn": "",
"pay_date": "",
"pay_period_end": "",
"pay_period_start": "",
"state_additional_tax": "",
"state_allowance": "",
"state_marital_status": "",
"tax_item": [
{
"tax_this_period": "",
"tax_type": "",
"tax_ytd": ""
}
]
}

- The JSON schema mus

In [13]:
print("\n-------Parsed Entities--------")
json_object = json.loads(response_text)
print(json_object)


-------Parsed Entities--------
{'earning_item': [{'earning_rate': '20', 'earning_hours': '80', 'earning_type': 'regular pay', 'earning_this_period': '1,600.00'}], 'direct_deposit_item': [], 'current_deduction': '160.00', 'ytd_deduction': '1,920.00', 'employee_id': '123456', 'employee_name': 'Janet Doe', 'employer_name': 'The Greatest Company LLC', 'employer_address': '176 Imaginary Ave\nCambridge, ΜΑ 02138', 'federal_additional_tax': None, 'federal_allowance': None, 'federal_marital_status': None, 'gross_earnings': '1,600.00', 'gross_earnings_ytd': '19,200.00', 'net_pay': '1,060.80', 'net_pay_ytd': '12,729.60', 'ssn': 'XXX-XX-1234', 'pay_date': '12/15/17', 'pay_period_end': '12/12/17', 'pay_period_start': '11/10/17', 'state_additional_tax': None, 'state_allowance': None, 'state_marital_status': None, 'tax_item': [{'tax_this_period': '20.80', 'tax_type': 'FICA MED TAX', 'tax_ytd': '249.60'}, {'tax_this_period': '99.20', 'tax_type': 'FICA SS TAX', 'tax_ytd': '1190.40'}, {'tax_this_perio

## Document Classification

Document classification is the process for identifying the type of document. For example, invoice, W-2, receipt, etc.

In this example, you will use a sample tax form (W-9) and get the specific type of document from a specified list.

In [14]:
classification_prompt = """You are a document classification assistant. Given a document, your task is to find which category the document belongs to from the list of document categories provided below.

 1040_2019
 1040_2020
 1099-r
 bank_statement
 credit_card_statement
 expense
 form_1120S_2019
 form_1120S_2020
 investment_retirement_statement
 invoice
 paystub
 property_insurance
 purchase_order
 utility_statement
 w2
 w9
 driver_license

Which category does the above document belong to? Answer with one of the predefined document categories only.
"""

In [15]:
response_text = process_document(
    classification_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/w9.pdf",
    print_prompt=True,
)

-------Prompt--------
PDF URL: https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf
You are a document classification assistant. Given a document, your task is to find which category the document belongs to from the list of document categories provided below.

 1040_2019
 1040_2020
 1099-r
 bank_statement
 credit_card_statement
 expense
 form_1120S_2019
 form_1120S_2020
 investment_retirement_statement
 invoice
 paystub
 property_insurance
 purchase_order
 utility_statement
 w2
 w9
 driver_license

Which category does the above document belong to? Answer with one of the predefined document categories only.



In [16]:
print("\n-------Document Classification--------")
print(response_text)


-------Document Classification--------
w9



You can see that Gemini successfully categorized the document.

### Chaining Classification and Extraction

These techniques can also be chained together to extract any number of document types. For example, if you have multiple types of documents to process, you can send each document to Gemini with a classification prompt, then based on that output, you can write logic to decide which extraction prompt to use.

In [17]:
generic_document_prompt = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:

{}

- The JSON schema must be followed during the extraction.
- The values must only include text found in the document
- Do not normalize any entity value.
- If an entity is not found in the document, set the entity value to null.
"""

w2_extraction_prompt = generic_document_prompt.format(
    """
{
    "ControlNumber": "",
    "EIN": "",
    "EmployeeAddress_City": "",
    "EmployeeAddress_State": "",
    "EmployeeAddress_StreetAddressOrPostalBox": "",
    "EmployeeAddress_Zip": "",
    "EmployeeName_FirstName": "",
    "EmployeeName_LastName": "",
    "EmployerAddress_City": "",
    "EmployerAddress_State": "",
    "EmployerAddress_StreetAddressOrPostalBox": "",
    "EmployerAddress_Zip": "",
    "EmployerName": "",
    "EmployerStateIdNumber_Line1": "",
    "FederalIncomeTaxWithheld": "",
    "FormYear": "",
    "MedicareTaxWithheld": "",
    "MedicareWagesAndTips": "",
    "SocialSecurityTaxWithheld": "",
    "SocialSecurityWages": "",
    "StateIncomeTax_Line1": "",
    "StateWagesTipsEtc_Line1": "",
    "State_Line1": "",
    "WagesTipsOtherCompensation": "",
    "a_Code": "",
    "a_Value": "",
}
"""
)

drivers_license_prompt = generic_document_prompt.format(
    """
{
    "Address": "",
    "Date Of Birth": "",
    "Document Id": "",
    "Expiration Date": "",
    "Family Name": "",
    "Given Names": "",
    "Issue Date": "",
}
"""
)

# Map classification types to extraction prompts
classification_to_prompt = {
    "invoice": invoice_extraction_prompt,
    "w2": w2_extraction_prompt,
    "driver_license": drivers_license_prompt,
}

In [18]:
gcs_uris = [
    "gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf",
]

for gcs_uri in gcs_uris:
    print(f"\nFile: {gcs_uri}\n")

    # Send to Gemini with Classification Prompt
    doc_classification = process_document(classification_prompt, gcs_uri).strip()

    print(f"Document Classification: {doc_classification}")

    # Get Extraction prompt based on Classification
    extraction_prompt = classification_to_prompt.get(doc_classification)

    if not extraction_prompt:
        print(f"Document does not belong to a specified class {doc_classification}")
        continue

    # Send to Gemini with Extraction Prompt
    extraction_response_text = process_document(
        extraction_prompt,
        gcs_uri,
        generation_config=generation_config,
        print_prompt=True,
    ).strip()

    print("\n-------Extracted Entities--------")
    json_object = json.loads(extraction_response_text)
    print(json_object)


File: gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf

Document Classification: driver_license
-------Prompt--------
PDF URL: https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf
You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:


{
    "Address": "",
    "Date Of Birth": "",
    "Document Id": "",
    "Expiration Date": "",
    "Family Name": "",
    "Given Names": "",
    "Issue Date": "",
}


- The JSON schema must be followed during the extraction.
- The values must only include text found in the document
- Do not normalize any entity value.
- If an entity is not found in the document, set the entity value to null.


-------Extracted Entities--------
{'Address': '123 MAIN STREET\nHELENA, MT 59601', 'Date Of Birth': '08/04/1968', 'Document Id': '0812319684104', 'Expiration Date': '08/04/2023', 'Family N

## Document Question Answering

Gemini can be used to answer questions about a document.

This example answers a question about the Transformer model paper "Attention is all you need".

In [19]:
qa_prompt = """What is attention in the context of transformer models? Give me the answer first, followed by an explanation."""

In [22]:
# Send Q&A Prompt to Gemini
response_text = process_document(
    qa_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf",
)

print(f"Answer: {response_text}")

Answer: Attention in the context of transformer models is a mechanism that allows the model to focus on specific parts of the input sequence when generating the output sequence. It is a key component of the transformer architecture, which has achieved state-of-the-art results in various natural language processing tasks, such as machine translation.

Here's a more detailed explanation:

* **Self-attention:** In transformer models, self-attention allows the model to attend to different parts of the same input sequence. This helps the model understand the relationships between words and phrases in a sentence, even if they are far apart.
* **Multi-head attention:** The transformer architecture uses multiple attention heads, each of which attends to a different subset of the input sequence. This allows the model to capture a more comprehensive understanding of the input sequence and to generate more nuanced output.
* **Scaled dot-product attention:** The transformer architecture uses scale

## Document Summarization

Gemini can also be used to summarize or paraphrase a document's contents. Your prompt can specify how detailed the summary should be or specific formatting, such as bullet points or paragraphs.

In [23]:
summarization_prompt = """You are a very professional document summarization specialist. Given a document, your task is to provide a detailed summary of the content of the document.

If it includes images, provide descriptions of the images.
If it includes tables, extract all elements of the tables.
If it includes graphs, explain the findings in the graphs.
Do not include any numbers that are not mentioned in the document.
"""

In [24]:
# Send Summarization Prompt to Gemini
response_text = process_document(
    summarization_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf",
)

print(f"Summarization: {response_text}")

Summarization: The document is a speech by FDIC Chairman Jelena McWilliams on the Notice of Proposed Rulemaking on Revisions to the Community Reinvestment Act Regulations. 

The speech highlights the need for updating the Community Reinvestment Act (CRA) regulations to reflect changes in the banking industry and ensure that they continue to serve the needs of low- and moderate-income (LMI) communities. The proposed rulemaking aims to achieve this goal by:

* Encouraging banks to make long-term commitments in LMI communities by providing greater credit for retail loans retained on-balance sheet.
* Increasing the size of qualifying loans to small businesses and small farms to encourage economic development and job creation.
* Providing CRA credit for retail and community development activities in Indian Country.
* Expanding the activities that qualify for CRA credit to include capital investments and loan participations undertaken by a bank in cooperation with Community Development Finan

## Table parsing from documents

Gemini can parse contents of a table and return it in a structured format, such as HTML or markdown.

In [25]:
table_extraction_prompt = """What is the html code of the table in this document?"""

In [26]:
# Send Table Extraction Prompt to Gemini
response_text = process_document(
    table_extraction_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/salary_table.pdf",
)
display(Markdown(response_text))

```html
<table border="1">
  <thead>
    <tr>
      <th rowspan="2">Grade</th>
      <th colspan="10">Annual Rates by Grade and Step</th>
      <th rowspan="2">WITHIN GRADE AMOUNTS</th>
    </tr>
    <tr>
      <th>Step 1</th>
      <th>Step 2</th>
      <th>Step 3</th>
      <th>Step 4</th>
      <th>Step 5</th>
      <th>Step 6</th>
      <th>Step 7</th>
      <th>Step 8</th>
      <th>Step 9</th>
      <th>Step 10</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>$ 20,999</td>
      <td>$ 21,704</td>
      <td>$ 22,401</td>
      <td>$ 23,097</td>
      <td>$ 23,794</td>
      <td>$ 24,202</td>
      <td>$ 24,893</td>
      <td>$ 25,589</td>
      <td>$ 25,617</td>
      <td>$ 26,273</td>
      <td>VARIES</td>
    </tr>
    <tr>
      <td>2</td>
      <td>23,612</td>
      <td>24,174</td>
      <td>24,956</td>
      <td>25,617</td>
      <td>25,906</td>
      <td>26,668</td>
      <td>27,430</td>
      <td>28,192</td>
      <td>28,954</td>
      <td>29,716</td>
      <td>VARIES</td>
    </tr>
    <tr>
      <td>3</td>
      <td>25,764</td>
      <td>26,623</td>
      <td>27,482</td>
      <td>28,341</td>
      <td>29,200</td>
      <td>30,059</td>
      <td>30,918</td>
      <td>31,777</td>
      <td>32,636</td>
      <td>33,495</td>
      <td>859</td>
    </tr>
    <tr>
      <td>4</td>
      <td>28,921</td>
      <td>29,885</td>
      <td>30,849</td>
      <td>31,813</td>
      <td>32,777</td>
      <td>33,741</td>
      <td>34,705</td>
      <td>35,669</td>
      <td>36,633</td>
      <td>37,597</td>
      <td>964</td>
    </tr>
    <tr>
      <td>5</td>
      <td>32,357</td>
      <td>33,436</td>
      <td>34,515</td>
      <td>35,594</td>
      <td>36,673</td>
      <td>37,752</td>
      <td>38,831</td>
      <td>39,910</td>
      <td>40,989</td>
      <td>42,068</td>
      <td>1,079</td>
    </tr>
    <tr>
      <td>6</td>
      <td>36,070</td>
      <td>37,272</td>
      <td>38,474</td>
      <td>39,676</td>
      <td>40,878</td>
      <td>42,080</td>
      <td>43,282</td>
      <td>44,484</td>
      <td>45,686</td>
      <td>46,888</td>
      <td>1,202</td>
    </tr>
    <tr>
      <td>7</td>
      <td>40,082</td>
      <td>41,418</td>
      <td>42,754</td>
      <td>44,090</td>
      <td>45,426</td>
      <td>46,762</td>
      <td>48,098</td>
      <td>49,434</td>
      <td>50,770</td>
      <td>52,106</td>
      <td>1,336</td>
    </tr>
    <tr>
      <td>8</td>
      <td>44,389</td>
      <td>45,869</td>
      <td>47,349</td>
      <td>48,829</td>
      <td>50,309</td>
      <td>51,789</td>
      <td>53,269</td>
      <td>54,749</td>
      <td>56,229</td>
      <td>57,709</td>
      <td>1,480</td>
    </tr>
    <tr>
      <td>9</td>
      <td>49,028</td>
      <td>50,662</td>
      <td>52,296</td>
      <td>53,930</td>
      <td>55,564</td>
      <td>57,198</td>
      <td>58,832</td>
      <td>60,466</td>
      <td>62,100</td>
      <td>63,734</td>
      <td>1,634</td>
    </tr>
    <tr>
      <td>10</td>
      <td>53,990</td>
      <td>55,790</td>
      <td>57,590</td>
      <td>59,390</td>
      <td>61,190</td>
      <td>62,990</td>
      <td>64,790</td>
      <td>66,590</td>
      <td>68,390</td>
      <td>70,190</td>
      <td>1,800</td>
    </tr>
    <tr>
      <td>11</td>
      <td>59,319</td>
      <td>61,296</td>
      <td>63,273</td>
      <td>65,250</td>
      <td>67,227</td>
      <td>69,204</td>
      <td>71,181</td>
      <td>73,158</td>
      <td>75,135</td>
      <td>77,112</td>
      <td>1,977</td>
    </tr>
    <tr>
      <td>12</td>
      <td>71,099</td>
      <td>73,469</td>
      <td>75,839</td>
      <td>78,209</td>
      <td>80,579</td>
      <td>82,949</td>
      <td>85,319</td>
      <td>87,689</td>
      <td>90,059</td>
      <td>92,429</td>
      <td>2,370</td>
    </tr>
    <tr>
      <td>13</td>
      <td>84,546</td>
      <td>87,364</td>
      <td>90,182</td>
      <td>93,000</td>
      <td>95,818</td>
      <td>98,636</td>
      <td>101,454</td>
      <td>104,272</td>
      <td>107,090</td>
      <td>109,908</td>
      <td>2,818</td>
    </tr>
    <tr>
      <td>14</td>
      <td>99,908</td>
      <td>103,238</td>
      <td>106,568</td>
      <td>109,898</td>
      <td>113,228</td>
      <td>116,558</td>
      <td>119,888</td>
      <td>123,218</td>
      <td>126,548</td>
      <td>129,878</td>
      <td>3,330</td>
    </tr>
    <tr>
      <td>15</td>
      <td>117,518</td>
      <td>121,435</td>
      <td>125,352</td>
      <td>129,269</td>
      <td>133,186</td>
      <td>137,103</td>
      <td>141,020</td>
      <td>144,937</td>
      <td>148,854</td>
      <td>152,771</td>
      <td>3,917</td>
    </tr>
  </tbody>
</table>
```

## Document Translation

Gemini can translate documents between languages. This example translates meeting notes from English into French and Spanish.

In [27]:
translation_prompt = """Translate the first paragraph into French and Spanish. Label each paragraph with the target language."""

In [None]:
# Send Translation Prompt to Gemini
response_text = process_document(
    translation_prompt,
    "gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf",
)

print(response_text)

## Document Comparison

Gemini can compare and contrast the contents of multiple documents. This example finds the changes in the IRS Form 1040 between 2013 and 2023.

Note: when working with multiple documents, the order can matter and should be specified in your prompt.

In [None]:
comparison_prompt = """The first document is from 2013, the second one from 2023. How did the standard deduction evolve?"""

In [None]:
# Send Comparison Prompt to Gemini
file_part1 = Part.from_uri(
    uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2013.pdf",
    mime_type=PDF_MIME_TYPE,
)

file_part2 = Part.from_uri(
    uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2023.pdf",
    mime_type=PDF_MIME_TYPE,
)

# Load contents
contents = [file_part1, file_part2, comparison_prompt]

# Send to Gemini
response = model.generate_content(contents)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("-------Output--------")
print(response.text)