In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Generative AI Document Summarization

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/terraform-google-gen-ai-document-summarization/blob/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/terraform-google-gen-ai-document-summarization/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/terraform-google-gen-ai-document-summarization/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This notebook is a companion to the [Generative AI Document Summarization Jump Start Solution](https://cloud.google.com/blog/products/application-modernization/introducing-google-cloud-jump-start-solutions) **TODO: better link target**. With this notebook, you can use the summarization solution to create summaries of academic PDF files. In the notebook, you will programmatically upload a PDF file to a Cloud Storage bucket and then view the summary of that PDF in a BigQuery table. 

+ Learn more about [using text chat LLM with Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview).
+ Learn more about [querying tables in Cloud BigQuery](https://cloud.google.com/bigquery/docs/tables).
+ Learn more about [creating EventArc triggers for Cloud Functions](https://cloud.google.com/functions/docs/calling/eventarc).
+ Learn more about [storing data in Cloud Storage](https://cloud.google.com/storage/docs/uploading-objects).
+ Learn more about [transcribing PDFs with Cloud Vision OCR](https://cloud.google.com/vision/docs/pdf).

### Objective

In this tutorial, you learn how to create a Cloud Function process that transcribes characters from a PDF, stores the complete PDF text in a Storage bucket, summarizes the PDF, and then upserts the document data (summary, complete text, URI) into a BigQuery table.

This tutorial uses the following Google Cloud services and resources:

- Vertex AI Generative AI
- CLoud BigQuery
- Cloud Vision OCR
- Cloud EventArc triggers
- Cloud Functions
- Cloud Storage

The steps performed include:

- Trigger an EventArc event by uploading a PDF to a Cloud Storage bucket
- Query the BigQuery table to see the results of the summarization process

### Dataset

This notebook uses a [Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) that contains a large collection of academic summaries from [arXiv.org](https://arxiv.org/). This dataset is made publicly available through a Cloud Storage bucket.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Vision
* Cloud Functions
* EventArc
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Cloud Vision pricing](https://cloud.google.com/vision/pricing),
and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),
and [Cloud EventArc pricing](https://cloud.google.com/eventarc/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
%%writefile requirements.txt

google-cloud-aiplatform
google-cloud-bigquery
google-cloud-logging
google-cloud-storage
polling2
tqdm

In [None]:
# Install the packages
import os

if not os.getenv("IS_TESTING"):
    USER = "--user"
else:
    USER = ""
! pip3 install {USER} --upgrade -r requirements.txt

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

This notebook assumes that you have already deployed this solution using either the [Terraform script]() **TODO: fix target for link** or using the [Solutions console](https://console.cloud.google.com/products/solutions/catalog). During this deployment, several actions required to run this solution were performed on your behalf:

1. The [Cloud Function](https://console.cloud.google.com/functions/list) was deployed.

2. The [EventArc trigger](https://console.cloud.google.com/eventarc/triggers) was applied to the input Cloud Storage bucket.

3. The following APIs were enabled for you: 

   - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
   - [BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com)
   - [Cloud Vision API](https://console.cloud.google.com/flows/enableapi?apiid=vision.googleapis.com)


<div style="background-color:rgb(150,200,255); padding:2px;"><strong>Note:</strong> It is recommended to run this notebook from <a href="https://console.cloud.google.com/vertex-ai/workbench/">Vertex AI Workbench</a>. If you are running this notebook locally instead, you need to install the <a href="https://cloud.google.com/sdk" target="_blank">Cloud SDK</a>.</div>

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow one of the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Import libraries

In [None]:
import os
import polling2
import re
import time

from tqdm.notebook import tqdm
from google.cloud import aiplatform
from google.cloud import bigquery
from google.cloud import logging
from google.cloud import storage

## Download test data

This Jump Start Solution uses data from [arXiv.org](https://arxiv.org/) to demonstrate the summarization capabilities of Vertex AI. arXiv, through [Kaggle.com](https://www.kaggle.com/datasets/Cornell-University/arxiv) has made many scholarly papers available, free of charge, from a Google Cloud Storage bucket.

In [None]:
# List all the comparative linguistics papers from Cloud Storage
! gsutil ls gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404

In [None]:
filename = '9404002v1'
file_uri = f'gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404/{filename}.pdf'

# Create a local folder and download some test PDFs
if not os.path.exists('pdfs'):
    os.mkdir('pdfs')

! gsutil cp -r $file_uri pdfs/

## Upload test data to Storage bucket

The Terraform scripts for this JSS applies an EventArc trigger to a Cloud Storage bucket. When a PDF is uploaded to the storage bucket, the EventArc trigger fires, starting the summarization process.

In [None]:
INPUT_BUCKET = f'{PROJECT_ID}_uploads'

Running the next cell uploads a local PDF file (downloaded in the previous section) to the target Cloud Storage bucket. 

In [None]:
file_complete_text = f'{filename}_summary.txt'
pdf = f'pdfs/{filename}.pdf'
logger_name = 'summarization-by-llm'

In [None]:
storage_client = storage.Client()
bucket = storage_client.bucket(INPUT_BUCKET)
blob = bucket.blob(pdf)
blob.upload_from_filename(pdf)

This upload process kicks off the summarization process. You can view the progress of the summarization process in the [Cloud Console](https://console.cloud.google.com/functions/details/us-central1/jss16-1).

**TODO: Ensure that Cloud Console links go to correct console locations.**

## Optional: View summarization process in Cloud Logging

You can view the results of the summarization Cloud Function as it writes updates to Cloud Logging. Each run of the summarization pipeline is associated with a `cloud_event_id`. By filtering for this ID, you can track the summarization process.

In [None]:
@polling2.poll_decorator(check_success=lambda x: x != '', step=0.5, timeout=90)
def get_cloud_event_id(pdf_filename, bar):
    logging_client = logging.Client(project=PROJECT_ID)
    logger = logging_client.logger(logger_name)
    
    pattern = 'cloud_event_id\((.*)\):'
    cloud_id = ''
    for entry in logger.list_entries(filter_=pdf_filename, max_results=100):
        entry_text = entry.payload
        res = re.search(pattern, entry_text)
        if res != None:
            cloud_id = res.group(1)
            print(cloud_id)
            bar.update(100)
    
        if cloud_id != '':
            return cloud_id
    return cloud_id

In [None]:
with tqdm(total=100) as bar:
    cloud_event_id = get_cloud_event_id(filename, bar)
    bar.close()

Now that we have the `cloud_event_id`, we can filter on just this cloud event and get updates for just this event.

In [None]:
print(f'cloud_event_id: {cloud_event_id}')

In [None]:
@polling2.poll_decorator(step=10, timeout=70)
def get_cloud_event_logs(cloud_event_id):
    print("polling")
    logging_client = logging.Client(project=PROJECT_ID)
    logger = logging_client.logger(logger_name)
    
    entries = []
    for entry in logger.list_entries(filter_=cloud_event_id, max_results=100):
        entry_text = entry.payload
        entries.append(entry_text)
    return entries

In [None]:
entries = []
bar = tqdm(total=6)

for _ in range(6):
    tmp_entries = get_cloud_event_logs(cloud_event_id)
    for e in tmp_entries:
        if e not in entries:
            bar.update(1)
            entries.append(e)
            print(e)
            
    

## Query the BigQuery table to see the summary

Once the summarization flow has completed, the summary of the PDF document should be available for you to read. To get the summary of the PDF document, you can query the BigQuery table that contains the summary.

If you do not get a result the first time you run the query, then the summarization pipeline might still be running. You might need to wait a minute to allow the pipeline to finish and to try the query again.

In [None]:
bigquery_client = bigquery.Client(project=PROJECT_ID)

table_name = f"{PROJECT_ID}.summary_dataset.summary_table"

# Compose the SQL query to select the summary for the PDF document
sql_query = f"SELECT summary FROM `{table_name}` WHERE filename LIKE '%{file_complete_text}%'"

job = bigquery_client.query(sql_query)
rows = job.result()
row_list = list(rows)

if len(row_list) != 0:
    summary = row_list[0]

summary['summary']