In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Text Summarization of Large Documents

> **NOTE:** This notebook uses the PaLM generative model, which will reach its [discontinuation date in October 2024](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text#model_versions). Please refer to [this updated notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb) for a version which uses the latest Gemini model.

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/use-cases/document-summarization/summarization_large_documents.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Thu Ya Kyaw](https://github.com/iamthuya) |

## Overview

Text summarization is the process of creating a shorter version of a text document while still preserving the most important information. This can be useful for a variety of purposes, such as quickly skimming a long document, getting the gist of an article, or sharing a summary with others.

Although summarizing a short paragraph is a non-trivial task, there are a few challenges to overcome if you want to summarize a large document, such as a PDF file with multiple pages. In this notebook, you will go through a few examples of how you can use generative models to summarize large documents.

### Objective

In this tutorial, you will learn how to use generative models to summarize information from text by working through the following examples:

- Stuffing method
- MapReduce method
- MapReduce with Overlapping Chunks method
- MapReduce with Rolling Summary method

### Costs

This tutorial uses billable components of Google Cloud:
- Vertex AI Generative AI Studio

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Generative AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK, other packages and their dependencies

In [None]:
%pip install google-cloud-aiplatform PyPDF2 ratelimit backoff --upgrade --quiet --user

**Colab only**: Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

**Colab only:** Uncomment the following cell to initialize the Vertex AI SDK. For Vertex AI Workbench, you don't need to run this.

In [None]:
# import vertexai

# PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
# vertexai.init(project=PROJECT_ID, location="us-central1")

In [None]:
from pathlib import Path
import urllib
import warnings

import PyPDF2
import backoff
from google.api_core import exceptions
import ratelimit
from tqdm import tqdm
from vertexai.language_models import TextGenerationModel

warnings.filterwarnings("ignore")

### Import models

Here you load the pre-trained text generation model called `text-bison`.


In [None]:
generation_model = TextGenerationModel.from_pretrained("text-bison")

### Preparing data files

To begin, you will need to download a pdf file for the summarizing tasks below.

In [None]:
# Define a folder to store the files
data_folder = "data"
Path(data_folder).mkdir(parents=True, exist_ok=True)

# Define a pdf link to download and place to store the download file
pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
pdf_file = Path(data_folder, pdf_url.split("/")[-1])

# Download the file using `urllib` library
urllib.request.urlretrieve(pdf_url, pdf_file)

Here you will take a peak at a few pages of the downloaded pdf file

In [None]:
# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Print three pages from the pdf
for i in range(3):
    text = pages[i].extract_text().strip()
    print(f"Page {i}: {text} \n\n")

## Method 1: Stuffing

The simplest way to pass data to a language model is to "stuff" it all into the prompt as context. This means simply including all of the relevant information in the prompt, in the order that you want the model to process it.

Here you will extract the text from all the pages in the pdf file.

In [None]:
# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Entry string to concatenate all the extracted texts
concatenated_text = ""

# Loop through the pages
for page in tqdm(pages):
    # Extract the text from the page and remove any leading or trailing whitespace
    text = page.extract_text().strip()

    # Concat the extracted text to the concatenated text
    concatenated_text += text

print(f"There are {len(concatenated_text)} characters in the pdf")

You will now create a prompt template that can be used later in the notebook.

In [None]:
prompt_template = """
    Write a concise summary of the following text delimited by triple backquotes.
    Return your response in bullet points which covers the key points of the text.

    ```{text}```

    BULLET POINT SUMMARY:
"""

Here you will use LLM via the API to summarize the extracted texts. Please note that LLMs currently have input text limit and stuffing a large input text might not be accepted. You can read more about quotas and limits [here](https://cloud.google.com/vertex-ai/docs/quotas).

The following code will cause **an exception**!

In [None]:
# Define the prompt using the prompt template
prompt = prompt_template.format(text=concatenated_text)

# Use the model to summarize the text using the prompt
summary = generation_model.predict(prompt=prompt, max_output_tokens=1024).text

print(summary)

#### Retrying

The model responded with an error message: **400 Request contains an invalid argument** because the extracted text is too long for the generative model to process.

To avoid this issue, you will only input a chunk of the extracted text (e.g. the first 30,000 words).

In [None]:
# Define the prompt using the prompt template
prompt = prompt_template.format(text=concatenated_text[:30000])

# Use the model to summarize the text using the prompt
summary = generation_model.predict(prompt=prompt, max_output_tokens=1024).text

print(summary)

### Recap

Although full text is too large for the model, you have managed to create a concise, bulleted list of the most important information from a portion of the PDF using the model. Thus, here are the pros and cons of using the stuffing method:

**Pros:**
- Only required a single call to the model.
- When summarizing text, the model has access to all the data at once so that the result may be better.

**Cons:**
- Most models have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
- This method only works on smaller pieces of data and not suitable to large documents most of the time.

In the following session, you will explore approaches which designed to help deal with having longer text than context length limit of LLMs.

### Adding rate limit to model calls

When you use MapReduce or other similar methods, you will be making multiple API calls to the model in a short period of time. There is a limit on the number of API calls you can make per minute, so you will need to add a safety measure to your code to prevent exceeding the limit. This will help to ensure that your code runs smoothly and does not encounter any errors.

For this method, here are a few specific things that you will do:
1. You will make use of a Python library called [ratelimit](https://pypi.org/project/ratelimit/) to limit the number of API calls per minute
2. You will make use of a Python library called [backoff](https://pypi.org/project/backoff/) to retry until the maximum time limit has reached

The following function improves the API call process by limiting the number of calls to **20 per minute**. It also back offs and retries calling the API after encountering **Resource Exhausted** exception. The wait duration grows **exponentially until the 5-minute mark**, and then the function will give up on retrying.

In [None]:
CALL_LIMIT = 20  # Number of calls to allow within a period
ONE_MINUTE = 60  # One minute in seconds
FIVE_MINUTE = 5 * ONE_MINUTE


# A function to print a message when the function is retrying
def backoff_hdlr(details):
    print(
        "Backing off {} seconds after {} tries".format(
            details["wait"], details["tries"]
        )
    )


@backoff.on_exception(  # Retry with exponential backoff strategy when exceptions occur
    backoff.expo,
    (
        exceptions.ResourceExhausted,
        ratelimit.RateLimitException,
    ),  # Exceptions to retry on
    max_time=FIVE_MINUTE,
    on_backoff=backoff_hdlr,  # Function to call when retrying
)
@ratelimit.limits(  # Limit the number of calls to the model per minute
    calls=CALL_LIMIT, period=ONE_MINUTE
)
# This function will call the `generation_model.predict` function, but it will retry if defined exceptions occur.
def model_with_limit_and_backoff(**kwargs):
    return generation_model.predict(**kwargs)

## Method 2: MapReduce

This method works by first splitting the large data into chunks, then running a prompt on each chunk of text. For summarization tasks, the output from the initial prompt would be a summary of that chunk. Once all the initial outputs have been generated, a different prompt is run to combine them.

This method is a bit more complex than the first method, but it can be more effective for large datasets. Here you will prepare two prompt templates: one for the initial summary step and another for the final combine step. You will be using these two templates later in this notebook.

In [None]:
initial_prompt_template = """
    Write a concise summary of the following text delimited by triple backquotes.

    ```{text}```

    CONCISE SUMMARY:
"""

final_prompt_template = """
    Write a concise summary of the following text delimited by triple backquotes.
    Return your response in bullet points which covers the key points of the text.

    ```{text}```

    BULLET POINT SUMMARY:
"""

#### Map step

In this section, you will read the PDF file again and use the model to summarize each page individually using the initial prompt template.

In [None]:
# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Create an empty list to store the summaries
initial_summary = []

# Iterate over the pages and generate a summary for each page
for page in tqdm(pages):
    # Extract the text from the page and remove any leading or trailing whitespace
    text = page.extract_text().strip()

    # Create a prompt for the model using the extracted text and a prompt template
    prompt = initial_prompt_template.format(text=text)

    # Generate a summary using the model and the prompt
    summary = model_with_limit_and_backoff(prompt=prompt, max_output_tokens=1024).text

    # Append the summary to the list of summaries
    initial_summary.append(summary)

Take a look at the first few summaries of from the initial Map phrase.

In [None]:
print("\n\n".join(initial_summary[:10]))

Here you will count the number of characters in the initial summary to see if they are small enough to fit in a prompt.

In [None]:
len("\n".join(initial_summary))

As you managed to input 30,000 characters in a prompt previously, you can input this whole summary which has fewer characters to a prompt directly too. You will do that in the next step.

#### Reduce step

Here you will create a reduce function that concatenate the summaries from the inital summarization step (Map step) and use the final prompt template to summarize the summaries again.

In [None]:
# Define a function to create a summary of the summaries


def reduce(initial_summary, prompt_template):
    # Concatenate the summaries from the inital step
    concat_summary = "\n".join(initial_summary)

    # Create a prompt for the model using the concatenated text and a prompt template
    prompt = prompt_template.format(text=concat_summary)

    # Generate a summary using the model and the prompt
    summary = model_with_limit_and_backoff(prompt=prompt, max_output_tokens=1024).text

    return summary

You are ready to proceed on to the next step to combine all the summary into an even smaller summary using the final prompt template and the function that you created earlier.

In [None]:
# Use defined `reduce` function to summarize the summaries
summary = reduce(initial_summary, final_prompt_template)

print(summary)

#### Recap

You just summarized the whole paper into a few bullet points using the MapReduce method. Here are the pros and cons of using such method:

**Pros:**
- Can summarize a large document
- Can work well with parallel processing as the processes to summarize pages are independent to each other

**Cons:**
- Multiple calls to the model is needed
- As the pages are summarized individually, the context between the pages could be loss


In the next section, you will try another method which makes use of more than one chunk (page) per prompt to summarize.

## Method 3: MapReduce with Overlapping Chunks

It is similar to MapReduce, but with one key difference: overlapping chunks. This means that a few pages will be summarized together, rather than each page being summarized separately. This helps to preserve more context or information between chunks, which can improve the accuracy of the results.

It is important to note that combining chunks may sometimes exceed the token limit imposed by the model. If this occurs, you can either implement the chunk splitting method showor creatively solve the issue (e.g. removing a few initial chunks).

#### Map step

In this section, you will read the PDF file again and use the model to summarize <b>a few pages</b> together using the initial prompt template that you defined earlier.

In [None]:
# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Create an empty list to store the extracted text from the pages
text_from_pages = []

# Iterate over the pages and generate a summary for each page
for page in tqdm(pages):
    # Extract the text from the page and remove any leading or trailing whitespace
    text = page.extract_text().strip()

    # Append the extracted text to the list of extracted text
    text_from_pages.append(text)

Here you will define the chunk size (number of pages to combine in this example) and summarize the chunks.

In [None]:
CHUNK_SIZE = 2  # number of overlapping pages

# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Create an empty list to store the summaries
initial_summary = []

# Iterate over the pages and generate a summary for a few pages as one chunk based on `CHUNK_SIZE`
for i in tqdm(range(len(pages))):
    # Select a list of pages to merge as one chunk
    pages_to_merge = [x for x in range(i, i + CHUNK_SIZE) if x < len(pages)]

    extracted_texts = [text_from_pages[x] for x in pages_to_merge]

    # Concatenate the
    text = "\n".join(extracted_texts)

    # Create a prompt for the model using the concatenated text and a prompt template
    prompt = initial_prompt_template.format(text=text)

    # Generate a summary using the model and the prompt
    summary = model_with_limit_and_backoff(prompt=prompt, max_output_tokens=1024).text

    # Append the summary to the list of summaries
    initial_summary.append(summary)

    # If the last page is reached, break the loop
    if pages_to_merge[-1] == len(reader.pages):
        break

Take a look at the first few summaries of from the initial Map phrase.

In [None]:
print("\n\n".join(initial_summary[:10]))

#### Reduce step

You are ready to proceed on to the next step to combine all the summary into an even smaller summary using the final prompt template and the function that you created earlier.

In [None]:
# Use defined `reduce` function to summarize the summaries
summary = reduce(initial_summary, final_prompt_template)

print(summary)

#### Recap

The model was able to summary the whole paper into a few bullet points using the MapReduce with Overlapping Chunks method. Here are the pros and cons of using such method:

**Pros:**
- Can summarize a large document
- As the sequential pages are summarized together, the context between the pages are preserved
- Can use parallel processing as the results are independent to each other

**Cons:**
- Multiple calls to the model is needed
- Slightly slower than pure MapReduce method
- Create larger input text


In the next section, you will try a different approach that make use of a summary from the previous page instead of the entire text.

## Method 4: MapReduce with Rolling Summary (Refine)

On some occasions, combining a few pages might be too large to summarize. To resolve that issue, you will now a different approach that uses an initial summary from the previous step along with the next page to summarize each prompt. This helps to ensure that the summary is complete and accurate, as it takes into account the context of the previous page.

In [None]:
initial_prompt_template = """
    Taking the following context delimited by triple backquotes into consideration:

    ```{context}```

    Write a concise summary of the following text delimited by triple backquotes.

    ```{text}```

    CONCISE SUMMARY:
"""

In [None]:
# Read the PDF file and create a list of pages.
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Create an empty list to store the summaries.
initial_summary = []

# Iterate over the pages and generate a summary
for idx, page in enumerate(tqdm(pages)):
    # Extract the text from the page and remove any leading or trailing whitespace.
    text = page.extract_text().strip()

    if idx == 0:  # if current page is the first page, no previous context
        prompt = initial_prompt_template.format(context="", text=text)

    else:  # if current page is not the first page, previous context is the summary of the previous page
        prompt = initial_prompt_template.format(
            context=initial_summary[idx - 1], text=text
        )

    # Generate a summary using the model and the prompt
    summary = model_with_limit_and_backoff(prompt=prompt, max_output_tokens=1024).text

    # Append the summary to the list of summaries
    initial_summary.append(summary)

Here you will list out a few entries from the initial summary list.

In [None]:
initial_summary[:10]

It is expected that there will be a few duplicate entries in the list, as you are rolling in context from previous pages to the next. You can easily remove these duplicates by using the set function.

In [None]:
initial_summary = set(initial_summary)  # set() function removes duplicate items

#### Reduce step
You are ready to proceed on to the next step to combine all the summary into an even smaller summary using the final prompt template and the function that you created earlier.

In [None]:
# Use defined `reduce` function to summarize the summaries
summary = reduce(initial_summary, final_prompt_template)

print(summary)

#### Recap

The model was able to summarize the whole paper into a few bullet points using the MapReduce with Rolling Summary method. Here are the pros and cons of using such method:

**Pros:**
- Can summarize a large document
- As the sequential pages are summarized using the context from previous pages, the context between the pages are preserved

**Cons:**
- Multiple calls to the model is needed
- Cannot work well with parallel processing as the processes to summarize pages are dependent to each other

## Conclusion

You have successfully summarized a long document, even though it was initially impossible due to an input prompt limit. You have also learned several methods for summarizing long documents, along with their advantages and disadvantages.

Summarizing a long document can be challenging. It requires you to identify the main points of the document, synthesize the information, and present it in a concise and coherent way. This can be especially difficult if the document is complex or technical. Additionally, summarizing a long document can be time-consuming, as you need to carefully read and analyze the text to ensure that the summary is accurate and complete.

While these methods allow you to interact with LLMs and summarize long documents in a flexible way, you may sometimes want to speed up the process by using bootstrapping or pre-built methods. This is where libraries like LangChain come in. You can read more about LangChain support on Vertex AI [here](https://python.langchain.com/en/latest/modules/models/llms/integrations/google_vertex_ai_palm.html).