# Document Summarization

This notebook demonstrates an application of long document summarization techniques to a work of literature.

## Setting up the environment

### Install Dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [None]:
! pip install git+https://github.com/ibm-granite-community/granite-kitchen \
    transformers \
    torch \
    tiktoken

### Serving the Granite AI model


This notebook requires IBM Granite models to be served by an AI model runtime so that the models can be invoked or called. This notebook can use a locally accessible [Ollama](https://github.com/ollama/ollama) server to serve the models, or the [Replicate](https://replicate.com) cloud service.

During the pre-work, you may have either started a local Ollama server on your computer, or setup Replicate access and obtained an [API token](https://replicate.com/account/api-tokens). In this case, you can skip the "Running Ollama in Colab" section below.

#### Running Ollama in Colab

This section is if you are not going to either run the Ollama server locally on your computer or use the Replicate cloud service. Running the Ollama server in Colab will limit the size of Granite models you can use and be _significantly_ slower when calling the Granite models.


1. Download and install Ollama in Colab

In [None]:
!curl https://ollama.ai/install.sh | sh

2. Start the Ollama server as a background process in Colab using `nohup` and `&`

In [None]:
import os
os.system("nohup ollama serve &")

3. Pull down the Granite models in Colab that you will use in the workshop. Larger models take more memory to run. The `20b` model is too large for the Colab runtime environment.

In [None]:
!ollama pull granite-code:3b
!ollama pull granite-code:8b

## Select your model


Select a Granite model to use. Here we use a Langchain client to connect to the model. If there is a locally accessible Ollama server, we use an Ollama client to access the model. Otherwise, we use a Replicate client to access the model.

When using Replicate, if the `REPLICATE_API_TOKEN` environment variable is not set, or a `REPLICATE_API_TOKEN` Colab secret is not set, then the notebook will ask for your [Replicate API token](https://replicate.com/account/api-tokens) in a dialog box.

In [None]:
import os
import requests
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var

try: # Look for a locally accessible Ollama server for the model
    response = requests.get(os.getenv("OLLAMA_HOST", "http://127.0.0.1:11434"))
    model = OllamaLLM(model="granite-code:8b")
except Exception: # Use Replicate for the model
    set_env_var("REPLICATE_API_TOKEN")
    model = Replicate(model="ibm-granite/granite-8b-code-instruct-128k")


## Download a book

Here we fetch H.D. Thoreau's "Walden" from [Project Gutenberg](https://www.gutenberg.org/) for summarization.

We have to trim it down so that it will fit in the context window of the model.

In [None]:
# The following URL contains a text version of H.D. Thoreau's "Walden"
url = "https://www.gutenberg.org/cache/epub/205/pg205.txt"

# Get the contents
response = requests.get(url)
response.raise_for_status()
full_contents = response.text

# Extract the text of the book, leaving out the gutenberg boilerplate.
start_str = "*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
start_index = full_contents.index(start_str) + len(start_str)
end_str = "*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
end_index = full_contents.index(end_str)
book_contents = full_contents[start_index:end_index]
print(f"Length of book text: {len(book_contents)} chars")

# We limit the text to 10k characters, which is about 2.8k tokens.
char_limit = 10000
contents = book_contents[:char_limit]
print(f"Length of text for summarization: {len(contents)} chars")

## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the [`granite-8b-code-instruct-128k`](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) model, which has a context window of 128,000 tokens.
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [None]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-8b-code-instruct-128k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Your model uses the tokenizer " + type(tokenizer).__name__)

print(f"Your document has {len(tokenizer.tokenize(contents))} tokens. ")

## Summarize the text

We construct our final prompt and send it to the AI model served by Ollama for processing.

In [None]:
prompt = f"""
Summarize the following text:
{contents}
"""

output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 10000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 200, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)

## Summary of Summaries

Here we use an iterative summarization technique to adapt to the context length of the model.

### Chunk the text

Divide the full text into smaller passages for separate processing.

In [None]:
from langchain.text_splitter import TokenTextSplitter

text = book_contents
print(f"The text is {len(tokenizer.tokenize(text))} tokens.")

# Split the documents into chunks
chunk_token_limit = 3000  # In tokens: 3000 message + 512 completion + ~350 padding < 4000 context length
text_splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer, chunk_size=chunk_token_limit, chunk_overlap=0)
chunks = text_splitter.split_text(text)

print(f"Chunk count: {len(chunks)}")
print(f"Max chunk tokens: {max([len(tokenizer.tokenize(chunk)) for chunk in chunks])}")

### Summarize the chunks

Here we create a separate summary of each passage. This can take a few minutes.

In [None]:
summaries = []

max_chunks = min(10, len(chunks)) # adjust to limit the work
for i in range(max_chunks):
    text = chunks[i]
    prompt = f"""
        Summarize the following text using only the information found in the text:
        {text}
        """
    print(f"{i + 1}. Prompt size: {len(tokenizer.tokenize(prompt))} tokens")
    output = model.invoke(
        prompt,
        model_kwargs={
            "max_tokens": 2000, # Set the maximum number of tokens to generate as output.
            "min_tokens": 200, # Set the minimum number of tokens to generate as output.
            "temperature": 0.75,
            "system_prompt": "You are a helpful assistant.",
            "presence_penalty": 0,
            "frequency_penalty": 0
        }
    )
    print(f"{i + 1}. Output size: {len(tokenizer.tokenize(output))} tokens")
    summary = f"Summary {i+1}:\n{output}\n\n"
    summaries.append(summary)
    print(summary)

print(f"Summary count: {len(summaries)}")
summary_contents = "\n\n".join(summaries)
print(f"Total: {len(tokenizer.tokenize(summary_contents))} tokens")

### Summarize the Summaries

We signal to the model that it is receiving separate summaries of passages from an original text, and to create a unified summary of that text.

In [None]:
prompt = f"""
A text was summarized in separate passages; those passage summaries are provided below.

{summary_contents}

From these summaries alone, compose a single, unified summary of the text.
"""
print(f"Prompt size: {len(tokenizer.tokenize(prompt))} tokens")
print(prompt)
output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 500, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)