<a href="https://colab.research.google.com/github/Amith71965/Academic-Projects/blob/main/notebooks/Long%20Document%20summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Summarization

This notebook demonstrates an application of long document summarization techniques to a work of literature.

## Setting up the environment

### Install Dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py).

In [1]:
! pip install git+https://github.com/ibm-granite-community/granite-kitchen \
    'transformers>=4.45.2' \
    torch \
    tiktoken

Collecting git+https://github.com/ibm-granite-community/granite-kitchen
  Cloning https://github.com/ibm-granite-community/granite-kitchen to /tmp/pip-req-build-xzy_yc7y
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/granite-kitchen /tmp/pip-req-build-xzy_yc7y
  Resolved https://github.com/ibm-granite-community/granite-kitchen to commit c56fc793b8ce48a96ae199ed8dd0a5c24cf71a2e
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.45.2
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting ibm-granite-community@ git+https://github.com/ibm-granite-community/utils (from granite-kitchen==0.1.0)
  Cloning https://github.com/ibm-granite-community/utils 

### Serving the Granite AI model


This notebook requires IBM Granite models to be served by an AI model runtime so that the models can be invoked or called. This notebook can use a locally accessible [Ollama](https://github.com/ollama/ollama) server to serve the models, or the [Replicate](https://replicate.com) cloud service.

During the pre-work, you may have either started a local Ollama server on your computer, or setup Replicate access and obtained an [API token](https://replicate.com/account/api-tokens). In this case, you can skip the "Running Ollama in Colab" section below.

#### Running Ollama in Colab

This section is required if you are not going to either run the Ollama server locally on your computer or use the Replicate cloud service. Running the Ollama server in Colab will limit the size of Granite models you can use and be _significantly_ slower when calling the Granite models.

> **Note:** You can modify the notebook's runtime type to select a GPU hardware accelerator. Using the "Runtime->Change runtime type" menu item, select "T4 GPU" instead of "CPU" and save. This will improve the performance of the Ollama server. There are limitations on using a GPU hardware accelerator especially on the free tier. Check out documentation for more details.


1. Download and install Ollama in Colab

In [2]:
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Installing ollama to /usr/local
100 13320    0 13320    0     0  59317      0 --:--:-- --:--:-- --:--:-- 59464
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


2. Start the Ollama server as a background process in Colab using `nohup` and `&`

In [3]:
import os
os.system("nohup ollama serve &")

0

3. Pull down the Granite models in Colab that you will use in the workshop. Larger models take more memory to run.

In [4]:
!ollama pull granite3-dense:2b
!ollama pull granite3-dense:8b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest 
pulling 63dd4fe4571a...   0% ▕▏    0 B/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 63dd4fe4571a...   0% ▕▏ 3.0 MB/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 63dd4fe4571a...   2% ▕▏  27 MB/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 63dd4fe4571a...   4% ▕▏  64 MB/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1G

## Select your model


Select a Granite model to use. Here we use a Langchain client to connect to the model. If there is a locally accessible Ollama server, we use an Ollama client to access the model. Otherwise, we use a Replicate client to access the model.

When using Replicate, if the `REPLICATE_API_TOKEN` environment variable is not set, or a `REPLICATE_API_TOKEN` Colab secret is not set, then the notebook will ask for your [Replicate API token](https://replicate.com/account/api-tokens) in a dialog box.

In [5]:
import os
import requests
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var

try: # Look for a locally accessible Ollama server for the model
    response = requests.get(os.getenv("OLLAMA_HOST", "http://127.0.0.1:11434"))
    model = OllamaLLM(model="granite3-dense:8b")
except Exception: # Use Replicate for the model
    set_env_var("REPLICATE_API_TOKEN")
    model = Replicate(model="ibm-granite/granite-3.0-8b-instruct")


## Download a book

Here we fetch H.D. Thoreau's "Walden" from [Project Gutenberg](https://www.gutenberg.org/) for summarization.

We have to trim it down so that it will fit in the context window of the model.

In [6]:
# The following URL contains a text version of H.D. Thoreau's "Walden"
url = "https://www.gutenberg.org/cache/epub/205/pg205.txt"

# Get the contents
response = requests.get(url)
response.raise_for_status()
full_contents = response.text

# Extract the text of the book, leaving out the gutenberg boilerplate.
start_str = "*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
start_index = full_contents.index(start_str) + len(start_str)
end_str = "*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
end_index = full_contents.index(end_str)
book_contents = full_contents[start_index:end_index]
print(f"Length of book text: {len(book_contents)} chars")

# We limit the text to 10k characters, which is about 2.8k tokens.
char_limit = 10000
contents = book_contents[:char_limit]
print(f"Length of text for summarization: {len(contents)} chars")

Length of book text: 644843 chars
Length of text for summarization: 10000 chars


## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the [`granite-3.0-8b-instruct`](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) model, which has a context window of 4,000 tokens.
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [7]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Your model uses the tokenizer " + type(tokenizer).__name__)

print(f"Your document has {len(tokenizer.tokenize(contents))} tokens. ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/5.64k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.48M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

Your model uses the tokenizer GPT2TokenizerFast
Your document has 2863 tokens. 


## Summarize the text

Use this optimial question-answer format according to the Granite Prompting Guide.

In [8]:
prompt_guide_template = """\
<|start_of_role|>user<|end_of_role|>{prompt}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""


We construct our final prompt and send it to the AI model being served for processing.

In [9]:
prompt = prompt_guide_template.format(prompt=f"""
Summarize the following text:
{contents}
""")

output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 10000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 200, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)

This text appears to be a reflection on the struggles and hardships faced by individuals who are unable to afford basic necessities such as food, clothing, and shelter. The author emphasizes the importance of treating oneself and others with kindness and understanding, as they may be dealing with difficult circumstances that are not immediately apparent.

The text also touches on the idea of self-imposed slavery or servitude, where individuals may feel trapped by their own thoughts and opinions about themselves, leading to a sense of desperation and hopelessness. The author encourages readers to strive for self-emancipation and to break free from these self-imposed limitations.

Overall, the text seems to be a call to empathy and understanding towards those who are struggling, as well as a reminder to strive for personal growth and self-improvement.


## Summary of Summaries

Here we use an iterative summarization technique to adapt to the context length of the model.

### Chunk the text

Divide the full text into smaller passages for separate processing.

In [10]:
from langchain.text_splitter import TokenTextSplitter

text = book_contents
print(f"The text is {len(tokenizer.tokenize(text))} tokens.")

# Split the documents into chunks
chunk_token_limit = 3000  # In tokens: 3000 message + 512 completion + ~350 padding < 4000 context length
text_splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer, chunk_size=chunk_token_limit, chunk_overlap=0)
chunks = text_splitter.split_text(text)

print(f"Chunk count: {len(chunks)}")
print(f"Max chunk tokens: {max([len(tokenizer.tokenize(chunk)) for chunk in chunks])}")

The text is 184361 tokens.
Chunk count: 57
Max chunk tokens: 3351


### Summarize the chunks

Here we create a separate summary of each passage. This can take a few minutes.

In [11]:
summaries = []
prompt_summary_template = prompt_guide_template.format(prompt="""\
Summarize the following text using only the information found in the text:
{text}
""")

max_chunks = min(10, len(chunks)) # adjust to limit the work
for i in range(max_chunks):
    text = chunks[i]
    prompt = prompt_summary_template.format(text=text)
    print(f"{i + 1}. Prompt size: {len(tokenizer.tokenize(prompt))} tokens")
    output = model.invoke(
        prompt,
        model_kwargs={
            "max_tokens": 2000, # Set the maximum number of tokens to generate as output.
            "min_tokens": 200, # Set the minimum number of tokens to generate as output.
            "temperature": 0.75,
            "system_prompt": "You are a helpful assistant.",
            "presence_penalty": 0,
            "frequency_penalty": 0
        }
    )
    print(f"{i + 1}. Output size: {len(tokenizer.tokenize(output))} tokens")
    summary = f"Summary {i+1}:\n{output}\n\n"
    summaries.append(summary)
    print(summary)

print(f"Summary count: {len(summaries)}")
summary_contents = "\n\n".join(summaries)
print(f"Total: {len(tokenizer.tokenize(summary_contents))} tokens")

1. Prompt size: 3240 tokens
1. Output size: 135 tokens
Summary 1:
The text appears to be a reflection on the nature of work, slavery, and the human condition. It suggests that many people lead lives of quiet desperation, feeling trapped by societal expectations and their own self-imposed limitations. The author argues that true freedom comes from self-emancipation and questioning one's beliefs and values. They also criticize the idea of resignation as a form of despair, suggesting that wisdom lies in not doing desperate things. The text ultimately encourages readers to challenge their preconceived notions and strive for new deeds and ways of thinking.


2. Prompt size: 3295 tokens
2. Output size: 286 tokens
Summary 2:
The text discusses the importance of maintaining vital heat or animal life in humans. It suggests that food, clothing, shelter, and beds serve to retain the heat generated within our bodies. The author argues that many people prioritize their comfort and luxuries over the

### Summarize the Summaries

We signal to the model that it is receiving separate summaries of passages from an original text, and to create a unified summary of that text.

In [None]:
prompt = prompt_guide_template.format(prompt=f"""
A text was summarized in separate passages; those passage summaries are provided below.

{summary_contents}

From these summaries alone, compose a single, unified summary of the text.
""")
print(f"Prompt size: {len(tokenizer.tokenize(prompt))} tokens")
output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 2000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 500, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)