Step 1: Measure the Document Length
This function tokenizes the input text using NLTK’s word tokenizer and returns the number of tokens in the document

In [1]:
import nltk
from nltk.tokenize import word_tokenize

def measure_length(text):
    """Measure the length of the text in tokens."""
    tokens = word_tokenize(text)
    return len(tokens)

# Example usage:
sample_text = "This is a sample document."
print("Token count:", measure_length(sample_text))


Token count: 6


Step 2: Compute Target Summary Length
This function computes the target summary length as a proportion of the total token count (e.g., 30% of the original text).

In [2]:
def compute_target_length(total_length, target_ratio=0.3):
    """
    Compute the target summary length based on a ratio.
    For example, a ratio of 0.3 means the summary should have about 30% of the original tokens.
    """
    return int(total_length * target_ratio)

# Example usage:
total_tokens = 1000
print("Target summary length:", compute_target_length(total_tokens))


Target summary length: 300


Step 3: Slice the Document
This function slices a document into smaller chunks so that each chunk does not exceed the specified token limit (e.g., 4000 tokens).

In [3]:
from nltk.tokenize import word_tokenize

def slice_document(text, token_limit=4000):
    """
    Slice the document into chunks where each chunk has at most token_limit tokens.
    """
    words = word_tokenize(text)
    slices = []
    for i in range(0, len(words), token_limit):
        chunk = " ".join(words[i:i+token_limit])
        slices.append(chunk)
    return slices

# Example usage:
sample_text = " ".join(["word"] * 5000)  # A sample text with 5000 words.
chunks = slice_document(sample_text, token_limit=1000)
print("Number of chunks:", len(chunks))


Number of chunks: 5


Step 4: Summarize a Text Slice
This function uses a frequency-based approach to summarize a text slice. It tokenizes the text into sentences, calculates word frequencies (ignoring stopwords and punctuation), scores each sentence, and then selects the top-scoring sentences based on the specified summary ratio.

In [4]:
import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

def summarize_slice(text, summary_ratio=0.3):
    """
    Summarize the text slice by:
    - Tokenizing the text into sentences.
    - Computing word frequencies (ignoring stopwords and punctuation).
    - Scoring each sentence based on the frequencies of its words.
    - Selecting the top sentences to form the summary.
    """
    sentences = sent_tokenize(text)
    if not sentences:
        return text

    stop_words = set(stopwords.words("english"))
    punctuations = set(string.punctuation)

    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words and word not in punctuations]
    freq = nltk.FreqDist(filtered_words)

    sentence_scores = {}
    for sentence in sentences:
        sentence_words = word_tokenize(sentence.lower())
        score = sum(freq[word] for word in sentence_words if word in freq)
        sentence_scores[sentence] = score

    summary_length = max(1, int(len(sentences) * summary_ratio))
    top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:summary_length]
    top_sentences.sort(key=lambda s: sentences.index(s))
    
    summary = " ".join(top_sentences)
    return summary

# Example usage:
sample_text = ("This is a sentence. " 
               "Here is another sentence. " 
               "And yet another sentence to summarize.")
print("Summary:", summarize_slice(sample_text))


Summary: And yet another sentence to summarize.


Step 5: Hierarchical Summarization
This function implements the hierarchical summarization strategy. It first checks whether the text fits within the token limit. If not, it slices the document into chunks, summarizes each chunk, collates these summaries, and then recursively summarizes the collated summary until the final summary meets the token limit

In [5]:
import nltk
from nltk.tokenize import word_tokenize

# Import functions from previous steps (assume they are defined in the notebook cells)
# Here, we assume slice_document, summarize_slice, and measure_length are already defined.

def hierarchical_summarize(text, token_limit=4000, summary_ratio=0.3):
    """
    Perform hierarchical summarization:
    - If the text is within the token limit, return it as is.
    - Otherwise, slice the text into chunks that fit within the context window.
    - Summarize each slice.
    - Collate the summaries and, if necessary, summarize the collated summary recursively until it fits.
    """
    tokens = word_tokenize(text)
    if len(tokens) <= token_limit:
        return text

    # Slice the document into manageable chunks.
    slices = slice_document(text, token_limit)
    intermediate_summaries = [summarize_slice(chunk, summary_ratio) for chunk in slices]
    
    collated_summary = " ".join(intermediate_summaries)
    
    # Recursively summarize if needed.
    if measure_length(collated_summary) > token_limit:
        return hierarchical_summarize(collated_summary, token_limit, summary_ratio)
    else:
        return collated_summary

# Example usage:
sample_text = "This is a long text for summarization. " * 1000
summary = hierarchical_summarize(sample_text, token_limit=200, summary_ratio=0.3)
print("Hierarchical Summary:", summary)


Hierarchical Summary: This is a long text for summarization . This is a long text for summarization .


Step 6: Save Text to File
A simple utility function to save text to a file. This is used for saving the final summaries and the generated query.

In [6]:
def save_text(filename, text):
    """Save the provided text to a file."""
    with open(filename, "w", encoding="utf-8") as f:
        f.write(text)

# Example usage:
sample_text = "This text will be saved to a file."
save_text("output.txt", sample_text)
print("Text saved to output.txt")


Text saved to output.txt


Step 7: Generate a Query
This function generates a query by combining the summaries from the two documents. It extracts the most frequent keywords (after removing stopwords and punctuation) and joins them to form a simple query.

In [7]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def generate_query(summary1, summary2, top_n=5):
    """
    Generate a query by combining both summaries and selecting the top_n most frequent keywords.
    """
    combined_text = summary1 + " " + summary2
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(combined_text.lower())
    filtered_words = [word for word in words if word not in stop_words and word not in string.punctuation]
    freq = nltk.FreqDist(filtered_words)
    common_words = [word for word, count in freq.most_common(top_n)]
    query = " ".join(common_words)
    return query

# Example usage:
summary1 = "This is the first summary."
summary2 = "This is the second summary."
print("Generated Query:", generate_query(summary1, summary2))


Generated Query: summary first second


Step 8: Main Execution
This cell ties all the steps together. It measures the document lengths, computes target summary lengths (for informational purposes), performs hierarchical summarization on two documents, saves the summaries, and finally generates a query based on the summaries.

In [8]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

# Make sure required NLTK data is downloaded (uncomment if needed)
# nltk.download('punkt')
# nltk.download('stopwords')

# Assume the functions from previous steps are already defined in your notebook:
# - measure_length
# - compute_target_length
# - slice_document
# - summarize_slice
# - hierarchical_summarize
# - save_text
# - generate_query

# Example documents (replace these with your actual texts)
document1 = """Your very long text for document 1 goes here. 
This document could be thousands of tokens long, requiring hierarchical summarization..."""

document2 = """Your reference style text for document 2 goes here. 
This text is used to guide the style of the summary for document 1."""

# Step 1: Measure document lengths.
length1 = measure_length(document1)
length2 = measure_length(document2)
print("Document 1 Length:", length1)
print("Document 2 Length:", length2)

# Step 2: Compute target summary lengths (informational).
target_length1 = compute_target_length(length1)
target_length2 = compute_target_length(length2)
print("Target Summary Length for Document 1 (approx.):", target_length1)
print("Target Summary Length for Document 2 (approx.):", target_length2)

# Steps 3-7: Hierarchically summarize each document.
summary1 = hierarchical_summarize(document1, token_limit=4000, summary_ratio=0.3)
summary2 = hierarchical_summarize(document2, token_limit=4000, summary_ratio=0.3)
print("\nSummary for Document 1:\n", summary1)
print("\nSummary for Document 2:\n", summary2)

# Step 6: Save the final summaries.
save_text("final_summary_doc1.txt", summary1)
save_text("final_summary_doc2.txt", summary2)

# Step 7: Generate a query based on the two summaries.
query = generate_query(summary1, summary2)
save_text("generated_query.txt", query)
print("\nGenerated Query:", query)


Document 1 Length: 23
Document 2 Length: 25
Target Summary Length for Document 1 (approx.): 6
Target Summary Length for Document 2 (approx.): 7

Summary for Document 1:
 Your very long text for document 1 goes here. 
This document could be thousands of tokens long, requiring hierarchical summarization...

Summary for Document 2:
 Your reference style text for document 2 goes here. 
This text is used to guide the style of the summary for document 1.

Generated Query: document text long 1 goes
