I found that how I preprocessed the markdown had drastic effects on the output of Hugging Face's summarization pipeline. Specifically, keeping headings and newlines produced the best output. This notebook explores how preprocessing affectst my other consolidation methods.

# Imports and Functions

In [1]:
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoConfig, AutoTokenizer, pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from transformers import logging

import re
import pathlib
import warnings
import math

logging.set_verbosity_error()
nltk.download('punkt_tab')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\lydp7\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
def extract_content_from_md(file_path: str, keep_headers: bool = True, keep_newlines: bool = False) -> str:
    """
    Clean markdown content for NLP processing.
    
    Removes LaTeX math, markdown formatting, and frontmatter. 
    Preserves Obsidian hashtags but handles markdown headers based on keep_headers.
    
    Args:
        file_path (str): Path to markdown file
        keep_header (bool)s: If True, converts headers to plain text. If False, removes them.
        remove_newlines (bool): If True, collapses all whitespace to single spaces. If False, preserves paragraph structure.
    
    Returns:
        Cleaned text with preserved or collapsed whitespace based on remove_newlines
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Remove frontmatter if present
    if content.startswith('---'):
        # Find the closing ---
        end_frontmatter = content.find('---', 3)
        if end_frontmatter != -1:
            content = content[end_frontmatter + 3:].strip()

    if keep_headers:
        # Convert headers to plain text but preserve them
        content = re.sub(r'^(#+)\s+(.+)$', r'\2', content, flags=re.MULTILINE)
    else:
        # Remove headers entirely        
        content = re.sub(r'^#+\s+.*$', '', content, flags=re.MULTILINE)
    
    # Clean LaTeX: Remove display math ($$...$$)
    content = re.sub(r'\$\$.*?\$\$', '', content, flags=re.DOTALL)
    
    # Clean LaTeX: Remove inline math ($...$)
    content = re.sub(r'\$[^$]*?\$', '', content)
    
    # Clean LaTeX: Remove math environments (equation, align, etc.)
    content = re.sub(r'\\begin\{[^}]+\}.*?\\end\{[^}]+\}', '', content, flags=re.DOTALL)
    
    # Clean markdown formatting
    content = re.sub(r'\*\*(.*?)\*\*', r'\1', content)  # Bold
    content = re.sub(r'\*(.*?)\*', r'\1', content)  # Italic
    content = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', content)  # Links
    content = re.sub(r'`(.*?)`', r'\1', content)  # Inline code
    content = re.sub(r'```.*?```', '', content, flags=re.DOTALL)  # Code blocks
    
    # Clean up whitespace
    if keep_newlines:
        # Preserve structure but clean up excessive whitespace
        content = re.sub(r'[ \t]+', ' ', content)        # Multiple spaces/tabs → single space
        content = re.sub(r'\n\s*\n+', '\n', content)  # Multiple newlines → single newline
        content = re.sub(r'^\s+|\s+$', '', content, flags=re.MULTILINE)  # Trim line edges    
    else:
        # Collapse all whitespace to single spaces
        content = re.sub(r'\s+', ' ', content)
    
    return content.strip()

In [3]:
def get_model_max_length(model_name: str) -> tuple[int, str]:
    """
    Get the maximum sequence length for a given model.

    Tries multiple common config attributes in order of preference, falls back to 
    tokenizer.model_max_length, then defaults to 512 if nothing reasonable is found.
    Filters out unreasonably large values (>1M) that are likely placeholder defaults.
    
    Args:
        model_name (str): HuggingFace model identifier
       
    Returns:
        tuple: (max_length, source_attribute)
            - max_length (int): Maximum sequence length
            - source_attribute (str): Which attribute was used to determine the length
    """
    try:
        max_reasonable_length = 1_000_000
        
        config = AutoConfig.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # List of possible attributes that indicate max length
        # Order matters - more specific attributes first
        length_attributes = [
            'max_source_positions',     # BART, Pegasus
            'max_position_embeddings',  # BERT, GPT, many others
            'n_positions',              # GPT-2, GPT-J
            'max_sequence_length',      # Some custom models
            'model_max_length',         # Tokenizer (fallback)
            'max_len',                  # Some older models
        ]
        
        found_length = None
        used_attribute = None
        
        # Check config first
        for attr in length_attributes[:-1]:  # Exclude model_max_length for now
            if hasattr(config, attr):
                value = getattr(config, attr)
                # Filter out unreasonably large values (likely defaults)
                if value and value < max_reasonable_length:
                    found_length = value
                    used_attribute = f"config.{attr}"
                    break
        
        # If no reasonable value found in config, check tokenizer as fallback
        if found_length is None:
            tokenizer_max = tokenizer.model_max_length
            if tokenizer_max < max_reasonable_length:
                found_length = tokenizer_max
                used_attribute = "tokenizer.model_max_length"
            else:
                # Last resort - use a conservative default
                found_length = 512
                used_attribute = "default (no reliable max found)"
                warnings.warn(f"Could not determine max length for {model_name}, using conservative default of 512")
        
        return found_length, used_attribute
        
    except Exception as e:
        warnings.warn(f"Error getting max length for {model_name}: {e}. Using default of 512.")
        return 512, "default (error occurred)"

In [4]:
def mean_pooling(embeddings: np.ndarray) -> np.ndarray:
    """
    Compute the mean pooling of embeddings.
    
    Args:
        embeddings (np.ndarray): Input embeddings array
       
    Returns:
        np.ndarray: Mean-pooled vector
    """
    text_vector = np.mean(embeddings, axis=0)

    return text_vector
    

def fixed_size_chunking(text: str, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', doc_size: int = None, full_sentences: bool = True) -> list[str]:
    """
    Split text into fixed-size chunks based on token count.

    When full_sentences=True, tries to keep sentences intact but will split oversized
    sentences. When False, splits text purely by token count regardless of boundaries.
    
    Args:
        text (str): Input text to chunk
        model_name (str): HuggingFace model name for tokenizer
        doc_size (int, optional): Maximum tokens per chunk. If None, uses model's max length minus 50
        full_sentences (bool): If True, preserves sentence boundaries when possible
       
    Returns:
        list[str]: List of text chunks, each within the token limit
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if doc_size is None:
        doc_size = get_model_max_length(model_name)[0] - 50  # -50 b/c sometimes I reach token limits without it

    docs = []
    
    if full_sentences:
        sentences = sent_tokenize(text)

        encoded_sentences = [tokenizer.encode(sent) for sent in sentences]
        current_tokens = 0
        current_chunk = []
        for sent, encoded_sent in zip(sentences, encoded_sentences):
            # If adding this sentence would exceed the limit, finalize current chunk and reset everything
            if len(encoded_sent) + current_tokens > doc_size and current_chunk:
                chunk_text = ' '.join(current_chunk)
                docs.append(chunk_text)
                current_chunk = []
                current_tokens = 0

            # If a single sentence exceeds doc_size, split it up
            if len(encoded_sent) > doc_size:
                if current_chunk:
                    # First, add any accumulated sentences
                    chunk_text = ' '.join(current_chunk)
                    docs.append(chunk_text)
                    current_chunk = []
                    current_tokens = 0

                for i in range(0, len(encoded_sent), doc_size):
                    chunk_tokens = encoded_sent[i:i + doc_size]
                    doc_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=True)
                    docs.append(doc_text)
            else:
                # Add sentence to current chunk
                current_chunk.append(sent)
                current_tokens += len(encoded_sent)

        if current_chunk:
            chunk_text = ' '.join(current_chunk)
            docs.append(chunk_text)
            
    else:
        tokens = tokenizer.encode(text)        
        
        for i in range(0, len(tokens), doc_size):
            chunk_tokens = tokens[i:i + doc_size]
            doc_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            docs.append(doc_text)
    
    return docs

In [5]:
def mmr(doc_embeddings: np.ndarray, query_embedding: np.ndarray, max_tokens: int, chunked_sentences: list[tuple], lambda_param: float = 0.5) -> list:
    """
    Maximal Marginal Relevance with token budget constraint.
    
    Args:
        doc_embeddings (np.ndarray): Document embeddings matrix (n_docs, embedding_dim)
        query_embedding (np.ndarray): Query embedding vector (embedding_dim,)
        max_tokens (int): Maximum token budget for selected sentences
        chunked_sentences (list[tuple]): List of (text, tokens) tuples for each sentence
        lambda_param (float, optional): Relevance vs diversity trade-off (1.0=pure relevance, 0.0=pure diversity)
    
    Returns:
        list[tuple]: Selected sentences as (text, tokens) tuples
    """
    query_similarities = cosine_similarity(doc_embeddings, [query_embedding]).flatten()

    selected = []
    remaining = list(range(len(doc_embeddings)))
    token_count = 0
    
    while remaining and token_count < max_tokens:
        if not selected:
            idx = np.argmax(query_similarities)
        else:
            mmr_scores = []
            for i in remaining:
                max_selected_similarity = np.max(cosine_similarity([doc_embeddings[i]], doc_embeddings[selected]))
                mmr = lambda_param * query_similarities[i] - (1 - lambda_param) * max_selected_similarity
                mmr_scores.append(mmr)

            best_idx = np.argmax(mmr_scores)            
            idx = remaining[best_idx]  

        # Check if adding this sentence would exceed token limit
        sentence_tokens = len(chunked_sentences[idx][1])
        if token_count + sentence_tokens > max_tokens:
            break
            
        selected.append(idx)
        remaining.remove(idx)
        token_count += sentence_tokens

    return [chunked_sentences[i] for i in selected]

In [6]:
def tfidf_consolidation(text: str, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', output: str = 'text', max_tokens: int = None, mmr_lambda: float = 0.5) -> list:
    """
    Consolidate text by selecting most important sentences using TF-IDF scoring.

    Uses TF-IDF to rank sentence importance, combining average TF-IDF scores with
    similarity to document centroid. Returns most important sentences up to token limit.
    
    Args:
        text (str): Input text to consolidate
        model_name (str): HuggingFace model name for tokenization
        output (str): Output format - 'tokens' or 'text'
        max_tokens (int, optional): Maximum tokens in output. If None, uses model's max length
        mmr_lambda (float, optional): MMR lambda parameter for relevance vs diversity trade-off
       
    Returns:
        list: Selected sentences as tokens (list of lists) or text (list of strings)
    """
    # Initialize tokenizer from model name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
        
    # Auto-detect model's token limit
    if max_tokens is None:
        max_tokens = get_model_max_length(model_name)[0]

    # Split text into sentences (NB: does not split along newlines)
    sentences = sent_tokenize(text)
    if not sentences:
        raise ValueError("No sentences found in text.")
    encoded_sentences = [(sentence, tokenizer.encode(sentence)) for sentence in sentences]

    # Check if consolidation is needed
    text_tokens = sum(len(item[1]) for item in encoded_sentences)
    if text_tokens <= max_tokens: # No consolidation needed
        if output == 'tokens':
            return [item[1] for item in encoded_sentences]
        elif output == 'text':
            return [item[0] for item in encoded_sentences]
        else:
            warnings.warn('Invalid output. Defaulting to tokens.')
            return [item[1] for item in encoded_sentences]
    
    chunked_sentences = []
    for sentence, encoded in encoded_sentences:
        if len(encoded) <= max_tokens:            
            chunked_sentences.append((sentence, encoded))
        else:
            # Break apart sentences that are longer than max_tokens
            # TODO this approach is inefficient, but works for now
            for i in range(0, len(encoded), max_tokens):
                segment_text = tokenizer.decode(encoded[i:i + max_tokens], skip_special_tokens=True, clean_up_tokenization_spaces=True)
                segment_tokens = encoded[i:i + max_tokens]
                chunked_sentences.append((segment_text, segment_tokens))
    
    # Calculate sentence importance using TF-IDF
    vectorizer = TfidfVectorizer(stop_words='english')
    
    # Create document-term matrix (each sentence is a document)
    try:
        sentence_vectors = vectorizer.fit_transform([item[0] for item in chunked_sentences])
    except ValueError:
        # Handle case with too few sentences or no features, which can occur when document is empty or consists of only stop words
        # Fallback: take first sentences up to token limit
        selected_sentences = []
        total_tokens = 0
        
        for sentence, encoded in chunked_sentences:  # Use chunked_sentences for consistency
            sentence_tokens = len(encoded)
            if total_tokens + sentence_tokens <= max_tokens:
                selected_sentences.append((sentence, encoded))
                total_tokens += sentence_tokens
            else:
                break
        
        if output == 'tokens':
            return [item[1] for item in selected_sentences]
        elif output == 'text':
            return [item[0] for item in selected_sentences]
        else:
            warnings.warn('Invalid output. Defaulting to text.')
            return [item[0] for item in selected_sentences]
    
    doc_centroid = np.mean(sentence_vectors.toarray(), axis=0)    

    mmr_selection = mmr(sentence_vectors.toarray(), doc_centroid, max_tokens=max_tokens, chunked_sentences=chunked_sentences, lambda_param=mmr_lambda)
    
    if output == 'tokens':
        return [selected[1] for selected in mmr_selection]
    elif output == 'text':
        return [selected[0] for selected in mmr_selection]
    else:
        warnings.warn('Invalid output. Defaulting to text.')
        return [selected[0] for selected in mmr_selection]

In [7]:
def transformers_consolidation(text: str,
                               pooling: str = 'mean',
                               model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',
                               output: str = 'text',
                               chunking_method: str = 'sentence',
                               max_tokens: int = None,
                              mmr_lambda: float = 0.5) -> list:
    """
    Consolidate text by selecting sentences most similar to document centroid using embeddings.

    Embeds all sentences for selection, then computes document centroid using specified 
    chunking method and pooling strategy. Ranks sentences by cosine similarity to centroid 
    and returns most similar sentences up to token limit.
    
    Args:
        text (str): Input text to consolidate
        pooling (str): Pooling method for document centroid - 'mean', 'len_weighted', or 'semantic_centrality'
        model_name (str): SentenceTransformer model name for embeddings
        output (str): Output format - 'tokens' or 'text'
        chunking_method (str): Method for computing document centroid:
            - 'sentence': Use sentence embeddings for centroid calculation
            - 'chunks': Use fixed-size chunk embeddings for centroid calculation
        max_tokens (int, optional): Maximum tokens in output. If None, uses model's max length
        mmr_lambda (float, optional): MMR lambda parameter for relevance vs diversity trade-off
       
    Returns:
        list: Selected sentences as tokens (list of lists) or text (list of strings),
              ordered by similarity score (highest first)
    """
    # Initialize embedder
    embedder = SentenceTransformer(model_name)
    tokenizer = embedder.tokenizer
        
    # Auto-detect model's token limit
    if max_tokens is None:
        max_tokens = get_model_max_length(model_name)[0]
    
    # Split text into sentences (NB: does not split along newlines)
    sentences = sent_tokenize(text)
    if not sentences:
        raise ValueError("No sentences found in text.")
    encoded_sentences = [(sentence, tokenizer.encode(sentence)) for sentence in sentences]

    # Check if consolidation is needed
    text_tokens = sum(len(item[1]) for item in encoded_sentences)
    if text_tokens <= max_tokens: # No consolidation needed
        if output == 'tokens':
            return [item[1] for item in encoded_sentences]
        elif output == 'text':
            return [item[0] for item in encoded_sentences]
        else:
            warnings.warn('Invalid output. Defaulting to tokens.')
            return [item[1] for item in encoded_sentences]

    # Break apart sentences that are longer than max_tokens
    chunked_sentences = []
    for sentence, encoded in encoded_sentences:
        if len(encoded) <= max_tokens:            
            chunked_sentences.append((sentence, encoded))
        else:
            # TODO this approach is inefficient, but works for now
            for i in range(0, len(encoded), max_tokens):
                segment_text = tokenizer.decode(encoded[i:i + max_tokens], skip_special_tokens=True, clean_up_tokenization_spaces=True)
                segment_tokens = encoded[i:i + max_tokens]
                chunked_sentences.append((segment_text, segment_tokens))

    # Always embed sentences for selection
    sentence_embeddings = embedder.encode([item[0] for item in chunked_sentences], normalize_embeddings=False)
    
    # Embed and pool using pooling method of choice
    if chunking_method == 'sentence':        
        centroid_embeddings = sentence_embeddings  # Use sentence embeddings for centroid
    elif chunking_method == 'chunks':
        chunks = fixed_size_chunking(text, model_name=model_name)
        centroid_embeddings = embedder.encode(chunks, normalize_embeddings=False)  # Use chunk embeddings for centroid
    else:
        warnings.warn('Invalid chunking method. Defaulting to sentence chunking.')
        centroid_embeddings = sentence_embeddings

    if pooling == 'mean':
        doc_centroid = mean_pooling(centroid_embeddings)
    elif pooling == 'len_weighted':
        doc_centroid = len_weighted_mean_pooling(centroid_embeddings)
    elif pooling == 'semantic_centrality':
        doc_centroid = semantic_centrality_weighting(centroid_embeddings)
    else:
        warnings.warn('Invalid pooling method. Defaulting to mean pooling.')
        doc_centroid = mean_pooling(centroid_embeddings)
    
    mmr_selection = mmr(sentence_embeddings, doc_centroid, max_tokens=max_tokens, chunked_sentences=chunked_sentences, lambda_param=mmr_lambda)
    
    if output == 'tokens':
        return [selected[1] for selected in mmr_selection]
    elif output == 'text':
        return [selected[0] for selected in mmr_selection]
    else:
        warnings.warn('Invalid output. Defaulting to text.')
        return [selected[0] for selected in mmr_selection]

In [16]:
def summarization_pipeline_consolidation(text: str, model: str = 'facebook/bart-large-cnn', chunking_method: str = 'fixed_size', summary_size: int = None) -> str:
    """
    Consolidate text using iterative summarization until target size is reached.

    Iteratively chunks and summarizes text until the result fits within the target size.
    Each iteration reduces text length by summarizing chunks, then re-chunking if needed.
    
    Args:
        text (str): Input text to consolidate
        model (str): HuggingFace summarization model name
        chunking_method (str): Method for chunking - 'fixed_size', 'paragraph', or 'semantic'
        summary_size (int, optional): Target summary size in tokens. If None, uses model's max length
       
    Returns:
        str: Consolidated summary text
    """
    summarizer = pipeline('summarization', model=model)
    tokenizer = AutoTokenizer.from_pretrained(model)
    model_token_limit = get_model_max_length(model)[0] 

    if summary_size is None:
        summary_size = model_token_limit

    # Check if consolidation is needed
    if len(tokenizer.encode(text, truncation=False)) <= summary_size: # No consolidation needed
        return text

    chunk_ratio = 0.5
    doc_size = int(model_token_limit * chunk_ratio)
    
    if chunking_method == 'fixed_size':
        chunker = lambda t: fixed_size_chunking(text=t, model_name=model, doc_size=doc_size)
    elif chunking_method == 'paragraph':
        chunker = lambda t: paragraph_chunking(text=t)
    elif chunking_method == 'semantic':
        chunker = lambda t: semantic_chunking(text=t)
    else:
        warnings.warn('Invalid chunking method. Defaulting to fixed_size.')
        chunker = lambda t: fixed_size_chunking(text=t, model_name=model)
    
    docs = chunker(text)

    # directly summarize text without iterating, if possible
    if (len(docs) == 1) and (summary_size <= doc_size):
        doc_tokens = len(tokenizer.encode(docs[0]))
        min_length = min(math.ceil(doc_tokens * 0.3), math.ceil(summary_size * 0.6))
        
        return summarizer(
                docs[0],
                max_new_tokens=summary_size,
                max_length=summary_size,
                min_length=min_length,
                do_sample=False,
                num_beams=4,
                early_stopping=True
            )[0]['summary_text']

    while(True):   
        summaries = ''
        for doc in docs:
            doc_tokens = len(tokenizer.encode(doc))

            max_length = math.ceil(doc_tokens * 0.9)                                    
            min_length = math.ceil(doc_tokens * 0.3)

            # For debugging
            # print(f"About to summarize chunk with {doc_tokens} tokens. Limit: {summary_size}")
            # if doc_tokens > summary_size:
            #     print(f"WARNING: Chunk exceeds model limit!")
            
            summary = summarizer(
                doc,
                max_new_tokens=max_length,
                max_length=max_length,
                min_length=min_length,
                do_sample=False,
                num_beams=4,
                early_stopping=True
            )[0]['summary_text']

            # for debugging
            # print(f"Original chunk:\n{doc}")
            # print(f"---\nGenerated summary:\n'{summary}'")
            # print()
    
            summaries += summary + ' '

        summaries_tokenized = tokenizer.encode(summaries)

        # for debugging
        # print(len(summaries_tokenized))
        # TODO need a failsafe infinite loops. Maybe if len(summaries_tokenized) doesn't change, then break   
        if len(summaries_tokenized) <= summary_size:
            break

        docs = chunker(summaries)
    
    return summaries

# Testing

In [9]:
def test_consolidation(content, consolidation_func, summarization_pipeline=False, model='facebook/bart-large-cnn', test_chunking_methods=False, summary_size=None):
    if summarization_pipeline:
        for c in content:
            consolidate = consolidation_func(c['content'], model=model, summary_size=summary_size)
        
            print(f'keep_header={c['keep_headers']}, keep_newlines={c['keep_newlines']}')
            print('---')
            print(consolidate)
            print()
    elif test_chunking_methods:
            # Test different chunking methods for transformers consolidation
            chunking_methods = ['sentence', 'chunks']
            
            for c in content:
                print(f'keep_header={c["keep_headers"]}, keep_newlines={c["keep_newlines"]}')
                print('=' * 60)
                
                for method in chunking_methods:
                    print(f'\nChunking Method: {method}')
                    print('-' * 30)
                    
                    consolidate = consolidation_func(c['content'], output='text', chunking_method=method)
                    print(' '.join(consolidate))
                    print()
                
                print('=' * 60)
    else:
        for c in content:
            consolidate = consolidation_func(c['content'], output='text')
        
            print(f'keep_header={c['keep_headers']}, keep_newlines={c['keep_newlines']}')
            print('---')
            print(' '.join(consolidate))
            print()

In [10]:
file = pathlib.Path(r"quantum_computing_essay.md")

t_or_f = [(x, y) for x in [True, False] for y in [True, False]]
content = []

for b in t_or_f:
    temp = {'keep_headers': b[0],
           'keep_newlines': b[1],
           'content': extract_content_from_md(file, keep_headers=b[0], keep_newlines=b[1])}
    content.append(temp)

## TF-IDF Consolidation

In [18]:
lambda_values = [i / 10 for i in range(11)]

headers = content[0]
no_headers = content[2]

for value in lambda_values:    
    text1 = tfidf_consolidation(headers['content'], mmr_lambda=value)
    text2 = tfidf_consolidation(no_headers['content'], mmr_lambda=value)
        
    print(f'MMR lambda={value}')
    print('=' * 60)

    print(f'keep_header={headers['keep_headers']}, keep_newlines={headers['keep_newlines']}')
    print('-' * 30)
    print(' '.join(text1))
    print()

    print(f'keep_header={no_headers['keep_headers']}, keep_newlines={no_headers['keep_newlines']}')
    print('-' * 30)
    print(' '.join(text2))
    print('\n\n')

MMR lambda=0.0
keep_header=True, keep_newlines=True
------------------------------
Quantum Algorithms: Where Quantum Computers Shine
The true power of quantum computing emerges through specialized algorithms that exploit quantum mechanical properties to solve problems more efficiently than classical methods. However, when we zoom into the microscopic realm of atoms and subatomic particles, these certainties begin to dissolve into a realm of probabilities and possibilities that challenges our intuitive understanding of reality. The second crucial principle is entanglement, which Einstein famously called "spooky action at a distance." Measuring one particle instantly affects the state of its entangled partner, regardless of the space between them. Every operation, every calculation, and every piece of data ultimately reduces to manipulations of these binary values. When we measure the qubit, it "collapses" into either 0 or 1, but before measurement, it genuinely exists in both states at 

## Transformers Consolidation

In [19]:
test_consolidation(content, transformers_consolidation, test_chunking_methods=True)

keep_header=True, keep_newlines=True

Chunking Method: sentence
------------------------------
Conclusion: Standing at the Threshold of a New Era
Quantum computing represents more than just a new type of computer—it embodies a fundamental shift in how we process information and solve problems. The third principle, quantum interference, allows quantum states to amplify or cancel each other out, much like waves on water. The art of quantum algorithm design lies in cleverly arranging these gates to manipulate quantum superposition and interference in ways that solve specific problems more efficiently than classical approaches. This requires encoding each logical qubit across many physical qubits and performing complex error correction protocols in real-time. Quantum systems are extraordinarily sensitive to their environment—even the slightest interaction with external factors can destroy the superposition and entanglement that quantum computers depend on. It could also provide unbreakable

## Hugging Face `pipeline('summarization')` Consolidation

### Individual Model Tests

In [34]:
model = 'facebook/bart-large-xsum'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)
doc_tokens = len(tokenizer.encode(docs[0]))

print(f"Original chunk:\n\n{docs[0]}")
print("=" * 60, end="\n\n")

max_length = math.ceil(doc_tokens * 0.9)
min_length = math.ceil(doc_tokens * 0.3)

summary = summarizer(
    docs[0],
    max_length=max_length,
    min_length=min_length,
    num_beams=4,
    early_stopping=True,
    do_sample=False
)[0]['summary_text']

print(f"Generated summary:\n\n'{summary}'")

Original chunk:

The Quantum Revolution: Understanding Quantum Computing and Its Promise
Introduction: A New Paradigm in Computing
Imagine a computer that could solve certain problems exponentially faster than any classical computer ever built. Picture a machine that could break the encryption protecting your bank account in minutes, while simultaneously enabling the creation of unbreakable quantum encryption. This isn't science fiction—it's the emerging reality of quantum computing, a revolutionary technology that harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. While this comparison might so

In [36]:
model = 'pszemraj/led-large-book-summary'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)
doc_tokens = len(tokenizer.encode(docs[0]))

# print(f"Original chunk:\n{docs[0]}")
# print("=" * 60, end="\n\n")

max_length = math.ceil(doc_tokens * 0.9)
min_length = math.ceil(doc_tokens * 0.1)

summary = summarizer(
    f"Summarize:{docs[0]}",
    max_length=max_length,
    min_length=min_length,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    do_sample=False,
    early_stopping=True
)[0]['summary_text']

print(f"Generated summary:\n'{summary}'")

Generated summary:
'In this chapter, the author explains how quantum computing is revolutionizing the way we think about and process information. Imagine a time-lapse video of a superfast, supersecure, super smart computer that can unlock your bank vault in minutes while simultaneously making it impossible to hack into it. That's the reality of advanced quantum computing . The underlying science behind this technology, known as quantum mechanics , is nothing to sneeze at. It's actually pretty cool. Here are some more details: "super" refers to the scientific concept of simultaneity, which means that you can have multiple objects in the same space at the same time. In other words, if you have two separate parts of the same thing, you can use them to interact with each other. You can't just copy and paste something into one part of another piece of the world; instead, you have to put two different pieces of the Whole together. This allows you to do things like send Morse code over wirele

In [37]:
model = 'google/flan-t5-base'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)
doc_tokens = len(tokenizer.encode(docs[0]))

print(f"Original chunk:\n{docs[0]}")
print("=" * 60, end="\n\n")

max_length = math.ceil(doc_tokens * 0.9)
min_length = math.ceil(doc_tokens * 0.3)

summary = summarizer(
    f"Summarize:{docs[0]}",
    max_new_tokens=max_length,
    min_length=min_length,
    do_sample=False,
    num_beams=4,
    early_stopping=True
)[0]['summary_text']

print(f"Generated summary:\n\n'{summary}'")

Original chunk:
The Quantum Revolution: Understanding Quantum Computing and Its Promise
Introduction: A New Paradigm in Computing
Imagine a computer that could solve certain problems exponentially faster than any classical computer ever built. Picture a machine that could break the encryption protecting your bank account in minutes, while simultaneously enabling the creation of unbreakable quantum encryption. This isn't science fiction—it's the emerging reality of quantum computing, a revolutionary technology that harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. While this comparison might sou

In [33]:
model = 'sshleifer/distilbart-cnn-12-6'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)
doc_tokens = len(tokenizer.encode(docs[0]))

print(f"Original chunk:\n{docs[0]}")
print("=" * 60, end="\n\n")

max_length = math.ceil(doc_tokens * 0.9)
min_length = math.ceil(doc_tokens * 0.3)

summary = summarizer(
    f"Summarize:{docs[0]}",
    # max_length=max_length,
    max_new_tokens=max_length,
    min_length=min_length,
    do_sample=False,
    num_beams=4,
    early_stopping=True
)[0]['summary_text']

print(f"Generated summary:\n\n'{summary}'")

Original chunk:
The Quantum Revolution: Understanding Quantum Computing and Its Promise
Introduction: A New Paradigm in Computing
Imagine a computer that could solve certain problems exponentially faster than any classical computer ever built. Picture a machine that could break the encryption protecting your bank account in minutes, while simultaneously enabling the creation of unbreakable quantum encryption. This isn't science fiction—it's the emerging reality of quantum computing, a revolutionary technology that harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. While this comparison might sou

In [38]:
model = 't5-small'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)
doc_tokens = len(tokenizer.encode(docs[0]))

print(f"Original chunk:\n{docs[0]}")
print("=" * 60, end="\n\n")

max_length = math.ceil(doc_tokens * 0.9)
min_length = math.ceil(doc_tokens * 0.3)

summary = summarizer(
    f"Summarize:{docs[0]}",
    # max_length=max_length,
    max_new_tokens=max_length,
    min_length=min_length,
    do_sample=False,
    num_beams=4,
    early_stopping=True
)[0]['summary_text']

print(f"Generated summary:\n\n'{summary}'")

Original chunk:
The Quantum Revolution: Understanding Quantum Computing and Its Promise
Introduction: A New Paradigm in Computing
Imagine a computer that could solve certain problems exponentially faster than any classical computer ever built. Picture a machine that could break the encryption protecting your bank account in minutes, while simultaneously enabling the creation of unbreakable quantum encryption. This isn't science fiction—it's the emerging reality of quantum computing, a revolutionary technology that harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. While this comparison might sou

In [40]:
model = 'sshleifer/distilbart-cnn-12-6'

summarizer = pipeline('summarization', model=model)
tokenizer = AutoTokenizer.from_pretrained(model)

summary_size = get_model_max_length(model)[0]
docs = fixed_size_chunking(content[0]['content'], model_name=model, doc_size=summary_size // 2)

while(True):   
    summaries = ''
    for doc in docs:
        doc_tokens = len(tokenizer.encode(doc))

        max_length = int(doc_tokens * 0.9)
        min_length = int(doc_tokens * 0.3)

        # For debugging
        # print(f"About to summarize chunk with {doc_tokens} tokens. Limit: {summary_size}")
        # if doc_tokens > summary_size:
        #     print(f"WARNING: Chunk exceeds model limit!")
        
        summary = summarizer(
            f"Summarize:{doc}",
            max_new_tokens=max_length,
            # max_length=max_length,
            min_length=min_length,
            do_sample=False,
            num_beams=4,
            early_stopping=True
        )[0]['summary_text']

        # for debugging
        print(f"Original chunk:\n{doc}")
        print(f"---\nGenerated summary:\n'{summary}'")
        print()

        summaries += summary + ' '

    summaries_tokenized = tokenizer.encode(summaries)

    ## TODO need a failsafe infinite loops. Maybe if len(summaries_tokenized) doesn't change, then break   
    if len(summaries_tokenized) <= summary_size:
        break

    docs = fixed_size_chunking(summaries, model_name=model, doc_size=summary_size // 2)

print(summaries)

Original chunk:
The Quantum Revolution: Understanding Quantum Computing and Its Promise
Introduction: A New Paradigm in Computing
Imagine a computer that could solve certain problems exponentially faster than any classical computer ever built. Picture a machine that could break the encryption protecting your bank account in minutes, while simultaneously enabling the creation of unbreakable quantum encryption. This isn't science fiction—it's the emerging reality of quantum computing, a revolutionary technology that harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. While this comparison might sou

In [44]:
test_consolidation(content, summarization_pipeline_consolidation, summarization_pipeline=True, model='sshleifer/distilbart-cnn-12-6')

keep_header=True, keep_newlines=True
---
 Quantum computing harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways . Quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously . The foundation of quantum physics rests on several key principles that seem to defy common sense, including superposition and entanglement, Einstein famously called "spooky action at a distance" Current quantum computers must operate at temperatures close to absolute zero (colder than outer space) and be isolated from external interference as much as possible . Even under these extreme conditions, quantum states typically persist for only microseconds to milliseconds before decoherence sets in . The art of quantum algorithm design lies in cleverly arranging these gates .  Researchers have developed quantum algorithms for machine learning, optimization, and scientific simulation .

In [17]:
test_consolidation(content, summarization_pipeline_consolidation, summarization_pipeline=True, model='google/pegasus-x-large', summary_size=750)

keep_header=True, keep_newlines=True
---
To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers are like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. A sufficiently large quantum computer running Shor's algorithm could break much of the encryption that currently protects digital communications, leading to what cryptographers call the "quantum apocalypse." However, Shor's algorithm also illustrates an important limitation: it requires a fault-tolerant quantum computer with thousands of logical qubits, which is far beyond current capabilities. Quantum machine learning algorithms might be able to find patterns in data more efficiently than classical approaches, while quantum optimization algorithms could solve complex scheduling and routing problems. Quantum systems might be ab

In [18]:
test_consolidation(content, summarization_pipeline_consolidation, summarization_pipeline=True, model='pszemraj/led-large-book-summary', summary_size=750)

keep_header=True, keep_newlines=True
---
In this chapter, the author explains how quantum computing is revolutionizing the way we think about computing. Imagine a future in which your bank account could double as secure as a quantum computer while simultaneously enabling you to create unbreakable quantum encryption. This isn't science fiction--it's the emerging reality of quantum computing, a revolutionary technology that harnesses "the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways." To truly appreciate the magnitude of this technological leap, consider this analogy: if classical computers were like skilled accountants working through problems step by step with perfect precision, quantum computers are like having access to parallel universes where every possible solution can be explored simultaneously. The Quantum Foundation: Understanding the Microscopic World The principle of superposition, which states that quantum part

### Testing Multiple Models

In [19]:
models = [
    'facebook/bart-large-cnn',
    'facebook/bart-large-xsum',
    'facebook/bart-large',    
    'google/flan-t5-base', # this one can work with instructions
    't5-small'
]

for model in models:
    print(f'***********{'*' * len(model)}****')
    print(f'*** MODEL: {model} ***')
    print(f'***********{'*' * len(model)}****')
    test_consolidation(content, summarization_pipeline_consolidation, summarization_pipeline=True, model=model)
    print('\n\n')

**************************************
*** MODEL: facebook/bart-large-cnn ***
**************************************
keep_header=True, keep_newlines=True
---
Quantum computing harnesses the bizarre and counterintuitive principles of quantum mechanics to process information in fundamentally new ways. Unlike classical computers, qubits can exist in a superposition of both 0 and 1 states simultaneously. This property enables quantum algorithms to enhance the probability of finding correct answers while diminishing the likelihood of incorrect ones. But quantum computers aren't simply faster versions of classical computers. They're fundamentally different machines that excel at specific types of calculations while potentially performing worse than classical computers on other types of problems. The Quantum Revolution: Understanding Quantum Computing and Its Promise is published by Oxford University Press at £16.99 (U.S. £12.99) and includes a free guide to quantum computing. Quantum compute

# Conclusion

## TF-IDF and Transformers Consolidation

The text resulting from these approachces lacks discourse coherence, so I've taken to thinking of them as text consolidation techniques, not summarization techniques, though maybe the distinction is semantic. I'm primarily interested in using these methods to condense text for the purposes of tagging/categorizing long text files when using model-based approaches requiring shorter documents. Unsurprisingly, just reading through the results doesn't make it obvious to me which set of parameters is best for text categorization, though I have some guesses. I'll have to perform further testing with various categorization processes, which I'll do in another notebook.

### Transformers

The chunk-level consolidations are more broad and high-level, while the sentence-level consolidations are more granular. Removing newlines doens't do anything to the output, so really I'm only comparing two samples here.

## Summarization Pipeline

My first attempts with the summarization pipeline produced text filled with hallucinated author names, websites, book titles, and many other entities with every model I tried. What finally fixed the output was halving the number of tokens fed to the pipeline. Initially, I was using the model's token limit to determine chunk size, but giving the model less to work with actually improves output. My decision to use the model's token limit to determine chunk size was partially informed by a poor understanding of what the `max_length` and `max_new_tokens` parameters actually do. 

After making this change, I found that I liked the output of sshleifer/distilbart-cnn-12-6 (which is the pipeline's [default PyTorch model](https://github.com/huggingface/transformers/blob/71688a8889c4df7dd6d90a65d895ccf4e33a1a56/src/transformers/pipelines.py#L2763)) and t5-small the most. I'll have to test further on other texts to decide between the two. Of course, I could probably continually tune parameters to make each model produce satisfactory output, but I'm much more interested in finding a model which requires the least amount of work on my part.

What follows are some quick notes for the output of each model.

### sshleifer/distilbart-cnn-12-6

**keep_header=True, keep_newlines=True** \
Overall a decent narrative flow. However, it has the line, "The art of quantum algorithm design lies in cleverly arranging these gates," while never mentioning anything before or after about "these gates."

**keep_header=True, keep_newlines=False** \
Narrative flow is more disjointed, and has more instances of the previous problem. Also is hallucinating the word "author."

**keep_header=False, keep_newlines=True** \
Disjointed narrative flow and hallucinates author names.

**keep_header=False, keep_newlines=False** \
Identical output to above.

### google/pegasus-x-large

**keep_header=True, keep_newlines=True** \
Started out very strong, but soon devolved into senseless repetition. The first few sentences of the summary were quite good, though. Tuning the pipeline's parameters may make this the most viable model.

**keep_header=True, keep_newlines=False** \
Identical output to above.

**keep_header=False, keep_newlines=True** \
Same as first output: started out strong and then just started repeating itself. In fact, much of the first half of the output was identical to the first output.

**keep_header=False, keep_newlines=False** \
Identical output to above.

### pszemraj/led-large-book-summary

**keep_header=True, keep_newlines=True** \
A decent summary overall. There we a few hallucinations, such as, "In this chapter," and the word "electrochemical." A sentence near the beginning talking about bank accounts was incoherent. The presence of heading text scattered throughout the summary was distracting. Ended mid-sentence.

**keep_header=True, keep_newlines=False** \
More hallucinations and odd spacing errors. Good flow, however. Also had heading text scattered throughout.

**keep_header=False, keep_newlines=True** \
Probably the most hallucinations yet, but I feel like the flow of this summary is the best so far. The hallucinations make its usability questionable, however.

**keep_header=False, keep_newlines=False** \
Identical output to above.

### facebook/bart-large-cnn

Rife with hallucinations. I'm not going to bother commenting on each individual output.

### facebook/bart-large-xsum

Rife with hallucinations. I'm not going to bother commenting on each individual output.

### facebook/bart-large

**keep_header=True, keep_newlines=True** \
Some hallucination (University of California, Berkeley, and the National Institute of Standards and Technology were hallucinated) but overall decent, though the presence of heading text makes it awkward at times. Ends mid-sentence.

**keep_header=True, keep_newlines=False** \
Very incoherent. There's a lot of partial sentences and nonsensical repetition. There are some hallucinations I caught, but not as many glaringly obvious ones as there were previously. Ends mid-sentence.

**keep_header=False, keep_newlines=True** \
Mostly incoherent. Many partial sentences, and many sentences that just don't make a lot of sense. Perhaps they're missing context. Ends mid-sentence.

**keep_header=False, keep_newlines=False** \
Identical output to above.

### google/flan-t5-base

**keep_header=True, keep_newlines=True** \
So far, this output is the best grammatically and it properly uses punctuation throughout. However, it begins a repetitive series of sentences starting with, "Understand that...," "Read more...," and "Learn more...," as if it's listing topics. I didn't catch any obvious hallucinations.

**keep_header=True, keep_newlines=False** \
Identical output to above. Looks like this model doesn't take into account structure.

**keep_header=False, keep_newlines=True** \
Similar issues as the first output, though the "topic listing" is more prevalent. I also caught one hallucination: "Understand how quantum computers are able to represent exactly one of 2300 possible states at any given moment."

**keep_header=False, keep_newlines=False** \
Identical output to above.

### t5-small

**keep_header=True, keep_newlines=True** \
A well-structured summary overall, though with a couple instances of awkward wording.

**keep_header=True, keep_newlines=False** \
Identical output to above. Looks like this model doesn't take into account structure.

**keep_header=False, keep_newlines=True** \
Not quite as well structured as the first output; there are several more instances of awkward wording, and several uses of pronouns like "this" or "they" absent context. 

**keep_header=False, keep_newlines=False** \
Identical output to above.