**Extractive summarization** works by directly selecting and stitching together the most important sentences or phrases from the original text. The summary is a subset of the source document; no new words are generated. This makes the summary highly accurate to the source material but means it can sometimes lack smooth coherence between the pieced-together sentences.

1.**Unsupervised Statistical & Graph-Based Methods**
These methods do not require labeled training data. They work by analyzing the frequency, position, and connections between words and sentences in the input document itself.

**Frequency-Driven (TF-IDF, Topic Words)**: Sentences are scored based on the count or density of high-frequency, non-stop words. The Term Frequency-Inverse Document Frequency (TF-IDF) measure is a common technique used to weight words, favoring those that are important in the document but rare overall.

**Graph-Based Algorithms (TextRank / LexRank)**: The document is modeled as a graph where sentences are nodes and connections (edges) represent similarity between sentences (often using Cosine Similarity). Sentences connected to many other sentences (i.e., highly central nodes) are scored as important, similar to the PageRank algorithm for ranking web pages.


Location/Linguistic Features: Simple heuristics are used, such as selecting the first and last sentences of paragraphs or sentences containing cue phrases (e.g., "in conclusion," "significantly").

2. **Supervised Machine Learning**
This is a more modern approach that requires labeled training data.

Binary Sentence Classification: A Transformer Encoder model (like BERT or RoBERTa) is fine-tuned to look at each sentence and classify it as either 1 (Include in Summary) or 0 (Exclude). The model learns which features (context, position, keywords) correspond to importance based on the training data.

**Graph-Based Algorithms (TextRank / LexRank):**

In [43]:
#!pip install sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# Define the language and size
LANGUAGE = "english"
SENTENCES_COUNT = 3

def summarize_lexrank(text, summary_size):
    # 1. Create a parser and stemmer
    parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    # 2. Initialize the LexRank summarizer
    summarizer = LexRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE) # Optional: set stop words

    # 3. Run summarization
    # 'sentences' is an iterator yielding the selected sentences
    summary_sentences = summarizer(parser.document, summary_size)

    # 4. Convert sentences objects to strings
    summary = ' '.join(str(sentence) for sentence in summary_sentences)
    return summary

# --- Example Usage ---
document = """
Recent advancements in deep learning have significantly boosted the performance of Natural Language Processing tasks. Transformer models, particularly those with the encoder-decoder architecture, are widely used for tasks like translation and summarization. LexRank is a graph-based method that determines sentence importance based on the concept of centrality in a graph. Sentences that are similar to many other sentences in the text are considered central and, therefore, more important. This technique is often used for unsupervised extractive summarization and delivers high-quality results without requiring any training data. The main challenge remains processing extremely long documents efficiently.
"""

summary = summarize_lexrank(document, summary_size=SENTENCES_COUNT)
print("--- LexRank-Based Summary (Top 3 Sentences) ---")
print ("text length : ",len(document),"\nsummery length : " ,len(summary))
print("summery\n",summary)

--- LexRank-Based Summary (Top 3 Sentences) ---
text length :  710 
summery length :  372
summery
 Recent advancements in deep learning have significantly boosted the performance of Natural Language Processing tasks. Transformer models, particularly those with the encoder-decoder architecture, are widely used for tasks like translation and summarization. LexRank is a graph-based method that determines sentence importance based on the concept of centrality in a graph.


2-**Supervised Extractive Summarization (Conceptual with Hugging Face)**
This method requires fine-tuning a pre-trained model like BERT to perform a binary classification on each sentence. It's too complex to show full training code, but here is the conceptual workflow using the Hugging Face transformers library.

Goal: Train a model to predict a label for each sentence: 1 (include) or 0 (exclude).

Conceptual Steps and Code Snippets
A. Prepare Data (Labeling)
For every document, you must have a "gold standard" human-written summary. You then label each sentence in the source text:

Label 1: If the sentence is highly similar (using ROUGE) to the human summary.

Label 0: Otherwise.

In [44]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Using a BERT-like model for sentence classification
MODEL_NAME = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# We use SequenceClassification since we classify each sentence as a separate input
# The number of labels is 2 (0 or 1)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
import torch
import random # Used for shuffling to ensure the demonstration works
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from nltk.tokenize import sent_tokenize

# Ensure you have NLTK's sentence tokenizer data
# import nltk
# nltk.download('punkt')

# --- 1. Define Document and Reference Summary ---
# NOTE: The reference summary is only for conceptual context; it is not used in the final inference loop.
document = (
    "The 2024 Olympic Games are scheduled to take place in Paris, France. "
    "This marks the third time the city has hosted the Games, having previously done so in 1900 and 1924. "
    "The event is expected to feature 32 sports and 329 events across various venues. "
    "Sustainability is a major focus for the organizers, with plans to minimize the environmental impact. "
    "Preparations are currently underway, involving significant infrastructure upgrades, particularly in the transportation sector. "
    "Ticket sales have been strong, indicating high public enthusiasm for the world's premier sporting event. "
    "A total of 10,500 athletes are expected to compete. The opening ceremony will be held on the River Seine."
)
reference_summary = "Paris is hosting the 2024 Olympics, focusing on sustainability and involving 10,500 athletes in 32 sports." # Conceptual R

# --- 2. Initialize Model and Tokenizer (Conceptual) ---
MODEL_NAME = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Loading the model for structural completeness, although its actual weights are untrained for this task
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)


def predict_sentence_importance(sentence, model, tokenizer, device='cpu'):
    """
    Predicts if a sentence is important (1) or not (0) using a trained model.

    ***FOR THIS DEMO, WE STUB THE PREDICTION TO SELECT THE FIRST 3 SENTENCES***
    """
    # 1. Tokenize (for structure)
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True)

    # 2. Simulate confidence score based on a simple rule (e.g., shorter sentences are more 'important' here)
    # The first three sentences in the document have the highest confidence score for this demo.
    score_to_include = 1.0 / (len(sentence) + 1)

    # 3. Simulate binary prediction based on a dummy threshold
    if score_to_include > 0.01: # A low threshold ensures many are included, but we rely on sorting
         predicted_class_id = 1
    else:
         predicted_class_id = 0

    return predicted_class_id, score_to_include

# --- 3. Supervised Extractive Inference Loop ---

SUMMARY_SENTENCE_COUNT = 3 # We only want the top 3 sentences
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

sentences = sent_tokenize(document)
scored_sentences = [] # Stores: (confidence_score, original_index, sentence_text)

print(f"--- Document Sentences Found: {len(sentences)} ---")

for i, sentence in enumerate(sentences):
    # Get the simulated prediction and confidence
    predicted_label, confidence = predict_sentence_importance(sentence, model, tokenizer, device)

    # Store the sentence, its confidence, and its original position
    scored_sentences.append((confidence, i, sentence))

    # Optional: Print the conceptual prediction for transparency
    print(f"[{'Important' if predicted_label == 1 else 'Minor'}] Confidence: {confidence:.4f} | {sentence[:30]}...")


# --- 4. Selection and Final Summary Generation ---

# A. Sort by confidence score (descending)
# The highest scores are the most 'important'
scored_sentences.sort(key=lambda x: x[0], reverse=True)

# B. Select only the top N sentences
top_n_sentences = scored_sentences[:SUMMARY_SENTENCE_COUNT]

# C. Sort the selected sentences back into their original document order (by index x[1])
top_n_sentences.sort(key=lambda x: x[1])

# D. Extract just the text
summary_sentences = [text for score, index, text in top_n_sentences]
final_summary = ' '.join(summary_sentences)

# --- 5. Print Results and Lengths ---
print("\n-------------------------------------------------")
print("✅ Final Supervised Extractive Summary:")
print(final_summary)
print("-------------------------------------------------")
print(f"Original Document Length (Characters): {len(document)}")
print(f"Final Summary Length (Characters): {len(final_summary)}")
print(f"Length Reduction achieved (Summary is {len(summary_sentences)} sentences).")
print("-------------------------------------------------")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Document Sentences Found: 8 ---
[Important] Confidence: 0.0145 | The 2024 Olympic Games are sch...
[Minor] Confidence: 0.0099 | This marks the third time the ...
[Important] Confidence: 0.0123 | The event is expected to featu...
[Minor] Confidence: 0.0099 | Sustainability is a major focu...
[Minor] Confidence: 0.0079 | Preparations are currently und...
[Minor] Confidence: 0.0095 | Ticket sales have been strong,...
[Important] Confidence: 0.0192 | A total of 10,500 athletes are...
[Important] Confidence: 0.0185 | The opening ceremony will be h...

-------------------------------------------------
✅ Final Supervised Extractive Summary:
The 2024 Olympic Games are scheduled to take place in Paris, France. A total of 10,500 athletes are expected to compete. The opening ceremony will be held on the River Seine.
-------------------------------------------------
Original Document Length (Characters): 689
Final Summary Length (Characters): 174
Length Reduction achieved (Summary is 3 sentenc

**Abstractive Summarization**

Abstractive summarization works by understanding the meaning of the source text and then generating new sentences to capture the core concepts. The summary uses its own unique vocabulary and phrasing, often condensing complex ideas and resulting in a summary that is highly fluent and human-like. This method is more complex and carries the risk of hallucination (generating factually incorrect information).

1. Sequence-to-Sequence (Seq2Seq) Models
Abstractive summarization is primarily handled by Encoder-Decoder architectures, which map the input sequence (document) to a completely new output sequence (summary).

Recurrent Neural Networks (RNNs/LSTMs): These were the foundational models for Seq2Seq, where an encoder read the entire document, and a decoder generated the summary one word at a time. They are largely superseded by Transformers.

Attention Mechanisms: This crucial addition allows the decoder to focus on the most relevant parts of the input document at each step of summary generation, improving coherence and accuracy.

Pointer-Generator Networks: A mechanism added to Seq2Seq models that allows the model to either generate a new word or copy (point) a word directly from the source text. This improves factual accuracy and reduces the risk of generating rare or nonsensical words.

2. Transformer-Based Large Language Models (LLMs)
These are the current state-of-the-art models that have revolutionized abstractive summarization.

BART (Bidirectional and Auto-Regressive Transformer): A powerful encoder-decoder model pre-trained by corrupting text and training the model to reconstruct the original. It excels at summarization.

**T5 / mT5 (Text-to-Text Transfer Transformer):** A model that frames all NLP tasks (including summarization) as a text-to-text problem. It is highly effective and the mT5 variant supports many languages.

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization): A model specifically designed for summarization, trained by masking out entire important sentences and forcing the model to generate them, resulting in highly abstractive summaries.

Generative Models (GPT-style): Large decoder-only models (like GPT-4) can be prompted to summarize text using a technique called in-context learning or zero-shot summarization, though they are not specialized for the task.

**T5 transformer**

In [18]:
# initialize the pretrained model
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

In [19]:
# input text
text = """
Back in the 1950s, the fathers of the field, Minsky and McCarthy, described artificial intelligence as any task performed by a machine that would have previously been considered to require human intelligence.

That's obviously a fairly broad definition, which is why you will sometimes see arguments over whether something is truly AI or not.

Modern definitions of what it means to create intelligence are more specific. Francois Chollet, an AI researcher at Google and creator of the machine-learning software library Keras, has said intelligence is tied to a system's ability to adapt and improvise in a new environment, to generalise its knowledge and apply it to unfamiliar scenarios.

"Intelligence is the efficiency with which you acquire new skills at tasks you didn't previously prepare for," he said.

"Intelligence is not skill itself; it's not what you can do; it's how well and how efficiently you can learn new things."

It's a definition under which modern AI-powered systems, such as virtual assistants, would be characterised as having demonstrated 'narrow AI', the ability to generalise their training when carrying out a limited set of tasks, such as speech recognition or computer vision.

Typically, AI systems demonstrate at least some of the following behaviours associated with human intelligence: planning, learning, reasoning, problem-solving, knowledge representation, perception, motion, and manipulation and, to a lesser extent, social intelligence and creativity.

AlexNet's performance demonstrated the power of learning systems based on neural networks, a model for machine learning that had existed for decades but that was finally realising its potential due to refinements to architecture and leaps in parallel processing power made possible by Moore's Law. The prowess of machine-learning systems at carrying out computer vision also hit the headlines that year, with Google training a system to recognise an internet favorite: pictures of cats.

The next demonstration of the efficacy of machine-learning systems that caught the public's attention was the 2016 triumph of the Google DeepMind AlphaGo AI over a human grandmaster in Go, an ancient Chinese game whose complexity stumped computers for decades. Go has about possible 200 moves per turn compared to about 20 in Chess. Over the course of a game of Go, there are so many possible moves that are searching through each of them in advance to identify the best play is too costly from a computational point of view. Instead, AlphaGo was trained how to play the game by taking moves played by human experts in 30 million Go games and feeding them into deep-learning neural networks.
"""

In [20]:
## preprocess the input text
preprocessed_text = text.strip().replace('\n','')
t5_input_text = 'summarize: ' + preprocessed_text


In [21]:

t5_input_text

'summarize: Back in the 1950s, the fathers of the field, Minsky and McCarthy, described artificial intelligence as any task performed by a machine that would have previously been considered to require human intelligence.That\'s obviously a fairly broad definition, which is why you will sometimes see arguments over whether something is truly AI or not.Modern definitions of what it means to create intelligence are more specific. Francois Chollet, an AI researcher at Google and creator of the machine-learning software library Keras, has said intelligence is tied to a system\'s ability to adapt and improvise in a new environment, to generalise its knowledge and apply it to unfamiliar scenarios."Intelligence is the efficiency with which you acquire new skills at tasks you didn\'t previously prepare for," he said."Intelligence is not skill itself; it\'s not what you can do; it\'s how well and how efficiently you can learn new things."It\'s a definition under which modern AI-powered systems, 

In [22]:

len(t5_input_text.split())


410

In [23]:

tokenized_text = tokenizer.encode(t5_input_text, return_tensors='pt', max_length=512).to(device)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


**Summerize**

In [24]:
summary_ids = model.generate(tokenized_text, min_length=30, max_length=120)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [25]:
summary

"artificial intelligence is a definition of how it means to create intelligence. it's a definition of how AI-powered systems are characterised as having demonstrated 'narrow AI' the technology is a model for machine learning that had existed for decades."

**The Hybrid Summarization Pipeline** is a two-step process to summarize very long documents efficiently:

Extractive Filtering (Reducer): A quick extractive method (like TextRank) first reads the entire document and selects only the most critical sentences (e.g., 10-20). This significantly shrinks the input size.

Abstractive Generation (Polisher): The shortened text from Step 1 is then fed into a powerful abstractive model (like BART or T5) to generate the final, fluent, and concise summary.

This method combines the efficiency of extraction with the fluency of abstraction.

In [55]:
import nltk
from nltk.tokenize import sent_tokenize
from transformers import pipeline

# --- Configuration ---
EXTRACTIVE_SENTENCE_COUNT = 5 # Number of top sentences to extract
MAX_SUMMARY_LENGTH = 80       # Desired character length limit for final abstractive summary

# --- Long Document and Conceptual Summary ---
document = (
    "The Mars Perseverance Rover landed successfully in Jezero Crater on February 18, 2021. "
    "Its primary mission is to seek signs of ancient life and collect samples of rock and regolith for a future return mission. "
    "The rover is equipped with a sophisticated suite of instruments, including the Mastcam-Z camera system and the SuperCam laser. "
    "One of the most exciting tools is MOXIE, which is designed to produce oxygen from the Martian atmosphere, paving the way for human missions. "
    "The atmosphere of Mars is thin and composed primarily of carbon dioxide. "
    "Ingenuity, a small helicopter carried by Perseverance, completed the first powered, controlled flight on another planet, proving aerial exploration is viable. "
    "This successful flight opened a new dimension in planetary science. "
    "The samples collected will be crucial for confirming whether life ever existed on the Red Planet. "
    "The mission has already provided stunning high-definition images and valuable atmospheric data. "
    "Future plans involve the European Space Agency's Rosalind Franklin rover, which will work in tandem with the sample return efforts. "
    "Overall, Perseverance represents a massive leap forward in our exploration capabilities."
)

# --- 1. Extractive Phase (Conceptual TextRank/LexRank) ---

def run_extractive_summarization(text, k):
    """
    Conceptual function to simulate the output of a TextRank/LexRank model.
    In a real scenario, this would use the 'sumy' library to score and select sentences.

    Here, we simply select the first k sentences as a placeholder for the most 'critical' ones.
    """
    sentences = sent_tokenize(text)

    # Select the top k sentences and preserve their original order
    extracted_sentences = sentences[:k]

    # The output is a string of reduced text
    reduced_text = ' '.join(extracted_sentences)

    print(f"   [Step 1] Document reduced to {k} sentences.")
    return reduced_text

# --- 2. Abstractive Phase (Conceptual BART/T5) ---

def run_abstractive_summarization(text, max_len):
    """
    Conceptual function to simulate the output of an Abstractive model (e.g., BART).
    In a real scenario, this would load a pre-trained model via the transformers pipeline.
    """
    try:
        # Initializing the summarization pipeline (uses BART/T5 behind the scenes)
        summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", framework="pt")

        # Generating the abstractive summary
        summary = summarizer(
            text,
            max_length=max_len,
            min_length=int(max_len * 0.5),
            do_sample=False
        )[0]['summary_text']

        print("   [Step 2] Abstractive model successfully generated summary.")
        return summary

    except Exception as e:
        # Fallback for environments without required models/libraries
        print(f"   [Step 2] Using conceptual fallback due to error: {e}")
        # Conceptual Abstractive Output based on the reduced text
        return f"The Perseverance Rover landed in 2021 to look for ancient life and collect samples. It successfully deployed the Ingenuity helicopter, proving aerial exploration on Mars is now possible."

# --- Run the Hybrid Pipeline ---

print("🚀 Starting Extractive-Abstractive Pipeline...")

# Step 1: Extractive Reduction
intermediate_text = run_extractive_summarization(
    document,
    EXTRACTIVE_SENTENCE_COUNT
)
print(f"   [Intermediate Length]: {len(intermediate_text)} characters")
print("-" * 50)


# Step 2: Abstractive Generation on the Reduced Text
final_summary = run_abstractive_summarization(
    intermediate_text,
    MAX_SUMMARY_LENGTH
)
print("-" * 50)


# --- Final Output and Lengths ---
print("\n📝 Original Document:")
print(document)
print("\n✨ Final Abstractive Summary (from reduced text):")
print(final_summary)

print("\n" + "=" * 50)
print(f"📏 Document Length (Characters): {len(document)}")
print(f"📐 Summary Length (Characters): {len(final_summary)}")
print(f"Length Reduction: {len(document) - len(final_summary)} characters saved.")
print("=" * 50)

🚀 Starting Extractive-Abstractive Pipeline...
   [Step 1] Document reduced to 5 sentences.
   [Intermediate Length]: 550 characters
--------------------------------------------------


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


   [Step 2] Abstractive model successfully generated summary.
--------------------------------------------------

📝 Original Document:
The Mars Perseverance Rover landed successfully in Jezero Crater on February 18, 2021. Its primary mission is to seek signs of ancient life and collect samples of rock and regolith for a future return mission. The rover is equipped with a sophisticated suite of instruments, including the Mastcam-Z camera system and the SuperCam laser. One of the most exciting tools is MOXIE, which is designed to produce oxygen from the Martian atmosphere, paving the way for human missions. The atmosphere of Mars is thin and composed primarily of carbon dioxide. Ingenuity, a small helicopter carried by Perseverance, completed the first powered, controlled flight on another planet, proving aerial exploration is viable. This successful flight opened a new dimension in planetary science. The samples collected will be crucial for confirming whether life ever existed on the R