# Automatic Text Summarization using NLP

## Introduction  
Automatic Text Summarization is a key technique in **Natural Language Processing (NLP)** that utilizes machine learning and linguistic algorithms to condense large bodies of text while preserving essential information. It is widely used in **news aggregation, document summarization, legal text processing, and AI-driven assistants**.  

This notebook explores two primary approaches to text summarization:  
1. **Extractive Summarization** – Selecting key sentences directly from the original text.  
2. **Abstractive Summarization** – Generating new sentences to summarize the text meaningfully.  

We will implement both methods using **Python NLP libraries** like **NLTK, Summa, and Hugging Face Transformers (T5, BART)**.  

 **Objectives of this Notebook:**  
 Preprocess and clean textual data  
 Implement Extractive and Abstractive Summarization  
 Evaluate and compare summarization methods  



In [1]:
!pip install nltk spacy transformers summa newspaper3k

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.3-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.2 MB/

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import string

nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenization
    words = [word for word in words if word not in stopwords.words("english")]  # Remove stopwords
    return " ".join(words)

In [4]:
sample_text = "Text summarization is an NLP task, that helps in reducing the length of a document while preserving its meaning."
cleaned_text = preprocess_text(sample_text)
print(cleaned_text)

text summarization nlp task helps reducing length document preserving meaning


## Extractive summarization
Extractive summarization algorithms automatically generate summaries by selecting and combining key passages from the original text. Unlike human summarizers, these models focus on extracting the most important sentences without creating new content. The goal is to preserve the meaning of the original text while condensing it.

In [5]:
from summa import summarizer

text = """Text summarization is an NLP technique that generates a concise and meaningful summary
from a larger body of text. It can be either extractive or abstractive. Extractive methods select
important sentences from the original text, while abstractive methods generate new sentences."""

summary = summarizer.summarize(text, ratio=0.3)  # Extract 30% of the key sentences
print("Extractive Summary:\n", summary)

Extractive Summary:
 important sentences from the original text, while abstractive methods generate new sentences.


## Abstractive Summarization
Abstractive summarization generates entirely new sentences to convey key ideas from the original text. Unlike extractive summarization, which selects and rearranges sentences from the original content, abstractive methods rephrase information in a more concise and coherent manner, often using new vocabulary that wasn't present in the original.

In [6]:
from transformers import pipeline

summarizer = pipeline("summarization")

text = """Text summarization is an NLP technique that generates a concise and meaningful summary
from a larger body of text. It can be either extractive or abstractive. Extractive methods select
important sentences from the original text, while abstractive methods generate new sentences."""

summary = summarizer(text, max_length=50, min_length=20, do_sample=False)
print("Abstractive Summary:\n", summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


Abstractive Summary:
  Text summarization is an NLP technique that generates a concise and meaningful summary from a larger body of text . It can be either extractive or abstractive . Extractive methods select important sentences from the original text .


## Text Summarization from scratch

In [12]:
text="""Text summarization is a Natural Language Processing (NLP) technique that condenses a larger text into a shorter, meaningful version while retaining key information. It helps in quickly understanding lengthy documents, news articles, and research papers.

There are two main types of summarization: extractive and abstractive. Extractive summarization selects important sentences directly from the original text and compiles them into a summary. This method relies on statistical and linguistic features like word frequency and sentence importance.

Abstractive summarization, on the other hand, involves generating entirely new sentences that convey the main ideas of the original text. It uses deep learning models, such as transformers and sequence-to-sequence networks, to generate human-like summaries.

Summarization is widely used in applications like news aggregation, search engines, and content generation. With advancements in NLP, modern summarization models continue to improve, producing more accurate and contextually aware summaries that closely resemble human-written content."""
cleaned_text=preprocess_text(text)
cleaned_text

'text summarization natural language processing nlp technique condenses larger text shorter meaningful version retaining key information helps quickly understanding lengthy documents news articles research papers two main types summarization extractive abstractive extractive summarization selects important sentences directly original text compiles summary method relies statistical linguistic features like word frequency sentence importance abstractive summarization hand involves generating entirely new sentences convey main ideas original text uses deep learning models transformers sequencetosequence networks generate humanlike summaries summarization widely used applications like news aggregation search engines content generation advancements nlp modern summarization models continue improve producing accurate contextually aware summaries closely resemble humanwritten content'

## 1. Creating a Frequency Table of Words
This snippet creates a **word frequency table** to determine the importance of each word in the text.  
- It converts words to **lowercase** for uniformity.  
- It ignores **stopwords** (common words like "is", "the", "and").  
- It counts the **occurrence of each word** and stores it in `freq_table`.

In [13]:
freq_table={}
words=word_tokenize(cleaned_text)
for word in words:
  if word in freq_table:
    freq_table[word]+=1
  else:
    freq_table[word]=1
freq_table

{'text': 4,
 'summarization': 6,
 'natural': 1,
 'language': 1,
 'processing': 1,
 'nlp': 2,
 'technique': 1,
 'condenses': 1,
 'larger': 1,
 'shorter': 1,
 'meaningful': 1,
 'version': 1,
 'retaining': 1,
 'key': 1,
 'information': 1,
 'helps': 1,
 'quickly': 1,
 'understanding': 1,
 'lengthy': 1,
 'documents': 1,
 'news': 2,
 'articles': 1,
 'research': 1,
 'papers': 1,
 'two': 1,
 'main': 2,
 'types': 1,
 'extractive': 2,
 'abstractive': 2,
 'selects': 1,
 'important': 1,
 'sentences': 2,
 'directly': 1,
 'original': 2,
 'compiles': 1,
 'summary': 1,
 'method': 1,
 'relies': 1,
 'statistical': 1,
 'linguistic': 1,
 'features': 1,
 'like': 2,
 'word': 1,
 'frequency': 1,
 'sentence': 1,
 'importance': 1,
 'hand': 1,
 'involves': 1,
 'generating': 1,
 'entirely': 1,
 'new': 1,
 'convey': 1,
 'ideas': 1,
 'uses': 1,
 'deep': 1,
 'learning': 1,
 'models': 2,
 'transformers': 1,
 'sequencetosequence': 1,
 'networks': 1,
 'generate': 1,
 'humanlike': 1,
 'summaries': 2,
 'widely': 1,
 'us

## 2. Assigning Scores to Sentences
This function assigns a **score** to each sentence based on the frequency of words it contains.  
- Each sentence gets a **cumulative score** based on the sum of its words' frequencies.  
- Sentences containing more **high-frequency words** will have a **higher score**.  
- These scores help us determine which sentences are **most important** for the summary.

In [14]:
sentences=sent_tokenize(text)
sentence_value={}
for sentence in sentences:
  for word, freq in freq_table.items():
    if word in sentence.lower():
      if sentence in sentence_value:
        sentence_value[sentence]+=freq
      else:
        sentence_value[sentence]=freq
sentence_value


{'Text summarization is a Natural Language Processing (NLP) technique that condenses a larger text into a shorter, meaningful version while retaining key information.': 24,
 'It helps in quickly understanding lengthy documents, news articles, and research papers.': 12,
 'There are two main types of summarization: extractive and abstractive.': 14,
 'Extractive summarization selects important sentences directly from the original text and compiles them into a summary.': 22,
 'This method relies on statistical and linguistic features like word frequency and sentence importance.': 11,
 'Abstractive summarization, on the other hand, involves generating entirely new sentences that convey the main ideas of the original text.': 26,
 'It uses deep learning models, such as transformers and sequence-to-sequence networks, to generate human-like summaries.': 13,
 'Summarization is widely used in applications like news aggregation, search engines, and content generation.': 20,
 'With advancements in 

## 3. Calculating the Average Sentence Score
To decide the **threshold** for summary selection, we calculate the **average sentence score**.  
- The sum of all sentence scores is computed.  
- The average is determined by dividing the total score by the **number of sentences**.  
- This average score helps in filtering out **less relevant** sentences.  


In [15]:
def getsumvalue():
  sum=0
  for sentence in sentence_value:
    sum+=sentence_value[sentence]
  avg=int(sum/len(sentence_value))
  return avg
avg=getsumvalue()
avg

18

## 4. Generating the Summary
This final step selects the **most important sentences** based on their scores.  
- A sentence is **included in the summary** if its score is greater than **1.2 times the average score**.  
- The selected sentences are concatenated to form the final **summarized text**.

In [16]:
summary=''
for sentence in sentences:
    if (sentence in sentence_value ) and (sentence_value[sentence]>(1.2*avg)):
        summary+=" "+sentence
print(summary)

 Text summarization is a Natural Language Processing (NLP) technique that condenses a larger text into a shorter, meaningful version while retaining key information. Extractive summarization selects important sentences directly from the original text and compiles them into a summary. Abstractive summarization, on the other hand, involves generating entirely new sentences that convey the main ideas of the original text. With advancements in NLP, modern summarization models continue to improve, producing more accurate and contextually aware summaries that closely resemble human-written content.
