 ***what is text summarization***
> Text summarization is the process of distilling the most important information from a source text.



---



 ***The importance of automatic text summarization lies in several key benefits:***



1.   **Reduced Reading Time:** Summaries significantly cut down the   
     time needed to read and understand large documents.

2.   **Easier Document Selection:** When conducting research,
     summaries simplify the process of choosing relevant documents.


3.   **Improved Indexing Effectiveness:** Summaries enhance the
     efficiency of indexing systems.


4.   **Reduced Bias:** Automated summarization is less prone to bias compared to human summarizers.

5.   **Personalized Information:** Customized summaries are valuable in question-answering systems as they provide tailored information.

6.   **Increased Processing Capacity:** Automatic or semi-automatic summarization allows commercial abstract services to handle a larger volume of text documents.




---

* Important Steps in Text summarization*
  1. Text Cleaning
       * Word Tokenization
       * Sentence Tokenization
  2. Word-Frequency Table
  3. Summarization

In [None]:
# Download important libraries
!pip install -U spacy
!python -m spacy download en_core_web_sm
import spacy
from heapq import nlargest
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Text And Percentage

In [None]:
text = "On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town. The mansion, shrouded in mystery and overgrown with ivy, had always piqued their curiosity, with its cracked windows, creaky doors, and tales of ghosts and hidden treasures. Armed with flashlights, notebooks, and an old map they had found in the library, they approached the front gate, which squeaked eerily as they pushed it open. The garden, though once magnificent, was now a tangle of weeds and wildflowers, and as they made their way up the cobblestone path, the air was filled with the scent of damp earth and decaying leaves. Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves teeming with old, leather-bound volumes. As they ventured deeper into the mansion, the sense of history and mystery grew stronger, and they couldn't help but wonder about the lives of those who had once called this place home. Despite the occasional creak of floorboards and the unsettling flicker of shadows, their excitement and curiosity propelled them forward, eager to uncover the secrets that lay hidden within the mansion's walls."


In [None]:
percentage = 0.3;

# Removing Stop Words and Processing Text with SpaCy

In [None]:
# List of stop words
stopwords = list(STOP_WORDS)
# pass document into spacy and store in "doc" object
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

# Stop Words

In [None]:
print(f"Stop Words: {stopwords}")

Stop Words: ['fifty', 'keep', 'yourself', 'afterwards', 'within', 'thus', 'who', 'others', 'on', 'have', 'beyond', 'her', 'done', 'does', 'anyhow', 'nine', 'beforehand', 'to', 'regarding', 'which', 'never', 'into', 'again', 'become', 'else', 'name', 'whether', 'our', 'someone', 'after', 'hereupon', 'am', 'bottom', 'myself', 'amount', 'because', '‘ll', 'until', 'whoever', 'mine', 'something', 'eight', 'call', 'in', 'well', 'using', 'another', 'above', 'herself', 'has', 'over', 'somewhere', 'nobody', 'with', 'n’t', 'are', 'formerly', 'there', 'did', 'that', 'mostly', 'last', 'already', 'just', 'few', 'ourselves', 'sometime', 'twenty', 'whenever', 'they', 'thereupon', 'below', 'once', 'themselves', 'all', 'indeed', 'full', 'take', 'quite', "'s", 'here', 'neither', 'would', 'and', 'cannot', 'besides', 'elsewhere', 'latter', '‘m', 'via', 'could', 'one', 'namely', 'it', "'ll", 'more', 'back', 'empty', 'had', 'three', '’d', 'seems', 'not', 'everywhere', 'toward', "'d", 'us', 'whither', 'herei

#Tokenization



Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.
One can think of token as parts like a word is a token in a sentence, and a sentence is a token

In [None]:
tokens = [token.text for token in doc]
print(tokens)

['On', 'a', 'crisp', 'autumn', 'morning', ',', 'despite', 'the', 'looming', 'threat', 'of', 'rain', 'and', 'the', 'biting', 'chill', 'in', 'the', 'air', ',', 'they', 'decided', ',', 'after', 'much', 'contemplation', 'and', 'a', 'hearty', 'breakfast', ',', 'to', 'explore', 'the', 'old', ',', 'abandoned', 'mansion', 'at', 'the', 'edge', 'of', 'town', '.', 'The', 'mansion', ',', 'shrouded', 'in', 'mystery', 'and', 'overgrown', 'with', 'ivy', ',', 'had', 'always', 'piqued', 'their', 'curiosity', ',', 'with', 'its', 'cracked', 'windows', ',', 'creaky', 'doors', ',', 'and', 'tales', 'of', 'ghosts', 'and', 'hidden', 'treasures', '.', 'Armed', 'with', 'flashlights', ',', 'notebooks', ',', 'and', 'an', 'old', 'map', 'they', 'had', 'found', 'in', 'the', 'library', ',', 'they', 'approached', 'the', 'front', 'gate', ',', 'which', 'squeaked', 'eerily', 'as', 'they', 'pushed', 'it', 'open', '.', 'The', 'garden', ',', 'though', 'once', 'magnificent', ',', 'was', 'now', 'a', 'tangle', 'of', 'weeds', '

In [None]:
# print punctuations from : [from string import punctuation]
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Word Frequency Calculation


This code iterates through each word in a document. It checks if the word is not in a list of stopwords (common words like "the", "and", etc.) and not in a list of punctuation marks. If the word passes these checks, it is added to a dictionary called word_frequencies. If the word is already in the dictionary, its frequency count is incremented by 1. Finally, it prints out the dictionary containing word frequencies.

In [None]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] +=1
print(word_frequencies)

{'crisp': 1, 'autumn': 1, 'morning': 1, 'despite': 1, 'looming': 1, 'threat': 1, 'rain': 1, 'biting': 1, 'chill': 1, 'air': 2, 'decided': 1, 'contemplation': 1, 'hearty': 1, 'breakfast': 1, 'explore': 1, 'old': 3, 'abandoned': 1, 'mansion': 5, 'edge': 1, 'town': 1, 'shrouded': 1, 'mystery': 2, 'overgrown': 1, 'ivy': 1, 'piqued': 1, 'curiosity': 2, 'cracked': 1, 'windows': 1, 'creaky': 1, 'doors': 1, 'tales': 1, 'ghosts': 1, 'hidden': 2, 'treasures': 1, 'Armed': 1, 'flashlights': 1, 'notebooks': 1, 'map': 1, 'found': 1, 'library': 1, 'approached': 1, 'gate': 1, 'squeaked': 1, 'eerily': 1, 'pushed': 1, 'open': 1, 'garden': 1, 'magnificent': 1, 'tangle': 1, 'weeds': 1, 'wildflowers': 1, 'way': 1, 'cobblestone': 1, 'path': 1, 'filled': 2, 'scent': 1, 'damp': 1, 'earth': 1, 'decaying': 1, 'leaves': 1, 'Inside': 1, 'labyrinth': 1, 'dusty': 1, 'corridors': 1, 'forgotten': 1, 'rooms': 1, 'relics': 1, 'past': 1, '—': 1, 'faded': 1, 'portraits': 1, 'antique': 1, 'furniture': 1, 'bookshelves': 1,

In [None]:
#Determining Maximum Word Frequency
max_frequency = max(word_frequencies.values())

# Normalizing Word Frequencies


This step is called Normalizing Word Frequencies. By dividing each word's frequency by the maximum frequency, the code scales the word frequencies to a range between 0 and 1. This normalization helps in comparing word importance uniformly across the document.

In [None]:
# Maximum repeated word in the document seemd  - data
max_frequency

5

In [None]:
# Normalize the word frequencies
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [None]:
print(word_frequencies)

{'crisp': 0.2, 'autumn': 0.2, 'morning': 0.2, 'despite': 0.2, 'looming': 0.2, 'threat': 0.2, 'rain': 0.2, 'biting': 0.2, 'chill': 0.2, 'air': 0.4, 'decided': 0.2, 'contemplation': 0.2, 'hearty': 0.2, 'breakfast': 0.2, 'explore': 0.2, 'old': 0.6, 'abandoned': 0.2, 'mansion': 1.0, 'edge': 0.2, 'town': 0.2, 'shrouded': 0.2, 'mystery': 0.4, 'overgrown': 0.2, 'ivy': 0.2, 'piqued': 0.2, 'curiosity': 0.4, 'cracked': 0.2, 'windows': 0.2, 'creaky': 0.2, 'doors': 0.2, 'tales': 0.2, 'ghosts': 0.2, 'hidden': 0.4, 'treasures': 0.2, 'Armed': 0.2, 'flashlights': 0.2, 'notebooks': 0.2, 'map': 0.2, 'found': 0.2, 'library': 0.2, 'approached': 0.2, 'gate': 0.2, 'squeaked': 0.2, 'eerily': 0.2, 'pushed': 0.2, 'open': 0.2, 'garden': 0.2, 'magnificent': 0.2, 'tangle': 0.2, 'weeds': 0.2, 'wildflowers': 0.2, 'way': 0.2, 'cobblestone': 0.2, 'path': 0.2, 'filled': 0.4, 'scent': 0.2, 'damp': 0.2, 'earth': 0.2, 'decaying': 0.2, 'leaves': 0.2, 'Inside': 0.2, 'labyrinth': 0.2, 'dusty': 0.2, 'corridors': 0.2, 'forgot

# Sentence Tokenization

In [None]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town., The mansion, shrouded in mystery and overgrown with ivy, had always piqued their curiosity, with its cracked windows, creaky doors, and tales of ghosts and hidden treasures., Armed with flashlights, notebooks, and an old map they had found in the library, they approached the front gate, which squeaked eerily as they pushed it open., The garden, though once magnificent, was now a tangle of weeds and wildflowers, and as they made their way up the cobblestone path, the air was filled with the scent of damp earth and decaying leaves., Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves teeming with old, leather-bound volumes., As they ventured deeper into the mansi

# Calculate sentences scores
By creating a Dictionay for sentences and its normalized frequencies

In [None]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]
sentence_scores

{On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town.: 5.400000000000001,
 The mansion, shrouded in mystery and overgrown with ivy, had always piqued their curiosity, with its cracked windows, creaky doors, and tales of ghosts and hidden treasures.: 4.400000000000001,
 Armed with flashlights, notebooks, and an old map they had found in the library, they approached the front gate, which squeaked eerily as they pushed it open.: 2.8000000000000003,
 The garden, though once magnificent, was now a tangle of weeds and wildflowers, and as they made their way up the cobblestone path, the air was filled with the scent of damp earth and decaying leaves.: 3.400000000000001,
 Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves

# Selecting Top Sentences

In [None]:
# Set the percentage of sentences to include in the summary
select_length = int(len(sentence_tokens) * percentage)
# Get the sentences with the highest scores
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
summary  # Got important sentences

[On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town.,
 Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves teeming with old, leather-bound volumes.]

# Generating the Final Summary

In [None]:
# Extract the text from the selected sentences
final_summary = [word.text for word in summary]
summary_text = ' '.join(final_summary)
print(summary_text)

On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town. Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves teeming with old, leather-bound volumes.


# Testing Models

> Pegasus


> T5


> Bart







In [None]:
import logging
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

logging.getLogger("transformers").setLevel(logging.ERROR)
# Load pre-trained Pegasus model and tokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Define your input text
input_text = summary_text

# Tokenize the input text
input_ids = tokenizer.encode("paraphrase: " + input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate paraphrase
paraphrase_ids = model.generate(input_ids, max_length=150, num_beams=4, early_stopping=True)

# Decode and print the generated paraphrase
Pegasus_paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Suppress warning messages
logging.getLogger("transformers").setLevel(logging.ERROR)

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Define your input text
input_text = summary_text

# Tokenize the input text
input_ids = tokenizer.encode("paraphrase: " + input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate paraphrase
paraphrase_ids = model.generate(input_ids, max_length=150, num_beams=4, early_stopping=True)

# Decode and print the generated paraphrase
T5_paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Suppress warning messages
logging.getLogger("transformers").setLevel(logging.ERROR)

# Load pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Define your input text
input_text = summary_text

# Tokenize the input text
input_ids = tokenizer.encode("paraphrase: " + input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate paraphrase
paraphrase_ids = model.generate(input_ids, max_length=150, num_beams=4, early_stopping=True)

# Decode and print the generated paraphrase
bart_paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)



vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

# Final Results

In [None]:
# Calculate lengths
text_length = len(text)
summary_length = len(summary_text)
pegasus_paraphrase_length = len(Pegasus_paraphrase)
t5_paraphrase_length = len(T5_paraphrase)
bart_paraphrase_length = len(bart_paraphrase)

# Print input text
print(f"INPUT (Length: {text_length})\n{text}\n{'-' * 50}\n")

# Print summarized text
print(f"Summarized (Length: {summary_length})\n{summary_text}\n{'-' * 50}\n")

# Print paraphrased text using Pegasus
print(f"Paraphrased using Pegasus (Length: {pegasus_paraphrase_length})\n{Pegasus_paraphrase}\n{'-' * 50}\n")

# Print paraphrased text using T5
print(f"Paraphrased using T5 (Length: {t5_paraphrase_length})\n{T5_paraphrase}\n{'-' * 50}\n")

# Print paraphrased text using Bart
print(f"Paraphrased using Bart (Length: {bart_paraphrase_length})\n{bart_paraphrase}\n{'-' * 50}\n")

# Print statistics
print("Statistics:")
print(f"Original Text Length: {text_length}")
print(f"Summarized Text Length: {summary_length}")
print(f"Pegasus Paraphrase Length: {pegasus_paraphrase_length}")
print(f"T5 Paraphrase Length: {t5_paraphrase_length}")
print(f"Bart Paraphrase Length: {bart_paraphrase_length}")


INPUT (Length: 1352)
On a crisp autumn morning, despite the looming threat of rain and the biting chill in the air, they decided, after much contemplation and a hearty breakfast, to explore the old, abandoned mansion at the edge of town. The mansion, shrouded in mystery and overgrown with ivy, had always piqued their curiosity, with its cracked windows, creaky doors, and tales of ghosts and hidden treasures. Armed with flashlights, notebooks, and an old map they had found in the library, they approached the front gate, which squeaked eerily as they pushed it open. The garden, though once magnificent, was now a tangle of weeds and wildflowers, and as they made their way up the cobblestone path, the air was filled with the scent of damp earth and decaying leaves. Inside, the mansion was a labyrinth of dusty corridors and forgotten rooms, each filled with relics of the past—faded portraits, antique furniture, and bookshelves teeming with old, leather-bound volumes. As they ventured deeper