# Text Summarization

in extractive text summarization, the summary is created by selecting and extracting important sentencesor phrases directly from the orignal text without any modification.

In [1]:
#Text for summarization
text = "Text summarization is a natural language processing (NLP) technique that aims to generate concise and meaningful summaries from large amounts of text data. This process is essential for extracting key information, reducing redundancy, and enhancing the efficiency of information retrieval. Text summarization can be applied in various domains such as news articles, research papers, and customer feedback analysis, enabling businesses to quickly grasp the most important insights. It can be implemented using techniques like extractive methods, which select significant portions of the text, or abstractive methods, which generate new summaries based on the text's meaning."

In [2]:
#Checking lenght of Text
len(text)

673

In [3]:
# Import the spaCy library for natural language processing
# Import the stop words from spaCy
# Import punctuation for text processing

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [4]:
# Load the English language model from spaCy

nlp = spacy.load('en_core_web_sm')

In [5]:
# Process the text with the spaCy NLP pipeline

doc = nlp(text)

In [6]:
# Tokenizing the text and removing stopwords and punctuation

tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct and token.text != "\n"]

# This creates a list of tokens in lowercase that are not stopwords or punctuation


In [7]:
#printing the tokens

tokens

['text',
 'summarization',
 'natural',
 'language',
 'processing',
 'nlp',
 'technique',
 'aims',
 'generate',
 'concise',
 'meaningful',
 'summaries',
 'large',
 'amounts',
 'text',
 'data',
 'process',
 'essential',
 'extracting',
 'key',
 'information',
 'reducing',
 'redundancy',
 'enhancing',
 'efficiency',
 'information',
 'retrieval',
 'text',
 'summarization',
 'applied',
 'domains',
 'news',
 'articles',
 'research',
 'papers',
 'customer',
 'feedback',
 'analysis',
 'enabling',
 'businesses',
 'quickly',
 'grasp',
 'important',
 'insights',
 'implemented',
 'techniques',
 'like',
 'extractive',
 'methods',
 'select',
 'significant',
 'portions',
 'text',
 'abstractive',
 'methods',
 'generate',
 'new',
 'summaries',
 'based',
 'text',
 'meaning']

In [8]:
#importing counter

from collections import Counter

In [9]:
# Calculate word frequency using Counter

word_frequency=Counter(tokens)

In [10]:
#printing counter of tokens

Counter(tokens)

Counter({'text': 5,
         'summarization': 2,
         'natural': 1,
         'language': 1,
         'processing': 1,
         'nlp': 1,
         'technique': 1,
         'aims': 1,
         'generate': 2,
         'concise': 1,
         'meaningful': 1,
         'summaries': 2,
         'large': 1,
         'amounts': 1,
         'data': 1,
         'process': 1,
         'essential': 1,
         'extracting': 1,
         'key': 1,
         'information': 2,
         'reducing': 1,
         'redundancy': 1,
         'enhancing': 1,
         'efficiency': 1,
         'retrieval': 1,
         'applied': 1,
         'domains': 1,
         'news': 1,
         'articles': 1,
         'research': 1,
         'papers': 1,
         'customer': 1,
         'feedback': 1,
         'analysis': 1,
         'enabling': 1,
         'businesses': 1,
         'quickly': 1,
         'grasp': 1,
         'important': 1,
         'insights': 1,
         'implemented': 1,
         'techniques': 1,
  

In [11]:
# Get the maximum frequency of any word

max_frequncy = max(word_frequency.values())

In [12]:
max_frequncy

5

In [13]:
# Normalize word frequencies (optional, but useful for ranking)

for word in word_frequency.keys():
    word_frequency[word] = round(word_frequency[word]/max_frequncy,2)
    
# Normalize frequencies to a range from 0 to 1    

In [14]:
word_frequency

Counter({'text': 1.0,
         'summarization': 0.4,
         'natural': 0.2,
         'language': 0.2,
         'processing': 0.2,
         'nlp': 0.2,
         'technique': 0.2,
         'aims': 0.2,
         'generate': 0.4,
         'concise': 0.2,
         'meaningful': 0.2,
         'summaries': 0.4,
         'large': 0.2,
         'amounts': 0.2,
         'data': 0.2,
         'process': 0.2,
         'essential': 0.2,
         'extracting': 0.2,
         'key': 0.2,
         'information': 0.4,
         'reducing': 0.2,
         'redundancy': 0.2,
         'enhancing': 0.2,
         'efficiency': 0.2,
         'retrieval': 0.2,
         'applied': 0.2,
         'domains': 0.2,
         'news': 0.2,
         'articles': 0.2,
         'research': 0.2,
         'papers': 0.2,
         'customer': 0.2,
         'feedback': 0.2,
         'analysis': 0.2,
         'enabling': 0.2,
         'businesses': 0.2,
         'quickly': 0.2,
         'grasp': 0.2,
         'important': 0.2,
 

In [15]:
# Sentence tokenization

sent_token = [sent.text for sent in doc.sents]

# This creates a list of sentences from the processed document

In [16]:
sent_token

['Text summarization is a natural language processing (NLP) technique that aims to generate concise and meaningful summaries from large amounts of text data.',
 'This process is essential for extracting key information, reducing redundancy, and enhancing the efficiency of information retrieval.',
 'Text summarization can be applied in various domains such as news articles, research papers, and customer feedback analysis, enabling businesses to quickly grasp the most important insights.',
 "It can be implemented using techniques like extractive methods, which select significant portions of the text, or abstractive methods, which generate new summaries based on the text's meaning."]

In [17]:
# Calculate sentence scores

sent_score = {}  # Dictionary to hold sentence scores
for sent in sent_token:
    for word in sent.split():  # Split the sentence into words
        if word.lower() in word_frequency.keys():  # Check if the word is in the frequency dictionary
            if sent not in sent_score.keys():  # If the sentence is not already in the score dictionary
                sent_score[sent] = word_frequency[word]  # Initialize its score
            else:
                sent_score[sent] +=word_frequency[word]  # Add the word's frequency to the score
        print(word)

Text
summarization
is
a
natural
language
processing
(NLP)
technique
that
aims
to
generate
concise
and
meaningful
summaries
from
large
amounts
of
text
data.
This
process
is
essential
for
extracting
key
information,
reducing
redundancy,
and
enhancing
the
efficiency
of
information
retrieval.
Text
summarization
can
be
applied
in
various
domains
such
as
news
articles,
research
papers,
and
customer
feedback
analysis,
enabling
businesses
to
quickly
grasp
the
most
important
insights.
It
can
be
implemented
using
techniques
like
extractive
methods,
which
select
significant
portions
of
the
text,
or
abstractive
methods,
which
generate
new
summaries
based
on
the
text's
meaning.


In [18]:
sent_score

{'Text summarization is a natural language processing (NLP) technique that aims to generate concise and meaningful summaries from large amounts of text data.': 4.0,
 'This process is essential for extracting key information, reducing redundancy, and enhancing the efficiency of information retrieval.': 1.7999999999999998,
 'Text summarization can be applied in various domains such as news articles, research papers, and customer feedback analysis, enabling businesses to quickly grasp the most important insights.': 2.6,
 "It can be implemented using techniques like extractive methods, which select significant portions of the text, or abstractive methods, which generate new summaries based on the text's meaning.": 2.8000000000000003}

In [19]:
import pandas as pd

In [20]:
pd.DataFrame(list(sent_score.items()),columns=['Sentence','Score'])

Unnamed: 0,Sentence,Score
0,Text summarization is a natural language proce...,4.0
1,This process is essential for extracting key i...,1.8
2,Text summarization can be applied in various d...,2.6
3,It can be implemented using techniques like ex...,2.8


In [21]:
from heapq import nlargest

In [26]:
# Extract top 'n' sentences (the summary)
# Set the number of sentences for the summary
# Get the top 'n' sentences based on their scores

num_sentences = 2
n = nlargest(num_sentences, sent_score, key=sent_score.get)

In [27]:
# Join the selected sentences into a final summary

summary = " ".join(n)

In [28]:
# Print the summary
# Output the generated summary
print(summary)

Text summarization is a natural language processing (NLP) technique that aims to generate concise and meaningful summaries from large amounts of text data. It can be implemented using techniques like extractive methods, which select significant portions of the text, or abstractive methods, which generate new summaries based on the text's meaning.


In [29]:
len(summary)

348

Explanation of Key Steps:
    
Tokenization: This process breaks down the text into individual words (tokens) while removing irrelevant words (stopwords) and punctuation.

Word Frequency Calculation: Counts how many times each word appears and normalizes these counts to a scale of 0 to 1.

Sentence Scoring: Each sentence is scored based on the sum of the normalized word frequencies of the words it contains.

Summary Extraction: The top sentences with the highest scores are selected to create a concise summary.

#### This code summarizes a given text by using Natural Language Processing (NLP) techniques. It uses spaCy to process the text, removing stopwords and punctuation, and then calculates the frequency of each word. Sentences are scored based on the sum of the frequencies of the words they contain. The code then extracts the top-scoring sentences to form a summary, based on a predefined number of sentences (num_sentences). In the end, it prints the summary by selecting the most relevant sentences from the original text.