<a href="https://colab.research.google.com/github/Sudheer-Arora/Text_Summarization_by_spaCy_library/blob/main/Text_Summarization_by_spaCy_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [5]:
nlp=spacy.load('en_core_web_sm')

In [6]:
s1="Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [7]:
doc=nlp(s1)

In [8]:
len(list(doc.sents))

7

In [9]:
keyword=[]
stopword=list(STOP_WORDS)
pos_tag=["PROPN","ADJ","NOUN","VERB"]
for token in doc:
    if(token.text in stopword or token.text in punctuation):
        continue
    if(token.pos_ in pos_tag):
        keyword.append(token.text)

In [10]:
freq_word=Counter(keyword)
freq_word.most_common(10)

[('learning', 8),
 ('Machine', 4),
 ('study', 3),
 ('algorithms', 3),
 ('task', 3),
 ('data', 3),
 ('machine', 3),
 ('computer', 2),
 ('specific', 2),
 ('mathematical', 2)]

In [11]:
max_freq=Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
  freq_word[word]=(freq_word[word]/max_freq)
freq_word.most_common()

[('learning', 1.0),
 ('Machine', 0.5),
 ('study', 0.375),
 ('algorithms', 0.375),
 ('task', 0.375),
 ('data', 0.375),
 ('machine', 0.375),
 ('computer', 0.25),
 ('specific', 0.25),
 ('mathematical', 0.25),
 ('predictions', 0.25),
 ('focuses', 0.25),
 ('application', 0.25),
 ('field', 0.25),
 ('ML', 0.125),
 ('scientific', 0.125),
 ('statistical', 0.125),
 ('models', 0.125),
 ('systems', 0.125),
 ('use', 0.125),
 ('improve', 0.125),
 ('performance', 0.125),
 ('build', 0.125),
 ('model', 0.125),
 ('sample', 0.125),
 ('known', 0.125),
 ('training', 0.125),
 ('order', 0.125),
 ('decisions', 0.125),
 ('programmed', 0.125),
 ('perform', 0.125),
 ('applications', 0.125),
 ('email', 0.125),
 ('filtering', 0.125),
 ('detection', 0.125),
 ('network', 0.125),
 ('intruders', 0.125),
 ('vision', 0.125),
 ('infeasible', 0.125),
 ('develop', 0.125),
 ('algorithm', 0.125),
 ('instructions', 0.125),
 ('performing', 0.125),
 ('related', 0.125),
 ('computational', 0.125),
 ('statistics', 0.125),
 ('makin

In [12]:
sent_strength={}
for sent in doc.sents:
  for word in sent:
    if word.text in freq_word.keys():
      if sent in sent_strength.keys():
        sent_strength[sent]+=freq_word[word.text]
      else:
        sent_strength[sent]=freq_word[word.text]
print(sent_strength)

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.125, Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.: 4.625, Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.: 4.25, Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 2.625, The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 3.125, Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learnin

In [13]:
summarized_sentences=nlargest(3,sent_strength,key=sent_strength.get)
summarized_sentences

[Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.,
 Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.,
 Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.]

In [14]:
print(type(summarized_sentences[0]))

<class 'spacy.tokens.span.Span'>


In [15]:
from typing import final
final_sentences=[w.text for w in summarized_sentences]
summary="".join(final_sentences)
print(summary)

Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.


In [16]:
import re

In [17]:
def analyze_text(text):
    character_count=len(text)
    punctuation_count=len(re.findall(r'[^\w\s]', text))
    sentence_count=len(re.findall(r'[.!?]+', text))
    space_count=text.count(' ')
    return character_count, punctuation_count, sentence_count, space_count

In [18]:
orig_char_count, orig_punct_count, orig_sentence_count, orig_space_count = analyze_text(s1)
sum_char_count, sum_punct_count, sum_sentence_count, sum_space_count = analyze_text(summary)

In [19]:
print(f"\nOriginal Text Analysis:\nCharacters: {orig_char_count}, Punctuation Marks: {orig_punct_count}, Sentences: {orig_sentence_count}, Spaces: {orig_space_count}")
print(f"Summarized Text Analysis:\nCharacters: {sum_char_count}, Punctuation Marks: {sum_punct_count}, Sentences: {sum_sentence_count}, Spaces: {sum_space_count}")


Original Text Analysis:
Characters: 1069, Punctuation Marks: 19, Sentences: 7, Spaces: 152
Summarized Text Analysis:
Characters: 548, Punctuation Marks: 10, Sentences: 3, Spaces: 78
