# **Text-Summarization**

Text summarization in NLP describes methods to automatically generate text summaries containing the most relevant information from source texts. With text summarization, we use extractive and abstractive techniques. In extractive techniques, algorithms extract the most important word sequences of the document to produce a summary of the given text. Abstractive techniques generate summaries by generating a new text and paraphrase the content of the original document, pretty much like humans do when they write an abstract. In this section, we focus on extractive techniques. [[1]](#scrollTo=8Pzkt1Z_M6OH).


This notebook shows examples for unsupervised text summerization with TextRank.

A common unsupervised extractive summarization technique is TextRank. TextRank compares every sentence in the text with every other sentence by calculating a similarity score, for example, the cosine similarity for each sentence pair. The closer the score is to 1, the more similar the sentence is to the other sentence representing the other sentences in a good way. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. Finally, the sentences can be sorted by rank and a summary can be built from a defined number of highest ranked sentences. TextRank is inspired by PageRank, an algorithm developed by Google that is used to rank web pages by their importance.


Text summarizer can give a short summary of a large text. <br>
[SpaCy](www.spacy.io) together with [pyTextRank](https://github.com/DerwenAI/pytextrank) have a text summarization model which is presented in this notebook.

The following example is based on code-snippet for pytextrank: [derwin.ai](https://derwen.ai/docs/ptr/explain_summ/)

## Load resources

### Install PyTextRank

In [None]:
# Install PyTextRank 
## PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension.
!pip install pytextrank==3.0.1

### Load language model
We will download "en_core_web_sm" English language model by using spaCy library.
It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities [[4]](https://spacy.io/models).
It is optimized for CPU and its components are: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer [[5]](https://spacy.io/models/en).

In [None]:
# Download "en_core_web_sm" English language model
!python -m spacy download en_core_web_sm

### Import libraries and warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Import spaCy and pytextrank libraries
import spacy
import pytextrank

# Load English tokenizer, tagger, parser, NER and word vectors
sp = spacy.load('en_core_web_sm')

## Prepare pipeline

In [None]:
# prepare pipeline
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7f52462466d0>

#### Basic Text-Summarization example

In [None]:
# Create sample text as SpaCy instance
doc = sp(
    "Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say \
they were perfectly normal, thank you very much. They were the last people you'd expect \
to be involved in anything strange or mysterious, because they just didn't hold with such nonsense."
)

In [None]:
# Print the noun chunks of the sample text
for chunks in doc.noun_chunks:
  print(chunks)

Mr. and Mrs. Dursley
number
Private Drive
they
you
They
the last people
you
anything
they
such nonsense


In [None]:
# Print entities of the document
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Dursley PERSON 3 4
number four CARDINAL 6 8
Private Drive FAC 9 11


In [None]:
# Show top rated phrases

# Iterate through each sentence in the doc, constructing a
# [*lemma graph*](https://derwen.ai/docs/ptr/glossary/#lemma-graph),
# then returning the top-ranked phrases.

for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

0.2058     1  such nonsense
[such nonsense]
0.1477     2  Private Drive
[Private Drive, Private Drive]
0.1000     1  number
[number]
0.0883     1  Dursley
[Dursley]
0.0697     1  Mr. and Mrs. Dursley
[Mr. and Mrs. Dursley]
0.0520     1  the last people
[the last people]
0.0462     1  number four
[number four]
0.0000     1  They
[They]
0.0000     1  anything
[anything]
0.0000     2  they
[they, they]
0.0000     2  you
[you, you]


In [None]:
# Construct a list of the sentence boundaries with a phrase-vector (initialized to empty set) for each sentence.
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
print(sent_bounds)

[[0, 26, set()], [26, 53, set()]]


In [None]:
# Iterate through the top-ranked phrases and add them to the 
# phrase-vector for each sentence.

limit_phrases = 4

phrase_id = 0
unit_vector = []

for p in doc._.phrases:
    print(phrase_id, p.text, p.rank)

    unit_vector.append(p.rank)

    for chunk in p.chunks:
        #print(" ", chunk.start, chunk.end)

        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.end <= sent_end:
                #print(" ", sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == limit_phrases:
        break

0 such nonsense 0.20578727368601007
1 Private Drive 0.1476932514643156
2 number 0.10003737208485038
3 Dursley 0.08830619471948406


In [None]:
# Show the results

# Look at the sentence boundaries with its phrase-vector
print(sent_bounds)

[[0, 26, {1, 2, 3}], [26, 53, {0}]]


In [None]:
# We also construct a unit_vector for all of the phrases, up to the limit requested.
print(unit_vector)

[0.20578727368601007, 0.1476932514643156, 0.10003737208485038, 0.08830619471948406]


In [None]:
# Nomralize the unit_vector
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]

unit_vector

[0.37980458370468767,
 0.2725852424381945,
 0.18463071976729584,
 0.1629794540898221]

In [None]:
# Iterate through each sentence, calculating its euclidean distance from the unit vector.

from math import sqrt

sent_rank = {}
sent_id = 0

for sent_start, sent_end, sent_vector in sent_bounds:
    #print(sent_vector)
    sum_sq = 0.0

    for phrase_id in range(len(unit_vector)):
        #print(phrase_id, unit_vector[phrase_id])

        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

print(sent_rank)

{0: 0.37980458370468767, 1: 0.3673602040672008}


In [None]:
# Sort the sentence indexes in descending order
from operator import itemgetter

sorted(sent_rank.items(), key=itemgetter(1)) 

[(1, 0.3673602040672008), (0, 0.37980458370468767)]

In [None]:
# Extract the sentences with the lowest distance, up to the limit requested.
limit_sentences = 2

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    print(sent_id, sent_text[sent_id])
    num_sent += 1

    if num_sent == limit_sentences:
        break

1 They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.
0 Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say they were perfectly normal, thank you very much.


# **References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Course Book
- [2] https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html

https://www.kaggle.com/code/aggarwalrahul/nlp-text-summarization-using-textrank

https://colab.research.google.com/github/prateekjoshi565/textrank_text_summarization/blob/master/TestRank_Text_Summarization.ipynb#scrollTo=jwxtPBlgO_Gk

dataset: https://www.kaggle.com/code/emrahyener/nlp-text-summarization-using-textrank/edit

Copyright © 2022 IU International University of Applied Sciences