# **Text summarization**

Text summarizer can give a short summary of a large text. <br>
[SpaCy](www.spacy.io) together with [pyTextRank](https://github.com/DerwenAI/pytextrank) have a text summriziation model which is presented in this notebook.

the following example is based on that code-snippet for pytextrank: [derwin.ai](https://derwen.ai/docs/ptr/explain_summ/)

#### install additional libraries

In [None]:
!pip install pytextrank==3.0.1

In [None]:
# download language model for spacy
!python -m spacy download en_core_web_sm

#### load resources

In [28]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# resources
#
import spacy
import pytextrank
import wikipedia
#
# Load English tokenizer, tagger, parser, NER and word vectors
sp = spacy.load('en_core_web_sm')
#
# core-model with German language:
#sp = spacy.load('de_core_news_sm')
#

In [5]:
# prepare pipeline
#
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7f55ae84e9d0>

#### basic example

In [13]:
# create sample text
#
doc = sp(
    "Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say \
    that they were perfectly normal, thank you very much. They were the last people you'd expect \
    to be involved in anything strange or mysterious, because they just didn't hold with such nonsense."
    )

In [16]:
# extract the noun chunks of hte sample text
#
for chunks in doc.noun_chunks:
  print(chunks)

Mr. and Mrs. Dursley
number
Private Drive
they
you
They
the last people
you
anything
they
such nonsense


In [17]:
# extract entities
#
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Dursley PERSON 3 4
number four CARDINAL 6 8
Private Drive FAC 9 11


In [18]:
# show top rated phrases
#
# Iterate through each sentence in the doc, constructing a
# [*lemma graph*](https://derwen.ai/docs/ptr/glossary/#lemma-graph)
# then returning the top-ranked phrases.
#
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

0.2058     1  such nonsense
[such nonsense]
0.1477     2  Private Drive
[Private Drive, Private Drive]
0.1000     1  number
[number]
0.0883     1  Dursley
[Dursley]
0.0697     1  Mr. and Mrs. Dursley
[Mr. and Mrs. Dursley]
0.0520     1  the last people
[the last people]
0.0462     1  number four
[number four]
0.0000     1  They
[They]
0.0000     1  anything
[anything]
0.0000     2  they
[they, they]
0.0000     2  you
[you, you]


In [19]:
# Construct a list of the sentence boundaries with a phrase vector (initialized to empty set) for each sentence.
#
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
sent_bounds

[[0, 28, set()], [28, 56, set()]]

In [20]:
# Iterate through the top-ranked phrases, 
# added them to the phrase vector for each sentence.
#
limit_phrases = 4

phrase_id = 0
unit_vector = []

for p in doc._.phrases:
    print(phrase_id, p.text, p.rank)

    unit_vector.append(p.rank)

    for chunk in p.chunks:
        print(" ", chunk.start, chunk.end)

        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.end <= sent_end:
                print(" ", sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == limit_phrases:
        break

0 such nonsense 0.20578727368601007
  53 55
  28 53 55 56
1 Private Drive 0.1476932514643156
  9 11
  0 9 11 28
  9 11
  0 9 11 28
2 number 0.10003737208485038
  6 7
  0 6 7 28
3 Dursley 0.08830619471948406
  3 4
  0 3 4 28


In [21]:
# display the results
#
sent_bounds

[[0, 28, {1, 2, 3}], [28, 56, {0}]]

In [22]:
for sent in doc.sents:
    print(sent)

Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say     that they were perfectly normal, thank you very much.
They were the last people you'd expect     to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.


In [23]:
# We also construct a unit_vector for all of the phrases, up to the limit requested.
#
unit_vector

[0.20578727368601007,
 0.1476932514643156,
 0.10003737208485038,
 0.08830619471948406]

In [24]:
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]

unit_vector

[0.37980458370468767,
 0.2725852424381945,
 0.18463071976729584,
 0.1629794540898221]

In [25]:
# Iterate through each sentence, calculating its euclidean distance from the unit vector.
#
from math import sqrt

sent_rank = {}
sent_id = 0

for sent_start, sent_end, sent_vector in sent_bounds:
    print(sent_vector)
    sum_sq = 0.0

    for phrase_id in range(len(unit_vector)):
        print(phrase_id, unit_vector[phrase_id])

        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

print(sent_rank)

{1, 2, 3}
0 0.37980458370468767
1 0.2725852424381945
2 0.18463071976729584
3 0.1629794540898221
{0}
0 0.37980458370468767
1 0.2725852424381945
2 0.18463071976729584
3 0.1629794540898221
{0: 0.37980458370468767, 1: 0.3673602040672008}


In [26]:
# Sort the sentence indexes in descending order
#
from operator import itemgetter

sorted(sent_rank.items(), key=itemgetter(1)) 

[(1, 0.3673602040672008), (0, 0.37980458370468767)]

In [27]:
# Extract the sentences with the lowest distance, up to the limit requested.
#
limit_sentences = 2

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    print(sent_id, sent_text[sent_id])
    num_sent += 1

    if num_sent == limit_sentences:
        break

1 They were the last people you'd expect     to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.
0 Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say     that they were perfectly normal, thank you very much.


Copyright © 2021 IUBH Internationale Hochschule