# **Text-Summarization**

Text summarization in NLP describes methods to automatically generate text summaries containing the most relevant information from source texts. With text summarization, we use extractive and abstractive techniques. In extractive techniques, algorithms extract the most important word sequences of the document to produce a summary of the given text. Abstractive techniques generate summaries by generating a new text and paraphrase the content of the original document, pretty much like humans do when they write an abstract [[1]](#scrollTo=8Pzkt1Z_M6OH).

## Unsupervised extractive text summarization with TextRank
In this section, we focus on examples for unsupervised extractive text summerization with TextRank.

TextRank is a common unsupervised extractive summarization technique. It compares every sentence in the text with every other sentence by calculating a similarity score, for example, the cosine similarity for each sentence pair. The closer the score is to 1, the more similar the sentence is to the other sentence representing the other sentences in a good way. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. Finally, the sentences can be sorted by rank and a summary can be built from a defined number of highest ranked sentences. TextRank is inspired by PageRank, an algorithm developed by Google that is used to rank web pages by their importance [[1]](#scrollTo=8Pzkt1Z_M6OH).

Unsupervised text summarization can be tackled with the ``spacy`` library and the TextRank algorithm by using the ``pytextrank`` library. For more details about ``spacy`` and ``pytextrank`` libraries, please refer to [[2]](https://spacy.io/) and [[3]](https://derwen.ai/docs/ptr/).

The following example is based on [[4]](https://derwen.ai/docs/ptr/explain_summ/).

### Install ``pytextrank``

``pytextrank`` is an implementation of TextRank in Python for use in ``spacy`` pipelines which provides fast, effective phrase extraction from texts, along with extractive summarization. The graph algorithm works independently of a specific natural language and does not require domain knowledge  [[5]](https://spacy.io/universe/project/spacy-pytextrank).



In [None]:
# Install PyTextRank 
!pip install pytextrank==3.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytextrank==3.0.1
  Downloading pytextrank-3.0.1-py3-none-any.whl (19 kB)
Collecting icecream>=2.1
  Downloading icecream-2.1.2-py2.py3-none-any.whl (8.3 kB)
Collecting spacy>=3.0
  Downloading spacy-3.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 3.2 MB/s 
[?25hCollecting graphviz>=0.13
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 4.3 MB/s 
[?25hCollecting asttokens>=2.0.1
  Downloading asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting colorama>=0.3.9
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting executing>=0.3.1
  Downloading executing-0.8.3-py2.py3-none-any.whl (16 kB)
Collecting spacy-legacy<3.1.0,>=3.0.9
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting typing-extensions<4.0.0.0,>=3.7.4
  D

### Download and install the language model
We will download and install ``en_core_web_sm`` English language model by using ``spacy`` library.
It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities [[6]](https://spacy.io/models).
It is optimized for CPU and its components are: ``tok2vec``, ``tagger``, ``parser``, ``senter``, ``ner``, ``attribute_ruler``, ``lemmatizer`` [[7]](https://spacy.io/models/en).

In [None]:
# Download "en_core_web_sm" English language model
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 568 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Import libraries

We import ``spacy`` and ``pytextrank`` libraries.

``spacy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[8]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=8Pzkt1Z_M6OH). More information about ``spacy`` please refer to  [[2]](https://spacy.io/).

In [None]:
# Import spaCy and pytextrank libraries
import spacy
import pytextrank

### Load the installed language model


In [None]:
# Load the language model with the package name or a path to the data directory
sp = spacy.load('en_core_web_sm')

### Prepare pipeline

We use ``add_pipe()`` method to add a component to the processing pipeline. We will add ``pytextrank`` to the ``spacy`` pipeline.

In [None]:
# Add PyTextRank to the spaCy pipeline
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7f1782a759d0>

### Basic text-summarization example

#### Create sample text

In [None]:
# Create sample text as SpaCy instance
doc = sp(
    "Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say \
they were perfectly normal, thank you very much. They were the last people you'd expect \
to be involved in anything strange or mysterious, because they just didn't hold with such nonsense."
)

#### Print the noun chunks

We use ``noun_chunks`` method to iterate over the base noun phrases in the document. 

In [None]:
# Print the noun chunks of the sample text
for chunks in doc.noun_chunks:
  print(chunks)

Mr. and Mrs. Dursley
number
Private Drive
they
you
They
the last people
you
anything
they
such nonsense


#### Print entities of the document

Named entities are available as the ``ents`` property of a ``doc``.
The standard way to access entity annotations is the ``doc.ents`` property. The entity type is accessible either as a hash value or as a string, using the attributes ``ent.label`` and ``ent.label_`` [[9]](https://spacy.io/usage/linguistic-features).

In [None]:
# Print entities of the document
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Dursley PERSON 3 4
number four CARDINAL 6 8
Private Drive FAC 9 11


#### Show top rated phrases
We iterate through each sentence in the ``doc``, constructing a lemma graph, then returning the top-ranked phrases [[10]](https://derwen.ai/docs/ptr/glossary/#lemma-graph).

In [None]:
# Show top rated phrases
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

0.2058     1  such nonsense
[such nonsense]
0.1477     2  Private Drive
[Private Drive, Private Drive]
0.1000     1  number
[number]
0.0883     1  Dursley
[Dursley]
0.0697     1  Mr. and Mrs. Dursley
[Mr. and Mrs. Dursley]
0.0520     1  the last people
[the last people]
0.0462     1  number four
[number four]
0.0000     1  They
[They]
0.0000     1  anything
[anything]
0.0000     2  they
[they, they]
0.0000     2  you
[you, you]


#### Create a list of the sentence boundaries
Create a list of the sentence boundaries with a phrase vector (initialized to an empty set) for each sentence.

In [None]:
# Create a list of the sentence boundaries
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
print(sent_bounds)

[[0, 26, set()], [26, 53, set()]]


#### Add top-ranked phrases to the phrase-vector 
Iterate through the top-ranked phrases and add them to the phrase vector for each sentence.

In [None]:
# Type limit of the iteration
## This defines the item number of the top-ranked phrases list
limit_phrases = 4

# Set "phrase_id" to zero and increase after each iteration
phrase_id = 0

# Create an empty list as "unit_vector" to keep the rank data of the phrases
unit_vector = []

# List the top 4 phrases and append into unit_vector
for p in doc._.phrases:
    print(phrase_id, p.text, p.rank)

    unit_vector.append(p.rank)

    for chunk in p.chunks:
        #print(" ", chunk.start, chunk.end)

        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.end <= sent_end:
                #print(" ", sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == limit_phrases:
        break

0 such nonsense 0.20578727368601005
1 Private Drive 0.1476932514643156
2 number 0.10003737208485038
3 Dursley 0.08830619471948406


In [None]:
# Look at the sentence boundaries with its phrase-vector
## The phrases with id number 1,2 and 3 belongs to the first sentence
## The phrase with id number 0 belongs to the second sentence
print(sent_bounds)

[[0, 26, {1, 2, 3}], [26, 53, {0}]]


#### Print the unit_vector

In [None]:
# Print the unit vector which contains the rank data of the top 4 phrases
print(unit_vector)

[0.20578727368601005, 0.1476932514643156, 0.10003737208485038, 0.08830619471948406]


#### Normalize the unit_vector

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent [[11]](https://aclanthology.org/Q15-1016/).



In [None]:
# Normalize the unit_vector
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]

unit_vector

[0.3798045837046876,
 0.2725852424381945,
 0.18463071976729584,
 0.1629794540898221]

#### Calculate euclidean distance for each sentence

In [None]:
# Iterate through each sentence, calculating its euclidean distance from the unit vector

## sqrt function is used to return the square root of x
from math import sqrt

## Create a dictionary to keep rank data of the sentences
sent_rank = {}

## Set "sent_id" to zero and increase after each iteration
sent_id = 0

## Calculate sentence rank data and append into the dictionary "sent_rank"
for sent_start, sent_end, sent_vector in sent_bounds:
    #print(sent_vector)
    sum_sq = 0.0

    for phrase_id in range(len(unit_vector)):
        #print(phrase_id, unit_vector[phrase_id])

        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

print(sent_rank)

{0: 0.3798045837046876, 1: 0.3673602040672008}


#### Sort the sentence indexes

In [None]:
# Sort the sentence indexes in descending order
from operator import itemgetter

sorted(sent_rank.items(), key=itemgetter(1)) 

[(1, 0.3673602040672008), (0, 0.3798045837046876)]

#### Extract the sentences

In [None]:
# Extract the sentences with the lowest distance, up to the limit requested.
limit_sentences = 1

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    print(sent_id, sent_text[sent_id])
    num_sent += 1

    if num_sent == limit_sentences:
        break

1 They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.


# **References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Course Book
- [2] https://spacy.io/
- [3] https://derwen.ai/docs/ptr/
- [4] https://derwen.ai/docs/ptr/explain_summ/
- [5] https://spacy.io/universe/project/spacy-pytextrank
- [6] https://spacy.io/models
- [7] https://spacy.io/models/en
- [8] https://spacy.io/usage/spacy-101
- [9] https://spacy.io/usage/linguistic-features
- [10] https://derwen.ai/docs/ptr/glossary/#lemma-graph
- [11] https://aclanthology.org/Q15-1016/


Copyright © 2022 IU International University of Applied Sciences