# **Text Summarization**

Text summarization in NLP describes methods to automatically generate text summaries containing the most relevant information from source texts. With text summarization, we use extractive and abstractive techniques. In extractive techniques, algorithms extract the most important word sequences of the document to produce a summary of the given text. Abstractive techniques generate summaries by generating a new text and paraphrase the content of the original document, pretty much like humans do when they write an abstract [[1]](#scrollTo=8Pzkt1Z_M6OH).

This notebook shows an example of unsupervised extractive text summerization with TextRank.

## Unsupervised extractive text summarization with TextRank

TextRank is a common unsupervised extractive summarization technique. It compares every sentence in the text with every other sentence by calculating a similarity score, for example, the cosine similarity for each sentence pair. The closer the score is to 1, the more similar the sentence is to the other sentence representing the other sentences in a good way. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. Finally, the sentences can be sorted by rank and a summary can be built from a defined number of highest ranked sentences [[1]](#scrollTo=8Pzkt1Z_M6OH).

Unsupervised text summarization can be performed with the ``spaCy`` library and the TextRank algorithm by using the ``pytextrank`` library. For more details about ``spaCy`` and ``pytextrank`` libraries, please refer to [[2]](https://spacy.io/) and [[3]](https://derwen.ai/docs/ptr/).

The following example is based on [[4]](https://derwen.ai/docs/ptr/explain_summ/).

### Install ``pytextrank`` library

``pytextrank`` is an implementation of TextRank to use in ``spaCy`` pipelines. It provides fast, effective phrase extraction from texts, along with extractive summarization [[5]](https://spacy.io/universe/project/spacy-pytextrank).



In [1]:
# Install the pytextrank library 
!pip install pytextrank==3.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytextrank==3.0.1
  Downloading pytextrank-3.0.1-py3-none-any.whl (19 kB)
Collecting icecream>=2.1
  Downloading icecream-2.1.2-py2.py3-none-any.whl (8.3 kB)
Collecting graphviz>=0.13
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 2.6 MB/s 
Collecting asttokens>=2.0.1
  Downloading asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting colorama>=0.3.9
  Downloading colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Collecting executing>=0.3.1
  Downloading executing-0.8.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: executing, colorama, asttokens, icecream, graphviz, pytextrank
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed asttokens-2.0.5 colorama-0.4.5 executing-0.8.3 graphviz

### Download and install language model
We load the ``en_core_web_sm`` English language model by using the ``spaCy`` library.
For more details about ``en_core_web_sm``, please refer to [[6]](https://spacy.io/models).

In [2]:
# Download "en_core_web_sm" English language model
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Import libraries

We import ``spaCy`` and ``pytextrank`` libraries.

``spaCy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[7]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=8Pzkt1Z_M6OH). More information about ``spaCy`` please refer to  [[2]](https://spacy.io/).

In [3]:
# Import spaCy and pytextrank libraries
import spacy
import pytextrank

### Load installed language model


In [4]:
# Load the language model with the package name
sp = spacy.load('en_core_web_sm')

### Prepare pipeline

We use the ``add_pipe()`` method to add a component to the processing pipeline. We will add ``pytextrank`` to the ``spaCy`` pipeline.

In [5]:
# Add pytextrank to the spaCy pipeline
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7fd9acc47850>

### Text summarization example

#### Create sample text

In [6]:
# Create sample text as spaCy Doc object
doc = sp(
    ""
)Mr. and Mrs. Dursley, of number four, Private Drive, were proud to say \
they were perfectly normal, thank you very much. They were the last people you'd expect \
to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.

#### Print noun chunks

Usually, in TextRank distances of each sentence to each other sentence are computed. But comparing a lot of sentences with each other can come with high computational costs. Consequently, to save computational costs, we show an approach where 4 representative noun chunks are combined in a so called unit vector which represents the whole sample text. This unit vector is then compared to each sentence. This way it is not necessary to compare each sentence with each other.

We use ``noun_chunks`` method to iterate over the base noun phrases in the document. 

In [10]:
# Print the noun chunks of the sample text
for chunks in doc.noun_chunks:
  print(chunks)

Mr. and Mrs. Dursley
number
Private Drive
they
you
They
the last people
you
anything
they
such nonsense


#### Show top rated noun chunks
To extract the important 4 resepresentative noun chunks which will later be combined in the unit vector, we first rank all noun chunks. For that, we iterate through each sentence in the ``doc`` and rank our noun chunks with a ranking algorithm provided by TextRank. This algorithm is based on the creation of a lemma graph which returns the top-ranked phrases [[9]](https://derwen.ai/docs/ptr/glossary/#lemma-graph). 

In [9]:
# Show top rated phrases
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

0.2058     1  such nonsense
[such nonsense]
0.1477     2  Private Drive
[Private Drive, Private Drive]
0.1000     1  number
[number]
0.0883     1  Dursley
[Dursley]
0.0697     1  Mr. and Mrs. Dursley
[Mr. and Mrs. Dursley]
0.0520     1  the last people
[the last people]
0.0462     1  number four
[number four]
0.0000     1  They
[They]
0.0000     1  anything
[anything]
0.0000     2  they
[they, they]
0.0000     2  you
[you, you]


#### Create a list of the sentence boundaries
Create a list of the sentence boundaries with a phrase vector (initialized to an empty set) for each sentence. WARUM?

In [None]:
# Create a list of the sentence boundaries
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
print(sent_bounds)

[[0, 26, set()], [26, 53, set()]]


#### Add top-ranked phrases to the phrase-vector 
Iterate through the 4 top-ranked phrases and add them to the phrase vector for each sentence.

In [None]:
# Set number of the phrases
## This defines the item number of the top-ranked phrases list
no_phrases = 4

# Set "phrase_id" to zero and increase after each iteration
phrase_id = 0

# Create an empty list as "unit_vector" to keep the rank data of the phrases
unit_vector = []

# List the top 4 phrases and append into unit_vector
for p in doc._.phrases:
    print(phrase_id, p.text, p.rank)

    unit_vector.append(p.rank)

    for chunk in p.chunks:
        #print(" ", chunk.start, chunk.end)

        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.end <= sent_end:
                #print(" ", sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == no_phrases:
        break

0 such nonsense 0.20578727368601005
1 Private Drive 0.1476932514643156
2 number 0.10003737208485038
3 Dursley 0.08830619471948406


In [None]:
# Look at the sentence boundaries with its phrase vector
## The phrases with id number 1,2 and 3 belongs to the first sentence
## The phrase with id number 0 belongs to the second sentence
print(sent_bounds)

[[0, 26, {1, 2, 3}], [26, 53, {0}]]


#### Print the unit_vector

In [None]:
# Print the unit vector which contains the rank data of the top 4 phrases
print(unit_vector)

[0.20578727368601005, 0.1476932514643156, 0.10003737208485038, 0.08830619471948406]


#### Normalize the unit_vector

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent [[10]](https://aclanthology.org/Q15-1016/).



In [None]:
# Normalize the unit_vector
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]

unit_vector

[0.3798045837046876,
 0.2725852424381945,
 0.18463071976729584,
 0.1629794540898221]

#### Calculate euclidean distance for each sentence

We iterate through each sentence and calculate its euclidean distance from the unit vector.

In [None]:
# sqrt function is used to return the square root of x
from math import sqrt

# Create a dictionary to keep rank data of the sentences
sent_rank = {}

# Set "sent_id" to zero and increase after each iteration
sent_id = 0

# Calculate sentence rank data and append into the dictionary "sent_rank"
for sent_start, sent_end, sent_vector in sent_bounds:
    #print(sent_vector)
    sum_sq = 0.0

    for phrase_id in range(len(unit_vector)):
        #print(phrase_id, unit_vector[phrase_id])

        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

print(sent_rank)

{0: 0.3798045837046876, 1: 0.3673602040672008}


#### Extract the sentences

We extract the sentences with the lowest distance, up to the requested sentence limit.

In [None]:
# Set the requested sentence limit
limit_sentences = 1

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

# Extract and print the sentence with lowest distance
for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    print(sent_id, sent_text[sent_id])
    num_sent += 1

    if num_sent == limit_sentences:
        break

1 They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io/
- [3] https://derwen.ai/docs/ptr/
- [4] https://derwen.ai/docs/ptr/explain_summ/
- [5] https://spacy.io/universe/project/spacy-pytextrank
- [6] https://spacy.io/models
- [7] https://spacy.io/usage/spacy-101
- [8] https://spacy.io/usage/linguistic-features
- [9] https://derwen.ai/docs/ptr/glossary/#lemma-graph
- [10] https://aclanthology.org/Q15-1016/


Copyright © 2022 IU International University of Applied Sciences