<a href="https://colab.research.google.com/github/EmrahYener/DLMAINLPCV01_demo/blob/master/Kopie_von_nlp_2_1_text_summarization_20_07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Summarization**

Text summarization in NLP describes methods to automatically generate text summaries containing the most relevant information from source texts. With text summarization, we use extractive and abstractive techniques. In extractive techniques, algorithms extract the most important word sequences of the document to produce a summary of the given text. Abstractive techniques generate summaries by generating a new text and paraphrase the content of the original document, pretty much like humans do when they write an abstract [[1]](#scrollTo=8Pzkt1Z_M6OH).

This notebook shows an example of unsupervised extractive text summerization with TextRank.

## Unsupervised extractive text summarization with TextRank

TextRank is a common unsupervised extractive summarization technique. It compares every sentence in the text with every other sentence by calculating a similarity score, for example, the cosine similarity for each sentence pair. The closer the score is to 1, the more similar the sentence is to the other sentence representing the other sentences in a good way. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. Finally, the sentences can be sorted by rank and a summary can be built from a defined number of highest ranked sentences [[1]](#scrollTo=8Pzkt1Z_M6OH).

Unsupervised text summarization can be performed with the ``spaCy`` library and the TextRank algorithm by using the ``pytextrank`` library. For more details about ``spaCy`` and ``pytextrank`` libraries, please refer to [[2]](https://spacy.io/) and [[3]](https://derwen.ai/docs/ptr/).

The following example is based on [[4]](https://derwen.ai/docs/ptr/explain_summ/).


**EDIT:**
For tetxt summarization we will apply the following steps:
* Install libraries
* Download and install language model
* Create a document
* Perform sentence tokenization
* Score each sentence
* Rank each sentence by those scores
* The top scoring sentences will be our summary

### Install libraries

#### Install ``pytextrank`` library

``pytextrank`` is an implementation of TextRank to use in ``spaCy`` pipelines. It provides fast, effective phrase extraction from texts, along with extractive summarization [[5]](https://spacy.io/universe/project/spacy-pytextrank).



In [2]:
# Install the pytextrank library 
!pip install pytextrank==3.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytextrank==3.0.1
  Downloading pytextrank-3.0.1-py3-none-any.whl (19 kB)
Collecting graphviz>=0.13
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 3.4 MB/s 
Collecting icecream>=2.1
  Downloading icecream-2.1.2-py2.py3-none-any.whl (8.3 kB)
Collecting executing>=0.3.1
  Downloading executing-0.8.3-py2.py3-none-any.whl (16 kB)
Collecting asttokens>=2.0.1
  Downloading asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting colorama>=0.3.9
  Downloading colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Installing collected packages: executing, colorama, asttokens, icecream, graphviz, pytextrank
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed asttokens-2.0.5 colorama-0.4.5 executing-0.8.3 graphviz

#### Import libraries

We import ``spaCy`` and ``pytextrank`` libraries.

``spaCy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[6]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=8Pzkt1Z_M6OH). More information about ``spaCy`` please refer to  [[2]](https://spacy.io/).

In [4]:
# Import spaCy and pytextrank libraries
import spacy
import pytextrank

### Download and install language model
We load the ``en_core_web_sm`` English language model by using the ``spaCy`` library.
For more details about ``en_core_web_sm``, please refer to [[7]](https://spacy.io/models).

In [5]:
# Download "en_core_web_sm" English language model
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 15.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Load installed language model
We use the ``spacy.load()`` function to load our language model ``en_core_web_sm`` to the ``spaCy`` pipeline ``sp``.


In [6]:
# Load the language model with the package name
sp = spacy.load('en_core_web_sm')

### Prepare pipeline

We use the ``add_pipe()`` method to add ``pytextrank`` to the ``spaCy`` pipeline ``sp``.

In [7]:
# Add pytextrank to the spaCy pipeline
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7f39737bd190>

Now our ``spaCy`` pipeline is ready for text summarization. For this, we create a ``spaCy`` document in the following step.

### Create ``spaCy`` document with sample text

In this step, we add a sample text to the ``spaCy`` pipeline and creade a ``Doc`` object as ``doc``.

When we create a ``Doc`` object by using the ``spaCy`` library, it automatically performs tokenization, NER and POS tagging processes for an input text. The following figure demonstrates the processing pipeline of a given text to create a ``Doc`` object [[5]](https://spacy.io/usage/processing-pipelines).

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [8]:
# Define a sample text for summarization
text="""Alan Mathison Turing, a British mathematician and computer scientist,\
 was one of the early pioneers of artificial intelligence. Turing (1950) describes \
 the foundation of what was later called the Turing test. The experimental setup of \
 the Turing test is as follows. A human interrogator uses a chat program to talk to \
 two conversation partners: a chatbot and another human being. Both of them try to \
 convince the interrogator that they are the human. If the interrogator is not able to \
 identify the human through intense questioning, the machine is considered to have passed \
 the Turing test. According to Turing, passing the test can lead to the conclusion that \
 the machine’s intellectual power is on a level comparable to the human brain. While the \
 Turing test has often been criticized because of its focus on functionality, the question \
 of whether the machine is conscious about its answers remains open. Several attempts have \
 been made to pass the Turing test, but it still remains an unresolved challenge."""

# Create a spaCy Doc object "doc" with the sample text
doc = sp(text)

### List top-ranked phrases

Above we have added ``pytextrank`` to the ``spaCy`` pipeline and created a Doc object ``doc`` with a sample text.

Now we can access the ``pytextrank`` component within the ``spaCy`` pipeline, and use it to get more information about the document.

We use the ``_.phrases`` attribute of ``pytextrank`` to print a list of top-ranked phrases in the document. The list contains:
* ``phrase.rank``: Cosine similarity score of each phrase
* ``phrase.count``: Count of related phrase in the text
* ``phrase.text``: The phrase itself as string

The closer the similarity score of a phrase is to 1, the more important it is for text summarization. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. 

In [21]:
# Print the top-ranked phrases
for phrase in doc._.phrases:
  if phrase.rank>0:
    print(f'{phrase.rank:{20}} {phrase.count:{5}} {phrase.text:{5}}')

 0.10726830758300748     1 artificial intelligence
 0.09432284433635442     3 Turing
 0.08117326093202007     1 intense questioning
 0.07092000319088093     4 the Turing test
 0.06616451656224065     1 another human being
 0.06567762671637971     1 the human brain
 0.06239413059510708     1 functionality
 0.06110163052707078     2 Alan Mathison Turing
0.060920568103383546     1 A human interrogator
 0.05768369492512729     1 an unresolved challenge
 0.05229790122459473     1 the machine’s intellectual power
0.051471946777544623     1 the  Turing test
  0.0495936352774933     1 two conversation partners
0.048966965109190665     1 Several attempts
 0.04892833059988179     1 the test
0.047631209375115745     1 British
 0.04757661717719924     2 the machine
 0.04594356957071834     1 a chat program
 0.04380664297520442     1 the early pioneers
0.040621810082179084     1 a British mathematician and computer scientist
0.033507600154270484     1 a chatbot
0.031332584043474714     1 a level
0.

### Perform text summarization

We use the ``summary`` method of ``pytextrank`` to run an extractive summarization. We set the following parameters:

* ``limit_phrases``: It defines the maximum number of top-ranked phrases to use in the distance vectors. In this example, we set ``limit_phrases=3``.

* ``limit_sentences``: It defines the total number of sentences to return for the extractive summarization. In this example, we set ``limit_sentences=3``.

* ``preserve_order``: It preserves the order of sentences as they originally occurred in the source text. In this example, we set ``preserve_order=True``.

In [222]:
# Perform text summarization
summary = list(doc._.textrank.summary(limit_phrases=3, limit_sentences=3, preserve_order=True))
for sent in summary:
  print(sent,"\n")

Alan Mathison Turing, a British mathematician and computer scientist, was one of the early pioneers of artificial intelligence. 

Turing (1950) describes  the foundation of what was later called the Turing test. 

According to Turing, passing the test can lead to the conclusion that  the machine’s intellectual power is on a level comparable to the human brain. 



# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io/
- [3] https://derwen.ai/docs/ptr/
- [4] https://derwen.ai/docs/ptr/explain_summ/
- [5] https://spacy.io/universe/project/spacy-pytextrank
- [6] https://spacy.io/models
- [7] https://spacy.io/usage/spacy-101
- [8] https://spacy.io/usage/linguistic-features
- [9] https://derwen.ai/docs/ptr/glossary/#lemma-graph
- [10] https://aclanthology.org/Q15-1016/


Copyright © 2022 IU International University of Applied Sciences