# Testing Summarization

In this notebook, we shall take a look into the use case of summarizing texts and descriptions of business objects. With this, descriptions from external sites can be suggested on adding a new term.

## Extractive Summarization
Extractive Summarization splits the input text into short parts and scores them by calculating the relative relevance in the respective context. Afterwards, the parts above a certain threshold are selected and put together to form the output text.

### spaCy

spaCy is a free, open-source advanced natural language processing library, written in the programming languages Python and Cython.

In [33]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as STOP_WORDS_EN
from spacy.lang.de.stop_words import STOP_WORDS as STOP_WORDS_DE
from string import punctuation
from collections import Counter
from heapq import nlargest

In [7]:
# if first run, execute the following command in console
#!python -m spacy download en_core_web_lg

# load spaCy model
nlp_en = spacy.load('en_core_web_lg')

In [8]:
# example text
doc_en = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [9]:
# applying model to text
doc_en = nlp_en(doc_en)
print('Number of sentences = ',len(list(doc_en.sents)))

Number of sentences =  7


In [10]:
keyword = []
stopwords = list(STOP_WORDS_EN)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc_en:
    if token.text in stopwords or token.text in punctuation:
        continue
    if token.pos_ in pos_tag:
        keyword.append(token.text)

In [11]:
freq_word = Counter(keyword)
freq_word.most_common(5)

[('learning', 8), ('Machine', 4), ('study', 3), ('algorithms', 3), ('task', 3)]

In [13]:
# get most common token
max_freq = Counter(keyword).most_common(1)[0][1]

# normalize token frequency
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

freq_word.most_common(5)

[('learning', 0.125),
 ('Machine', 0.0625),
 ('study', 0.046875),
 ('algorithms', 0.046875),
 ('task', 0.046875)]

In [14]:
# calculate sentence strength
sent_strength = {}
for sent in doc_en.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

In [34]:
# get n sentences with descending strength
summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)
print(summarized_sentences)

[Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task., Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task., Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.]


In [35]:
# convert to string
final_sentences = [w.text for w in summarized_sentences]
summary = ' '.join(final_sentences)
print(summary)

Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.


#### Repeat for German:

In [36]:
#!python -m spacy download de_core_news_lg
nlp_de = spacy.load('de_core_news_lg')

In [37]:
doc_de = "Die Städtische Straßenbahn Spandau war ein Straßenbahnbetrieb im Großraum Berlin. Das am 26. April 1892 als Spandauer Straßenbahn Simmel, Matzky & Müller beziehungsweise Spandauer Straßenbahn Simmel, Matzky & Co. gegründete Unternehmen eröffnete am 1. Juni desselben Jahres seine erste Pferdebahnstrecke zwischen der damals selbstständigen Stadt Spandau und der damaligen Landgemeinde Pichelsdorf. Die Betriebsführung oblag ab 1894 der Allgemeinen Deutschen Kleinbahn-Gesellschaft, die am 7. März 1896 die elektrische Traktion einführte. Ab 1899 war die AEG Betriebsführerin der Straßenbahn, ab 1909 die Stadt Spandau. Im gleichen Jahr erwarb die Stadt die Elektrische Straßenbahn Spandau–Nonnendamm von Siemens & Halske, die bis 1914 vollständig in der Spandauer Straßenbahn aufging. Neben den Strecken nach Pichelsdorf und in die Siemensstadt bestanden weitere Äste nach Hakenfelde, Johannesstift sowie zum Spandauer Bock. 1920 ging die Spandauer Straßenbahn im Zuge des Groß-Berlin-Gesetzes in der Berliner Straßenbahn auf. Der letzte Streckenabschnitt nach Hakenfelde wurde am 2. Oktober 1967 stillgelegt, das Datum markiert gleichzeitig das Ende der Straßenbahn in West-Berlin."

In [38]:
# applying model to text
doc_de = nlp_de(doc_de)
print('Number of sentences = ',len(list(doc_de.sents)))

Number of sentences =  9


In [39]:
keyword = []
stopwords = list(STOP_WORDS_DE)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc_de:
    if token.text in stopwords or token.text in punctuation:
        continue
    if token.pos_ in pos_tag:
        keyword.append(token.text)

freq_word = Counter(keyword)

# get most common token
max_freq = Counter(keyword).most_common(1)[0][1]

# normalize token frequency
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

# calculate sentence strength
sent_strength = {}
for sent in doc_de.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)

# convert to string
final_sentences = [w.text for w in summarized_sentences]
summary = ' '.join(final_sentences)
print(summary)

Das am 26. April 1892 als Spandauer Straßenbahn Simmel, Matzky & Müller beziehungsweise Spandauer Straßenbahn Simmel, Matzky & Co. gegründete Unternehmen eröffnete am 1. Juni desselben Jahres seine erste Pferdebahnstrecke zwischen der damals selbstständigen Stadt Spandau und der damaligen Landgemeinde Pichelsdorf. 1920 ging die Spandauer Straßenbahn im Zuge des Groß-Berlin-Gesetzes in der Berliner Straßenbahn auf. Der letzte Streckenabschnitt nach Hakenfelde wurde am 2. Oktober 1967 stillgelegt, das Datum markiert gleichzeitig das Ende der Straßenbahn in West-Berlin.


### nltk

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

In [40]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [41]:
# example text
text_en = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [42]:
stopwords = set(stopwords.words("english"))
words = word_tokenize(text_en)

In [47]:
freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopwords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

In [48]:
sentences = sent_tokenize(text_en)
sentenceValue = dict()

In [49]:
for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq

In [50]:
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

In [51]:
summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence
print(summary)

 Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.


## Abstractive Summarization

Abstractive Text Summarization is the task of generating a short and concise summary that captures the salient ideas of the source text. The generated summaries potentially contain new phrases and sentences that may not appear in the source text.

### Pipeline API

In [1]:
#! pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

from transformers import pipeline

In [2]:
# using pipeline API for summarzation
summarization = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [3]:
# example text
original_text_en = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [4]:
summary_text = summarization(original_text_en)[0]['summary_text']

In [6]:
print("Summary:\n", summary_text)

Summary:
  Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task . Machine learning algorithms build a mathematical model of sample data in order to make predictions or decisions without being explicitly programmed to perform the task . Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning .


### T5 model

In [1]:
#!pip install sentencepiece
from transformers import T5ForConditionalGeneration, T5Tokenizer

# initialize the model architecture and weights
model = T5ForConditionalGeneration.from_pretrained("t5-base")
# initialize the model tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [2]:
# example text
original_text_en = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [3]:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode("summarize: " + original_text_en, return_tensors="pt", max_length=512, truncation=True)

In [4]:
# generate the summarization output
outputs = model.generate(
    inputs,
    max_length=150,
    min_length=40,
    length_penalty=2.0,
    num_beams=4,
    early_stopping=True)
# just for debugging
print(outputs)
print(tokenizer.decode(outputs[0]))

tensor([[    0,  1437,  1036,    41,  6858,    61,    19,     8,  4290,   810,
            13, 16783,    11, 11775,  2250,    24,  1218,  1002,   169,    12,
             3, 31599,  1172,    70,   821,    30,     3,     9,   806,  2491,
             3,     5,  1437,  1036, 16783,   918,     3,     9, 18913,   825,
            13,  3106,   331,     6,   801,    38,   105, 13023,   331,  1241,
            16,   455,    12,   143, 20099,    42,  3055,   406,   271, 21119,
          2486,    26,     3,     5,     1]])
<pad> machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed.</s>


## Conclusion

In the end, there are multiple viable ways for text summarization. Both extractive and abstractive summarization has its advantages and disadvantages.

Extractive Summarization for example generally produces a more readable and syntactically correct output. This however strongly depends on the quality of the original text. This approach works best when the input text consist of general descriptions and specialized, in-depth parts. The abbreviation performance is only good with short sentences (text splits).

Abstractive Summarization offers a mor human-like approach to abbreviating texts. The trained models (in this case transformers) understand the general concepts of the text and try to create their own summarization (see text generation). Overall, these models can strongly change the sentence and text structure compared to the input text. They have difficulty however, to produce grammatically and syntactically correct texts, especially in regards to specialized vocabulary they were not trained in.

As the first approach is to create a summarization service for wikipedia texts (and other curated texts), the Extractive Summarization was selected as the better approach.