In [10]:
text1= """
Extractive summarization is a challenging task that has only recently become practical.
Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.
This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.
The first considers only embeddings and their derivatives.
This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article. The baseline for this approach is the unsupervised TextRank model. The other approach incorporates sequential information and takes advantage of the well known Lead3 phenomena particular to news corpuses. This is the observation that the first three sentences typically do a good job in summarizing the article. In fact, this strategy is explicitly deployed by many publishers. Lead3 is used as the baseline for this second approach.
In both cases, the supervised models outperform the baselines on the Rouge-1 and Rouge-L F1 metric."""

test example


In [11]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

Importing libraries

In [12]:
stopwords = list(STOP_WORDS)

In [13]:
nlp = spacy.load('en_core_web_sm')

In [14]:
document = nlp(text1)

In [15]:
tokens = [token.text for token in document]
print(tokens)

['\n', 'Extractive', 'summarization', 'is', 'a', 'challenging', 'task', 'that', 'has', 'only', 'recently', 'become', 'practical', '.', '\n', 'Like', 'many', 'things', 'NLP', ',', 'one', 'reason', 'for', 'this', 'progress', 'is', 'the', 'superior', 'embeddings', 'offered', 'by', 'transformer', 'models', 'like', 'BERT', '.', '\n', 'This', 'project', 'uses', 'BERT', 'sentence', 'embeddings', 'to', 'build', 'an', 'extractive', 'summarizer', 'taking', 'two', 'supervised', 'approaches', '.', '\n', 'The', 'first', 'considers', 'only', 'embeddings', 'and', 'their', 'derivatives', '.', '\n', 'This', 'corresponds', 'to', 'our', 'intuition', 'that', 'a', 'good', 'summarizer', 'can', 'parse', 'meaning', 'and', 'should', 'select', 'sentences', 'based', 'purely', 'on', 'the', 'internal', 'structure', 'of', 'the', 'article', '.', 'The', 'baseline', 'for', 'this', 'approach', 'is', 'the', 'unsupervised', 'TextRank', 'model', '.', 'The', 'other', 'approach', 'incorporates', 'sequential', 'information',

converting each word of sentence into tokens

In [18]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n'

Finding punctuations in the paragraph

In [19]:
word_freq = {}
for word in document:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_freq.keys():
                word_freq[word.text] = 1
            else:
                word_freq[word.text] += 1
                
print(word_freq)

{'Extractive': 1, 'summarization': 1, 'challenging': 1, 'task': 1, 'recently': 1, 'practical': 1, 'Like': 1, 'things': 1, 'NLP': 1, 'reason': 1, 'progress': 1, 'superior': 1, 'embeddings': 3, 'offered': 1, 'transformer': 1, 'models': 2, 'like': 1, 'BERT': 2, 'project': 1, 'uses': 1, 'sentence': 1, 'build': 1, 'extractive': 1, 'summarizer': 2, 'taking': 1, 'supervised': 2, 'approaches': 1, 'considers': 1, 'derivatives': 1, 'corresponds': 1, 'intuition': 1, 'good': 2, 'parse': 1, 'meaning': 1, 'select': 1, 'sentences': 2, 'based': 1, 'purely': 1, 'internal': 1, 'structure': 1, 'article': 2, 'baseline': 2, 'approach': 3, 'unsupervised': 1, 'TextRank': 1, 'model': 1, 'incorporates': 1, 'sequential': 1, 'information': 1, 'takes': 1, 'advantage': 1, 'known': 1, 'Lead3': 2, 'phenomena': 1, 'particular': 1, 'news': 1, 'corpuses': 1, 'observation': 1, 'typically': 1, 'job': 1, 'summarizing': 1, 'fact': 1, 'strategy': 1, 'explicitly': 1, 'deployed': 1, 'publishers': 1, 'second': 1, 'cases': 1, '

Frequency of each word in the text

In [22]:
max_freq = max(word_freq.values())
max_freq

1.0

Finding maximum frequency

In [23]:
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq

print(word_freq)

{'Extractive': 0.3333333333333333, 'summarization': 0.3333333333333333, 'challenging': 0.3333333333333333, 'task': 0.3333333333333333, 'recently': 0.3333333333333333, 'practical': 0.3333333333333333, 'Like': 0.3333333333333333, 'things': 0.3333333333333333, 'NLP': 0.3333333333333333, 'reason': 0.3333333333333333, 'progress': 0.3333333333333333, 'superior': 0.3333333333333333, 'embeddings': 1.0, 'offered': 0.3333333333333333, 'transformer': 0.3333333333333333, 'models': 0.6666666666666666, 'like': 0.3333333333333333, 'BERT': 0.6666666666666666, 'project': 0.3333333333333333, 'uses': 0.3333333333333333, 'sentence': 0.3333333333333333, 'build': 0.3333333333333333, 'extractive': 0.3333333333333333, 'summarizer': 0.6666666666666666, 'taking': 0.3333333333333333, 'supervised': 0.6666666666666666, 'approaches': 0.3333333333333333, 'considers': 0.3333333333333333, 'derivatives': 0.3333333333333333, 'corresponds': 0.3333333333333333, 'intuition': 0.3333333333333333, 'good': 0.6666666666666666, 

Percentage of each word in the text

In [24]:
sen_tokens = [sent for sent in document.sents]
print(sen_tokens)

[
Extractive summarization is a challenging task that has only recently become practical.
, Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.
, This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.
, The first considers only embeddings and their derivatives.
, This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article., The baseline for this approach is the unsupervised TextRank model., The other approach incorporates sequential information and takes advantage of the well known Lead3 phenomena particular to news corpuses., This is the observation that the first three sentences typically do a good job in summarizing the article., In fact, this strategy is explicitly deployed by many publishers., Lead3 is used as the baseline for this second approach.
, In both cases,

Dividing sentences into tokens with fullstops

In [25]:
sen_scores = {}
for sent in sen_tokens:
    for word in sent:
        if word.text.lower() in word_freq.keys():
            if sent not in sen_scores.keys():
                sen_scores[sent] = word_freq[word.text.lower()]
            else:
                sen_scores[sent] += word_freq[word.text.lower()]
                
sen_scores

{
 Extractive summarization is a challenging task that has only recently become practical.: 1.9999999999999998,
 Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.: 4.333333333333333,
 This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.: 4.666666666666667,
 The first considers only embeddings and their derivatives.: 1.6666666666666665,
 This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article.: 5.666666666666666,
 The baseline for this approach is the unsupervised TextRank model.: 2.333333333333333,
 The other approach incorporates sequential information and takes advantage of the well known Lead3 phenomena particular to news corpuses.: 4.333333333333333,
 This is the observation that the first three sentences typically do a good job in summarizing the article.: 

Alloting score to each sentence

In [26]:
from heapq import nlargest

In [27]:
length = int(len(sen_tokens)*0.3)
length

3

In [43]:
summary = nlargest(length, sen_scores, key = sen_scores.get)
summary

[This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article.,
 This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.,
 Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.]

Finding the sentences with highest scores

In [44]:
final_summary = [word.text for word in summary]
summary = ' '.join(final_summary)

Joining sentences with highest score

In [45]:
print(text1)


Extractive summarization is a challenging task that has only recently become practical.
Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.
This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.
The first considers only embeddings and their derivatives.
This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article. The baseline for this approach is the unsupervised TextRank model. The other approach incorporates sequential information and takes advantage of the well known Lead3 phenomena particular to news corpuses. This is the observation that the first three sentences typically do a good job in summarizing the article. In fact, this strategy is explicitly deployed by many publishers. Lead3 is used as the baseline for this second approach.
In both cases, the supervised 

In [46]:
print(summary)

This corresponds to our intuition that a good summarizer can parse meaning and should select sentences based purely on the internal structure of the article. This project uses BERT sentence embeddings to build an extractive summarizer taking two supervised approaches.
 Like many things NLP, one reason for this progress is the superior embeddings offered by transformer models like BERT.



In [47]:
len(summary)

389

In [48]:
len(text1)

1069