# Text Summarization#

*Text Summarization is the process of distilling the most important information from a source text.*

# Why Automatic Summarization#



*   Summarization reduces the reading time.
*   When researching documents, summaries make the selection process easier.
*   It improves the effectiveness of Indexing.
*   Automatic Summarization algroithms are less biased than human summarizers.
*   Personalized summaries are useful in question-answering systems as they provide personalized information.









# How to do Text Summarization#

* Text Cleaning 
* Sentence Tokenization
* Word-Frequency Table 
* Clustering 
* Summarization 





# Model 1 using Spacy

In [None]:
text = "Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company. Lee is the parent company of the Montana newspapers the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic.Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time. The board of Lee Enterprises formally rejected the offer Thursday.Recent Stories from ktvh.com. Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects. We remain confident in our ability to create significant value as an independent company.”Lee Enterprises is a public traded company that has 75 daily newspaper outlets across the nation.Alden currently owns around 200 publications, making them the second-largest newspaper publisher in the United States. However, the hedge fund has earned a reputation of slashing costs, often through layoffs, among journalists formerly employed at those publications."

In [None]:
len(text)

1242

In [None]:
##Importing Libraries

#!pip install spacy 
#!python -m spacy download en_core_web_sm

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [None]:
#Stopwords
stop_words = list(STOP_WORDS)

In [None]:
#Tokenization
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

In [None]:
#List of tokens

tokens = [token.text for token in doc]
print(tokens)

['Lee', 'Enterprises', ',', 'the', 'owner', 'of', 'several', 'Montana', 'newspapers', ',', 'has', 'rejected', 'a', 'bid', 'by', 'Alden', 'Global', 'Capital', ',', 'LLC', ',', 'a', 'hedge', 'fund', ',', 'to', 'purchase', 'the', 'company', '.', 'Lee', 'is', 'the', 'parent', 'company', 'of', 'the', 'Montana', 'newspapers', 'the', 'Helena', 'Independent', 'Record', ',', 'the', 'Missoulian', ',', 'the', 'Billings', 'Gazette', ',', 'the', 'Montana', 'Standard', 'and', 'the', 'Ravalli', 'Republic', '.', 'Alden', 'was', 'offering', '$', '24', 'a', 'share', 'and', 'already', 'owned', '6', '%', 'of', 'the', 'issued', 'and', 'outstanding', 'common', 'stock', 'of', 'Lee', 'at', 'the', 'time', '.', 'The', 'board', 'of', 'Lee', 'Enterprises', 'formally', 'rejected', 'the', 'offer', 'Thursday', '.', 'Recent', 'Stories', 'from', 'ktvh.com', '.', 'Lee', 'Board', 'Chairman', 'Mary', 'Junck', 'said', 'in', 'a', 'press', 'release', ',', '“', 'The', 'Alden', 'proposal', 'grossly', 'undervalues', 'Lee', 'an

In [None]:
#In the above tokens list, we get both the punctuation marks and also the stopwords. 
#Task is to remove the punctuation marks and stop words.

In [None]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [None]:
##Finding weighted frequencies of occurrence

word_frequencies = {}
for word in doc:
  if word.text.lower() not in stop_words:
    if word.text.lower() not in punctuation:
#if any keys is  introduced for the first time, divide of that occurence = 1
#if after the first time, it will increment by 1 already present key in word_frequencies.
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else: 
          word_frequencies[word.text] += 1

In [None]:
print(word_frequencies)

{'Lee': 6, 'Enterprises': 3, 'owner': 1, 'Montana': 3, 'newspapers': 2, 'rejected': 2, 'bid': 1, 'Alden': 4, 'Global': 1, 'Capital': 1, 'LLC': 1, 'hedge': 2, 'fund': 2, 'purchase': 1, 'company': 4, 'parent': 1, 'Helena': 1, 'Independent': 1, 'Record': 1, 'Missoulian': 1, 'Billings': 1, 'Gazette': 1, 'Standard': 1, 'Ravalli': 1, 'Republic': 1, 'offering': 1, '24': 1, 'share': 1, 'owned': 1, '6': 1, 'issued': 1, 'outstanding': 1, 'common': 1, 'stock': 1, 'time': 1, 'board': 1, 'formally': 1, 'offer': 1, 'Thursday': 1, 'Recent': 1, 'Stories': 1, 'ktvh.com': 1, 'Board': 1, 'Chairman': 1, 'Mary': 1, 'Junck': 1, 'said': 1, 'press': 1, 'release': 1, '“': 1, 'proposal': 1, 'grossly': 1, 'undervalues': 1, 'fails': 1, 'recognize': 1, 'strength': 1, 'business': 1, 'today': 1, 'fastest': 1, 'growing': 1, 'digital': 1, 'subscription': 1, 'platform': 1, 'local': 1, 'media': 1, 'compelling': 1, 'future': 1, 'prospects': 1, 'remain': 1, 'confident': 1, 'ability': 1, 'create': 1, 'significant': 1, 'val

In [None]:
max_frequency = max(word_frequencies.values())
max_frequency

6

In [None]:
##To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.

for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency

In [None]:
print('Normalized frequency of each of the word in text data')
print(word_frequencies)


Normalized frequency of each of the word in text data
{'Lee': 1.0, 'Enterprises': 0.5, 'owner': 0.16666666666666666, 'Montana': 0.5, 'newspapers': 0.3333333333333333, 'rejected': 0.3333333333333333, 'bid': 0.16666666666666666, 'Alden': 0.6666666666666666, 'Global': 0.16666666666666666, 'Capital': 0.16666666666666666, 'LLC': 0.16666666666666666, 'hedge': 0.3333333333333333, 'fund': 0.3333333333333333, 'purchase': 0.16666666666666666, 'company': 0.6666666666666666, 'parent': 0.16666666666666666, 'Helena': 0.16666666666666666, 'Independent': 0.16666666666666666, 'Record': 0.16666666666666666, 'Missoulian': 0.16666666666666666, 'Billings': 0.16666666666666666, 'Gazette': 0.16666666666666666, 'Standard': 0.16666666666666666, 'Ravalli': 0.16666666666666666, 'Republic': 0.16666666666666666, 'offering': 0.16666666666666666, '24': 0.16666666666666666, 'share': 0.16666666666666666, 'owned': 0.16666666666666666, '6': 0.16666666666666666, 'issued': 0.16666666666666666, 'outstanding': 0.16666666666

In [None]:
##Sentence Tokenization

sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company., Lee is the parent company of the Montana newspapers the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic., Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time., The board of Lee Enterprises formally rejected the offer Thursday., Recent Stories from ktvh.com., Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects., We remain confident in our ability to create significant value as an independent company., ”Lee, Enterprises is a public traded company that has 75 daily newspaper outlets across the nation., Alden curre

In [None]:
##Calculate Sentence scores
#We have calculated the weighted frequencies. 
#Now scores for each sentence can be calculated by adding weighted frequencies for each word.

sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
#add the values of, every word has normalized frequency count, so we will add normalized frequency 
#in each of the sentences and then with maximum value we will select most important sentence.
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else:
        #add the previous sentence score
        sentence_scores[sent] += word_frequencies[word.text.lower()]

In [None]:
sentence_scores

{Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company.: 2.5,
 Lee is the parent company of the Montana newspapers the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic.: 1.3333333333333333,
 Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time.: 1.6666666666666667,
 The board of Lee Enterprises formally rejected the offer Thursday.: 0.8333333333333333,
 Recent Stories from ktvh.com.: 0.16666666666666666,
 Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects.: 3.8333333333333317,
 We remain confident in our ability to create significant value as an independent company.: 

In [None]:
#Task is to get 30% of the sentence with the maximum score.

from heapq import nlargest

In [None]:
select_length = int(len(sentence_tokens) * 0.3)
select_length

3

In [None]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary

[Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects.,
 Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company.,
 However, the hedge fund has earned a reputation of slashing costs, often through layoffs, among journalists formerly employed at those publications.]

In [None]:
#we got the 3 most important sentences. 
#Now let's combine the 3 sentences together.

In [None]:
final_summary = [word.text for word in summary] #take each of the words from summary
final_summary

['Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects.',
 'Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company.',
 'However, the hedge fund has earned a reputation of slashing costs, often through layoffs, among journalists formerly employed at those publications.']

In [None]:
summary = ' '.join(final_summary)
summary

'Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects. Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company. However, the hedge fund has earned a reputation of slashing costs, often through layoffs, among journalists formerly employed at those publications.'

In [None]:
len(text)

1242

In [None]:
len(summary)

558

# Model 2 Using Bart Tokenixer 

### BART (a new Seq2Seq model with SoTA summarization performance) 

In [3]:
pip install transformers

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 4.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.8 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 48.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 455 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

# Bart Tokenizer
###Encodes the string into tokens. We use BArtTokenizer to get a proper splitting.

# BartForConditionalGeneration 
### The BART Model with a language modeling head. Can be used for summarization. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

#BartConfig

###This is the configuration class to store the configuration of a BartModel. It is used to instantiate a BART model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BART facebook/bart-large architecture.

In [4]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

In [5]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

In [6]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [7]:
#ARTICLE_TO_SUMMARIZE = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
ARTICLE_TO_SUMMARIZE = "Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company. Lee is the parent company of the Montana newspapers the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic.Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time. The board of Lee Enterprises formally rejected the offer Thursday.Recent Stories from ktvh.com. Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects. We remain confident in our ability to create significant value as an independent company.”Lee Enterprises is a public traded company that has 75 daily newspaper outlets across the nation.Alden currently owns around 200 publications, making them the second-largest newspaper publisher in the United States. However, the hedge fund has earned a reputation of slashing costs, often through layoffs, among journalists formerly employed at those publications."

In [8]:
ARTICLE_TO_SUMMARIZE

'Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company. Lee is the parent company of the Montana newspapers the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic.Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time. The board of Lee Enterprises formally rejected the offer Thursday.Recent Stories from ktvh.com. Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects. We remain confident in our ability to create significant value as an independent company.”Lee Enterprises is a public traded company that has 75 daily newspaper outlets across the nation.Alden currently owns aro

In [9]:
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], return_tensors='pt')

In [10]:
summary_ids = model.generate(inputs['input_ids'], max_length=500, early_stopping=False)
summary_ids

tensor([[    2,     0, 24403, 15989,     6,     5,  1945,     9,   484,  8920,
          9911,     6,    34,  3946,    10,  2311,    30,  8019,   225,  1849,
          1867,     6,  2291,     6,    10,  4445,  1391,     4,  2094,    16,
             5,  4095,   138,     9,     5, 25239,  6911, 10788,     6,     5,
          4523,  5156,   811,     6,     5,  1585,  1033, 16865,     6,     5,
          8920,  5787,     8,     5, 19321, 19273,  3497,     4,  8019,   225,
            21,  1839,    68,  1978,    10,   458,     8,   416,  2164,   231,
           207,     9,     5,  1167,     8,  3973,  1537,   388,     9,  2094,
            23,     5,    86,     4,  2094,  1785,  3356,  2708,  6752,  2420,
            26,    11,    10,  1228,   800,     6,    44,    48,   133,  8019,
           225,  2570, 34354,   223, 43994,  2094,     8, 10578,     7,  5281,
             5,  2707,     9,    84,   265,   452,     6,    25,     5,  6273,
            12, 11600,  1778,  6656,  1761,    11,  

In [11]:
print([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids])

['Lee Enterprises, the owner of several Montana newspapers, has rejected a bid by Alden Global Capital, LLC, a hedge fund. Lee is the parent company of the Helena Independent Record, the Missoulian, the Billings Gazette, the Montana Standard and the Ravalli Republic. Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee at the time. Lee Board Chairman Mary Junck said in a press release, “The Alden proposal grossly undervalues Lee and fails to recognize the strength of our business today, as the fastest-growing digital subscription platform in local media, and our compelling future prospects.” Lee Enterprises is a public traded company that has 75 daily newspaper outlets across the nation.']



# Model - 3 using Transformer pipeline

In [12]:
from transformers import pipeline

The pipeline abstraction is a wrapper around all the other available pipelines. It is instantiated as any other pipeline but can provide additional quality of life.

In [13]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [14]:
summarizer(ARTICLE_TO_SUMMARIZE, min_length=30, do_sample=False)

[{'summary_text': ' Lee Enterprises has rejected a bid by Alden Global Capital, LLC, a hedge fund, to purchase the company . Alden was offering $24 a share and already owned 6% of the issued and outstanding common stock of Lee . Lee is the parent company of the Helena Independent Record, the Missoulian, the Billings Gazette and the Montana Standard .'}]