# Text summarization! 

**References:**

https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

https://medium.com/@ondenyi.eric/extractive-text-summarization-techniques-with-sumy-3d3b127a0a32

https://www.mygreatlearning.com/blog/text-summarization-in-python/

https://medium.com/analytics-vidhya/text-summarization-using-bert-gpt2-xlnet-5ee80608e961

### Contents:

1. <a href="#Introduction:">Introduction</a>
2. <a href="#Different-approaches-to-Text-Summarizers!">Different approaches to Text Summarizers!</a>

### Introduction:

Text summarization in NLP is the process of summarizing the information in large texts for quicker consumption.

When you open news sites, do you just start reading every news article? Probably not. We typically glance the short news summary and then read more details if interested. Short, informative summaries of the news is now everywhere like magazines, news aggregator apps, research sites, etc. Well, It is possible to create the summaries automatically as the news comes in from various sources around the world. The method of extracting these summaries from the original huge text without losing vital information is called as **Text Summarization**. It is essential for the summary to be a fluent, continuous and depict the significant. In fact, the google news, the inshorts app and various other news aggregator apps take advantage of text summarization algorithms.

**Two different approaches are used for Text Summarization**

![image.png](attachment:image.png)

- Extractive Summarization

In Extractive Summarization, we are identifying important phrases or sentences from the original text and extract only these phrases from the text. These extracted sentences would be the summary.

![image-2.png](attachment:image-2.png)

- Abstractive Summarization

In the Abstractive Summarization approach, we work on generating new sentences from the original text. The abstractive method is in contrast to the approach that was described above. The sentences generated through this approach might not even be present in the original text.

![image-3.png](attachment:image-3.png)

#### Different approaches to Text Summarizers!

Text summarization methods can be grouped into two main categories: **Extractive** and **Abstractive methods** * **Extractive Text Summarization** It is the traditional method developed first. The main objective is to identify the significant sentences of the text and add them to the summary. You need to note that the summary obtained contains exact sentences from the original text. * **Abstractive Text Summarization** It is a more advanced method, many advancements keep coming out frequently(I will cover some of the best here). The approach is to identify the important sections, interpret the context and reproduce in a new way. This ensures that the core information is conveyed through shortest text possible. Note that here, the sentences in summary are generated, not just extracted from original text. In the next sections, I will discuss different extractive and abstractive methods.

- **Text Summarization using Gensim with TextRank:**
Gensim is a very handy python library for performing NLP tasks. The text summarization process using `gensim` library is based on **TextRank Algorithm** What is __TextRank algorithm__? TextRank is an extractive summarization technique. It is based on the concept that words which occur more frequently are significant. Hence , the sentences containing highly frequent words are important . Based on this , the algorithm assigns scores to each sentence in the text . The top-ranked sentences make it to the summary.

- **Text Summarization with Sumy:**
Along with TextRank , there are various other algorithms to summarize text. Don’t you think it would be very smooth and beneficial to have a library, which will let you perform summarization through multiple algorithms? Fortunately, we already have the sumy library for it ! sumy libraray provides you several algorithms to implement Text Summarzation. Just import your desired algorithm rather having to code it on your own.

- **LexRank:**
First, let me introduce you to summarization with **LexRank**. **How does LexRank work?** A sentence which is similar to many other sentences of the text has a high probability of being important. The approach of LexRank is that a particular sentence is recommended by other similar sentences and hence is ranked higher. Higher the rank, higher is the priority of being included in the summarized text.

- **LSA (Latent semantic analysis):**
Latent Semantic Analysis is a unsupervised learning algorithm that can be used for extractive text summarization. It extracts semantically significant sentences by applying singular value decomposition(SVD) to the matrix of term-document frequency.

- **Luhn:**
Luhn Summarization algorithm’s approach is based on TF-IDF (Term Frequency-Inverse Document Frequency). It is useful when very low frequent words as well as highly frequent words(stopwords) are both not significant. Based on this, sentence scoring is carried out and the high ranking sentences make it to the summary. Import the summarizer and the text to summarize .

- **KL-Sum algorithm:**
Another extractive method is the KL-Sum algorithm. It selects sentences based on similarity of word distribution as the original text. It aims to lower the KL-divergence criteria. It uses greedy optimization approach and keeps adding sentences till the KL-divergence decreases.

- **Abstractive Text Summarization:**
Abstractive summarization is the new state of art method, which generates new sentences that could best represent the whole text. This is better than extractive methods where sentences are just selected from original text for the summary.

- **Summarization with T5 Transformers:**
T5 is an encoder-decoder model. It converts all language problems into a text-to-text format. First, you need to import the tokenizer and corresponding model through below command. It is preferred to use T5ForConditionalGeneration model when the input and output are both sequences.

#### Lets looks at the performance of some of these algothms using Sumy!

**Sumy offers several algorithms and methods for summarization such as:**
- Luhn – heurestic method
- Latent Semantic Analysis
- Edmundson heurestic method with previous statistic research
- LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
- TextRank
- SumBasic – Method that is often used as a baseline in the literature
- KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence.

This is text summarization task on BBC news datasets [http://mlg.ucd.ie/datasets/bbc.html] while comparing the performance of this extractive algorithms.

In [7]:
#Importing the Libraries!
import sumy
#Plain text parsers since we are parsing through text
from sumy.parsers.plaintext import PlaintextParser

#for tokenization
from sumy.nlp.tokenizers import Tokenizer

In [8]:
#pip install sumy

In [10]:
#name of the plain-text file ~ bbc news dataset

file = "../../Data/bbc/business/001.txt"
parser = PlaintextParser.from_file(file, Tokenizer("english"))

#### Actual Text

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.

Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. "Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility," chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.

TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.

In [11]:
#Lex rank Summarizer!
#This a graphical based text summarizer

from sumy.summarizers.lex_rank import LexRankSummarizer 
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 5) 
for sentence in summary:
    print(sentence)

TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.
TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.
Time Warner's fourth quarter profits were slightly better than analysts' expectations.
For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.
It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue.


In [14]:
#Luhn
#It is one of the earliest suggested algorithm by the famous IBM researcher it was named after.
#It scores sentences based on frequency of the most important words.

from sumy.summarizers.luhn import LuhnSummarizer
summarizer_1 = LuhnSummarizer()
summary_1 =summarizer_1(parser.document,5)
for sentence in summary_1:
    print(sentence)

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.
It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband.
TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.
It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue.
It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.


In [15]:
#LSA
#Latent semantic analysis is an unsupervised method of summarization it combines term frequency techniques with singular value decomposition to summarize texts. 

from sumy.summarizers.lsa import LsaSummarizer
summarizer_2 = LsaSummarizer()
summary_2 =summarizer_2(parser.document,5)
for sentence in summary_2:
    print(sentence)

Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.
Time Warner said on Friday that it now owns 8% of search-engine Google.
But its own internet business, AOL, had has mixed fortunes.
Time Warner's fourth quarter profits were slightly better than analysts' expectations.
For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.


In [16]:
#Text Rank
#Text rank is a graph-based summarization technique with keyword extractions in from document.

from sumy.summarizers.text_rank import TextRankSummarizer
summarizer_3 = TextRankSummarizer()
summary_3 =summarizer_3(parser.document,5)
for sentence in summary_3:
  print(sentence)

It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband.
But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results.
The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m.
It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue.
It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.


### Lets explore Traditional approach for text summarization in Python!

**Steps for Implementation**

- Step 1: The first step is to import the required libraries. There are two NLTK libraries that are necessary for building an efficient text summarizer.

>**Terms Used:**
<br>
<br>**Corpus**
A collection of text is known as Corpus. This could be either data sets such as bodies of work by an author, poems by a a particular poet, etc.
<br>
<br>**Tokenizers**
This divides a text into a series of tokens. In Tokenizers, there are three main tokens – sentence, word, and regex tokenizer. We will be using only the word and the sentence tokenizer.

- Step 2: Remove the Stop Words and store them in a separate array of words.

>Stop Words
Words such as is, an, a, the, for that do not add value to the meaning of a sentence. For example, let us take a look at the following sentence:
<br>
<br>GreatLearning is one of the most useful websites for ArtificialIntelligence aspirants.
<br>
<br>After removing the stop words in the above sentence, we can narrow the number of words and preserve the meaning as follows:
<br>
<br>[‘GreatLearning’, ‘one’, ‘useful’, ‘website’, ‘ArtificialIntelligence‘, ‘aspirants’, ‘.’]

- Step 3: We can then create a frequency table of the words.

>A Python Dictionary can keep a record of how many times each word will appear in the text after removing the stop words. We can use this dictionary over each sentence to know which sentences have the most relevant content in the overall text.

- Step 4: Depending on the words it contains and the frequency table, we will assign a score to each sentence.

>Here, we will use the sent_tokenize() method that can be used to create the array of sentences. We will also need a dictionary to keep track of the score of each sentence, and we can later go through the dictionary to create a summary.

- Step 5: To compare the sentences within the text, assign a score.

>One simple approach that can be used to compare the scores is to find an average score of a particular sentence. This average score can be a good threshold.

In [17]:
# Importing the libraries!

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

#Input text
text = "India is a great country where people speak different languages but the national language is Hindi. India is full of different castes, creeds, religion, and cultures but they live together. That’s the reasons India is famous for the common saying of “unity in diversity“. India is the seventh-largest country in the whole world. Geography and CultureIndia has the second-largest population in the world. India is also knowns as Bharat, Hindustan and sometimes Aryavart. It is surrounded by oceans from three sides which are Bay Of Bengal in the east, the Arabian Sea in the west and Indian oceans in the south. Tiger is the national animal of India. Peacock is the national bird of India. Mango is the national fruit of India. “Jana Gana Mana” is the national anthem of India. “Vande Mataram” is the national song of India. Hockey is the national sport of India. People of different religions such as Hinduism, Buddhism, Jainism, Sikhism, Islam, Christianity and Judaism lives together from ancient times. India is also rich in monuments, tombs, churches, historical buildings, temples, museums, scenic beauty, wildlife sanctuaries, places of architecture and many more. The great leaders and freedom fighters are from India."

#### Text:

India is a great country where people speak different languages but the national language is Hindi. India is full of different castes, creeds, religion, and cultures but they live together. That’s the reasons India is famous for the common saying of “unity in diversity“. India is the seventh-largest country in the whole world. Geography and CultureIndia has the second-largest population in the world. India is also knowns as Bharat, Hindustan and sometimes Aryavart. It is surrounded by oceans from three sides which are Bay Of Bengal in the east, the Arabian Sea in the west and Indian oceans in the south. Tiger is the national animal of India. Peacock is the national bird of India. Mango is the national fruit of India. “Jana Gana Mana” is the national anthem of India. “Vande Mataram” is the national song of India. Hockey is the national sport of India. People of different religions such as Hinduism, Buddhism, Jainism, Sikhism, Islam, Christianity and Judaism lives together from ancient times. India is also rich in monuments, tombs, churches, historical buildings, temples, museums, scenic beauty, wildlife sanctuaries, places of architecture and many more. The great leaders and freedom fighters are from India.

In [18]:
# Tokenizing the text!
stopwords = set(stopwords.words("english"))
words = word_tokenize(text)

In [19]:
## Creating a frequency table to keep the score of each words!

freqtable = dict()
for word in words:
    word = word.lower()
    if word in stopwords:
        continue
    if word in freqtable:
        freqtable[word] += 1
    else:
        freqtable[word] = 1
        
#Creating a dictionary to keep the score of each sentence
sentences = sent_tokenize(text)
sentvalue = dict()

In [20]:
for sentence in sentences:
    for word, freq in freqtable.items():
        if word in sentence.lower():
            if sentence in sentvalue:
                sentvalue[sentence] += freq
            else:
                sentvalue[sentence] = freq
                
sumvalues = 0 
for sentence in sentvalue:
    sumvalues += sentvalue[sentence]
    
#Average value of a sentence from the original text

average = int(sumvalues/len(sentvalue))

# Storing Sentences into our summary.

summary = ' '

for sentence in sentences:
    if (sentence in sentvalue) and (sentvalue[sentence] > (1.2 * average)):
        summary += " " + sentence
print(summary)

  India is full of different castes, creeds, religion, and cultures but they live together. India is also knowns as Bharat, Hindustan and sometimes Aryavart. It is surrounded by oceans from three sides which are Bay Of Bengal in the east, the Arabian Sea in the west and Indian oceans in the south. India is also rich in monuments, tombs, churches, historical buildings, temples, museums, scenic beauty, wildlife sanctuaries, places of architecture and many more.


In [22]:
#from platform import python_version
#print(python_version())


<a href="#Contents:">Back to Content</a>

**THE END**

In [None]:
#tested:no errors