## Extractive Summarization

In [1]:
!pip install sumy transformers



In [2]:
from sumy.summarizers.text_rank import TextRankSummarizer

In [3]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [4]:
text = '''
The tiger (Panthera tigris) is the largest living cat species and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes. It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and island tigers of the Sunda Islands.

Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. The tiger is an apex predator and preys mainly on ungulates such as deer and wild boar, which it takes by ambush. It lives a mostly solitary life and occupies home ranges, which it defends from individuals of the same sex. The range of a male tiger overlaps with that of multiple females with whom he has reproductive claims. Females give birth to usually two or three cubs that stay with their mother for about two years. When becoming independent, they leave their mother's home range and establish their own.

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali. Today, the tiger's range is severely fragmented. It is listed as Endangered on the IUCN Red List of Threatened Species, as its range is thought to have declined by 53% to 68% since the late 1990s. Major reasons for this decline are habitat destruction and fragmentation due to deforestation, poaching for fur, and the illegal trade of tiger body parts for medicinal purposes. Tigers are also victims of human–wildlife conflict for attacking and preying on livestock in areas, where natural prey is scarce. The species is legally protected in all range countries, which have ratified conservation action plans, established anti-poaching patrols and schemes for monitoring tiger populations.

The tiger is among the most popular of the world's charismatic megafauna. It has been kept in captivity since ancient times and has been trained to perform in circuses and other entertainment shows. The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide.
'''

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
sents = sent_tokenize(text)

In [7]:
len(sents)

18

In [8]:
sents

['\nThe tiger (Panthera tigris) is the largest living cat species and a member of the genus Panthera native to Asia.',
 'It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes.',
 'It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and island tigers of the Sunda Islands.',
 "Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.",
 'The tiger is an apex predator and preys mainly on ungulates such as deer and wild boar, which it takes by ambush.',
 'It lives a mostly solitary life and occupies home ranges, which it defends from individuals of the same sex.',
 'The range of a male tiger overlaps with that of multiple females with whom he has repr

In [9]:
my_parser = PlaintextParser.from_string(text, Tokenizer("english"))

### Text Rank Summarization

In [10]:
# Text rank is a graph-based summarization technique with keyword extractions in from document.

text_rank_summarizer = TextRankSummarizer()

In [11]:
summary = text_rank_summarizer(my_parser.document, sentences_count=3)

In [12]:
print(summary)

(<Sentence: Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.>, <Sentence: Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali.>, <Sentence: The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide.>)


In [13]:
for sent in summary:
    print(sent,'\n')

Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. 

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali. 

The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide. 



### Lex Rank Summarization

In [14]:
from sumy.summarizers.lex_rank import LexRankSummarizer

In [15]:
lex_rank_summarizer = LexRankSummarizer()

summary = lex_rank_summarizer(my_parser.document, sentences_count=3)
print(summary)

(<Sentence: It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes.>, <Sentence: Females give birth to usually two or three cubs that stay with their mother for about two years.>, <Sentence: Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali.>)


In [16]:
for sent in summary:
    print(sent,'\n')

It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes. 

Females give birth to usually two or three cubs that stay with their mother for about two years. 

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali. 



## Results

<u>All two methods are Extractive.</u>

Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. 

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali. 

The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide. 


#####################################################################################

It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes. 

Females give birth to usually two or three cubs that stay with their mother for about two years. 

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China, and on the islands of Java and Bali. 



In [17]:
mr_text = '''
'वाघ मार्जार कुळातील प्राणी असून भारताचा राष्ट्रीय प्राणी आहे[२]. मार्जार कुळातील सर्वात मोठा प्राणी म्हणून याची गणना होते व अन्न साखळीतील सर्वोच्च स्थान वाघ भूषवतो. वाघ या नावाची व्युत्पत्ती संस्कृत मधील 'व्याघ्र' या शब्दावरून आली आहे. इंग्रजीत वाघाला 'टायगर असे म्हणतात. मराठीत भल्या मोठ्या वाघाला 'ढाण्या' वाघ म्हणतात. वाघ हा शिकार करण्यात परिपक्व आहे.

इ.स. २०१० पासून जगभरात २९ जुलै हा जागतिक व्याघ्र दिन म्हणून पाळला जातो. भारतात वाघ हा संरक्षित प्राणी असून त्याची शिकार करणे हा दंडनीय अपराध आहे.

एके काळी पश्चिमेस पूर्व अँटोलिया [३]प्रदेश पासून अमूर नदी [४]पात्रात आणि दक्षिणेस हिमालयाच्या पायथ्यापासून सुली बेटांपर्यंतच्या बालीपर्यंत सर्वत्र वाघ पसरले. २० व्या शतकाच्या सुरुवातीस, वाघाची संख्या पूर्वीहून ९३% ने कमी झाली आहे तसेच पश्चिम आणि मध्य आशियात, जावा आणि बाली बेटांमधून आणि आग्नेय, दक्षिण आशिया आणि चीनच्या मोठ्या भागांतून लोप पावली आहे. सध्याची वाघ प्रजाती भारतीय उपखंड आणि सुमात्रावरील सायबेरियन समशीतोष्ण जंगलांपासून ते उप-उष्णकटिबंधीय व उष्णकटिबंधीय जंगलांपर्यंत पसरलेली आहे. १९८६ पासून वाघाला आययूसीएन रेड लिस्टमध्ये लुप्तप्राय प्रजाती म्हणून सूचीबद्ध केले आहे. २०१५ पर्यंत जगातील प्रौढ वाघांची संख्या ३०६२ ते ३९४८ असावी असा अंदाज आहे. भारतात सध्या सर्वात अधिक वाघांची संख्या आहे. वाघांची संख्या कमी होण्याच्या मुख्य कारणांमध्ये त्यांचे प्राकृतिक आवास स्थान नष्ट करणे, खंडित करणे आणि शिकार करणे समाविष्ट आहे. काही देशांमधील अधिक दाट लोकवस्ती चे अतिक्रमण या ठिकाणी कारणीभूत आहे.
'''

In [18]:
mr_sents = sent_tokenize(mr_text)

In [19]:
mr_sents

["\n'वाघ मार्जार कुळातील प्राणी असून भारताचा राष्ट्रीय प्राणी आहे[२].",
 'मार्जार कुळातील सर्वात मोठा प्राणी म्हणून याची गणना होते व अन्न साखळीतील सर्वोच्च स्थान वाघ भूषवतो.',
 "वाघ या नावाची व्युत्पत्ती संस्कृत मधील 'व्याघ्र' या शब्दावरून आली आहे.",
 "इंग्रजीत वाघाला 'टायगर असे म्हणतात.",
 "मराठीत भल्या मोठ्या वाघाला 'ढाण्या' वाघ म्हणतात.",
 'वाघ हा शिकार करण्यात परिपक्व आहे.',
 'इ.स.',
 '२०१० पासून जगभरात २९ जुलै हा जागतिक व्याघ्र दिन म्हणून पाळला जातो.',
 'भारतात वाघ हा संरक्षित प्राणी असून त्याची शिकार करणे हा दंडनीय अपराध आहे.',
 'एके काळी पश्चिमेस पूर्व अँटोलिया [३]प्रदेश पासून अमूर नदी [४]पात्रात आणि दक्षिणेस हिमालयाच्या पायथ्यापासून सुली बेटांपर्यंतच्या बालीपर्यंत सर्वत्र वाघ पसरले.',
 '२० व्या शतकाच्या सुरुवातीस, वाघाची संख्या पूर्वीहून ९३% ने कमी झाली आहे तसेच पश्चिम आणि मध्य आशियात, जावा आणि बाली बेटांमधून आणि आग्नेय, दक्षिण आशिया आणि चीनच्या मोठ्या भागांतून लोप पावली आहे.',
 'सध्याची वाघ प्रजाती भारतीय उपखंड आणि सुमात्रावरील सायबेरियन समशीतोष्ण जंगलांपासून ते उप-उष्णकटिबंधी

In [20]:
len(mr_sents)

17

In [21]:
my_parser = PlaintextParser.from_string(mr_text, Tokenizer("english"))

In [22]:
text_rank_summarizer = TextRankSummarizer()

summary = text_rank_summarizer(my_parser.document, sentences_count=3)
print(summary)

(<Sentence: 'वाघ मार्जार कुळातील प्राणी असून भारताचा राष्ट्रीय प्राणी आहे[२].>, <Sentence: मार्जार कुळातील सर्वात मोठा प्राणी म्हणून याची गणना होते व अन्न साखळीतील सर्वोच्च स्थान वाघ भूषवतो.>, <Sentence: सध्याची वाघ प्रजाती भारतीय उपखंड आणि सुमात्रावरील सायबेरियन समशीतोष्ण जंगलांपासून ते उप-उष्णकटिबंधीय व उष्णकटिबंधीय जंगलांपर्यंत पसरलेली आहे.>)


In [23]:
for sent in summary:
    print(sent,'\n')

'वाघ मार्जार कुळातील प्राणी असून भारताचा राष्ट्रीय प्राणी आहे[२]. 

मार्जार कुळातील सर्वात मोठा प्राणी म्हणून याची गणना होते व अन्न साखळीतील सर्वोच्च स्थान वाघ भूषवतो. 

सध्याची वाघ प्रजाती भारतीय उपखंड आणि सुमात्रावरील सायबेरियन समशीतोष्ण जंगलांपासून ते उप-उष्णकटिबंधीय व उष्णकटिबंधीय जंगलांपर्यंत पसरलेली आहे. 



### Latent Semantic Analysis (LSA)

Also Extractive Method for summarization.

Basically, ```sumy``` is Extractive and ```transformer``` is used for abstractive

In [24]:
from sumy.summarizers.lsa import LsaSummarizer 

In [25]:
lsa_summarizer = LsaSummarizer()

my_parser = PlaintextParser.from_string(text, Tokenizer("english"))

summary = lsa_summarizer(my_parser.document, sentences_count=3)
print(summary)

(<Sentence: It is listed as Endangered on the IUCN Red List of Threatened Species, as its range is thought to have declined by 53% to 68% since the late 1990s.>, <Sentence: Tigers are also victims of human–wildlife conflict for attacking and preying on livestock in areas, where natural prey is scarce.>, <Sentence: The species is legally protected in all range countries, which have ratified conservation action plans, established anti-poaching patrols and schemes for monitoring tiger populations.>)


In [26]:
for sent in summary:
    print(sent,'\n')

It is listed as Endangered on the IUCN Red List of Threatened Species, as its range is thought to have declined by 53% to 68% since the late 1990s. 

Tigers are also victims of human–wildlife conflict for attacking and preying on livestock in areas, where natural prey is scarce. 

The species is legally protected in all range countries, which have ratified conservation action plans, established anti-poaching patrols and schemes for monitoring tiger populations. 



In [27]:
help(Tokenizer)

Help on class Tokenizer in module sumy.nlp.tokenizers:

class Tokenizer(builtins.object)
 |  Tokenizer(language)
 |  
 |  Language dependent tokenizer of text document.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, language)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  to_sentences(self, paragraph)
 |  
 |  to_words(self, sentence)
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  language
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables
 |  
 |  __weakref__
 |      list of weak references to the object
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  LANGUAGE_ALIASES = {'slovak': 'czech'}
 |  
 |  LANGUAGE_EXTRA_ABREVS = {'english': ['e.g', 'al', 'i.e'], 'germ

## Abstractive Summarization

### Using GPT (Generative Pre-Trained Transformer)

In [31]:
from transformers import pipeline

In [32]:
text_summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [40]:
output = text_summarizer(text, max_length=400, min_length=100)

In [41]:
output[0]['summary_text']

' The tiger (Panthera tigris) is the largest living cat species and a member of the genus Panthera native to Asia . It has a powerful, muscular body with a large head and paws, a long tail, and orange fur with black, mostly vertical stripes . The tiger is an apex predator and preys mainly on ungulates such as deer and wild boar . It is listed as Endangered on the IUCN Red List of Threatened Species, as its range is thought to have declined by 53% to 68% since the late 1990s .'