# Text Summarization - Abstractive summarization

## Feedback given for previous review

- Perform abstractive summarization - using hugging face
- Semantic preservation - how do you know if the summary is good enough?
- Try new ways to summarize text instead of method used

## Response to feedback

- Two ways have been implemnted to perform abstractive summarization
- Metrics for summarization - using rogue_score (Semantic preservation)
- Summarization pipeline has been implemented keeping in mind the size of input text. Techniques implemented here are for larger input text.
    - Text has been taken directly from a webpage and preprocessing is done to scrape the text instead of copy pasting 

## Installing/ Importing dependencies

In [1]:
!pip install transformers
!pip install rouge_score
!pip install datasets



In [2]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests
from datasets import load_metric
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arany\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Scraping input text from random webpages

In [10]:
#How to become an ethical hacker
url1 = "https://www.simplilearn.com/tutorials/cyber-security-tutorial/how-to-become-an-ethical-hacker"

#Pizza recipe - step by step guide
url2 = "https://www.indianhealthyrecipes.com/pizza-recipe-make-pizza/"

#Formation of the universe
url3 = "https://www.bbc.com/future/article/20140812-how-was-the-universe-created"

In [11]:
r1 = requests.get(url1)
#r2 = requests.get(url2)
#r3 = requests.get(url3)

In [2]:
def ScrapeText(raw_text, r):
    soup = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all(['h1', 'p'])
    text = [result.text for result in results]
    text = ' '.join(text)
    return text

In [13]:
# for method 1
article1 = ScrapeText(url1,r1)
#article2 = ScrapeText(url2,r2)
#article3 = ScrapeText(url3,r3)

In [67]:
input_text = '''A glimpse of the IoT may be already visible in current
deployments where networks of sensing devices are being
interconnected with the Internet, and IP-based standard technologies will be fundamental in providing a common and wellaccepted ground for the development and deployment of new
IoT applications. Considering that security may be an enabling
factor of many of such applications, mechanisms to secure
communications using communication technologies for the IoT
will be fundamental. With such aspects in mind, in the survey
we perform an exhaustive analysis on the security protocols
and mechanisms available to protect communications on the
IoT. We also address existing research proposals and challenges
providing opportunities for future research work in the area.
In Table II we summarize the main characteristics of the
mechanisms and proposals analyzed throughout the survey,
together with its operational layer and the security properties
and functionalities supported. In conclusion, we believe this
survey may provide an important contribution to the research
community, by documenting the current status of this important
and very dynamic area of research, helping readers interested in
developing new solutions to address security in the context of
communication protocols for the IoT.
'''

In [15]:
summary_input_text = '''The German Johannes Gutenberg introduced printing in Europe. 
His invention had a decisive contribution in spread of mass-learning and in building the basis of the modern society.

Gutenberg major invention was a practical system permitting the mass production of printed books. 
The printed books allowed open circulation of information, and prepared the evolution of society from to the contemporary knowledge-based economy.'''

In [25]:
article1

'                                 How to Become an Ethical Hacker in 2022? Lesson 7 of 32By Rahul Venugopal                                  The word â\x80\x98hacker\' originally defined a skilled programmer proficient in machine code and computer operating systems. Today, a \'hacker\' is a person who consistently engages in hacking activities, and has accepted hacking as a lifestyle and philosophy of their choice. Hacking is the practice of modifying the features of a system, to accomplish a goal outside of the creator\'s original purpose. Before understanding how to become an ethical hacker, let us understand more about the role. The term â\x80\x98hackingâ\x80\x99 has very negative connotations, but that\'s only until the role of an ethical hacker is fully understood. Ethical hackers are the good guys of the hacking world, the ones who wear the "white hat". So what does the role of an ethical hacker entail? Instead of using their advanced computer knowledge for nefarious activities, 

In [26]:
article2

"Swasthi's Recipes Indian food blog with easy Indian recipes Pizza recipe | How to make pizza | Homemade pizza recipe By swasthi, November 25, 2020 188 Comments, Jump to Recipe Pizza recipe with video – Learn to make pizza at home like a pro with these simple step by step instructions. This detailed post will help you make the best pizza which I am sure will be your family favorite. This pizza has a crisp, light & chewy base with great flavor (no overpowering smell of yeast!!). The sauce is amazingly delicious & aromatic. So this post covers everything from scratch –\xa0making the pizza dough, making the sauce and assembling the pizza. Making a delicious and perfect cheesy pizza at home is much simpler than we think. There are numerous recipes online to make it but over the years this easy recipe has become our family favorite. Try this and you will never want to order a pizza again. You will be in love with your homemade pizza!! During my initial days of learning I followed a basic re

In [27]:
article3

" What is BBC Future? Future Planet Follow the Food Inner Space A Fair Climate Family Tree Best of BBC Future Food Fictions Towards Net Zero Latest More “You just won't believe how vastly, hugely, mind-bogglingly big space is,”said the author Douglas Adams. \xa0“I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space.” By our best estimates there are around 100 billion stars in the Milky Way and at least 140 billion galaxies across the Universe. If galaxies were frozen peas, there would be enough to fill an auditorium the size of the Royal Albert Hall. So how was this unimaginably giant Universe created? For centuries scientists thought the Universe always existed in a largely unchanged form, run like clockwork thanks to the laws of physics. But a Belgian priest and scientist called George Lemaitre put forward another idea. In 1927, he proposed that the Universe began as a large, pregnant and primeval atom, exploding and sending out the sm

In [16]:
# For method 2
parser1 = HtmlParser.from_url(url1, Tokenizer("english"))
#parser2 = HtmlParser.from_url(url2, Tokenizer("english"))
#parser3 = HtmlParser.from_url(url3, Tokenizer("english"))

## Method 1 - Performing abstractive summarization by creating chunks of text to pass into the pipeline

### Load the model

In [4]:
summarizer1 = pipeline("summarization")

No model was supplied, defaulted to t5-small (https://huggingface.co/t5-small)
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


### Creating text chunks

In [68]:
# At a given point in time the pipeline can only take in 1024 unique tokens. This is the reason for creating text chunks.

chunk_size = 500

In [49]:
# change article number as per requirement

article1 = article1.replace('.', '.<eos>')
article1 = article1.replace('?', '?<eos>')
article1 = article1.replace('!', '!<eos>')

NameError: name 'article1' is not defined

In [69]:
# For input text instead of website

input_text = input_text.replace('.', '.<eos>')
input_text = input_text.replace('?', '?<eos>')
input_text = input_text.replace('!', '!<eos>')

In [70]:
sentences = input_text.split('<eos>')
current_chunk = 0 
chunks = []
for sentence in sentences:
    if len(chunks) == current_chunk + 1: 
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= chunk_size:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

0


In [71]:
# Total chunks created for give article
len(chunks)

1

In [72]:
len(chunks[0].split(' '))

179

### Genetating summary

In [None]:
result = summarizer1(chunks, max_length=200, min_length=70, do_sample=False)

In [55]:
result[0]

{'summary_text': 'the standard distinguishes sensing devices by its capabilities and roles in the network . a full-function device (FFD) is able to coordinate a network of devices . by using RFD and FFD devices, IEEE 802. 15. 4 can support network topologies such as peer-to-peer, star and cluster networks .'}

In [65]:
summary_text = ' '.join([summ['summary_text'] for summ in result])

In [66]:
summary_text

'the current version of the IEEE 802. 15. 4 standard supports multi-hop communications using a technique originally proposed in form of the Time Synchronized Mesh Protocol (TMSP) [21] the TMSP protocol employs time synchronized frequency channel hopping to combat multipath fading and external interference, and is also the foundation of WirelessHART [19].'

### Checking the rogue_score

In [28]:
rouge_score_method1 = load_metric("rouge")

In [43]:
scores = rouge_score_method1.compute(
    predictions=[summary_text], references=[summary_input_text], use_aggregator=False
)

In [44]:
scores

{'rouge1': [Score(precision=0.43478260869565216, recall=0.47619047619047616, fmeasure=0.4545454545454545)],
 'rouge2': [Score(precision=0.14705882352941177, recall=0.16129032258064516, fmeasure=0.15384615384615385)],
 'rougeL': [Score(precision=0.2753623188405797, recall=0.30158730158730157, fmeasure=0.2878787878787879)],
 'rougeLsum': [Score(precision=0.3333333333333333, recall=0.36507936507936506, fmeasure=0.3484848484848485)]}

## Method 2 - Performing extractive summarization then performing abstractive summatization on the result

### Load the model

In [8]:
summarizer2 = TextRankSummarizer()

### Generating extractive summary

In [57]:
len(article1.split('.'))

126

In [9]:
extractive_summary_text = ''
for sentence in summarizer2(input_text, 50): # Summary will containg 10 sentences 
    temp = sentence
    extractive_summary_text = extractive_summary_text + str(temp)

AttributeError: 'str' object has no attribute 'sentences'

In [82]:
len(extractive_summary_text.split(' '))

1162

### Generating abstractive summary

In [85]:
summary = summarizer1(extractive_summary_text, max_length=500, min_length=30, do_sample=False)

In [86]:
summary

[{'summary_text': 'ethical hackers are the good guys of the hacking world, the ones who wear the "white hat" a career in ethical hacking can be an enticing prospect . you should be well-versed with LINUX as it gives them the power to utilize the open-source operating system Linux the way they desire .'}]