# Beyond Text: LLM Web Summarizer for Quick Content Grasping 

Unlike earlier times, we live in an era where a lot of information is published on the internet. However, it may not be possible for us to go through all of it. So, it would be useful if we get a gist of the content on the websites that we visit by using a website extension. Text summarization is a fundamental task in Natural Language Processing (NLP) with applications spanning various domains, including information retrieval and content creation. Large Language Models (LLMs) have shown significant potential in enhancing summarization techniques. 

FLAN-T5 (Finetuning language models) is a general-purpose encoder-decoder LLM which is fine tuned from the T5 model to perform a wide range of sequence-to-sequence tasks. We will train our FLAN-T5 model using the LoRa technique, which requires only a small memory footprint which reduces the number of parameters to train. This website extension will be using FLAN-T5 to summarize the selected content. The technologies include HTML, CSS, JavaScript and some Natural Language Processing libraries for summarizing the web content. HTML, CSS, and JavaScript will be used for the frontend purposes for displaying the extension content. Whereas in the backend we will perform web scrapping by using python libraries like Requests, BeautifulSoup, Scrapy, and Selenium to get the content of the current URL and perform summarization for that content using LLM.  

Diverse range of LLMs, such as MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT, text-davinci-003 models are the current technologies employed to perform summarization by adjusting hyperparameters, and the quality of generated summaries are being evaluated using established metrics like BLEU Score, ROUGE Score, and BERT Score. Another interesting model 'EXABSUM,' exists which is an innovative Automatic Text Summarization (ATS) approach capable of generating both extractive and abstractive summaries. It uses two distinct methods: (1) an extractive technique (EXABSUM Extractive), combining statistical and semantic scoring methods to select and extract pertinent, non-repetitive sentences from a text unit, and (2) an abstractive technique (EXABSUM Abstractive), utilizing a word graph approach (including compression and fusion stages) and re-ranking based on key phrases to generate abstractive summaries using the source document as input. In the evaluation of diverse benchmarks, EXABSUM outperformed extractive summarization methods and demonstrated competitiveness against abstractive baselines. As future considerations, this project can be extended for other types of tasks too, this is possible since Flan-T5 is good for different types of text generation tasks. 


## 0. Installing Transformers and Importing Dependencies

In [1]:
!pip install transformers



In [2]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

## 1. Load Summarization Pipeline

In [6]:
model_name = "stacked-summaries/flan-t5-large-stacked-samsum-1024"

In [7]:
summarizer = pipeline("summarization", model=model_name)

Downloading config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

ValueError: Could not load model stacked-summaries/flan-t5-large-stacked-samsum-1024 with any of the following classes: (<class 'transformers.models.auto.modeling_tf_auto.TFAutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_tf_t5.TFT5ForConditionalGeneration'>).

## 2. Model Architecture

In [None]:
from transformers import AutoModel, AutoConfig

In [None]:
config = AutoConfig.from_pretrained(model_name)
config

In [None]:
model = AutoModel.from_pretrained(model_name)
print(model)

## 2. Get Blog Post from Medium

In [3]:
URL = "https://medium.com/@ai-club/ml101-introduction-to-the-world-of-machine-learning-2-86e32c740b58"

In [4]:
r = requests.get(URL)

In [5]:
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)

In [6]:
ARTICLE

'Sign up Sign in Sign up Sign in ML101: Introduction to the world of Machine Learning — 2 PTU AI CLUB Follow -- Listen Share Authors: Harsha, Sakti Priya V, Yuvarani VD and Apurva Introduction Our last article in the ML101 series by PTU AI CLUB introduced you to the world of machine learning and its infinite use cases. In this article we will be diving a little deeper into this world by exploring the types of machine learning and the ethical aspects related to machine learning. We will also be providing you with a roadmap for getting started in this world. Enough talk, let’s ML! Types of Machine Learning Based on the methods and way of learning, machine learning is divided into mainly four types, which are: 1. Supervised Machine Learning2. Unsupervised Machine Learning3. Semi-supervised Machine Learning4. Reinforcement Learning Supervised machine learning is based on supervision. It means in the supervised learning technique, we train the machines using the “labeled” dataset, and based

## 3. Chunk Text

In [7]:
max_chunk = 500

In [8]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')

In [9]:
sentences = ARTICLE.split('<eos>')
current_chunk = 0 
chunks = []
for sentence in sentences:
    if len(chunks) == current_chunk + 1: 
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

0


In [10]:
len(chunks)

3

## 4. Summarize Text

In [11]:
res = summarizer(chunks, max_length=120, min_length=30, do_sample=False)

Token indices sequence length is longer than the specified maximum sequence length for this model (629 > 512). Running this sequence through the model will result in indexing errors


In [18]:
res[0]

{'summary_text': 'ML101: Introduction to the world of machine learning — 2 PTU AI CLUB Follow -- Listen Share Authors: Harsha, Sakti Priya V, Yuvarani VD and Apurva Introduction Our last article in the ML 101 series by PTUAI CLUB introduced you to the World of Machine Learning and its infinite use cases.'}

In [19]:
' '.join([summ['summary_text'] for summ in res])

'ML101: Introduction to the world of machine learning — 2 PTU AI CLUB Follow -- Listen Share Authors: Harsha, Sakti Priya V, Yuvarani VD and Apurva Introduction Our last article in the ML 101 series by PTUAI CLUB introduced you to the World of Machine Learning and its infinite use cases. Understand the ethical aspects of machine learning. Learn about bias and transparency. Understand how machine learning models arrive at their decisions. Get started with machine learning now. Machine learning has the power to revolutionize the way we live our lives and interact with the world around us. Imagine a world where doctors have access to vast amounts of medical data, allowing them to make more accurate diagnoses and better treatment plans for [...]'

In [20]:
text = ' '.join([summ['summary_text'] for summ in res])

In [None]:
text

## 5. Evaluation

In [None]:
!pip install rouge

In [None]:
rouge = Rouge()
scores = rouge.get_scores(text, ARTICLE)

##  6. Output to Text File

In [22]:
with open('./summarized-results/intro-to-machine-learning/summary-flan-t5-base.txt', 'w') as f:
    f.write(text)