# Text Summarisation

<br><br>

We will apply text summarisation here using [this](https://huggingface.co/Falconsai/text_summarization) model from hugging face.

## Importing the model

In [1]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")

Downloading config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

<br><br>

# Scrapping the web for Data

In [2]:
import requests
from bs4 import BeautifulSoup

In [10]:
url = 'https://economictimes.indiatimes.com/wealth/tax/what-budget-2024-means-for-you-positive-takeaways-from-the-interim-budget/articleshow/107377274.cms?from=mdr'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
target_element = soup.find(class_='artText')
text = target_element.get_text()
text

'No hike in basic exemption, no increase in deductions, not even a change in tax slabs– why should taxpayers feel happy about the interim Budget announced last week? Let us enumerate the reasons. For one, the government has extended an olive branch to taxpayers stuck with petty tax demands of previous years. The Budget has proposed to withdraw all direct tax demands up to Rs.25,000 till the year 2009-10 and up to Rs.10,000 for the years 2010-11 to 2014-15.The move is expected to benefit an estimated 1 crore taxpayers, who are still disputing these tax demands. It will not only free them from the tussle with the taxman, but also pave the way for tax refunds that were held up due to pending tax demands. Rajarshi Dasgupta, Executive Director and National Head of Tax, AQUILAW, observed, “In case of a pending tax demand, any refund in the subsequent year is not processed unless such demand is addressed. By disposing of the demands, many refund claims will be expedited.”The proposal comes at

<br><br>

# Using Summersizer

Here the summerizer we aree using has a word limit of 512 words , so let us make a function that will split the main paragraph into smaller paragraps of length less than 512 words .

In [4]:
def split_paragraphs(text):
    MAX_WORDS_PER_PARAGRAPH = 512
    paragraphs = []
    words = text.split()
    current_paragraph = []

    for word in words:
        if len(' '.join(current_paragraph + [word])) <= MAX_WORDS_PER_PARAGRAPH:
            current_paragraph.append(word)
        else:
            paragraphs.append(' '.join(current_paragraph))
            current_paragraph = [word]

    if current_paragraph:
        paragraphs.append(' '.join(current_paragraph))

    return paragraphs

### Split our web-scrapped data

In [5]:
split_para = split_paragraphs(text)

## Apply the summerizer into all the paragraphs 

In [6]:
summarized_paragraphs = []

for paragraph in split_para:
    summary = summarizer(paragraph, max_length=10, min_length=5, do_sample=False)[0]['summary_text']
    summarized_paragraphs.append(summary)

sum_txt = ' '.join(summarized_paragraphs)
sum_txt

'Budget has proposed to withdraw all direct tax demands taxpayers are still disputing tax demands  Income-tax Department has been notified of Tax filing portal TaxSpanner says it is Taxpayers, especially those with pending taxpayers who are yet to pay pending Rapid urbanisation, high property prices, in Grihum Housing Finance (formerly Poon the specifics of the proposed scheme should be government’s existing credit linked subsidy scheme Budget reiterated government’s focus on infrastructure the input prices will further provide support to the allocation for public health insurance scheme increased from Rs development, port connectivity and improvement of tourism infrastructure'

## Store the result in a file 

In [7]:
with open("summary.txt", "w", encoding="utf-8-sig") as file:
    file.write(sum_txt)
    
print("Text data has been stored in summary.txt.")

Text data has been stored in summary.txt.


# Words reduced 

In [8]:
len(text.split())

1078

In [9]:
len(sum_txt.split())

97