The following code uses the Google's T5 transformer to do the summarization of the text that is being input to the code through web scrapping from the official budget speech.

In [1]:
import io
import requests
from PyPDF2 import PdfReader
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

url = "https://www.indiabudget.gov.in/doc/budget_speech.pdf"  # Link to the budget speech

# Function to perform text summarization
def summarize_text(input_text):
    # Load the tokenizer and model, I am using the t5-base model from the Google T5 transformer models
    # Load the tokenizer with model_max_length set to 512 to compensate the model's limits
    tokenizer = AutoTokenizer.from_pretrained("t5-base", model_max_length=512)
    model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

    # Initialize the pipeline
    summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

    # Generate summary
    summary = summarizer(input_text, max_length=52, min_length=40, do_sample=False)[0]['summary_text']
    return summary

# Download the PDF file and read it with the help of PyPDF2
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
    read_pdf = PdfReader(open_pdf_file)
    num_pages = len(read_pdf.pages)
    text = ""
    for page_number in range(num_pages):
        text += read_pdf.pages[page_number].extract_text()

# Perform text summarization in smaller chunks to make sure the words are in the capacity of the model
chunk_size = 1024  # Adjust based on model tolerance and text complexity
summaries = []
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    summary = summarize_text(chunk)
    summaries.append(summary)

final_summary = " ".join(summaries)

# Display the summary
print("Generated Summary:\n", final_summary)

Generated Summary:
 NIRMALA SITHARAMAN, minister of finance, presents the interim budget for 2024 -25 . the budget is expected to be released on january 1, 2023 . it will also be available on november 'sabka Saath, Sabka Vikas, and Sabka Vishwas' was the 'mantra' of the government . the people blessed the government wi th a bigger mandate . phy covered all elements of inclusivity, namely,  social inclusivity through coverage of all strata of the society . t he country overcame the challenge of a once -in-a-century pande ugh 'housing for all', 'har ghar jal', electricity for 3 all, cooking gas for all, bank accounts and financial services for all in record time . the worries about food have been eliminated through the saturation approach of covering all eligible people is the true and comprehensive achievement of social justice . this is secularism in action, reduces corruption, and prevents nepo tism (-) the earlier approach of tackling poverty through entitlements had resulted in very

The following cell performs information extraction using the built-in re(Regular Expression) Module and taking the text from a website using web scrapping

In [3]:
import requests
from bs4 import BeautifulSoup
import re

def extract_numerical_information_from_website(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the webpage
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract text from webpage
        text = soup.get_text()
        
        # Regular expression pattern to match numerical values (both int and float)
        numerical_pattern = r'\d+(\.\d+)?'  
        
        # Finding all numerical values in the text
        numerical_values = re.findall(numerical_pattern, text)
        
        # Filtering out empty strings
        numerical_values = [value for value in numerical_values if value.strip()]
        
        # Converting strings to numerical values
        numerical_values = [float(value) if '.' in value else int(value) for value in numerical_values]
        
        return numerical_values
    else:
        # Printing an error message if the request was not successful
        print("Error: Unable to fetch webpage")

# Official PIB Summary 
url = 'https://pib.gov.in/PressReleaseIframePage.aspx?PRID=2001136'

# Extracting all the numerical information from the website
numerical_values = extract_numerical_information_from_website(url)

# Print the extracted numerical values
if numerical_values:
    print("Numerical values extracted from the website:", numerical_values)

Numerical values extracted from the website: [0.3, 0.1, 0.4, 0.1, 0.3, 0.1, 0.4, 0.3, 0.5, 0.3, 0.1, 0.4, 0.3, 0.1, 0.7, 0.65, 0.6, 0.8, 0.66, 0.02, 0.3, 0.5, 0.1, 0.13, 0.75, 0.56, 0.24, 0.9, 0.03, 0.13, 0.75, 0.5, 0.4, 0.3, 0.4, 0.66, 0.22, 0.3, 0.1, 0.4, 0.1, 0.3, 0.1, 0.4, 0.3, 0.5, 0.3, 0.1, 0.4, 0.3, 0.1, 0.7, 0.65, 0.6, 0.8, 0.66, 0.02, 0.3, 0.5, 0.1, 0.13, 0.75, 0.56, 0.24, 0.9, 0.03, 0.13, 0.75, 0.5, 0.4, 0.3, 0.4, 0.66, 0.22]


In the next cell I have tried Sentiment Analysis in case of polarity 1 means positive text while -1 means a negative statement, the subjectivity is a float value within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

In [4]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer,PatternAnalyzer

text = "This is a very negative text about you!"

TextBlob(text, analyzer = PatternAnalyzer()).sentiment

Sentiment(polarity=-0.48750000000000004, subjectivity=0.52)