This project uses Python and the Hugging Face transformers library to summarize online articles. It leverages a pre-trained summarization pipeline and web scraping with BeautifulSoup to extract text from a URL, process it, and generate concise summaries.
Features
Scrapes the text content from an article using BeautifulSoup. Processes and organizes the text into manageable chunks for summarization. Summarizes text efficiently using Hugging Face's summarization pipeline. Outputs the summary to a text file (blogsummry.txt). Requirements Python 3.7+ Required Python Libraries: transformers beautifulsoup4 requests Install dependencies with:
Copy code:
pip install transformers beautifulsoup4 requests
How It Works:
Initialize Summarizer The Hugging Face pipeline for summarization is initialized to handle the summarization tasks.
Web Scraping:
The program fetches content from a specified URL using requests. BeautifulSoup parses the HTML and extracts text from h1 and p tags. Text Processing
The extracted text is processed into chunks of less than 500 words. This ensures compatibility with the summarizer’s input limits. Summarization
Each chunk is summarized using the pre-trained summarization pipeline. All summaries are combined into a single cohesive summary. Output The final summary is saved to a text file named blogsummry.txt.
How to Run:
Replace the URL variable with the link to the article you want to summarize:
python Copy code URL = "https://example.com/article"
Copy code:
python summarizer.py View the summary in the generated blogsummry.txt file.
Customization:
Change Summarization Model: Modify the summarization pipeline to use different models from Hugging Face’s library if needed. Be aware of storage and compatibility.
Adjust Summary Length:
Update the max_length and min_length parameters in the summarizer call to control the length of the summary.
Limitations: Scraper Specificity:
This script targets articles with h1 and p tags. It might need adjustments for differently structured websites.
Text Length:
Articles exceeding the maximum token limit may require further splitting or alternate summarization methods.
Acknowledgments
Built with Hugging Face's transformers library. Uses BeautifulSoup for HTML parsing and requests for fetching web content.