SmartWebScraper

A versatile and intelligent web scraper that extracts relevant content from trusted sources based on user-defined keywords. This tool leverages DuckDuckGo search results and integrates requests, BeautifulSoup, and optionally cloudscraper to handle bot protection.

SmartWebScraper

SmartWebScraper is an advanced web scraping tool that extracts both text and images from trusted websites based on a given keyword. It automates data collection, processes content efficiently, and stores the results in structured formats such as JSON and PDF.

Features

Keyword-Based Scraping: Extracts text and images related to any given keyword.
Trusted Websites Filtering: Scrapes only from specified trusted domains to ensure data reliability.
Automated Search: Uses DuckDuckGo to find relevant pages.
Efficient Web Scraping: Supports Cloudflare-protected sites (with optional cloudscraper support).
Robust Session Handling: Implements retries and user-agent rotation to avoid request blocks.
Data Storage: Saves extracted data in JSON format and generates a structured PDF report.

Installation

Prerequisites

Ensure you have Python 3 installed, then install the required dependencies:

pip install -r requirements.txt

If you want improved Cloudflare bypass support, install:

pip install cloudscraper

Usage

Modify the script to specify your desired keyword and trusted websites.
Run the script:

python scraper.py

Extracted data will be saved in the keyword_data folder as:
- scraped_data.json: Contains structured text and image URLs.
- content_document.pdf: A formatted PDF report with text and images.

Output

JSON File: Stores extracted text, image URLs, and their local paths.
PDF Report: Includes structured text with headings and embedded images.

Example

If the keyword is "Geo-political Tension", the tool will:

Search relevant content on trusted sites (e.g., Bloomberg, Forbes, etc.).
Scrape and store related text and images.
Generate scraped_data.json and content_document.pdf in the output folder.

License

This project is open-source. Feel free to modify and enhance it!

Disclaimer

Use this scraper responsibly. Ensure compliance with website terms and conditions before extracting data.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Geo-political_Tension_data		Geo-political_Tension_data
.gitignore		.gitignore
README.md		README.md
scraper.log		scraper.log
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartWebScraper

SmartWebScraper

Features

Installation

Prerequisites

Usage

Output

Example

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SmartWebScraper

SmartWebScraper

Features

Installation

Prerequisites

Usage

Output

Example

License

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages