This project is a versatile web scraping tool designed to extract and structure data from any website, dynamically adjusting to the data structure without hardcoded parameters. The scraper uses Selenium and BeautifulSoup for web interaction and HTML parsing, while the Groq LLM models handle data alignment and unification. The data is output in JSON and Excel formats for easy analysis.
- Flexible Scraping: No hardcoded fields. Adapts to any website’s structure.
- Dynamic Column Mapping: Uses an LLM to infer and unify columns across different websites.
- Multi-format Outputs: Saves data in Markdown, JSON, and Excel.
- Scalable Design: Handles large datasets with token management for API efficiency.
- User-Agent Rotation: Mimics human browsing behavior to avoid detection.
pip install -r requirements.txt
Dependencies include:
- pandas: Data handling.
- BeautifulSoup (bs4): HTML parsing.
- Selenium: Web automation.
- html2text: Converts HTML to Markdown.
- pydantic: Creates dynamic models.
- tiktoken: Manages tokenization for the LLMs.
- ChromeDriver: Required by Selenium. Download here.
- Groq API: For LLM-based data processing. Add your API key to the .env file.
- Clone the Repository
git clone https://github.com/HelixCipher/Dynamic-webscraper-WIP-.git
cd repo-name
pip install -r requirements.txt
In the project root, create a .env file:
GROQ_API_KEY=your-groq-api-key
Update data_source.py with the URLs to scrape:
URLS = [ "https://example.com/page1", "https://example.com/page2" ]
Download and place it in the project root.
Usage Run the Scraper program
python main.py
- Markdown: Cleaned webpage content.
- JSON: Structured data output.
- Excel: Human-readable structured data.
Files are saved in output/, with a timestamped directory for each session:
output/ └── 20241017_123456/ ├── scraped_data_20241017123456.md ├── formatted_data_20241017123456.json └── formatted_data_20241017123456.xlsx
The scraper dynamically adjusts field names during the scraping process. To extend or modify fields, you can customize this logic within scraper.py.