Web Scraper Project

A comprehensive and flexible web scraping framework built with Python. This project provides a robust foundation for scraping websites with built-in rate limiting, error handling, data storage, and monitoring capabilities.

Features

🚀 Multiple Scraping Methods: Support for requests, Selenium, and Scrapy
🛡️ Built-in Protection: Rate limiting, retry logic, and respectful scraping
📊 Flexible Data Storage: Save to JSON, CSV, or databases
🔧 Highly Configurable: Easy configuration through YAML and environment variables
📝 Comprehensive Logging: Detailed logging and progress monitoring
🔄 Async Support: Asynchronous scraping for better performance
🎭 Proxy Support: Built-in proxy rotation capabilities

Quick Start

Clone and Setup with Virtual Environment

Automated Setup (Recommended):

cd web_scraper

# Windows
.\setup_venv.ps1

# Linux/Mac
./setup_venv.sh

Manual Setup:

cd web_scraper

# Create and activate virtual environment
python -m venv venv

# Windows
.\venv\Scripts\Activate.ps1

# Linux/Mac
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
cp .env.example .env

Basic Usage

from src.scraper import WebScraper

scraper = WebScraper()
data = scraper.scrape_url('https://example.com')
scraper.save_data(data, 'example_data.json')

Run Examples

python examples/basic_scraper.py
python examples/ecommerce_scraper.py

Project Structure

web_scraper/
├── src/
│   ├── scraper.py          # Main scraper class
│   ├── storage.py          # Data storage utilities
│   ├── utils.py            # Helper functions
│   └── __init__.py
├── config/
│   ├── scraper_config.yaml # Scraping configuration
│   └── headers.yaml        # HTTP headers
├── examples/
│   ├── basic_scraper.py    # Simple scraping example
│   ├── ecommerce_scraper.py # E-commerce scraping
│   └── news_scraper.py     # News scraping
├── data/                   # Scraped data storage
├── logs/                   # Log files
├── requirements.txt
└── .env.example

Configuration

Edit config/scraper_config.yaml to customize:

Request delays and timeouts
Retry policies
User agents and headers
Output formats
Logging levels

Examples

Basic Web Scraping

from src.scraper import WebScraper

scraper = WebScraper()
data = scraper.scrape_url('https://quotes.toscrape.com/')
print(data)

Bulk URL Scraping

urls = [
    'https://example1.com',
    'https://example2.com',
    'https://example3.com'
]

scraper = WebScraper(delay=2)
results = scraper.scrape_multiple_urls(urls)
scraper.save_to_csv(results, 'bulk_data.csv')

Using Selenium for Dynamic Content

scraper = WebScraper(use_selenium=True)
data = scraper.scrape_url('https://spa-example.com')

Advanced Features

Custom Data Extraction

def extract_products(soup):
    products = []
    for item in soup.find_all('div', class_='product'):
        products.append({
            'name': item.find('h3').text,
            'price': item.find('.price').text
        })
    return products

scraper = WebScraper()
scraper.set_custom_extractor(extract_products)
data = scraper.scrape_url('https://shop.example.com')

Async Scraping

import asyncio
from src.scraper import AsyncWebScraper

async def main():
    scraper = AsyncWebScraper()
    urls = ['https://example1.com', 'https://example2.com']
    results = await scraper.scrape_multiple_async(urls)
    return results

asyncio.run(main())

Best Practices

Respect robots.txt: Always check and respect the website's robots.txt file
Use delays: Implement appropriate delays between requests
Handle errors gracefully: Implement proper error handling and retries
Monitor your scraping: Use logging to track your scraping activities
Be respectful: Don't overload servers with too many concurrent requests

Legal Considerations

Always check the website's Terms of Service
Respect robots.txt files
Be mindful of copyright and data protection laws
Consider reaching out to website owners for permission when scraping large amounts of data

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
data		data
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md
VENV_SETUP_COMPLETE.md		VENV_SETUP_COMPLETE.md
activate.ps1		activate.ps1
activate.sh		activate.sh
requirements.txt		requirements.txt
setup_venv.ps1		setup_venv.ps1
setup_venv.sh		setup_venv.sh
test_environment.py		test_environment.py
test_simple.py		test_simple.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper Project

Features

Quick Start

Project Structure

Configuration

Examples

Basic Web Scraping

Bulk URL Scraping

Using Selenium for Dynamic Content

Advanced Features

Custom Data Extraction

Async Scraping

Best Practices

Legal Considerations

Contributing

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JVanderpool-repos/web_scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper Project

Features

Quick Start

Project Structure

Configuration

Examples

Basic Web Scraping

Bulk URL Scraping

Using Selenium for Dynamic Content

Advanced Features

Custom Data Extraction

Async Scraping

Best Practices

Legal Considerations

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages