A sophisticated Amazon product scraper with comprehensive anti-bot protection mechanisms, built using Selenium and advanced evasion techniques.
By The Way! This is only my second software release, so if you enjoyed this tool, be sure to give me a thumbs up, leave me a comment, or please consider supporting me below:
- 🔎Amazon Product Search: Search for products using keywords with pagination support
- Comprehensive Data Extraction: Extract detailed product information including:
- Product titles, descriptions, and specifications
- Prices, ratings, and review counts
- Product images and availability status
- ASIN and product URLs
- Multi-format Export: Save data in JSON and CSV formats
- 📷Image Download: Automatically download and validate product images
- ☝️Browser Fingerprint Randomization:
- Rotating user agents from a pool of realistic browser signatures
- Randomized viewport sizes to mimic different devices
- Dynamic browser configuration and options
- Request Pattern Management:
- Intelligent random delays between requests
- Exponential backoff on failures with jitter
- Adaptive delay adjustment based on failure rates
- 🖥️Session Management:
- Cookie persistence and intelligent rotation
- Randomized HTTP headers
- Session cleanup and renewal strategies
⚠️ Advanced Error Handling:- CAPTCHA detection and automatic response
- ☁️Cloudflare protection detection
- Network error recovery with retry logic
- Page structure change adaptation
- Rate limit detection and intelligent handling
- 🥷Stealth Mode: Hide automation indicators from websites
- Failure Monitoring: Track and respond to scraping failures
- Resource Cleanup: Automatic cleanup of browser sessions and resources
scraper/
├── src/
│ ├── __init__.py
│ ├── scraper.py # Main Amazon scraper class
│ ├── anti_bot.py # Anti-bot measures and protection
│ ├── fingerprint.py # Browser fingerprint randomization
│ ├── config.py # Configuration management
│ └── utils.py # Utility functions
├── tests/
│ ├── __init__.py
│ ├── test_scraper.py # Main scraper tests
│ └── test_anti_bot.py # Anti-bot measures tests
├── data/
│ ├── images/ # Downloaded images storage
│ ├── scraped_data.json
│ └── scraped_data.csv
├── requirements.txt # Python dependencies
├── example_usage.py # Example usage with anti-bot features
├── README.md # This file
└── .gitignore # Git ignore rules
-
Create virtual environment:
python -m venv venv
-
Activate virtual environment:
- Windows:
venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
-
Install dependencies:
pip install -r requirements.txt
Create a .env file in the project root with the following optional settings:
SCRAPER_DELAY=1.0
SCRAPER_TIMEOUT=30
SCRAPER_RETRIES=3
OUTPUT_DIR=data
HEADLESS=True
WINDOW_SIZE=1920,1080from src.scraper import AmazonScraper
from src.utils import setup_logging, RateLimitError
# Setup logging
logger = setup_logging()
try:
# Initialize scraper with built-in anti-bot measures
with AmazonScraper() as scraper:
logger.info("Starting safe scraping with anti-bot protection")
# Search for products
products = scraper.search_products("wireless headphones", max_pages=2)
# Get detailed information
if products:
detailed_info = scraper.scrape_product_details(products[0]['url'])
products[0].update(detailed_info)
# Download images
scraper.download_product_images(products[:5])
# Save results
scraper.save_results(products, format_type='both')
except RateLimitError:
logger.warning("Rate limit detected - anti-bot system activated")
except Exception as e:
logger.error(f"Scraping error: {e}")from src.anti_bot import AntiBotMeasures
from src.fingerprint import BrowserFingerprint
# Initialize with custom settings
anti_bot = AntiBotMeasures(min_delay=2.0, max_delay=5.0)
# Setup driver with anti-bot measures
driver = anti_bot.setup_driver_with_anti_bot('chrome')
# Navigate safely with protection
success = anti_bot.navigate_safely(driver, "https://amazon.com/s?k=laptop")
if success:
# Scraping logic here
pass
else:
print("Navigation blocked or failed")
# Cleanup
anti_bot.cleanup()# After scraping, check anti-bot statistics
request_count = scraper.anti_bot.request_manager.request_count
failed_requests = scraper.anti_bot.request_manager.failed_requests
failure_rate = failed_requests / request_count if request_count > 0 else 0
print(f"Total requests: {request_count}")
print(f"Failed requests: {failed_requests}")
print(f"Failure rate: {failure_rate:.2%}")The project includes a comprehensive test suite with multiple types of tests:
- Unit Tests: Fast tests with no external dependencies
- Integration Tests: Tests with browser automation and network access
- Mock Response Tests: Tests using mocked Amazon responses for reliability
- Anti-Bot Tests: Specific tests for anti-detection measures
- Error Case Tests: Tests for error handling scenarios
# Run all tests
pytest
# Run specific test categories
pytest -m unit # Unit tests only
pytest -m integration # Integration tests
pytest -m mock_responses # Tests with mocked responses
pytest -m anti_bot # Anti-bot measure tests
# Run tests with coverage
pytest --cov=src --cov-report=html
# Run tests in parallel
pytest -n auto
# Run tests with verbose output
pytest -v
# Generate HTML test report
pytest --html=reports/test_report.htmlThe project uses pytest.ini for test configuration with the following markers:
@pytest.mark.unit: Unit tests (fast, no external dependencies)@pytest.mark.integration: Integration tests (may require browser/network)@pytest.mark.slow: Slow tests (may take several minutes)@pytest.mark.network: Tests requiring network access@pytest.mark.browser: Tests requiring browser automation@pytest.mark.anti_bot: Tests for anti-bot measures@pytest.mark.mock_responses: Tests using mocked Amazon responses@pytest.mark.error_cases: Tests for error handling scenarios
When adding new tests, use appropriate markers:
import pytest
@pytest.mark.unit
def test_utility_function():
"""Unit test for utility function."""
pass
@pytest.mark.integration
@pytest.mark.browser
def test_scraper_with_real_browser():
"""Integration test requiring browser."""
pass
@pytest.mark.mock_responses
def test_with_mocked_amazon():
"""Test using mocked Amazon responses."""
passComprehensive documentation is available in the following files:
- API Documentation: Detailed API reference for all classes and methods
- Troubleshooting Guide: Common issues and solutions
- Rate Limiting Guidelines: Best practices for responsible scraping
- CLI Usage Guide: Command-line interface documentation
- Data Extraction Guide: Advanced data extraction features
from src.config import Config
config = Config(
delay=1.5, # Delay between requests
timeout=30, # Request timeout
retries=3, # Retry attempts
headless=True, # Headless browser mode
enable_images=True, # Download images
max_images_per_product=5 # Max images per product
)from src.anti_bot import AntiBotMeasures
anti_bot = AntiBotMeasures(
min_delay=1.0, # Minimum delay
max_delay=3.0, # Maximum delay
failure_threshold=0.3, # Failure rate threshold
session_rotation_interval=50 # Session rotation
)from src.utils import RateLimitError, ScrapingError
try:
products = scraper.search_products("laptop")
except RateLimitError:
# Handle rate limiting
pass
except ScrapingError as e:
# Handle scraping errors
print(f"Error: {e}")- selenium: Web browser automation
- webdriver_manager: Automatic ChromeDriver management
- pandas: Data manipulation and analysis
- beautifulsoup4: HTML parsing
- fake-useragent: User-Agent rotation
- requests: HTTP library
- python-dotenv: Environment variable management
- Pillow: Image processing
- lxml: XML/HTML parser
- rich: Rich terminal output
- pytest: Testing framework
- pytest-cov: Coverage reporting
- pytest-mock: Mocking utilities
- pytest-asyncio: Async test support
- pytest-xdist: Parallel test execution
- pytest-html: HTML test reports
- coverage: Code coverage analysis
- black: Code formatting
- flake8: Code linting
- isort: Import sorting
- mypy: Type checking
- pre-commit: Git hooks
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Run the test suite:
pytest - Ensure code quality:
black . && flake8 . && isort . - Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project follows PEP 8 style guidelines and uses:
- Black for code formatting
- flake8 for linting
- isort for import sorting
- mypy for type checking
Run the following before committing:
black .
flake8 .
isort .
mypy src/Important: This scraper is designed for educational and research purposes. Please ensure you:
- 🤖Respect robots.txt: Check and follow the target website's robots.txt file
- 📈Rate limiting: Use appropriate delays and don't overwhelm servers
- 🔞Terms of service: Review and comply with Amazon's terms of service
- 🈶Commercial use: Consider using official APIs for commercial applications
- 🔏Data privacy: Handle any scraped data responsibly and in compliance with applicable laws
- 🎓Academic Research: ✅ Appropriate with proper rate limiting
- 💵Price Monitoring: ✅ Use conservative delays and respect limits
- 🧑💻Personal Use: ✅ Follow best practices and be respectful
- 🕴️Commercial Use:
⚠️ Consider official APIs and legal implications - 💯High-Volume Scraping: ❌ Not recommended without explicit permission
This project is open source and available under the MIT License.
This software is provided for educational purposes only. Users are responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines. The authors are not responsible for any misuse of this software.
