Amazon 📦Scraper with Anti-Bot Measures🛡️

A sophisticated Amazon product scraper with comprehensive anti-bot protection mechanisms, built using Selenium and advanced evasion techniques.

By The Way! This is only my second software release, so if you enjoyed this tool, be sure to give me a thumbs up, leave me a comment, or please consider supporting me below:

Features

Core Scraping Features

🔎Amazon Product Search: Search for products using keywords with pagination support
Comprehensive Data Extraction: Extract detailed product information including:
- Product titles, descriptions, and specifications
- Prices, ratings, and review counts
- Product images and availability status
- ASIN and product URLs
Multi-format Export: Save data in JSON and CSV formats
📷Image Download: Automatically download and validate product images

Advanced Anti-Bot Measures 🤖

☝️Browser Fingerprint Randomization:
- Rotating user agents from a pool of realistic browser signatures
- Randomized viewport sizes to mimic different devices
- Dynamic browser configuration and options
Request Pattern Management:
- Intelligent random delays between requests
- Exponential backoff on failures with jitter
- Adaptive delay adjustment based on failure rates
🖥️Session Management:
- Cookie persistence and intelligent rotation
- Randomized HTTP headers
- Session cleanup and renewal strategies
⚠️Advanced Error Handling:
- CAPTCHA detection and automatic response
- ☁️Cloudflare protection detection
- Network error recovery with retry logic
- Page structure change adaptation
- Rate limit detection and intelligent handling

🛡️Protection Features

🥷Stealth Mode: Hide automation indicators from websites
Failure Monitoring: Track and respond to scraping failures
Resource Cleanup: Automatic cleanup of browser sessions and resources

📂Project Structure

scraper/
├── src/
│   ├── __init__.py
│   ├── scraper.py       # Main Amazon scraper class
│   ├── anti_bot.py      # Anti-bot measures and protection
│   ├── fingerprint.py   # Browser fingerprint randomization
│   ├── config.py        # Configuration management
│   └── utils.py         # Utility functions
├── tests/
│   ├── __init__.py
│   ├── test_scraper.py  # Main scraper tests
│   └── test_anti_bot.py # Anti-bot measures tests
├── data/
│   ├── images/          # Downloaded images storage
│   ├── scraped_data.json
│   └── scraped_data.csv
├── requirements.txt     # Python dependencies
├── example_usage.py     # Example usage with anti-bot features
├── README.md           # This file
└── .gitignore          # Git ignore rules

💾 Setup

Create virtual environment:
```
python -m venv venv
```
Activate virtual environment:
- Windows: venv\Scripts\activate
- macOS/Linux: source venv/bin/activate
Install dependencies:
```
pip install -r requirements.txt
```

⚙️Configuration

Create a .env file in the project root with the following optional settings:

SCRAPER_DELAY=1.0
SCRAPER_TIMEOUT=30
SCRAPER_RETRIES=3
OUTPUT_DIR=data
HEADLESS=True
WINDOW_SIZE=1920,1080

Usage

Basic Amazon Scraping with Anti-Bot Protection

from src.scraper import AmazonScraper
from src.utils import setup_logging, RateLimitError

# Setup logging
logger = setup_logging()

try:
    # Initialize scraper with built-in anti-bot measures
    with AmazonScraper() as scraper:
        logger.info("Starting safe scraping with anti-bot protection")
        
        # Search for products
        products = scraper.search_products("wireless headphones", max_pages=2)
        
        # Get detailed information
        if products:
            detailed_info = scraper.scrape_product_details(products[0]['url'])
            products[0].update(detailed_info)
        
        # Download images
        scraper.download_product_images(products[:5])
        
        # Save results
        scraper.save_results(products, format_type='both')
        
except RateLimitError:
    logger.warning("Rate limit detected - anti-bot system activated")
except Exception as e:
    logger.error(f"Scraping error: {e}")

⚙️Advanced Anti-Bot Configuration

from src.anti_bot import AntiBotMeasures
from src.fingerprint import BrowserFingerprint

# Initialize with custom settings
anti_bot = AntiBotMeasures(min_delay=2.0, max_delay=5.0)

# Setup driver with anti-bot measures
driver = anti_bot.setup_driver_with_anti_bot('chrome')

# Navigate safely with protection
success = anti_bot.navigate_safely(driver, "https://amazon.com/s?k=laptop")

if success:
    # Scraping logic here
    pass
else:
    print("Navigation blocked or failed")

# Cleanup
anti_bot.cleanup()

🤖Monitoring Anti-Bot Statistics

# After scraping, check anti-bot statistics
request_count = scraper.anti_bot.request_manager.request_count
failed_requests = scraper.anti_bot.request_manager.failed_requests
failure_rate = failed_requests / request_count if request_count > 0 else 0

print(f"Total requests: {request_count}")
print(f"Failed requests: {failed_requests}")
print(f"Failure rate: {failure_rate:.2%}")

Testing 🧪

The project includes a comprehensive test suite with multiple types of tests:

Test Categories

Unit Tests: Fast tests with no external dependencies
Integration Tests: Tests with browser automation and network access
Mock Response Tests: Tests using mocked Amazon responses for reliability
Anti-Bot Tests: Specific tests for anti-detection measures
Error Case Tests: Tests for error handling scenarios

🏃‍♀️Running Tests

# Run all tests
pytest

# Run specific test categories
pytest -m unit                    # Unit tests only
pytest -m integration             # Integration tests
pytest -m mock_responses          # Tests with mocked responses
pytest -m anti_bot               # Anti-bot measure tests

# Run tests with coverage
pytest --cov=src --cov-report=html

# Run tests in parallel
pytest -n auto

# Run tests with verbose output
pytest -v

# Generate HTML test report
pytest --html=reports/test_report.html

📝Test Configuration

The project uses pytest.ini for test configuration with the following markers:

@pytest.mark.unit: Unit tests (fast, no external dependencies)
@pytest.mark.integration: Integration tests (may require browser/network)
@pytest.mark.slow: Slow tests (may take several minutes)
@pytest.mark.network: Tests requiring network access
@pytest.mark.browser: Tests requiring browser automation
@pytest.mark.anti_bot: Tests for anti-bot measures
@pytest.mark.mock_responses: Tests using mocked Amazon responses
@pytest.mark.error_cases: Tests for error handling scenarios

📝Writing Tests

When adding new tests, use appropriate markers:

import pytest

@pytest.mark.unit
def test_utility_function():
    """Unit test for utility function."""
    pass

@pytest.mark.integration
@pytest.mark.browser
def test_scraper_with_real_browser():
    """Integration test requiring browser."""
    pass

@pytest.mark.mock_responses
def test_with_mocked_amazon():
    """Test using mocked Amazon responses."""
    pass

Documentation

Comprehensive documentation is available in the following files:

API Documentation: Detailed API reference for all classes and methods
Troubleshooting Guide: Common issues and solutions
Rate Limiting Guidelines: Best practices for responsible scraping
CLI Usage Guide: Command-line interface documentation
Data Extraction Guide: Advanced data extraction features

📑Quick Reference

Configuration Options

from src.config import Config

config = Config(
    delay=1.5,              # Delay between requests
    timeout=30,             # Request timeout
    retries=3,              # Retry attempts
    headless=True,          # Headless browser mode
    enable_images=True,     # Download images
    max_images_per_product=5  # Max images per product
)

🤖Anti-Bot Settings

from src.anti_bot import AntiBotMeasures

anti_bot = AntiBotMeasures(
    min_delay=1.0,          # Minimum delay
    max_delay=3.0,          # Maximum delay
    failure_threshold=0.3,   # Failure rate threshold
    session_rotation_interval=50  # Session rotation
)

⚠️Error Handling

from src.utils import RateLimitError, ScrapingError

try:
    products = scraper.search_products("laptop")
except RateLimitError:
    # Handle rate limiting
    pass
except ScrapingError as e:
    # Handle scraping errors
    print(f"Error: {e}")

💱Dependencies

Core Dependencies

selenium: Web browser automation
webdriver_manager: Automatic ChromeDriver management
pandas: Data manipulation and analysis
beautifulsoup4: HTML parsing
fake-useragent: User-Agent rotation
requests: HTTP library
python-dotenv: Environment variable management
Pillow: Image processing
lxml: XML/HTML parser
rich: Rich terminal output

Testing Dependencies

pytest: Testing framework
pytest-cov: Coverage reporting
pytest-mock: Mocking utilities
pytest-asyncio: Async test support
pytest-xdist: Parallel test execution
pytest-html: HTML test reports
coverage: Code coverage analysis

Development Dependencies

black: Code formatting
flake8: Code linting
isort: Import sorting
mypy: Type checking
pre-commit: Git hooks

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Run the test suite: pytest
Ensure code quality: black . && flake8 . && isort .
Commit your changes: git commit -am 'Add feature'
Push to the branch: git push origin feature-name
Submit a pull request

Code Style

This project follows PEP 8 style guidelines and uses:

Black for code formatting
flake8 for linting
isort for import sorting
mypy for type checking

Run the following before committing:

black .
flake8 .
isort .
mypy src/

⚖️Legal and Ethical Considerations

Important: This scraper is designed for educational and research purposes. Please ensure you:

🤖Respect robots.txt: Check and follow the target website's robots.txt file
📈Rate limiting: Use appropriate delays and don't overwhelm servers
🔞Terms of service: Review and comply with Amazon's terms of service
🈶Commercial use: Consider using official APIs for commercial applications
🔏Data privacy: Handle any scraped data responsibly and in compliance with applicable laws

Recommended Usage Guidelines

🎓Academic Research: ✅ Appropriate with proper rate limiting
💵Price Monitoring: ✅ Use conservative delays and respect limits
🧑‍💻Personal Use: ✅ Follow best practices and be respectful
🕴️Commercial Use: ⚠️ Consider official APIs and legal implications
💯High-Volume Scraping: ❌ Not recommended without explicit permission

License

This project is open source and available under the MIT License.

Disclaimer

This software is provided for educational purposes only. Users are responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines. The authors are not responsible for any misuse of this software.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/images		data/images
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
API_DOCUMENTATION.md		API_DOCUMENTATION.md
CLI_IMPLEMENTATION.md		CLI_IMPLEMENTATION.md
CLI_USAGE.md		CLI_USAGE.md
DATA_EXTRACTION_README.md		DATA_EXTRACTION_README.md
DEPLOYMENT.md		DEPLOYMENT.md
RATE_LIMITING.md		RATE_LIMITING.md
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
example_cli_usage.py		example_cli_usage.py
example_data_extraction.py		example_data_extraction.py
example_usage.py		example_usage.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
scraper.bat		scraper.bat
scraper.py		scraper.py

ArmihDeveloper/Perfect

Folders and files

Latest commit

History

Repository files navigation