Skip to content

A powerful Amazon data scraper with CLI integration 🖥️, bot avoidance 🛡️, error detection 🚨, multi-image downloads 📸, and queue management 📋. Includes batch scripts for Windows 🪟 and bash scripts for Linux/Mac 🐧🍎. Coming soon: web UI 🌐, database integration 🗄️, .csv/.json exports 📊, and Facebook Marketplace listing converter 📦!

Notifications You must be signed in to change notification settings

ArmihDeveloper/Perfect

 
 

Repository files navigation

Amazon 📦Scraper with Anti-Bot Measures🛡️

A sophisticated Amazon product scraper with comprehensive anti-bot protection mechanisms, built using Selenium and advanced evasion techniques.

By The Way! This is only my second software release, so if you enjoyed this tool, be sure to give me a thumbs up, leave me a comment, or please consider supporting me below:

Buy Me A Coffee

Features

Core Scraping Features

  • 🔎Amazon Product Search: Search for products using keywords with pagination support
  • Comprehensive Data Extraction: Extract detailed product information including:
    • Product titles, descriptions, and specifications
    • Prices, ratings, and review counts
    • Product images and availability status
    • ASIN and product URLs
  • Multi-format Export: Save data in JSON and CSV formats
  • 📷Image Download: Automatically download and validate product images

Advanced Anti-Bot Measures 🤖

  • ☝️Browser Fingerprint Randomization:
    • Rotating user agents from a pool of realistic browser signatures
    • Randomized viewport sizes to mimic different devices
    • Dynamic browser configuration and options
  • Request Pattern Management:
    • Intelligent random delays between requests
    • Exponential backoff on failures with jitter
    • Adaptive delay adjustment based on failure rates
  • 🖥️Session Management:
    • Cookie persistence and intelligent rotation
    • Randomized HTTP headers
    • Session cleanup and renewal strategies
  • ⚠️Advanced Error Handling:
    • CAPTCHA detection and automatic response
    • ☁️Cloudflare protection detection
    • Network error recovery with retry logic
    • Page structure change adaptation
    • Rate limit detection and intelligent handling

🛡️Protection Features

  • 🥷Stealth Mode: Hide automation indicators from websites
  • Failure Monitoring: Track and respond to scraping failures
  • Resource Cleanup: Automatic cleanup of browser sessions and resources

📂Project Structure

scraper/
├── src/
│   ├── __init__.py
│   ├── scraper.py       # Main Amazon scraper class
│   ├── anti_bot.py      # Anti-bot measures and protection
│   ├── fingerprint.py   # Browser fingerprint randomization
│   ├── config.py        # Configuration management
│   └── utils.py         # Utility functions
├── tests/
│   ├── __init__.py
│   ├── test_scraper.py  # Main scraper tests
│   └── test_anti_bot.py # Anti-bot measures tests
├── data/
│   ├── images/          # Downloaded images storage
│   ├── scraped_data.json
│   └── scraped_data.csv
├── requirements.txt     # Python dependencies
├── example_usage.py     # Example usage with anti-bot features
├── README.md           # This file
└── .gitignore          # Git ignore rules

💾 Setup

  1. Create virtual environment:

    python -m venv venv
  2. Activate virtual environment:

    • Windows: venv\Scripts\activate
    • macOS/Linux: source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

⚙️Configuration

Create a .env file in the project root with the following optional settings:

SCRAPER_DELAY=1.0
SCRAPER_TIMEOUT=30
SCRAPER_RETRIES=3
OUTPUT_DIR=data
HEADLESS=True
WINDOW_SIZE=1920,1080

Usage

Basic Amazon Scraping with Anti-Bot Protection

from src.scraper import AmazonScraper
from src.utils import setup_logging, RateLimitError

# Setup logging
logger = setup_logging()

try:
    # Initialize scraper with built-in anti-bot measures
    with AmazonScraper() as scraper:
        logger.info("Starting safe scraping with anti-bot protection")
        
        # Search for products
        products = scraper.search_products("wireless headphones", max_pages=2)
        
        # Get detailed information
        if products:
            detailed_info = scraper.scrape_product_details(products[0]['url'])
            products[0].update(detailed_info)
        
        # Download images
        scraper.download_product_images(products[:5])
        
        # Save results
        scraper.save_results(products, format_type='both')
        
except RateLimitError:
    logger.warning("Rate limit detected - anti-bot system activated")
except Exception as e:
    logger.error(f"Scraping error: {e}")

⚙️Advanced Anti-Bot Configuration

from src.anti_bot import AntiBotMeasures
from src.fingerprint import BrowserFingerprint

# Initialize with custom settings
anti_bot = AntiBotMeasures(min_delay=2.0, max_delay=5.0)

# Setup driver with anti-bot measures
driver = anti_bot.setup_driver_with_anti_bot('chrome')

# Navigate safely with protection
success = anti_bot.navigate_safely(driver, "https://amazon.com/s?k=laptop")

if success:
    # Scraping logic here
    pass
else:
    print("Navigation blocked or failed")

# Cleanup
anti_bot.cleanup()

🤖Monitoring Anti-Bot Statistics

# After scraping, check anti-bot statistics
request_count = scraper.anti_bot.request_manager.request_count
failed_requests = scraper.anti_bot.request_manager.failed_requests
failure_rate = failed_requests / request_count if request_count > 0 else 0

print(f"Total requests: {request_count}")
print(f"Failed requests: {failed_requests}")
print(f"Failure rate: {failure_rate:.2%}")

Testing 🧪

The project includes a comprehensive test suite with multiple types of tests:

Test Categories

  • Unit Tests: Fast tests with no external dependencies
  • Integration Tests: Tests with browser automation and network access
  • Mock Response Tests: Tests using mocked Amazon responses for reliability
  • Anti-Bot Tests: Specific tests for anti-detection measures
  • Error Case Tests: Tests for error handling scenarios

🏃‍♀️Running Tests

# Run all tests
pytest

# Run specific test categories
pytest -m unit                    # Unit tests only
pytest -m integration             # Integration tests
pytest -m mock_responses          # Tests with mocked responses
pytest -m anti_bot               # Anti-bot measure tests

# Run tests with coverage
pytest --cov=src --cov-report=html

# Run tests in parallel
pytest -n auto

# Run tests with verbose output
pytest -v

# Generate HTML test report
pytest --html=reports/test_report.html

📝Test Configuration

The project uses pytest.ini for test configuration with the following markers:

  • @pytest.mark.unit: Unit tests (fast, no external dependencies)
  • @pytest.mark.integration: Integration tests (may require browser/network)
  • @pytest.mark.slow: Slow tests (may take several minutes)
  • @pytest.mark.network: Tests requiring network access
  • @pytest.mark.browser: Tests requiring browser automation
  • @pytest.mark.anti_bot: Tests for anti-bot measures
  • @pytest.mark.mock_responses: Tests using mocked Amazon responses
  • @pytest.mark.error_cases: Tests for error handling scenarios

📝Writing Tests

When adding new tests, use appropriate markers:

import pytest

@pytest.mark.unit
def test_utility_function():
    """Unit test for utility function."""
    pass

@pytest.mark.integration
@pytest.mark.browser
def test_scraper_with_real_browser():
    """Integration test requiring browser."""
    pass

@pytest.mark.mock_responses
def test_with_mocked_amazon():
    """Test using mocked Amazon responses."""
    pass

Documentation

Comprehensive documentation is available in the following files:

📑Quick Reference

Configuration Options

from src.config import Config

config = Config(
    delay=1.5,              # Delay between requests
    timeout=30,             # Request timeout
    retries=3,              # Retry attempts
    headless=True,          # Headless browser mode
    enable_images=True,     # Download images
    max_images_per_product=5  # Max images per product
)

🤖Anti-Bot Settings

from src.anti_bot import AntiBotMeasures

anti_bot = AntiBotMeasures(
    min_delay=1.0,          # Minimum delay
    max_delay=3.0,          # Maximum delay
    failure_threshold=0.3,   # Failure rate threshold
    session_rotation_interval=50  # Session rotation
)

⚠️Error Handling

from src.utils import RateLimitError, ScrapingError

try:
    products = scraper.search_products("laptop")
except RateLimitError:
    # Handle rate limiting
    pass
except ScrapingError as e:
    # Handle scraping errors
    print(f"Error: {e}")

💱Dependencies

Core Dependencies

  • selenium: Web browser automation
  • webdriver_manager: Automatic ChromeDriver management
  • pandas: Data manipulation and analysis
  • beautifulsoup4: HTML parsing
  • fake-useragent: User-Agent rotation
  • requests: HTTP library
  • python-dotenv: Environment variable management
  • Pillow: Image processing
  • lxml: XML/HTML parser
  • rich: Rich terminal output

Testing Dependencies

  • pytest: Testing framework
  • pytest-cov: Coverage reporting
  • pytest-mock: Mocking utilities
  • pytest-asyncio: Async test support
  • pytest-xdist: Parallel test execution
  • pytest-html: HTML test reports
  • coverage: Code coverage analysis

Development Dependencies

  • black: Code formatting
  • flake8: Code linting
  • isort: Import sorting
  • mypy: Type checking
  • pre-commit: Git hooks

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run the test suite: pytest
  5. Ensure code quality: black . && flake8 . && isort .
  6. Commit your changes: git commit -am 'Add feature'
  7. Push to the branch: git push origin feature-name
  8. Submit a pull request

Code Style

This project follows PEP 8 style guidelines and uses:

  • Black for code formatting
  • flake8 for linting
  • isort for import sorting
  • mypy for type checking

Run the following before committing:

black .
flake8 .
isort .
mypy src/

⚖️Legal and Ethical Considerations

Important: This scraper is designed for educational and research purposes. Please ensure you:

  1. 🤖Respect robots.txt: Check and follow the target website's robots.txt file
  2. 📈Rate limiting: Use appropriate delays and don't overwhelm servers
  3. 🔞Terms of service: Review and comply with Amazon's terms of service
  4. 🈶Commercial use: Consider using official APIs for commercial applications
  5. 🔏Data privacy: Handle any scraped data responsibly and in compliance with applicable laws

Recommended Usage Guidelines

  • 🎓Academic Research: ✅ Appropriate with proper rate limiting
  • 💵Price Monitoring: ✅ Use conservative delays and respect limits
  • 🧑‍💻Personal Use: ✅ Follow best practices and be respectful
  • 🕴️Commercial Use: ⚠️ Consider official APIs and legal implications
  • 💯High-Volume Scraping: ❌ Not recommended without explicit permission

License

This project is open source and available under the MIT License.

Disclaimer

This software is provided for educational purposes only. Users are responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines. The authors are not responsible for any misuse of this software.

About

A powerful Amazon data scraper with CLI integration 🖥️, bot avoidance 🛡️, error detection 🚨, multi-image downloads 📸, and queue management 📋. Includes batch scripts for Windows 🪟 and bash scripts for Linux/Mac 🐧🍎. Coming soon: web UI 🌐, database integration 🗄️, .csv/.json exports 📊, and Facebook Marketplace listing converter 📦!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Batchfile 0.2%