Skip to content

BlueGuider/parsing

Repository files navigation

🌍 International Google Maps Business Scraper

A robust, language-agnostic Google Maps scraper designed to work seamlessly across different locales and VPS environments, including German servers.

✨ Features

  • 🌐 International Support: Works with German, English, and other locales
  • πŸ”§ VPS-Optimized: Configured for headless operation on virtual private servers
  • πŸ“Š Multiple Export Formats: CSV and Excel download options
  • 🎯 Smart Extraction: Handles different languages and interface layouts
  • ⚑ Auto-Setup: Automatic Chrome driver management with webdriver-manager
  • πŸ›‘οΈ Robust Error Handling: Graceful fallbacks and detailed error reporting

πŸš€ Quick Start

1. Environment Setup

Run the automated setup script:

chmod +x setup_environment.sh
./setup_environment.sh

2. Manual Setup (if needed)

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Test the scraper
python test_scraper.py

3. Run the Application

source venv/bin/activate
streamlit run app.py

πŸ”§ Language & Locale Issues Fixed

Problem

The original scraper failed on German VPS due to:

  • ❌ Language-specific Google Maps interface
  • ❌ Missing locale configuration
  • ❌ Chrome browser setup issues
  • ❌ Encoding problems with international characters

Solution

  • βœ… Multi-language selectors: Handles German, English, and other interfaces
  • βœ… UTF-8 encoding: Proper character encoding for international text
  • βœ… Chrome language forcing: Forces English interface for consistency
  • βœ… Robust browser detection: Multiple Chrome/Chromium binary paths
  • βœ… Webdriver auto-management: Automatic ChromeDriver installation

πŸ“‹ Requirements

  • Python 3.7+
  • Chrome/Chromium browser
  • Internet connection
  • Virtual environment (recommended)

System Dependencies

# Ubuntu/Debian
sudo apt update
sudo apt install -y python3-venv python3-pip chromium-browser

# Or install Google Chrome
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update && sudo apt install -y google-chrome-stable

🎯 Usage

  1. Enter Search Query: Type what you're looking for (e.g., "restaurants in Berlin")
  2. Set Maximum Results: Choose between 10-100 results
  3. Start Scraping: Click the button and wait for results
  4. Download Data: Export as CSV or Excel

Example Queries

  • restaurants in Berlin
  • hotels in Munich
  • car repair shops in Hamburg
  • bakeries near Frankfurt

πŸ“Š Extracted Data

The scraper extracts the following information:

  • Name: Business name
  • Rating: Star rating (e.g., 4.5)
  • Reviews: Number of reviews
  • Category: Business type/category
  • Address: Full address
  • Phone: Phone number (if available)
  • Website: Website URL (if available)
  • Hours: Operating hours (if available)

πŸ› οΈ Technical Details

Chrome Configuration

  • Headless mode for VPS compatibility
  • Language set to English for consistent interface
  • Optimized for server environments
  • Multiple browser binary detection

Locale Handling

  • UTF-8 encoding enforcement
  • Multiple locale fallbacks
  • International character support
  • German VPS compatibility

Extraction Strategy

  • Multiple CSS selector approaches
  • Language-agnostic pattern matching
  • Robust error handling and retries
  • Smart duplicate detection

πŸ§ͺ Testing

Run the test suite to verify everything works:

python test_scraper.py

Expected output:

🌍 International Google Maps Scraper - Test Suite
============================================================
πŸ§ͺ Testing Chrome driver setup...
βœ… Chrome driver setup successful!
βœ… Successfully navigated to Google

πŸ” Testing Google Maps search...
Searching for: restaurants in Berlin
βœ… Found 5 results!
πŸŽ‰ All tests passed! The scraper is working correctly.

πŸ› Troubleshooting

Chrome/Chromium Issues

# Check if Chrome is installed
which google-chrome || which chromium-browser

# Install Chromium
sudo apt install chromium-browser

# Test headless mode
chromium-browser --headless --dump-dom https://www.google.com

Locale Issues

# Check current locale
locale

# Generate UTF-8 locale
sudo locale-gen en_US.UTF-8

# Set environment variables
export LC_ALL=C.UTF-8
export LANG=C.UTF-8

Permission Issues

# Make scripts executable
chmod +x setup_environment.sh
chmod +x test_scraper.py

Network Issues

  • Ensure internet connectivity
  • Check firewall settings
  • Verify Google Maps is accessible

πŸ“ Project Structure

β”œβ”€β”€ app.py                    # Main Streamlit application
β”œβ”€β”€ test_scraper.py          # Test suite
β”œβ”€β”€ setup_environment.sh     # Environment setup script
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ README.md               # This file
└── venv/                   # Virtual environment (created after setup)

πŸ”„ Updates from Original Version

Major Improvements

  1. International Compatibility: Removed US-specific assumptions
  2. Language Agnostic: Works with German and other locales
  3. VPS Optimization: Headless Chrome with proper flags
  4. Auto Driver Management: webdriver-manager integration
  5. Better Error Handling: Comprehensive error messages and fallbacks
  6. UTF-8 Support: Proper encoding for international characters
  7. Multiple Browser Support: Chrome, Chromium, and snap installations

Code Changes

  • Added webdriver-manager for automatic ChromeDriver setup
  • Implemented multiple Chrome binary detection
  • Added UTF-8 encoding configuration
  • Enhanced CSS selectors for different languages
  • Improved error handling and fallback mechanisms

🌟 Success Metrics

After implementing these fixes:

  • βœ… German VPS Compatibility: Works on German servers
  • βœ… Multi-language Support: Handles different Google Maps interfaces
  • βœ… Robust Setup: Automatic environment configuration
  • βœ… Better Data Quality: Improved extraction accuracy
  • βœ… Error Resilience: Graceful handling of failures

πŸ“ž Support

If you encounter issues:

  1. Run python test_scraper.py to diagnose problems
  2. Check the troubleshooting section above
  3. Ensure all system dependencies are installed
  4. Verify Chrome/Chromium installation

πŸ“„ License

This project is provided as-is for educational purposes. Please respect Google's Terms of Service and rate limits when using this scraper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •