Skip to content

KevinG45/practo_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Bangalore Doctors Scraper

A Scrapy-based web scraper for extracting doctor information from Practo.com, specifically focusing on Bangalore-based medical practitioners.

Features

  • Scrapes doctor profiles from Practo.com
  • Extracts comprehensive doctor information including:
    • Name and specialization
    • Experience and education
    • Clinic details and locations
    • Consultation fees
    • Ratings and reviews
    • Available time slots
  • Google Maps integration for location verification
  • Data cleaning and validation utilities
  • Support for multiple output formats (JSON, CSV, MongoDB)

Installation

  1. Clone the repository:
git clone <repository-url>
cd bangalore_doctors_scraper
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables: Create a .env file in the root directory and add:
GOOGLE_MAPS_API_KEY=your_api_key_here

Usage

Basic Usage

Run the scraper with default settings:

python run_scraper.py

Advanced Usage

Run with custom parameters:

scrapy crawl practo_spider -a specialty="cardiology" -a location="bangalore"

Output Options

  • JSON: scrapy crawl practo_spider -o doctors.json
  • CSV: scrapy crawl practo_spider -o doctors.csv
  • MongoDB: Configure in pipelines.py

Project Structure

bangalore_doctors_scraper/
├── scrapy.cfg              # Scrapy configuration
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── setup.py               # Package setup
├── run_scraper.py         # Main execution script
├── doctors_scraper/       # Scrapy project directory
│   ├── settings.py        # Scrapy settings
│   ├── middlewares.py     # Custom middlewares
│   ├── pipelines.py       # Data processing pipelines
│   ├── items.py           # Data models
│   └── spiders/           # Spider implementations
│       └── practo_spider.py
├── utils/                 # Utility modules
│   ├── google_maps.py     # Google Maps integration
│   └── data_cleaner.py    # Data cleaning utilities
└── tests/                 # Test files
    └── test_spider.py

Configuration

Scrapy Settings

Key settings can be modified in doctors_scraper/settings.py:

  • DOWNLOAD_DELAY: Delay between requests (default: 1 second)
  • CONCURRENT_REQUESTS: Number of concurrent requests (default: 16)
  • USER_AGENT: User agent string for requests
  • ROBOTSTXT_OBEY: Whether to obey robots.txt (default: True)

Rate Limiting

To be respectful to the target website:

  • Default delay of 1 second between requests
  • Random delay variance of 0.5-1.5 seconds
  • Auto-throttling enabled based on response times

Data Fields

The scraper extracts the following information for each doctor:

  • name: Doctor's full name
  • specialty: Medical specialization
  • experience: Years of experience
  • education: Educational qualifications
  • clinic_name: Name of the clinic/hospital
  • clinic_address: Full address
  • location: City/area
  • consultation_fee: Fee for consultation
  • rating: Overall rating
  • review_count: Number of reviews
  • languages: Languages spoken
  • services: Medical services offered
  • availability: Available time slots
  • phone: Contact number
  • latitude: Geographic latitude (from Google Maps)
  • longitude: Geographic longitude (from Google Maps)

Legal and Ethical Considerations

  • This scraper is for educational and research purposes only
  • Always respect the website's robots.txt and terms of service
  • Implement appropriate delays to avoid overwhelming the server
  • Do not use scraped data for commercial purposes without permission
  • Ensure compliance with data protection regulations (GDPR, etc.)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational purposes only. Users are responsible for ensuring their use complies with applicable laws and website terms of service.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages