A Scrapy-based web scraper for extracting doctor information from Practo.com, specifically focusing on Bangalore-based medical practitioners.
- Scrapes doctor profiles from Practo.com
- Extracts comprehensive doctor information including:
- Name and specialization
- Experience and education
- Clinic details and locations
- Consultation fees
- Ratings and reviews
- Available time slots
- Google Maps integration for location verification
- Data cleaning and validation utilities
- Support for multiple output formats (JSON, CSV, MongoDB)
- Clone the repository:
git clone <repository-url>
cd bangalore_doctors_scraper
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
Create a
.env
file in the root directory and add:
GOOGLE_MAPS_API_KEY=your_api_key_here
Run the scraper with default settings:
python run_scraper.py
Run with custom parameters:
scrapy crawl practo_spider -a specialty="cardiology" -a location="bangalore"
- JSON:
scrapy crawl practo_spider -o doctors.json
- CSV:
scrapy crawl practo_spider -o doctors.csv
- MongoDB: Configure in
pipelines.py
bangalore_doctors_scraper/
├── scrapy.cfg # Scrapy configuration
├── requirements.txt # Python dependencies
├── README.md # This file
├── setup.py # Package setup
├── run_scraper.py # Main execution script
├── doctors_scraper/ # Scrapy project directory
│ ├── settings.py # Scrapy settings
│ ├── middlewares.py # Custom middlewares
│ ├── pipelines.py # Data processing pipelines
│ ├── items.py # Data models
│ └── spiders/ # Spider implementations
│ └── practo_spider.py
├── utils/ # Utility modules
│ ├── google_maps.py # Google Maps integration
│ └── data_cleaner.py # Data cleaning utilities
└── tests/ # Test files
└── test_spider.py
Key settings can be modified in doctors_scraper/settings.py
:
DOWNLOAD_DELAY
: Delay between requests (default: 1 second)CONCURRENT_REQUESTS
: Number of concurrent requests (default: 16)USER_AGENT
: User agent string for requestsROBOTSTXT_OBEY
: Whether to obey robots.txt (default: True)
To be respectful to the target website:
- Default delay of 1 second between requests
- Random delay variance of 0.5-1.5 seconds
- Auto-throttling enabled based on response times
The scraper extracts the following information for each doctor:
name
: Doctor's full namespecialty
: Medical specializationexperience
: Years of experienceeducation
: Educational qualificationsclinic_name
: Name of the clinic/hospitalclinic_address
: Full addresslocation
: City/areaconsultation_fee
: Fee for consultationrating
: Overall ratingreview_count
: Number of reviewslanguages
: Languages spokenservices
: Medical services offeredavailability
: Available time slotsphone
: Contact numberlatitude
: Geographic latitude (from Google Maps)longitude
: Geographic longitude (from Google Maps)
- This scraper is for educational and research purposes only
- Always respect the website's robots.txt and terms of service
- Implement appropriate delays to avoid overwhelming the server
- Do not use scraped data for commercial purposes without permission
- Ensure compliance with data protection regulations (GDPR, etc.)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational purposes only. Users are responsible for ensuring their use complies with applicable laws and website terms of service.