A FastAPI server with web scraping capabilities for extracting structured data from veterinary clinic websites using Crawl4AI and Google Gemini AI.
- main.py: Application entry point
- routes/: API route handlers
- scraper.py: Web scraping and data extraction endpoints
- hello.py: Simple hello endpoint
- helpers/: Utility functions
- scraperHelper.py: Crawl4AI integration for web crawling
- geminiHelper.py: Google Gemini AI integration for data extraction
- envHelper.py: Environment configuration utilities
- Python 3.13 (or compatible version)
- pip
- Google Gemini API key
- Install dependencies:
pip install -r requirements.txt- Configure environment:
Create a .env file in the root directory (you can use .env.example as a template):
# Server Configuration
HOST=0.0.0.0
PORT=8080
# API Keys
GEMINI_API_KEY=your_gemini_api_key_hereReplace your_gemini_api_key_here with your actual Google Gemini API key.
Note: The application uses Pydantic BaseSettings for type-safe configuration management. All environment variables are automatically loaded from the .env file and validated at startup.
- Run the server:
python main.pyGET /: Basic health check- Returns:
{"message": "Simple API is running"}
- Returns:
GET /v1/hello: Hello endpoint- Returns:
{"message": "Hello"} - Also prints "Hello endpoint called" to console
- Returns:
POST /v1/scraper/crawl: Crawl a veterinary clinic website and extract structured data-
Request Body:
{ "url": "https://example-vet-clinic.com", "max_depth": 3, "max_pages": 50 } -
Parameters:
url(required): Base URL of the website to crawlmax_depth(optional): Maximum crawl depth (1-10, default: 3)max_pages(optional): Maximum pages to crawl (1-200, default: 50)
-
Response:
{ "success": true, "url": "https://example-vet-clinic.com", "pages_crawled": 15, "data": { "name": "Example Veterinary Clinic", "phone": "555-1234", "address": "123 Main St, City, ST 12345", "email": "info@example-vet.com", "business_hours": { "monday": "9:00 AM - 5:00 PM", "tuesday": "9:00 AM - 5:00 PM", "wednesday": "9:00 AM - 5:00 PM", "thursday": "9:00 AM - 5:00 PM", "friday": "9:00 AM - 5:00 PM", "saturday": "10:00 AM - 2:00 PM", "sunday": "Closed" }, "services": ["Wellness Exams", "Surgery", "Dental Care"], "staff": [ { "name": "Dr. Jane Smith", "role": "Veterinarian", "specialization": "Surgery", "bio": "10 years experience..." } ], "faqs": [ { "question": "What are your payment options?", "answer": "We accept cash, credit cards..." } ], "policies": "Appointment cancellation policy...", "additional_info": "Free parking available" }, "error": null }
-
GET /v1/scraper/health: Scraper service health check- Returns:
{"status": "healthy", "service": "scraper"}
- Returns:
Once the server is running, test the endpoints:
- Root endpoint:
curl http://localhost:8080/- Hello endpoint:
curl http://localhost:8080/v1/hello- Scraper endpoint:
curl -X POST http://localhost:8080/v1/scraper/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example-vet-clinic.com", "max_depth": 3, "max_pages": 50}'Or visit the interactive API documentation:
- Swagger UI: http://localhost:8080/docs
- ReDoc: http://localhost:8080/redoc
- Deep crawling using Crawl4AI with breadth-first search strategy
- Configurable crawl depth and page limits
- Automatic filtering of non-content pages (CSS, JS, images, etc.)
- Stays within the same domain
- Extracts clean markdown content
- Uses Google Gemini 2.0 Flash for intelligent data extraction
- Extracts structured information:
- Business name, contact info (phone, email, address)
- Business hours (structured by day)
- Services offered
- Staff members with roles and specializations
- FAQs
- Policies (payment, appointment, emergency)
- Additional information
- Markdown format extraction (fewer tokens than HTML)
- Content truncation to stay within token limits
- Automatic skipping of non-content pages
- Low temperature for consistent extraction results
The server uses:
- FastAPI: Modern web framework
- Uvicorn: ASGI server
- Crawl4AI: Advanced web crawling library
- Google Gemini AI: AI-powered data extraction
- Pydantic: Data validation and settings management with BaseSettings
- pydantic-settings: Type-safe environment variable loading
The API includes comprehensive error handling for:
- Invalid URLs
- Crawling failures (timeouts, connection errors)
- Missing or invalid API keys
- Content extraction failures
- Empty or insufficient website content
All errors return appropriate HTTP status codes and descriptive error messages.