AI-powered feed parser supporting RSS, Atom, JSON feeds, and intelligent HTML parsing using Google Gemini.
- Multi-format Feed Support: RSS, Atom, JSON feeds
- AI-Powered HTML Parsing: Uses ScrapeGraphAI with Google Gemini for intelligent content extraction
- Automatic Feed Discovery: Finds embedded RSS/Atom links in HTML pages
- FastAPI Framework: Modern, fast, with automatic interactive API documentation
- Intelligent Fallback: Gracefully degrades to basic parsing if AI fails
- Python 3.8 or higher
- Google Gemini API key (Get it here)
pip install -r requirements.txtplaywright installFor fish shell:
set -x GEMINI_API_KEY 'your-api-key-here'For bash/zsh:
export GEMINI_API_KEY='your-api-key-here'# Run with uvicorn
python -m parserapi
# Or use uvicorn directly
uvicorn parserapi.api:parserapi --host 0.0.0.0 --port 5000The API will be available at:
- API Endpoint: http://localhost:2058
- Interactive Docs (Swagger UI): http://localhost:2058/docs
Endpoint: GET /parse
Parameters:
url(required): The URL to parsegemini_key(optional): Gemini API key (overrides environment variable)
Examples:
# Parse a blog
curl "http://localhost:5000/parse?url=https://techcrunch.com"
# With API key in request
curl "http://localhost:5000/parse?url=https://example.com&gemini_key=your-key"Response:
{
"feed": {
"title": "Site Title",
"link": "https://example.com",
"description": "",
"language": "en",
"updated": "2025-11-06T10:30:00",
"version": "html-scrapegraph"
},
"items": [
{
"title": "Latest Blog Post",
"link": "https://example.com/post/1",
"published": "2025-11-06",
"summary": "Brief summary of the post...",
"author": "Author Name",
"categories": ["tech", "ai"],
"content": "<p>Full HTML content...</p>"
}
],
"source": "AI HTML parser (ScrapeGraphAI + Gemini)"
}Endpoint: GET /health
curl "http://localhost:5000/health"Response:
{
"status": "healthy",
"version": "2.0.0-fastapi-scrapegraph",
"gemini_configured": true
}Endpoint: GET /
curl "http://localhost:5000/"Returns API information and available endpoints.
from parserapi.htmlparser import parse_html_to_feed
import requests
# Fetch a webpage
response = requests.get('https://techcrunch.com')
# Parse with AI
feed = parse_html_to_feed(
html_content=response.text,
base_url='https://techcrunch.com',
gemini_api_key='your-api-key' # Optional if set in env
)
# Access the data
print(f"Feed Title: {feed['title']}")
print(f"Found {len(feed['entries'])} articles")
for article in feed['entries']:
print(f"- {article['title']}")
print(f" Link: {article['link']}")- URL Fetch: Fetches the target URL
- Content Type Detection: Identifies if it's HTML, RSS, JSON, etc.
- Feed Discovery: For HTML, first looks for embedded RSS/Atom links
- AI Parsing: If no feed found, uses ScrapeGraphAI with Gemini to:
- Analyze the HTML structure
- Identify article patterns
- Extract structured data (titles, links, dates, content)
- Fallback: If AI fails, uses basic BeautifulSoup parsing
- Response: Returns structured JSON response
| Metric | Value |
|---|---|
| First Request | 4-6 seconds (AI model loading) |
| Subsequent Requests | 2-4 seconds |
| Success Rate | ~95% |
| Free Tier Limit | 15 requests/minute (Gemini) |
- 15 requests per minute
- ~500 requests per day
- Perfect for testing and small projects
- $0.00025 per request (approximate)
- 10,000 requests = ~$2.50
- 100,000 requests = ~$25
The parser uses these default settings:
graph_config = {
"llm": {
"api_key": "your-gemini-key",
"model": "gemini-pro",
},
"verbose": False,
"headless": True,
}Customize by modifying parserapi/htmlparser.py.
- Ensure
GEMINI_API_KEYis set in environment - Or pass
gemini_keyparameter in API request
pip install -r requirements.txt
playwright install- First request is slower (AI model loading)
- Subsequent requests are faster
- Consider implementing caching
- Check if Gemini API key is valid
- Verify you have API quota remaining
- Parser will automatically fallback to basic extraction
parserapi/
βββ __main__.py # Entry point with uvicorn
βββ api.py # FastAPI app with endpoints
βββ htmlparser.py # ScrapeGraphAI HTML parser
βββ requirements.txt # Python dependencies
βββ cfg/ # Legacy config directory (can be removed)
# With Gunicorn (for production)
pip install gunicorn
gunicorn parserapi.api:parserapi -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:5000
# Or with uvicorn directly
uvicorn parserapi.api:parserapi --host 0.0.0.0 --port 5000 --workers 4FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt && playwright install
COPY parserapi/ ./parserapi/
ENV GEMINI_API_KEY=your-key-here
CMD ["uvicorn", "parserapi.api:parserapi", "--host", "0.0.0.0", "--port", "5000"]FastAPI provides automatic interactive documentation:
- Swagger UI: http://localhost:2058/docs
- Try out the API directly in your browser
- See all parameters and response schemas
You can extend the parser:
from parserapi.htmlparser import ScrapeGraphHTMLParser
class CustomParser(ScrapeGraphHTMLParser):
def _structure_feed_data(self, raw_result, base_url):
# Custom processing logic
return super()._structure_feed_data(raw_result, base_url)Implement caching to reduce API calls:
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_parse(url):
return parse_html_to_feed(fetch_html(url), url)- Fork the repository
- Create your feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
Same as the parent project.
For issues or questions:
- Check the interactive API docs at
/docs - Review error messages in the logs
- Verify API key and quota
- Test with simpler websites first
- ScrapeGraphAI: https://scrapegraphai.com/
- Gemini API: https://ai.google.dev/
- FastAPI: https://fastapi.tiangolo.com/
- Get API Key: https://makersuite.google.com/app/apikey
Version: 2.0.0
Framework: FastAPI
AI Engine: ScrapeGraphAI + Google Gemini