Express server that extracts clean article content from HTML for OpenAI summarization. Designed to work with engineering blog RSS feeds.
- Smart Content Extraction: Removes scripts, styles, navigation, ads, and other non-content elements
- Multiple Source Support: Handles HTML from various engineering blogs (GitHub, Slack, AWS, etc.)
- Metadata Extraction: Captures title, author, publish date, and content statistics
- Clean Output: Returns only the article content, optimized for AI processing
npm installnpm startFor development with auto-reload:
npm run devThe server will start on port 3000 (configurable via .env file).
Extract article content from HTML.
Request:
curl -X POST http://localhost:3000/extract \
-H "Content-Type: application/json" \
-d '{
"html": "<html><article><h1>Title</h1><p>Content...</p></article></html>"
}'Response:
{
"success": true,
"data": {
"title": "Article Title",
"content": "Extracted article content...",
"metadata": {
"author": "Author Name",
"publishDate": "2024-01-01",
"contentLength": 1234,
"wordCount": 200
}
}
}Health check endpoint.
curl http://localhost:3000/healthAPI documentation endpoint.
curl http://localhost:3000/The extractor is designed to work with HTML from these engineering blogs:
- Microsoft Developer Blogs
- StackOverflow Blogs
- Slack Engineering
- Twilio Blog
- GitHub Engineering
- Dropbox Tech Blog
- Spotify Engineering
- AWS Architecture Blog
- Google Cloud Blog
- Meta Engineering
- Stripe Engineering
- DigitalOcean Blog
- Cloudflare Blog
- MongoDB Blog
Create a .env file in the root directory:
PORT=3000- Preprocessing: Removes all non-content elements (scripts, styles, navigation, ads, etc.)
- Content Detection: Uses multiple selectors to find the main article container
- Text Extraction: Extracts text from paragraphs, headings, lists, and blockquotes
- Metadata Extraction: Captures title, author, and publish date from meta tags and HTML structure
- Validation: Ensures extracted content meets minimum length requirements
const axios = require('axios');
async function summarizeArticle(html) {
// Extract content
const response = await axios.post('http://localhost:3000/extract', { html });
const { title, content } = response.data.data;
// Send to OpenAI
const summary = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "user",
content: `Summarize this article:\n\nTitle: ${title}\n\nContent: ${content}`
}
]
});
return summary.choices[0].message.content;
}The API returns appropriate HTTP status codes:
200: Success400: Bad request (missing or invalid HTML)422: Extraction failed (could not extract meaningful content)500: Internal server error
ISC