HTML Article Extractor API

Express server that extracts clean article content from HTML for OpenAI summarization. Designed to work with engineering blog RSS feeds.

Features

Smart Content Extraction: Removes scripts, styles, navigation, ads, and other non-content elements
Multiple Source Support: Handles HTML from various engineering blogs (GitHub, Slack, AWS, etc.)
Metadata Extraction: Captures title, author, publish date, and content statistics
Clean Output: Returns only the article content, optimized for AI processing

Installation

npm install

Usage

Start the Server

npm start

For development with auto-reload:

npm run dev

The server will start on port 3000 (configurable via .env file).

API Endpoints

POST /extract

Extract article content from HTML.

Request:

curl -X POST http://localhost:3000/extract \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html><article><h1>Title</h1><p>Content...</p></article></html>"
  }'

Response:

{
  "success": true,
  "data": {
    "title": "Article Title",
    "content": "Extracted article content...",
    "metadata": {
      "author": "Author Name",
      "publishDate": "2024-01-01",
      "contentLength": 1234,
      "wordCount": 200
    }
  }
}

GET /health

Health check endpoint.

curl http://localhost:3000/health

GET /

API documentation endpoint.

curl http://localhost:3000/

Supported RSS Sources

The extractor is designed to work with HTML from these engineering blogs:

Microsoft Developer Blogs
StackOverflow Blogs
Slack Engineering
Twilio Blog
GitHub Engineering
Dropbox Tech Blog
Spotify Engineering
AWS Architecture Blog
Google Cloud Blog
Meta Engineering
Stripe Engineering
DigitalOcean Blog
Cloudflare Blog
MongoDB Blog

Configuration

Create a .env file in the root directory:

PORT=3000

How It Works

Preprocessing: Removes all non-content elements (scripts, styles, navigation, ads, etc.)
Content Detection: Uses multiple selectors to find the main article container
Text Extraction: Extracts text from paragraphs, headings, lists, and blockquotes
Metadata Extraction: Captures title, author, and publish date from meta tags and HTML structure
Validation: Ensures extracted content meets minimum length requirements

Example Integration with OpenAI

const axios = require('axios');

async function summarizeArticle(html) {
  // Extract content
  const response = await axios.post('http://localhost:3000/extract', { html });
  const { title, content } = response.data.data;
  
  // Send to OpenAI
  const summary = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "user",
        content: `Summarize this article:\n\nTitle: ${title}\n\nContent: ${content}`
      }
    ]
  });
  
  return summary.choices[0].message.content;
}

Error Handling

The API returns appropriate HTTP status codes:

200: Success
400: Bad request (missing or invalid HTML)
422: Extraction failed (could not extract meaningful content)
500: Internal server error

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML Article Extractor API

Features

Installation

Usage

Start the Server

API Endpoints

POST /extract

GET /health

GET /

Supported RSS Sources

Configuration

How It Works

Example Integration with OpenAI

Error Handling

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTML Article Extractor API

Features

Installation

Usage

Start the Server

API Endpoints

POST /extract

GET /health

GET /

Supported RSS Sources

Configuration

How It Works

Example Integration with OpenAI

Error Handling

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages