Skip to content

LordNayan/html-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML Article Extractor API

Express server that extracts clean article content from HTML for OpenAI summarization. Designed to work with engineering blog RSS feeds.

Features

  • Smart Content Extraction: Removes scripts, styles, navigation, ads, and other non-content elements
  • Multiple Source Support: Handles HTML from various engineering blogs (GitHub, Slack, AWS, etc.)
  • Metadata Extraction: Captures title, author, publish date, and content statistics
  • Clean Output: Returns only the article content, optimized for AI processing

Installation

npm install

Usage

Start the Server

npm start

For development with auto-reload:

npm run dev

The server will start on port 3000 (configurable via .env file).

API Endpoints

POST /extract

Extract article content from HTML.

Request:

curl -X POST http://localhost:3000/extract \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html><article><h1>Title</h1><p>Content...</p></article></html>"
  }'

Response:

{
  "success": true,
  "data": {
    "title": "Article Title",
    "content": "Extracted article content...",
    "metadata": {
      "author": "Author Name",
      "publishDate": "2024-01-01",
      "contentLength": 1234,
      "wordCount": 200
    }
  }
}

GET /health

Health check endpoint.

curl http://localhost:3000/health

GET /

API documentation endpoint.

curl http://localhost:3000/

Supported RSS Sources

The extractor is designed to work with HTML from these engineering blogs:

  • Microsoft Developer Blogs
  • StackOverflow Blogs
  • Slack Engineering
  • Twilio Blog
  • GitHub Engineering
  • Dropbox Tech Blog
  • Spotify Engineering
  • AWS Architecture Blog
  • Google Cloud Blog
  • Meta Engineering
  • Stripe Engineering
  • DigitalOcean Blog
  • Cloudflare Blog
  • MongoDB Blog

Configuration

Create a .env file in the root directory:

PORT=3000

How It Works

  1. Preprocessing: Removes all non-content elements (scripts, styles, navigation, ads, etc.)
  2. Content Detection: Uses multiple selectors to find the main article container
  3. Text Extraction: Extracts text from paragraphs, headings, lists, and blockquotes
  4. Metadata Extraction: Captures title, author, and publish date from meta tags and HTML structure
  5. Validation: Ensures extracted content meets minimum length requirements

Example Integration with OpenAI

const axios = require('axios');

async function summarizeArticle(html) {
  // Extract content
  const response = await axios.post('http://localhost:3000/extract', { html });
  const { title, content } = response.data.data;
  
  // Send to OpenAI
  const summary = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "user",
        content: `Summarize this article:\n\nTitle: ${title}\n\nContent: ${content}`
      }
    ]
  });
  
  return summary.choices[0].message.content;
}

Error Handling

The API returns appropriate HTTP status codes:

  • 200: Success
  • 400: Bad request (missing or invalid HTML)
  • 422: Extraction failed (could not extract meaningful content)
  • 500: Internal server error

License

ISC

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors