Skip to content

JX-76/web-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Extract Skill for OpenClaw

License: MIT OpenClaw

A robust web content extraction skill for OpenClaw that fetches and converts web pages to readable markdown/text using multiple fallback services.

🌟 Features

  • Multi-Service Fallback: Automatically tries 4 different extraction services
  • No API Keys Required: Uses free, publicly available services
  • Markdown Output: Clean, readable markdown format
  • Easy Integration: Works seamlessly with OpenClaw

🚀 Services (in order of preference)

Service URL Pattern Best For
markdown.new https://markdown.new/{url} General use, Cloudflare sites
defuddle.md https://defuddle.md/{url} Alternative parsing
r.jina.ai https://r.jina.ai/{url} Article extraction
Scrapling Python library Complex pages, JavaScript

📦 Installation

Method 1: Install from ClawHub (recommended)

clawhub install web-extract

Method 2: Manual Installation

  1. Clone this repository:
git clone https://github.com/yourusername/web-extract.git
  1. Copy to your OpenClaw skills directory:
cp -r web-extract ~/.openclaw/workspace/skills/
  1. Package the skill:
cd ~/.openclaw/workspace/skills/web-extract
clawhub package .

🎯 Usage

Once installed, OpenClaw will automatically use this skill when you provide a URL.

Example Workflow

  1. Try markdown.new first (fastest, best formatting)
  2. Fallback to defuddle.md (alternative parser)
  3. Try r.jina.ai (good for articles)
  4. Use Scrapling (when services fail)

Manual Usage

# Using the included script
python3 scripts/extract.py "https://example.com/article"

# With specific format
python3 scripts/extract.py "https://example.com/article" --format markdown

# Save to file
python3 scripts/extract.py "https://example.com/article" -o output.md

📁 Project Structure

web-extract/
├── SKILL.md                 # Main skill documentation
├── README.md                # This file
├── LICENSE                  # MIT License
├── scripts/
│   └── extract.py          # Scrapling extraction script
└── references/
    └── services.md         # Service documentation

🔧 Requirements

  • OpenClaw >= 1.0.0
  • Python 3.8+ (for Scrapling fallback)
  • Scrapling library (optional, for fallback):
    pip install scrapling

📝 Example

# Extract a blog post
python3 scripts/extract.py "https://example.com/blog/post"

# Output:
# Title: Example Blog Post
# URL: https://example.com/blog/post
# Content: ...

🛠️ Development

Testing

# Test with a simple URL
python3 scripts/extract.py "https://example.com"

# Test markdown output
python3 scripts/extract.py "https://example.com" --format markdown

Adding New Services

To add a new extraction service:

  1. Update SKILL.md with the new service
  2. Add service details to references/services.md
  3. Update the fallback chain in documentation

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

If you encounter any issues or have questions:

  1. Check the references/services.md for troubleshooting
  2. Open an issue on GitHub
  3. Contact the OpenClaw community

Made with ❤️ for OpenClaw

About

Web content extraction skill for OpenClaw with multi-service fallback

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages