A Model Context Protocol (MCP) server that extracts document content from Microsoft Learn and GitHub URLs, storing them in PocketBase for easy retrieval and search.
✅ Latest MCP SDK Features (v1.12.0+)
- Modern
McpServer
architecture with enhanced capabilities - Multiple transport protocols: STDIO, Streamable HTTP, SSE
- Dynamic tool management with lazy loading
- Session management for stateful connections
- Server-Sent Events support with backwards compatibility
- Real-time server statistics and metrics
✅ Content Extraction
- Microsoft Learn articles with rich metadata
- GitHub files (README, documentation, code files)
- Intelligent content parsing and cleaning
- Duplicate detection and updates
✅ PocketBase Integration
- Persistent document storage
- Full-text search capabilities
- Metadata preservation
- CRUD operations
✅ Advanced Server Features
- Multiple transport modes (STDIO/HTTP)
- Health check and info endpoints
- Read-only mode support
- Enhanced error handling and debugging
- Resource endpoints for server metrics
✅ Rich Metadata
- Word counts and content statistics
- Source attribution and URLs
- Extraction timestamps
- Content headers and descriptions
- Node.js 18+ with ES modules support
- PocketBase server running
- Network access for content extraction
# Navigate to the project directory
cd c:\powershell_scripts\pocketbase_document_mcp\document-extractor-mcp
# Install dependencies
npm install
The MCP server supports both local and remote PocketBase instances. Choose the setup that best fits your needs:
-
Download and install PocketBase:
# Download from https://pocketbase.io/docs/ # Extract the executable to your preferred directory
-
Start local PocketBase server:
# Run from the directory containing pocketbase.exe .\pocketbase.exe serve # Or specify custom port and data directory .\pocketbase.exe serve --http="127.0.0.1:8090" --dir="./pb_data"
-
Set up admin account:
- Access PocketBase Admin UI at http://127.0.0.1:8090/_/
- Create your admin account
- Note the email/password for configuration
-
Deploy PocketBase to your preferred hosting:
- Railway, Fly.io, DigitalOcean, AWS, etc.
- Follow your hosting provider's deployment guide
- Ensure HTTPS is enabled for production
-
Configure your remote instance:
- Set up admin account through the web interface
- Configure CORS settings if needed
- Note the full URL (e.g., https://your-pb-instance.com)
-
Using Docker Compose:
version: '3.8' services: pocketbase: image: ghcr.io/muchobien/pocketbase:latest ports: - "8090:8090" volumes: - ./pb_data:/pb/pb_data
-
Collection Management (Automatic for all setups):
- The server will automatically create the required
documents
collection on startup - If
AUTO_CREATE_COLLECTION=true
(default), no manual setup needed - Use the
ensure_collection
tool to manually verify/create collections - Use the
collection_info
tool to check collection status
- The server will automatically create the required
-
Manual Collection Setup (if needed):
- Access PocketBase Admin UI
- Create a new collection named
documents
- Add these fields:
title (Text, required) content (Text, required) metadata (JSON, required) created (Date, auto-generated) updated (Date, optional)
Create a .env
file in the project root. The server supports both local and remote PocketBase instances:
# PocketBase Configuration - Local
POCKETBASE_URL=http://127.0.0.1:8090
POCKETBASE_ADMIN_EMAIL=admin@example.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Development Settings
DEBUG=true
NODE_ENV=development
READ_ONLY_MODE=false
# Collection Management ✨ New!
AUTO_CREATE_COLLECTION=true
# PocketBase Configuration - Remote
POCKETBASE_URL=https://your-pocketbase-instance.com
POCKETBASE_ADMIN_EMAIL=admin@yourdomain.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Production Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false
# Collection Management
AUTO_CREATE_COLLECTION=true
# PocketBase Configuration - Docker
POCKETBASE_URL=http://pocketbase:8090
POCKETBASE_ADMIN_EMAIL=admin@localhost
POCKETBASE_ADMIN_PASSWORD=admin123
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Container Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false
# Collection Management
AUTO_CREATE_COLLECTION=true
The server supports multiple transport modes:
# STDIO mode (default) - for Claude Desktop and CLI clients
npm start
# or explicitly
npm run start:stdio
# HTTP mode - for web clients and testing
npm run start:http
# Development modes with debug logging
npm run dev # STDIO mode with debugging
npm run dev:http # HTTP mode with debugging
npm run dev:stdio # STDIO mode with debugging
# Test the setup
npm run test
Perfect for Claude Desktop and command-line MCP clients:
npm start
Enables web-based clients and testing with multiple protocols:
npm run start:http
Available endpoints in HTTP mode:
POST /mcp
- Streamable HTTP transport (modern protocol 2025-03-26)GET /sse
- Server-Sent Events transport (legacy protocol 2024-11-05)POST /messages
- SSE message endpointGET /health
- Health check endpointGET /info
- Server information endpoint
Extract and store content from URLs.
Parameters:
url
(string, required): Microsoft Learn or GitHub URL
Example:
{
"url": "https://learn.microsoft.com/en-us/azure/cognitive-services/openai/"
}
List stored documents with pagination.
Parameters:
limit
(number, optional): Max results per page (1-100, default: 20)page
(number, optional): Page number (default: 1)
Search documents by title or content.
Parameters:
query
(string, required): Search querylimit
(number, optional): Max results (1-100, default: 50)
Retrieve a specific document by ID.
Parameters:
id
(string, required): Document ID
Delete a document by ID.
Parameters:
id
(string, required): Document ID to delete
Check if the documents collection exists and create it if needed.
Parameters: None
Description: Automatically verifies the documents collection exists in PocketBase. If not found, creates the collection with the proper schema including all required fields and indexes.
Get detailed information about the documents collection including statistics.
Parameters: None
Description: Returns comprehensive collection information including schema details, record counts, indexes, and timestamps.
Real-time server statistics and metrics.
Content:
- Total document count
- Server information (name, version, uptime)
- Memory usage statistics
- Environment information
- Read-only mode status
The server supports dynamic tool management with lazy loading:
// Tools can be dynamically enabled/disabled
if (process.env.READ_ONLY_MODE === 'true') {
// Write operations are disabled in read-only mode
deleteDocumentTool.disable();
extractDocumentTool.disable();
}
// Tools can be re-enabled at runtime
tool.enable();
In HTTP mode, the server supports session management:
- Streamable HTTP: Modern session management with automatic session ID generation
- SSE (Legacy): Backwards compatible session handling
- Session persistence: Sessions are maintained across requests
- Automatic cleanup: Sessions are cleaned up when connections close
- Full article extraction
- Metadata preservation (description, keywords, author)
- Section headers extraction
- Content cleaning and formatting
Example URLs:
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/
https://learn.microsoft.com/en-us/dotnet/core/introduction
- File content extraction (README, docs, code)
- Repository metadata
- Branch handling (main/master fallback)
- File type detection
Supported URL formats:
https://github.com/owner/repo
(assumes README.md)https://github.com/owner/repo/blob/main/file.md
https://raw.githubusercontent.com/owner/repo/main/file.md
Variable | Description | Default |
---|---|---|
POCKETBASE_URL |
PocketBase server URL | http://127.0.0.1:8090 |
POCKETBASE_ADMIN_EMAIL |
Admin email for authentication | Required |
POCKETBASE_ADMIN_PASSWORD |
Admin password | Required |
DOCUMENTS_COLLECTION |
Collection name for documents | documents |
DEBUG |
Enable debug logging | false |
NODE_ENV |
Environment mode | development |
READ_ONLY_MODE |
Disable write operations | false |
AUTO_CREATE_COLLECTION |
Auto-create collections on startup | true |
Enable detailed logging:
$env:DEBUG="true"; node server.js
Debug logs include:
- Authentication status
- Content extraction details
- Database operations
- Error context
The server implements comprehensive error handling:
- Network errors: Timeout and connection issues
- Authentication errors: PocketBase connection problems
- Validation errors: Invalid input parameters
- Content errors: Extraction failures
- Database errors: Storage and retrieval issues
All errors are returned as structured MCP responses with appropriate error codes.
# Start in development mode
npm run dev
# Start in production mode
npm start
# Install dependencies
npm run install-deps
# Test basic functionality
$env:DEBUG="true"; node server.js
# In another terminal, you can test with MCP tools or:
# Use Claude Desktop with MCP configuration
# Use other MCP-compatible clients
-
Authentication Failed
- Verify PocketBase is running:
http://127.0.0.1:8090
- Check admin credentials in
.env
- Ensure admin user exists in PocketBase
- Verify PocketBase is running:
-
Content Extraction Errors
- Check network connectivity
- Verify URL accessibility
- Review debug logs for details
-
Collection Not Found
- Use the
ensure_collection
tool to automatically create the collection - Check collection name in environment variables
- Verify
AUTO_CREATE_COLLECTION
is enabled - Check collection permissions
- Use the
-
Module Import Errors
- Ensure
"type": "module"
in package.json - Use Node.js 18+ with ES modules support
- Check all dependencies are installed
- Ensure
Enable debug mode to see detailed logs:
$env:DEBUG="true"; node server.js
If you need to recreate the collection, use this schema:
{
"name": "documents",
"type": "base",
"schema": [
{
"name": "title",
"type": "text",
"required": true,
"options": {
"max": 255
}
},
{
"name": "content",
"type": "text",
"required": true
},
{
"name": "metadata",
"type": "json",
"required": true
},
{
"name": "created",
"type": "date",
"required": false
},
{
"name": "updated",
"type": "date",
"required": false
}
]
}
Add this to your Claude Desktop MCP settings:
{
"mcpServers": {
"document-extractor": {
"command": "node",
"args": ["c:\\powershell_scripts\\pocketbase_document_mcp\\document-extractor-mcp\\server.js"],
"env": {
"POCKETBASE_URL": "http://127.0.0.1:8090",
"POCKETBASE_ADMIN_EMAIL": "your-admin@example.com",
"POCKETBASE_ADMIN_PASSWORD": "your-password",
"DEBUG": "false"
}
}
}
}
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Latest MCP SDK v1.13.1+: Upgraded to the newest Model Context Protocol SDK
- Latest PocketBase SDK v0.26.1+: Updated to the latest PocketBase features
- Collection Management Tools: Added
ensure_collection
andcollection_info
tools - Auto-Collection Creation: Automatic database schema setup on startup
- Enhanced Lazy Loading: Improved dynamic tool management
- Latest SSE Features: Modern Server-Sent Events implementation
- Improved Error Handling: Better collection management error recovery
- Enhanced Documentation: Comprehensive usage examples and troubleshooting
- Updated to latest Anthropic MCP SDK
- Added comprehensive error handling
- Implemented input validation with Zod
- Enhanced metadata extraction
- Added debug logging
- Improved documentation
- Added PocketBase integration
- Support for Microsoft Learn and GitHub
This MCP server supports deployment on Smithery, a platform for hosting MCP servers.
The fastest way to deploy this server on Smithery:
- Fork or Clone this repository to your GitHub account
- Connect GitHub to Smithery (or claim your server if already listed)
- Navigate to the Deployments tab on your server page
- Click Deploy - Smithery will automatically build and host your server
The smithery.yaml
file is already configured for TypeScript/Node.js deployment.
Note: Despite being called "TypeScript Deploy", this method works perfectly for Node.js projects with ES modules.
For advanced deployment with full Docker control:
- Replace smithery.yaml with the container configuration:
cp smithery-container.yaml smithery.yaml
- Push to GitHub with the updated configuration
- Deploy via Smithery's Deployments tab
The Dockerfile
is optimized for production deployment with security best practices.
When deploying on Smithery, you'll configure:
- PocketBase URL: Your PocketBase instance URL
- Admin Credentials: Email and password for PocketBase admin
- Collection Settings: Default collection name and auto-creation
- Debug Mode: Enable detailed logging (optional)
- Tool Discovery: All tools are available without authentication for discovery
- Lazy Authentication: API validation occurs only when tools are invoked
- Environment Variables: Configuration is handled via Smithery's config schema
- Health Checks: Built-in health monitoring at
/health
endpoint