TL;DR: It's a content ingestion → processing → indexing → discovery platform—automating the entire lifecycle from input to searchable archive.
A personal data archiver. A system that works alongside bookmark tools and other bookmark data sources to ingest data from them, download, and archive the internet content from various sources automatically.
GatherHub is a versatile tool designed to download and organize web content from URLs, bookmarks, databases, RSS feeds, torrents and more. It features a modern web interface, job tracking, media type detection, and automated scheduling to help you archive and manage your digital content collection. The integrated search system allows for full-text search across all downloaded content, with support for indexing various file types including HTML, PDF, documents, and streaming video metadata.
- Multi-source support: Upload files, manually input URLs, and import from sources: browsers, bookmark tools, databases, and RSS feeds
- Smart content handling: Automatically selects the appropriate download method
- Web interface: Modern dashboard to manage and monitor download jobs
- REST API: Programmatic access for integration with other systems
- Job tracking: Monitor status of all downloads in a centralized database
- Content organization: Tagging system for easy content discovery
- Scheduled operations: Automatic synchronization with sources at configurable intervals
- Event hooks: Run custom scripts when specific events occur
- Flexible media handling: Customize how different content types are handled with the tool of your choice
- Content extraction: Automatically extract readable content from HTML, Images, and other documents for indexing
- Firefox
- Chrome
- Chromium
- Brave
- Vivaldi
- Readeck
- Karakeep
- Linkding
- LinkWarden
- Wallabag
- SQLite
- MySQL
- PostgreSQL
- RSS Feeds (with configurable settings for max items and enclosures)
- WROLPI saved links
- Ad hoc direct URL import through the web interface
- File uploads from local disk or network drives
- Streaming videos: Using yt-dlp with extensive features for YouTube, Vimeo, Twitch, and many other platforms including metadata extraction, thumbnail capture, subtitles, and sponsorblock integration
- Git repositories: Clones repositories and creates optional ZIP archives
- Web pages: Full HTML archiving with JavaScript support via monolith or SingleFile
- Web archives: Comprehensive website archives with automatic crawling, link rewriting, and navigation indexes
- Documents: PDF, DOCX, TXT, etc.
- E-books: EPUB, MOBI, AZW, etc.
- Media files: MP3, MP4, images, and other media formats
- Archives: ZIP, RAR, 7z, and other compressed formats
- Maps: PBF files
- ZIM files: Wikipedia and other ZIM formatted content
- Torrents: Torrent and Magnet files
- Custom types: Define your own media types with custom tools and URL patterns
Media types in GatherHub are highly configurable:
- Tool flexibility: Use any command-line tool that can download or process content
- Custom URL patterns: Define specific regex patterns to target exactly the content you want
- Domain-based configuration: Configure media types for specific domains without complex regex
- Pattern priority: More specific patterns take precedence over general ones
- Intelligent file type detection: Automatically identifies appropriate handlers based on file extensions and URL characteristics
- Arguments templating: Customize command arguments with variables like {url}, {output_dir}, {id}
- Add specialized handlers: Create media types for specific websites or content sources
- Override defaults: Replace the default tools with your preferred alternatives
Example customization for a specialized archiving tool:
[[media_types]]
name = 'special-wiki'
patterns = ['^https?://(www\.)?specialwiki\.org/.*']
tool = 'my-wiki-archiver'
tool_path = '/usr/local/bin/wiki-archiver'
arguments = '--download {url} --output {output_dir}/{id} --format=full'Example of domain-based media type configuration (simpler than regex patterns):
[[media_types]]
name = 'academic-papers'
domains = ['arxiv.org', 'papers.ssrn.com', 'scholar.google.com']
tool = 'pdf-archiver'
tool_path = '/usr/local/bin/pdf-archiver'
arguments = '--no-js --wait 3 {url} --out {output_dir}/{id}.pdf'GatherHub provides a comprehensive web interface with:
- Dashboard: Overview of download statistics and quick actions
- Job management: Add, retry, filter, and search download jobs
- File upload: Directly upload files through the web interface with automatic content extraction and indexing
- Bulk operations: Select multiple jobs and perform actions like retry or delete in a single click
- Settings: Configure sources, media types, storage, and more
- Cookie management: Manage browser cookies for authenticated downloads (enabling access to paywalled or private content)
- Personal Google: Your content is indexed and searchable
- Documentation: Built-in documentation for all features with it's own search engine
- Dark/light/Contrast mode: Customizable theme based on system preferences
- Reset stuck jobs: Recover from processes that failed to complete properly
- Tour helper: Get an overview of key features
GatherHub includes a comprehensive first-run setup wizard that guides new users through initial configuration:
- Automatic Detection: Detects if this is a first-time installation
- Setup Wizard: Step-by-step configuration process
- Documentation Indexing: Automatically indexes built-in documentation for searchability
- Dependency Checking: Verifies required external tools are installed
- Configuration Validation: Ensures all settings are properly configured
- Progress Tracking: Shows setup progress and completion status
- Skip Options: Allows experienced users to skip certain setup steps
The first-run setup ensures that new installations are properly configured and ready to use immediately.
The Shepherd system provides guided tours and contextual help throughout the application:
- Interactive Tours: Step-by-step walkthroughs of key features
- Contextual Help: Context-sensitive assistance based on current page
- Progress Tracking: Tracks which tours and help sections have been completed
- Customizable Tours: Different tour paths for different user types
- Reset Capability: Ability to reset and replay tours
- API Integration: Programmatic control over tour states and progress
Available shepherd tours include:
- Dashboard overview and navigation
- Adding and managing download jobs
- Configuring sources and media types
- Using the search functionality
- Managing tags and organization
GatherHub provides advanced search capabilities beyond basic full-text search:
- File Type Facets: Filter results by document type (PDF, HTML, video, etc.)
- Source Facets: Filter by content source (browser bookmarks, RSS feeds, etc.)
- Tag Facets: Filter by assigned tags with count information
- Word Count Facets: Filter by content length ranges (small <500 words, medium 500-2000, large 2000-5000, very large >5000)
- Content Highlighting: Highlights matching terms in search results
- Multiple Format Support: Works with HTML, text, and extracted content
- HTML Styling: Uses HTML-based highlighting for web interface
- Context Preservation: Shows surrounding context for matches
- Field-Specific Search: Search within specific fields using parameters like
fileType:pdf,tags:research,source:"Manual Upload" - Phrase Searching: Exact phrase matching with quotes (in documentation search)
- Parameter Parsing: Google-style search parameters that are extracted from queries
- Multi-value Filters: Support for multiple tags and other filter values
- Automatic Indexing: Content is indexed immediately upon download
- Incremental Updates: Only changed content is reindexed
- Batch Processing: Efficient handling of large content collections
- Metadata Extraction: Comprehensive metadata indexing from various file types
- Multi-format Support: Indexes text, HTML, JSON, and binary file metadata
- Index Cleanup: Automatic removal of index entries when jobs are deleted
- Dedicated Documentation Index: Separate search index for built-in documentation
- Structured Results: Organized by documentation sections and topics with title, content, section, and path fields
- Search Highlighting: Highlights matching terms in both title and content
- Optimized Queries: Weighted search across title (higher priority), section, and content fields
The content processing system provides sophisticated handling of downloaded content:
- Download: Content is downloaded using appropriate tools
- Detection: File type and content analysis
- Extraction: Text and metadata extraction using configurable extractors including OCR
- Transformation: Content conversion to multiple formats (text, JSON, HTML)
- Indexing: Full-text and metadata indexing for search
- Tagging: Automatic and manual tag assignment
- Storage: Organized storage with configurable directory structures
- Sequential Processing: Chain multiple extractors for complex workflows
- Conditional Logic: Apply different extractors based on content type
- Fallback Mechanisms: Automatic fallback to alternative extractors
- Custom Pipelines: Define custom processing pipelines for specific content types
- Multiple Formats: Generate text, JSON, and HTML versions of extracted content
- Format-Specific Optimization: Optimized output for different use cases
- Metadata Preservation: Maintains original metadata across format conversions
- Template Support: Customizable output templates for different formats
GatherHub includes robust content management features:
- Upload files directly through the web interface instead of downloading from URLs
- Support for various file types including documents, images, and media files
- Automatic content extraction from uploaded files
- Intelligent file type detection based on file extensions and content
- Batch uploads for processing multiple files at once
- Immediate indexing of uploaded content for searchability
- Creates structured directories for uploaded content organization
- Integrates with the tagging system for categorization
- Select multiple jobs for batch processing with a single action
- Bulk retry for multiple failed downloads at once
- Bulk deletion of completed or failed jobs
- Filter and select jobs by status or media type before bulk actions
- Responsive interface with clear visual feedback during bulk operations
- Job selection count with real-time updates
- Clear confirmation dialog before destructive actions
- Add custom tags to any downloaded content
- Filter and search content by tags
- Tag-based organization works across different media types
- Web interface for managing tags
- Combined tag searches (e.g., find all "important" AND "work" tagged items)
- Tag suggestions based on content type and URL patterns
- Full-text search: Search across all indexed content including HTML, documents, and video metadata
- Content types: Supports over 20 file types including HTML, PDF, DOC, DOCX, TXT, and many more
- Metadata search: Find content by URL, title, tags, and other metadata fields
- Media-specific search: Video search includes descriptions, titles, and subtitles when available
- Status filtering: Filter by download status (pending, downloading, completed, failed)
- Media type filtering: Narrow results by media type (Streaming Video, Git, HTML, etc.)
- Tag filtering: Search by one or more tags
- Combined filtering: Mix multiple search criteria (e.g., completed downloads with specific tags)
- Search history: Save and reuse search filters
- Automatic indexing: Content is automatically indexed upon download completion
- Manual reindexing: Force reindexing of specific content via UI or the
--reindex-allflag
- Automatically extracts readable content from HTML and other documents
- Removes clutter, ads, and navigation from web content
- Preserves important metadata like title, author, and publish date
- Multiple output formats:
- Plain text for indexing and readability
- JSON for programmatic access
- HTML for clean presentation
- Pluggable extractor architecture:
- Internal extractors: Built-in processing for common formats
- External extractors: Use any command-line tool without code changes
- Chain extractors: Multi-step pipelines for complex processing
- Priority-based selection with automatic fallback mechanisms
- Media type integration for format-specific extractor configuration
- Extensible without code changes via config.toml settings
- Configurable via
extractorssection in config.toml - Integrated with event hooks system for automatic processing
- Command-line tool for manual extraction and batch processing
Example extractor configuration:
# Example of a chain extractor for PDF files
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 10
steps = [
{ command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
{ use = 'readability', input = '{output}.html' }
]- Automatically preserves bookmark metadata (title, last visit date)
- Custom metadata storage in JSON format for extensibility
- Activity logs for tracking changes to downloads
- Storage path information for direct file access
- SQLite3
- External tools:
- yt-dlp (for YouTube)
- git
- aria2c (for general downloads)
- monolith or SingleFile (for HTML archiving with JavaScript support)
- FFmpeg (optional but recommended for yt-dlp)
- jq (for JSON parsing in hook scripts)
- zip (for creating ZIP archives)
- tesseract-ocr (for OCR)
# Install SQLite
sudo apt-get update
sudo apt-get install -y sqlite3
# Install yt-dlp
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlp
# Install git
sudo apt-get install -y git
# Install aria2
sudo apt-get install -y aria2
# Install FFmpeg
sudo apt-get install -y ffmpeg
# Install jq (required for hook scripts)
sudo apt-get install -y jq
# Install zip (required for creating ZIP archives)
sudo apt-get install -y zip
# Install Monolith for web page archiving
# For most systems:
sudo snap install monolith
sudo apt-get install -y tesseract-ocr
# For raspberry pi
# wget -O monolith https://github.com/Y2Z/monolith/releases/download/v2.10.1/monolith-gnu-linux-aarch64
# sudo mv monolith /usr/local/bin/
# sudo chmod +x /usr/local/bin/monolith
# Alternatively, install SingleFile
# npm install -g single-file-cli# Install SQLite
brew install sqlite
# Install yt-dlp
brew install yt-dlp
# Install git
brew install git
# Install aria2
brew install aria2
# Install FFmpeg
brew install ffmpeg
# Install jq (required for hook scripts)
brew install jq
# Install zip (required for creating ZIP archives)
brew install zip
# Install monolith for web page archiving
brew install monolith
brew install tesseract-ocr
# Install Node.js (if using SingleFile instead of monolith)
# brew install node
# npm install -g single-file-cliUsage: gatherhub [OPTIONS]
Options:
--config PATH Path to config file
--scan Scan sources for new content
--download Process pending downloads
--status Show download status
--retry Retry failed downloads
--clean Clean failed downloads
--daemon Run as daemon (for scheduling)
--api Start API server
--web Start web interface
--import-bookmarks NAME Import browser bookmarks with specified name
--browser BROWSER Browser to import from bookmarks (firefox, chrome, etc)
--profile PATH Browser profile path (optional)
--generate-admin-password Generate a new admin password hash
--generate-install-script PATH Generate install script for dependencies (specify output filename)
--reindex-all Reindex all downloaded files
--check-deps Check system dependencies
--help Show this message and exit
Scan sources for new content:
./gatherhub --scan
Import Firefox bookmarks:
./gatherhub --import-bookmarks "firefox-bookmarks" --browser firefox
Import from a specific Firefox profile:
./gatherhub --import-bookmarks "firefox-work" --browser firefox --profile "/path/to/firefox/profile"
Process pending downloads:
./gatherhub --download
Show download status:
./gatherhub --status
Run as a daemon with scheduling:
./gatherhub --daemon
Start the web interface:
./gatherhub --web
Launch both the web interface and API server:
./gatherhub --web --api
Check system dependencies:
./gatherhub --check-deps
Generate installation script for current platform:
./gatherhub --generate-install-script install.sh
Reindex all downloaded content:
./gatherhub --reindex-all
Generate new admin password:
./gatherhub --generate-admin-password
The web interface is available at http://localhost:8060 by default. You can log in with the username and password configured in config.toml (default credentials are admin/admin).
The API is available at http://localhost:5000/api by default (configurable in config.toml). The API provides programmatic access to GatherHub's functionality.
The API supports API Key-based authentication:
[api.auth]
enabled = true
api_secret = "your-secret-key" # Change this to a secure random string
apiKey_expiry_hours = 24When authentication is enabled, all API requests must include an Authorization header:
Authorization: Bearer your-api-key
- Endpoint:
GET /api/jobs - Description: Retrieves a list of all jobs
- Response: Array of job objects
[
{
"id": 1,
"url": "https://example.com/file.pdf",
"source_name": "firefox",
"source_id": "bookmark_123",
"media_type": "pdf",
"status": "completed",
"created_at": "2023-04-15T14:30:45Z",
"updated_at": "2023-04-15T14:35:12Z"
}
]- Endpoint:
POST /api/jobs - Description: Creates a new download job
- Request Body:
{
"url": "https://example.com/file.pdf",
"source_name": "manual",
"source_id": "user_123",
"media_type": "pdf" // Optional, will be auto-detected if not provided
}- Response: The created job object
- Endpoint:
GET /api/jobs/{id} - Description: Retrieves details for a specific job
- Parameters:
id: Job ID (integer)
- Response: Complete job object including status, file path, and metadata
- Endpoint:
PUT /api/jobs/{id} - Description: Updates a job's status or error message
- Parameters:
id: Job ID (integer) - Request Body:
{
"status": "failed",
"error": "Download failed: connection timeout"
}- Response: Updated job object
- Endpoint:
DELETE /api/jobs/{id} - Description: Deletes a job from the system
- Parameters:
id: Job ID (integer) - Response: HTTP 204 (No Content) on success
- Endpoint:
POST /api/bulk/jobs - Description: Perform operations on multiple jobs at once
- Request Body:
{
"action": "delete", // or "retry"
"ids": [1, 2, 3, 4] // Array of job IDs
}- Response:
{
"success": true,
"deleted": 3, // or "retried" for retry operations
"failed": 1,
"message": "Deleted 3 jobs, 1 failed"
}- Endpoint:
POST /api/scan - Description: Scan all enabled sources for new content
- Response: Success message with scan results
{
"success": true,
"message": "Scan completed successfully"
}- Endpoint:
POST /api/download - Description: Process all pending download jobs
- Response: HTTP 200 with a success message on completion
- Endpoint:
POST /api/retry - Description: Retry all failed download jobs
- Response: HTTP 200 with a success message on completion
- Endpoint:
POST /api/clean - Description: Removes failed jobs and old completed jobs
- Response: HTTP 200 with a message indicating how many jobs were cleaned (e.g., "Cleaned 5 failed jobs and 10 old jobs")
- Endpoint:
GET /api/stats - Description: Retrieve system statistics
- Response:
{
"total_jobs": 156,
"pending_jobs": 5,
"downloading_jobs": 2,
"completed_jobs": 145,
"failed_jobs": 4,
"sources": 3,
"recent_activity": true
}- Endpoint:
POST /api/upload - Description: Upload files directly rather than downloading from URLs
- Request: Multipart form with:
file: File to upload (can be multiple files)media_type: Optional media type, will be auto-detected if not provided
- Response:
{
"success": true,
"message": "2 of 2 files uploaded successfully",
"jobs": [
{
"id": 157,
"url": "file:///path/to/uploaded/file1.pdf",
"source_name": "Manual Upload",
"media_type": "pdf",
"status": "completed"
},
{
"id": 158,
"url": "file:///path/to/uploaded/file2.jpg",
"source_name": "Manual Upload",
"media_type": "jpg",
"status": "completed"
}
]
}- Endpoint:
GET /api/jobs/{id}/tags - Description: Retrieve all tags for a specific job
- Parameters:
id: Job ID (integer) - Response: Array of tag strings
- Endpoint:
POST /api/jobs/{id}/tags - Description: Add a tag to a specific job
- Parameters:
id: Job ID (integer) - Request Body:
{
"tag": "important"
}- Response: Success message with automatic content reindexing
- Endpoint:
DELETE /api/jobs/{id}/tags/{tag} - Description: Remove a specific tag from a job
- Parameters:
id: Job ID (integer)tag: Tag name to remove (URL-encoded)
- Response: Success message with automatic content reindexing
- Endpoint:
GET /api/tags - Description: List all available tags in the system
- Response: Array of tag strings
- Endpoint:
POST /api/reset-stuck - Description: Reset jobs that are stuck in downloading state (downloading for >10 minutes)
- Response: Message indicating how many jobs were reset
- Endpoint:
POST /api/reindex-job/{id} - Description: Force reindexing of content for a specific job
- Parameters:
id: Job ID (integer) - Response: Success message confirming reindexing was scheduled
- Endpoint:
GET /api/sources/ - Description: Retrieve all configured sources
- Response: Array of source configuration objects
- Endpoint:
GET /api/sources/{index} - Description: Retrieve configuration for a specific source
- Parameters:
index: Source index (integer) - Response: Source configuration object
- Endpoint:
POST /api/test-source-connection - Description: Test connection to a data source (database, RSS feed, etc.)
- Request: Form data with source configuration parameters
- Response: Success/failure message with connection status
- Endpoint:
GET /api/docs/search - Description: Search the built-in documentation
- Parameters:
q: Search query (required)page: Page number (optional, default: 1)per_page: Results per page (optional, default: 10)
- Response: Search results with documentation matches
- Endpoint:
GET /api/search - Description: Advanced search across all indexed content
- Parameters:
q: Search querypage: Page numberper_page: Results per pagefilters: JSON object with search filtershighlight: Enable search highlighting
- Response: Search results with faceted filtering and highlighting
- Endpoint:
POST /api/index - Description: Manually trigger content indexing
- Response: Success message confirming indexing was triggered
GatherHub is configured via data/config/config.toml. Key configuration sections include:
Configure bookmark sources from browsers, databases, and RSS feeds:
[[sources]]
name = 'firefox'
type = 'browser'
browser = 'firefox'
# profile_path = '/path/to/firefox/profile' # Optional
enabled = true
[[sources]]
name = 'readeck'
type = 'sqlite'
path = './readeck.db'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true
[[sources]]
name = 'mysql-bookmarks'
type = 'mysql'
path = 'localhost' # Host
port = '3306'
database = 'bookmarks'
username = 'user'
password = 'password'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true
[[sources]]
name = 'rss-feed'
type = 'rss'
url = 'https://example.com/feed.xml'
max_items = 20
include_enclosures = true
enabled = trueConfigure how different content types are handled:
[[media_types]]
name = 'streaming-video'
extensions = []
domains = ['youtube.com', 'youtu.be', 'vimeo.com', 'dailymotion.com', 'twitch.tv', 'facebook.com', 'instagram.com', 'twitter.com', 'tiktok.com', 'reddit.com', 'bilibili.com', 'bitchute.com', 'rumble.com', 'odysee.com', 'peertube.tv', 'nebula.app', 'curiositystream.com', 'patreon.com', 'floatplane.com', 'soundcloud.com', 'mixcloud.com', 'bandcamp.com']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "bestvideo[height<=720]+bestaudio/best[height<=720]" --merge-output-format mp4 --limit-rate 1M --no-check-certificate --ignore-errors --no-abort-on-error --geo-bypass --cookies cookies.txt --sponsorblock-remove all --write-description --write-info-json --write-thumbnail --write-all-thumbnails --write-auto-subs --sub-langs all,-live_chat --write-subs --embed-metadata --extractor-retries 5 --fragment-retries 5 --retry-sleep 3 --force-ipv4 --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'
[[media_types]]
name = 'youtube'
patterns = ['^https?://(www\.)?(youtube\.com|youtu\.be)/.*']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "best[height<=720]" --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'
[[media_types]]
name = 'html'
patterns = ['.*\.html$', '^https?://[^/]+/?$']
tool = 'monolith'
tool_path = '/usr/local/bin/monolith'
arguments = '{url} -o {output_dir}/{url-hostname}_{id}.html'
# Example of using SingleFile instead of monolith
[[media_types]]
name = 'singlefile-html'
patterns = ['^https?://docs\.example\.org/.*'] # Special pattern for documentation sites
tool = 'single-file'
tool_path = '/usr/local/bin/single-file'
arguments = '{url} --output-directory={output_dir} --filename-template="{url-hostname}_{id}.html"'Configure where downloaded content is stored:
[storage]
base_path = './downloads/'
[storage.by_type]
youtube = 'youtube/'
git = 'git/'
html = 'html/'
# Additional media types...Configure web interface and API settings:
[web_interface]
enabled = true
host = '0.0.0.0'
port = 8060
allow_iframe = true # Set to false if not embedding in another site
session_timeout_minutes = 60
[[web_interface.users]]
username = 'admin'
password_hash = 'scrypt:32768:8:1$4CQiZOt8Pk17kpi4$6094970a974f2f4298da9157c9a9f17b33cb260638906659e565295c65dd841f72115aec72f47fcdf6b2e7b30fc668a45ee37226a871139f40a3c64b31e0337c'
role = 'admin'
[api]
enabled = true
host = '127.0.0.1'
port = 5000
debug = false # Set to true for verbose API logs
[api.auth]
enabled = true
api_secret = 'your-secret-key' # Change this to a secure random string
apiKey_expiry_hours = 24Configure content extractors for different file types:
[extraction]
enabled = true
output_dir = '' # Use job directory if empty
output_formats = ['text', 'json', 'html']
include_metadata = true
supported_types = ['html', 'books', 'documents']
# Internal extractors (built-in)
[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10
[[extractors]]
name = 'spreadsheet'
extensions = ['.xlsx', '.ods']
type = 'internal'
priority = 10
# External extractors (command-line tools)
[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout {input} {output}'
priority = 20
[[extractors]]
name = 'pandoc'
extensions = ['.docx', '.doc']
type = 'external'
command = '/usr/bin/pandoc'
arguments = '{input} -o {output}'
priority = 20
# Chain extractors (multi-step processing)
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 15
steps = [
{ command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
{ use = 'readability', input = '{output}.html' }
]Configure search index behavior:
[search]
enabled = true
index_path = './data/search_index'
batch_size = 50
auto_index = true
include_content = true
include_metadata = true
[search.facets]
enabled = true
max_facet_count = 20
include_file_types = true
include_sources = true
include_tags = true
include_word_count_ranges = trueConfigure the first-run setup and guided tours:
[first_run]
enabled = true
require_setup = true
auto_index_docs = true
check_dependencies = true
[shepherd]
enabled = true
auto_start_tours = true
track_progress = true
available_tours = [
'dashboard-overview',
'job-management',
'source-configuration',
'search-features',
'tag-management'
]Configure detailed logging options:
[logging]
app_log_path = './data/logs/app.log'
activity_log_path = './data/logs/activity.log'
error_log_path = './data/logs/error.log'
max_size_bytes = 10485760 # 10MB
backup_count = 5
level = 'INFO' # DEBUG, INFO, WARN, ERROR
console_logging = trueConfigure concurrency settings:
[concurrency]
max_workers = 5
timeout_seconds = 3600Configure automatic cleanup of old jobs and files:
[auto_clean]
enabled = false
retry_failed = true
max_retries = 3
clean_after_days = 7For more detailed configuration options, see the docs_configuration.html page in the web interface.
GatherHub supports running custom scripts when certain events occur:
[event_hooks]
enabled = true
hooks_dir = './data/hooks'
[[event_hooks.hooks]]
event = 'post_download'
script = 'notify.py'
enabled = true- Rich JSON context: Hooks receive comprehensive data about the event via stdin
- Environment variables: Access to GATHERHUB_EVENT, GATHERHUB_APP_LOG, and other variables
- Multiple hooks per event: Chain several scripts for the same event type
- Conditional execution: Enable/disable hooks via config without removing them
- Global hook directory: Centralized management of all hook scripts
pre_download: Called before a download startspost_download: Called after a download completeson_error: Called when a download failson_status_change: Called when a download's status changeson_source_scan: Called when a source is scannedon_startup: Called when the application startson_shutdown: Called when the application shuts down
#!/usr/bin/env python3
import json
import sys
import requests
# Read JSON data from stdin
data = json.load(sys.stdin)
# Send notification to external service
if data['status'] == 'completed':
requests.post('https://notify.example.com/webhook', json={
'title': 'Download Complete',
'message': f"Downloaded: {data['url']}",
'status': 'success'
})#!/bin/bash
# Parse JSON input
JSON=$(cat)
URL=$(echo $JSON | jq -r '.url')
FILEPATH=$(echo $JSON | jq -r '.file_path')
MEDIA_TYPE=$(echo $JSON | jq -r '.media_type')
# Process downloaded files based on media type
if [ "$MEDIA_TYPE" = "pdf" ]; then
# Extract text from PDF
pdftotext "$FILEPATH" "${FILEPATH}.txt"
echo "Extracted text from PDF to ${FILEPATH}.txt"
fiGatherHub can use browser cookies to access authenticated or paywalled content:
- Automatic extraction: Extract cookies from supported browsers
- Website-specific cookies: Apply cookies only to matching domains
- Cookie management: Add, edit, and delete cookies through the web interface
- Secure storage: Cookies are stored securely in the tracking database
- Automatic expiry: Expired cookies are removed during cleanup
To enable cookie integration:
- Go to Settings → Cookie Settings in the web interface
- Import cookies from a browser or manually add them
- Enable cookies for the desired domains
- The cookies will be automatically applied to matching download jobs
GatherHub can be run as three different systemd services:
For automatic scheduled operations:
sudo cp deploy/gatherhub.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable gatherhub
sudo systemctl start gatherhub### Docker
GatherHub can also be run using Docker:
```bash
# Build the Docker image
docker build -t gatherhub -f deploy/Dockerfile .
# Run using docker-compose
docker-compose -f deploy/docker-compose.yml up -d
For secure access:
[web_interface]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"
[api]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"Alternatively, use a reverse proxy like Nginx (configuration included in deploy/gatherhub.nginx.conf).