GatherHub

TL;DR: It's a content ingestion → processing → indexing → discovery platform—automating the entire lifecycle from input to searchable archive.

A personal data archiver. A system that works alongside bookmark tools and other bookmark data sources to ingest data from them, download, and archive the internet content from various sources automatically.

Overview

GatherHub is a versatile tool designed to download and organize web content from URLs, bookmarks, databases, RSS feeds, torrents and more. It features a modern web interface, job tracking, media type detection, and automated scheduling to help you archive and manage your digital content collection. The integrated search system allows for full-text search across all downloaded content, with support for indexing various file types including HTML, PDF, documents, and streaming video metadata.

Features

Multi-source support: Upload files, manually input URLs, and import from sources: browsers, bookmark tools, databases, and RSS feeds
Smart content handling: Automatically selects the appropriate download method
Web interface: Modern dashboard to manage and monitor download jobs
REST API: Programmatic access for integration with other systems
Job tracking: Monitor status of all downloads in a centralized database
Content organization: Tagging system for easy content discovery
Scheduled operations: Automatic synchronization with sources at configurable intervals
Event hooks: Run custom scripts when specific events occur
Flexible media handling: Customize how different content types are handled with the tool of your choice
Content extraction: Automatically extract readable content from HTML, Images, and other documents for indexing

Out of the Box Supported Sources

Browser Bookmarks

Firefox
Chrome
Chromium
Brave
Vivaldi

Bookmarking Tools

Readeck
Karakeep
Linkding
LinkWarden
Wallabag

General Databases

SQLite
MySQL
PostgreSQL

Other Sources

RSS Feeds (with configurable settings for max items and enclosures)
WROLPI saved links
Ad hoc direct URL import through the web interface
File uploads from local disk or network drives

Supported Media Types

Streaming videos: Using yt-dlp with extensive features for YouTube, Vimeo, Twitch, and many other platforms including metadata extraction, thumbnail capture, subtitles, and sponsorblock integration
Git repositories: Clones repositories and creates optional ZIP archives
Web pages: Full HTML archiving with JavaScript support via monolith or SingleFile
Web archives: Comprehensive website archives with automatic crawling, link rewriting, and navigation indexes
Documents: PDF, DOCX, TXT, etc.
E-books: EPUB, MOBI, AZW, etc.
Media files: MP3, MP4, images, and other media formats
Archives: ZIP, RAR, 7z, and other compressed formats
Maps: PBF files
ZIM files: Wikipedia and other ZIM formatted content
Torrents: Torrent and Magnet files
Custom types: Define your own media types with custom tools and URL patterns

Media Type Customization

Media types in GatherHub are highly configurable:

Tool flexibility: Use any command-line tool that can download or process content
Custom URL patterns: Define specific regex patterns to target exactly the content you want
Domain-based configuration: Configure media types for specific domains without complex regex
Pattern priority: More specific patterns take precedence over general ones
Intelligent file type detection: Automatically identifies appropriate handlers based on file extensions and URL characteristics
Arguments templating: Customize command arguments with variables like {url}, {output_dir}, {id}
Add specialized handlers: Create media types for specific websites or content sources
Override defaults: Replace the default tools with your preferred alternatives

Example customization for a specialized archiving tool:

[[media_types]]
name = 'special-wiki'
patterns = ['^https?://(www\.)?specialwiki\.org/.*']
tool = 'my-wiki-archiver'
tool_path = '/usr/local/bin/wiki-archiver'
arguments = '--download {url} --output {output_dir}/{id} --format=full'

Example of domain-based media type configuration (simpler than regex patterns):

[[media_types]]
name = 'academic-papers'
domains = ['arxiv.org', 'papers.ssrn.com', 'scholar.google.com']
tool = 'pdf-archiver'
tool_path = '/usr/local/bin/pdf-archiver'
arguments = '--no-js --wait 3 {url} --out {output_dir}/{id}.pdf'

Web Interface

GatherHub provides a comprehensive web interface with:

Dashboard: Overview of download statistics and quick actions
Job management: Add, retry, filter, and search download jobs
File upload: Directly upload files through the web interface with automatic content extraction and indexing
Bulk operations: Select multiple jobs and perform actions like retry or delete in a single click
Settings: Configure sources, media types, storage, and more
Cookie management: Manage browser cookies for authenticated downloads (enabling access to paywalled or private content)
Personal Google: Your content is indexed and searchable
Documentation: Built-in documentation for all features with it's own search engine
Dark/light/Contrast mode: Customizable theme based on system preferences
Reset stuck jobs: Recover from processes that failed to complete properly
Tour helper: Get an overview of key features

Advanced Features

First Run Setup

GatherHub includes a comprehensive first-run setup wizard that guides new users through initial configuration:

Automatic Detection: Detects if this is a first-time installation
Setup Wizard: Step-by-step configuration process
Documentation Indexing: Automatically indexes built-in documentation for searchability
Dependency Checking: Verifies required external tools are installed
Configuration Validation: Ensures all settings are properly configured
Progress Tracking: Shows setup progress and completion status
Skip Options: Allows experienced users to skip certain setup steps

The first-run setup ensures that new installations are properly configured and ready to use immediately.

Shepherd Management

The Shepherd system provides guided tours and contextual help throughout the application:

Interactive Tours: Step-by-step walkthroughs of key features
Contextual Help: Context-sensitive assistance based on current page
Progress Tracking: Tracks which tours and help sections have been completed
Customizable Tours: Different tour paths for different user types
Reset Capability: Ability to reset and replay tours
API Integration: Programmatic control over tour states and progress

Available shepherd tours include:

Dashboard overview and navigation
Adding and managing download jobs
Configuring sources and media types
Using the search functionality
Managing tags and organization

Enhanced Search and Indexing

GatherHub provides advanced search capabilities beyond basic full-text search:

Faceted Search

File Type Facets: Filter results by document type (PDF, HTML, video, etc.)
Source Facets: Filter by content source (browser bookmarks, RSS feeds, etc.)
Tag Facets: Filter by assigned tags with count information
Word Count Facets: Filter by content length ranges (small <500 words, medium 500-2000, large 2000-5000, very large >5000)

Search Highlighting

Content Highlighting: Highlights matching terms in search results
Multiple Format Support: Works with HTML, text, and extracted content
HTML Styling: Uses HTML-based highlighting for web interface
Context Preservation: Shows surrounding context for matches

Advanced Query Features

Field-Specific Search: Search within specific fields using parameters like fileType:pdf, tags:research, source:"Manual Upload"
Phrase Searching: Exact phrase matching with quotes (in documentation search)
Parameter Parsing: Google-style search parameters that are extracted from queries
Multi-value Filters: Support for multiple tags and other filter values

Search Index Management

Automatic Indexing: Content is indexed immediately upon download
Incremental Updates: Only changed content is reindexed
Batch Processing: Efficient handling of large content collections
Metadata Extraction: Comprehensive metadata indexing from various file types
Multi-format Support: Indexes text, HTML, JSON, and binary file metadata
Index Cleanup: Automatic removal of index entries when jobs are deleted

Documentation Search

Dedicated Documentation Index: Separate search index for built-in documentation
Structured Results: Organized by documentation sections and topics with title, content, section, and path fields
Search Highlighting: Highlights matching terms in both title and content
Optimized Queries: Weighted search across title (higher priority), section, and content fields

Content Processing Pipeline

The content processing system provides sophisticated handling of downloaded content:

Multi-Stage Processing

Download: Content is downloaded using appropriate tools
Detection: File type and content analysis
Extraction: Text and metadata extraction using configurable extractors including OCR
Transformation: Content conversion to multiple formats (text, JSON, HTML)
Indexing: Full-text and metadata indexing for search
Tagging: Automatic and manual tag assignment
Storage: Organized storage with configurable directory structures

Extractor Chaining

Sequential Processing: Chain multiple extractors for complex workflows
Conditional Logic: Apply different extractors based on content type
Fallback Mechanisms: Automatic fallback to alternative extractors
Custom Pipelines: Define custom processing pipelines for specific content types

Output Format Management

Multiple Formats: Generate text, JSON, and HTML versions of extracted content
Format-Specific Optimization: Optimized output for different use cases
Metadata Preservation: Maintains original metadata across format conversions
Template Support: Customizable output templates for different formats

Content Management

GatherHub includes robust content management features:

File Upload

Upload files directly through the web interface instead of downloading from URLs
Support for various file types including documents, images, and media files
Automatic content extraction from uploaded files
Intelligent file type detection based on file extensions and content
Batch uploads for processing multiple files at once
Immediate indexing of uploaded content for searchability
Creates structured directories for uploaded content organization
Integrates with the tagging system for categorization

Bulk Operations

Select multiple jobs for batch processing with a single action
Bulk retry for multiple failed downloads at once
Bulk deletion of completed or failed jobs
Filter and select jobs by status or media type before bulk actions
Responsive interface with clear visual feedback during bulk operations
Job selection count with real-time updates
Clear confirmation dialog before destructive actions

Tagging System

Add custom tags to any downloaded content
Filter and search content by tags
Tag-based organization works across different media types
Web interface for managing tags
Combined tag searches (e.g., find all "important" AND "work" tagged items)
Tag suggestions based on content type and URL patterns

Search Capabilities

Full-text search: Search across all indexed content including HTML, documents, and video metadata
Content types: Supports over 20 file types including HTML, PDF, DOC, DOCX, TXT, and many more
Metadata search: Find content by URL, title, tags, and other metadata fields
Media-specific search: Video search includes descriptions, titles, and subtitles when available
Status filtering: Filter by download status (pending, downloading, completed, failed)
Media type filtering: Narrow results by media type (Streaming Video, Git, HTML, etc.)
Tag filtering: Search by one or more tags
Combined filtering: Mix multiple search criteria (e.g., completed downloads with specific tags)
Search history: Save and reuse search filters
Automatic indexing: Content is automatically indexed upon download completion
Manual reindexing: Force reindexing of specific content via UI or the --reindex-all flag

Content Extraction

Automatically extracts readable content from HTML and other documents
Removes clutter, ads, and navigation from web content
Preserves important metadata like title, author, and publish date
Multiple output formats:
- Plain text for indexing and readability
- JSON for programmatic access
- HTML for clean presentation
Pluggable extractor architecture:
- Internal extractors: Built-in processing for common formats
- External extractors: Use any command-line tool without code changes
- Chain extractors: Multi-step pipelines for complex processing
Priority-based selection with automatic fallback mechanisms
Media type integration for format-specific extractor configuration
Extensible without code changes via config.toml settings
Configurable via extractors section in config.toml
Integrated with event hooks system for automatic processing
Command-line tool for manual extraction and batch processing

Example extractor configuration:

# Example of a chain extractor for PDF files
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 10
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Metadata

Automatically preserves bookmark metadata (title, last visit date)
Custom metadata storage in JSON format for extensibility
Activity logs for tracking changes to downloads
Storage path information for direct file access

Requirements

SQLite3
External tools:
- yt-dlp (for YouTube)
- git
- aria2c (for general downloads)
- monolith or SingleFile (for HTML archiving with JavaScript support)
- FFmpeg (optional but recommended for yt-dlp)
- jq (for JSON parsing in hook scripts)
- zip (for creating ZIP archives)
- tesseract-ocr (for OCR)

External Tools Installation

Ubuntu/Debian

# Install SQLite
sudo apt-get update
sudo apt-get install -y sqlite3

# Install yt-dlp
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlp

# Install git
sudo apt-get install -y git

# Install aria2
sudo apt-get install -y aria2

# Install FFmpeg
sudo apt-get install -y ffmpeg

# Install jq (required for hook scripts)
sudo apt-get install -y jq

# Install zip (required for creating ZIP archives)
sudo apt-get install -y zip

# Install Monolith for web page archiving
# For most systems:
sudo snap install monolith

sudo apt-get install -y tesseract-ocr

# For raspberry pi 
# wget -O monolith https://github.com/Y2Z/monolith/releases/download/v2.10.1/monolith-gnu-linux-aarch64
# sudo mv monolith /usr/local/bin/
# sudo chmod +x /usr/local/bin/monolith

# Alternatively, install SingleFile
# npm install -g single-file-cli

macOS

# Install SQLite
brew install sqlite

# Install yt-dlp
brew install yt-dlp

# Install git
brew install git

# Install aria2
brew install aria2

# Install FFmpeg
brew install ffmpeg

# Install jq (required for hook scripts)
brew install jq

# Install zip (required for creating ZIP archives)
brew install zip

# Install monolith for web page archiving
brew install monolith

brew install tesseract-ocr

# Install Node.js (if using SingleFile instead of monolith)
# brew install node
# npm install -g single-file-cli

Usage

Command Line Interface

Usage: gatherhub [OPTIONS]

Options:
  --config PATH                 Path to config file
  --scan                        Scan sources for new content
  --download                    Process pending downloads
  --status                      Show download status
  --retry                       Retry failed downloads
  --clean                       Clean failed downloads
  --daemon                      Run as daemon (for scheduling)
  --api                         Start API server
  --web                         Start web interface
  --import-bookmarks NAME       Import browser bookmarks with specified name
  --browser BROWSER             Browser to import from bookmarks (firefox, chrome, etc)
  --profile PATH                Browser profile path (optional)
  --generate-admin-password     Generate a new admin password hash
  --generate-install-script PATH Generate install script for dependencies (specify output filename)
  --reindex-all                 Reindex all downloaded files
  --check-deps                  Check system dependencies
  --help                        Show this message and exit

Examples

Scan sources for new content:

./gatherhub --scan

Import Firefox bookmarks:

./gatherhub --import-bookmarks "firefox-bookmarks" --browser firefox

Import from a specific Firefox profile:

./gatherhub --import-bookmarks "firefox-work" --browser firefox --profile "/path/to/firefox/profile"

Process pending downloads:

./gatherhub --download

Show download status:

./gatherhub --status

Run as a daemon with scheduling:

./gatherhub --daemon

Start the web interface:

./gatherhub --web

Launch both the web interface and API server:

./gatherhub --web --api

Check system dependencies:

./gatherhub --check-deps

Generate installation script for current platform:

./gatherhub --generate-install-script install.sh

Reindex all downloaded content:

./gatherhub --reindex-all

Generate new admin password:

./gatherhub --generate-admin-password

Web Interface

The web interface is available at http://localhost:8060 by default. You can log in with the username and password configured in config.toml (default credentials are admin/admin).

API

The API is available at http://localhost:5000/api by default (configurable in config.toml). The API provides programmatic access to GatherHub's functionality.

Authentication

The API supports API Key-based authentication:

[api.auth]
enabled = true
api_secret = "your-secret-key"  # Change this to a secure random string
apiKey_expiry_hours = 24

When authentication is enabled, all API requests must include an Authorization header:

Authorization: Bearer your-api-key

Core API Endpoints

Job Management

Get All Jobs

Endpoint: GET /api/jobs
Description: Retrieves a list of all jobs
Response: Array of job objects

[
  {
    "id": 1,
    "url": "https://example.com/file.pdf",
    "source_name": "firefox",
    "source_id": "bookmark_123",
    "media_type": "pdf",
    "status": "completed",
    "created_at": "2023-04-15T14:30:45Z",
    "updated_at": "2023-04-15T14:35:12Z"
  }
]

Create Job

Endpoint: POST /api/jobs
Description: Creates a new download job
Request Body:

{
  "url": "https://example.com/file.pdf",
  "source_name": "manual",
  "source_id": "user_123",
  "media_type": "pdf" // Optional, will be auto-detected if not provided
}

Response: The created job object

Get Job Details

Endpoint: GET /api/jobs/{id}
Description: Retrieves details for a specific job
Parameters:
- id: Job ID (integer)
Response: Complete job object including status, file path, and metadata

Update Job

Endpoint: PUT /api/jobs/{id}
Description: Updates a job's status or error message
Parameters: id: Job ID (integer)
Request Body:

{
  "status": "failed",
  "error": "Download failed: connection timeout"
}

Response: Updated job object

Delete Job

Endpoint: DELETE /api/jobs/{id}
Description: Deletes a job from the system
Parameters: id: Job ID (integer)
Response: HTTP 204 (No Content) on success

Bulk Operations

Perform Bulk Job Operations

Endpoint: POST /api/bulk/jobs
Description: Perform operations on multiple jobs at once
Request Body:

{
  "action": "delete", // or "retry"
  "ids": [1, 2, 3, 4] // Array of job IDs
}

Response:

{
  "success": true,
  "deleted": 3, // or "retried" for retry operations
  "failed": 1,
  "message": "Deleted 3 jobs, 1 failed"
}

Source Operations

Scan Sources

Endpoint: POST /api/scan
Description: Scan all enabled sources for new content
Response: Success message with scan results

{
  "success": true,
  "message": "Scan completed successfully"
}

Download Operations

Process Downloads

Endpoint: POST /api/download
Description: Process all pending download jobs
Response: HTTP 200 with a success message on completion

Retry Failed Jobs

Endpoint: POST /api/retry
Description: Retry all failed download jobs
Response: HTTP 200 with a success message on completion

Clean Jobs

Endpoint: POST /api/clean
Description: Removes failed jobs and old completed jobs
Response: HTTP 200 with a message indicating how many jobs were cleaned (e.g., "Cleaned 5 failed jobs and 10 old jobs")

System Information

Get Statistics

Endpoint: GET /api/stats
Description: Retrieve system statistics
Response:

{
  "total_jobs": 156,
  "pending_jobs": 5,
  "downloading_jobs": 2,
  "completed_jobs": 145,
  "failed_jobs": 4,
  "sources": 3,
  "recent_activity": true
}

File Operations

Upload Files

Endpoint: POST /api/upload
Description: Upload files directly rather than downloading from URLs
Request: Multipart form with:
- file: File to upload (can be multiple files)
- media_type: Optional media type, will be auto-detected if not provided
Response:

{
  "success": true,
  "message": "2 of 2 files uploaded successfully",
  "jobs": [
    {
      "id": 157,
      "url": "file:///path/to/uploaded/file1.pdf",
      "source_name": "Manual Upload",
      "media_type": "pdf",
      "status": "completed"
    },
    {
      "id": 158,
      "url": "file:///path/to/uploaded/file2.jpg",
      "source_name": "Manual Upload",
      "media_type": "jpg",
      "status": "completed"
    }
  ]
}

Additional API Endpoints

Job Tag Management

Get Job Tags

Endpoint: GET /api/jobs/{id}/tags
Description: Retrieve all tags for a specific job
Parameters: id: Job ID (integer)
Response: Array of tag strings

Add Tag to Job

Endpoint: POST /api/jobs/{id}/tags
Description: Add a tag to a specific job
Parameters: id: Job ID (integer)
Request Body:

{
  "tag": "important"
}

Response: Success message with automatic content reindexing

Remove Tag from Job

Endpoint: DELETE /api/jobs/{id}/tags/{tag}
Description: Remove a specific tag from a job
Parameters:
- id: Job ID (integer)
- tag: Tag name to remove (URL-encoded)
Response: Success message with automatic content reindexing

System Management

Get All Tags

Endpoint: GET /api/tags
Description: List all available tags in the system
Response: Array of tag strings

Reset Stuck Jobs

Endpoint: POST /api/reset-stuck
Description: Reset jobs that are stuck in downloading state (downloading for >10 minutes)
Response: Message indicating how many jobs were reset

Force Job Reindexing

Endpoint: POST /api/reindex-job/{id}
Description: Force reindexing of content for a specific job
Parameters: id: Job ID (integer)
Response: Success message confirming reindexing was scheduled

Source Management

Get Sources

Endpoint: GET /api/sources/
Description: Retrieve all configured sources
Response: Array of source configuration objects

Get Specific Source

Endpoint: GET /api/sources/{index}
Description: Retrieve configuration for a specific source
Parameters: index: Source index (integer)
Response: Source configuration object

Test Source Connection

Endpoint: POST /api/test-source-connection
Description: Test connection to a data source (database, RSS feed, etc.)
Request: Form data with source configuration parameters
Response: Success/failure message with connection status

Documentation Search

Search Documentation

Endpoint: GET /api/docs/search
Description: Search the built-in documentation
Parameters:
- q: Search query (required)
- page: Page number (optional, default: 1)
- per_page: Results per page (optional, default: 10)
Response: Search results with documentation matches

Advanced Search

Search Content

Endpoint: GET /api/search
Description: Advanced search across all indexed content
Parameters:
- q: Search query
- page: Page number
- per_page: Results per page
- filters: JSON object with search filters
- highlight: Enable search highlighting
Response: Search results with faceted filtering and highlighting

Index Content

Endpoint: POST /api/index
Description: Manually trigger content indexing
Response: Success message confirming indexing was triggered

Configuration

GatherHub is configured via data/config/config.toml. Key configuration sections include:

Sources

Configure bookmark sources from browsers, databases, and RSS feeds:

[[sources]]
name = 'firefox'
type = 'browser'
browser = 'firefox'
# profile_path = '/path/to/firefox/profile'  # Optional
enabled = true

[[sources]]
name = 'readeck'
type = 'sqlite'
path = './readeck.db'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true

[[sources]]
name = 'mysql-bookmarks'
type = 'mysql'
path = 'localhost'  # Host
port = '3306'
database = 'bookmarks'
username = 'user'
password = 'password'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true

[[sources]]
name = 'rss-feed'
type = 'rss'
url = 'https://example.com/feed.xml'
max_items = 20
include_enclosures = true
enabled = true

Media Types

Configure how different content types are handled:

[[media_types]]
name = 'streaming-video'
extensions = []
domains = ['youtube.com', 'youtu.be', 'vimeo.com', 'dailymotion.com', 'twitch.tv', 'facebook.com', 'instagram.com', 'twitter.com', 'tiktok.com', 'reddit.com', 'bilibili.com', 'bitchute.com', 'rumble.com', 'odysee.com', 'peertube.tv', 'nebula.app', 'curiositystream.com', 'patreon.com', 'floatplane.com', 'soundcloud.com', 'mixcloud.com', 'bandcamp.com']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "bestvideo[height<=720]+bestaudio/best[height<=720]" --merge-output-format mp4 --limit-rate 1M --no-check-certificate --ignore-errors --no-abort-on-error --geo-bypass --cookies cookies.txt --sponsorblock-remove all --write-description --write-info-json --write-thumbnail --write-all-thumbnails --write-auto-subs --sub-langs all,-live_chat --write-subs --embed-metadata --extractor-retries 5 --fragment-retries 5 --retry-sleep 3 --force-ipv4 --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'

[[media_types]]
name = 'youtube'
patterns = ['^https?://(www\.)?(youtube\.com|youtu\.be)/.*']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "best[height<=720]" --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'

[[media_types]]
name = 'html'
patterns = ['.*\.html$', '^https?://[^/]+/?$']
tool = 'monolith'
tool_path = '/usr/local/bin/monolith'
arguments = '{url} -o {output_dir}/{url-hostname}_{id}.html'

# Example of using SingleFile instead of monolith
[[media_types]]
name = 'singlefile-html'
patterns = ['^https?://docs\.example\.org/.*']  # Special pattern for documentation sites
tool = 'single-file'
tool_path = '/usr/local/bin/single-file'
arguments = '{url} --output-directory={output_dir} --filename-template="{url-hostname}_{id}.html"'

Storage

Configure where downloaded content is stored:

[storage]
base_path = './downloads/'

[storage.by_type]
youtube = 'youtube/'
git = 'git/'
html = 'html/'
# Additional media types...

Web Interface and API

Configure web interface and API settings:

[web_interface]
enabled = true
host = '0.0.0.0'
port = 8060
allow_iframe = true  # Set to false if not embedding in another site
session_timeout_minutes = 60

[[web_interface.users]]
username = 'admin'
password_hash = 'scrypt:32768:8:1$4CQiZOt8Pk17kpi4$6094970a974f2f4298da9157c9a9f17b33cb260638906659e565295c65dd841f72115aec72f47fcdf6b2e7b30fc668a45ee37226a871139f40a3c64b31e0337c'
role = 'admin'

[api]
enabled = true
host = '127.0.0.1'
port = 5000
debug = false  # Set to true for verbose API logs

[api.auth]
enabled = true
api_secret = 'your-secret-key'  # Change this to a secure random string
apiKey_expiry_hours = 24

Advanced Configuration

Extractor Configuration

Configure content extractors for different file types:

[extraction]
enabled = true
output_dir = ''  # Use job directory if empty
output_formats = ['text', 'json', 'html']
include_metadata = true
supported_types = ['html', 'books', 'documents']

# Internal extractors (built-in)
[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

[[extractors]]
name = 'spreadsheet'
extensions = ['.xlsx', '.ods']
type = 'internal'
priority = 10

# External extractors (command-line tools)
[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout {input} {output}'
priority = 20

[[extractors]]
name = 'pandoc'
extensions = ['.docx', '.doc']
type = 'external'
command = '/usr/bin/pandoc'
arguments = '{input} -o {output}'
priority = 20

# Chain extractors (multi-step processing)
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 15
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Search and Indexing Configuration

Configure search index behavior:

[search]
enabled = true
index_path = './data/search_index'
batch_size = 50
auto_index = true
include_content = true
include_metadata = true

[search.facets]
enabled = true
max_facet_count = 20
include_file_types = true
include_sources = true
include_tags = true
include_word_count_ranges = true

First Run and Shepherd Configuration

Configure the first-run setup and guided tours:

[first_run]
enabled = true
require_setup = true
auto_index_docs = true
check_dependencies = true

[shepherd]
enabled = true
auto_start_tours = true
track_progress = true
available_tours = [
  'dashboard-overview',
  'job-management',
  'source-configuration',
  'search-features',
  'tag-management'
]

Logging Configuration

Configure detailed logging options:

[logging]
app_log_path = './data/logs/app.log'
activity_log_path = './data/logs/activity.log'
error_log_path = './data/logs/error.log'
max_size_bytes = 10485760  # 10MB
backup_count = 5
level = 'INFO'  # DEBUG, INFO, WARN, ERROR
console_logging = true

Concurrency Configuration

Configure concurrency settings:

[concurrency]
max_workers = 5
timeout_seconds = 3600

Auto-cleanup Configuration

Configure automatic cleanup of old jobs and files:

[auto_clean]
enabled = false
retry_failed = true
max_retries = 3
clean_after_days = 7

For more detailed configuration options, see the docs_configuration.html page in the web interface.

Event Hooks

GatherHub supports running custom scripts when certain events occur:

[event_hooks]
enabled = true
hooks_dir = './data/hooks'

[[event_hooks.hooks]]
event = 'post_download'
script = 'notify.py'
enabled = true

Event Hook Features

Rich JSON context: Hooks receive comprehensive data about the event via stdin
Environment variables: Access to GATHERHUB_EVENT, GATHERHUB_APP_LOG, and other variables
Multiple hooks per event: Chain several scripts for the same event type
Conditional execution: Enable/disable hooks via config without removing them
Global hook directory: Centralized management of all hook scripts

Available Events

pre_download: Called before a download starts
post_download: Called after a download completes
on_error: Called when a download fails
on_status_change: Called when a download's status changes
on_source_scan: Called when a source is scanned
on_startup: Called when the application starts
on_shutdown: Called when the application shuts down

Hook Script Examples

Python notification hook:

#!/usr/bin/env python3
import json
import sys
import requests

# Read JSON data from stdin
data = json.load(sys.stdin)

# Send notification to external service
if data['status'] == 'completed':
    requests.post('https://notify.example.com/webhook', json={
        'title': 'Download Complete',
        'message': f"Downloaded: {data['url']}",
        'status': 'success'
    })

Bash post-processing hook:

#!/bin/bash
# Parse JSON input
JSON=$(cat)
URL=$(echo $JSON | jq -r '.url')
FILEPATH=$(echo $JSON | jq -r '.file_path')
MEDIA_TYPE=$(echo $JSON | jq -r '.media_type')

# Process downloaded files based on media type
if [ "$MEDIA_TYPE" = "pdf" ]; then
    # Extract text from PDF
    pdftotext "$FILEPATH" "${FILEPATH}.txt"
    echo "Extracted text from PDF to ${FILEPATH}.txt"
fi

Browser Cookie Integration

GatherHub can use browser cookies to access authenticated or paywalled content:

Automatic extraction: Extract cookies from supported browsers
Website-specific cookies: Apply cookies only to matching domains
Cookie management: Add, edit, and delete cookies through the web interface
Secure storage: Cookies are stored securely in the tracking database
Automatic expiry: Expired cookies are removed during cleanup

To enable cookie integration:

Go to Settings → Cookie Settings in the web interface
Import cookies from a browser or manually add them
Enable cookies for the desired domains
The cookies will be automatically applied to matching download jobs

Deployment

Systemd Services

GatherHub can be run as three different systemd services:

Web, API, and Daemon Mode

For automatic scheduled operations:

sudo cp deploy/gatherhub.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable gatherhub
sudo systemctl start gatherhub

### Docker

GatherHub can also be run using Docker:

```bash
# Build the Docker image
docker build -t gatherhub -f deploy/Dockerfile .

# Run using docker-compose
docker-compose -f deploy/docker-compose.yml up -d

SSL/TLS Configuration

For secure access:

[web_interface]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"

[api]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"

Alternatively, use a reverse proxy like Nginx (configuration included in deploy/gatherhub.nginx.conf).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GatherHub

Overview

Features

Out of the Box Supported Sources

Browser Bookmarks

Bookmarking Tools

General Databases

Other Sources

Supported Media Types

Media Type Customization

Web Interface

Advanced Features

First Run Setup

Shepherd Management

Enhanced Search and Indexing

Faceted Search

Search Highlighting

Advanced Query Features

Search Index Management

Documentation Search

Content Processing Pipeline

Multi-Stage Processing

Extractor Chaining

Output Format Management

Content Management

File Upload

Bulk Operations

Tagging System

Search Capabilities

Content Extraction

Metadata

Requirements

External Tools Installation

Ubuntu/Debian

macOS

Usage

Command Line Interface

Examples

Web Interface

API

Authentication

Core API Endpoints

Job Management

Get All Jobs

Create Job

Get Job Details

Update Job

Delete Job

Bulk Operations

Perform Bulk Job Operations

Source Operations

Scan Sources

Download Operations

Process Downloads

Retry Failed Jobs

Clean Jobs

System Information

Get Statistics

File Operations

Upload Files

Additional API Endpoints

Job Tag Management

Get Job Tags

Add Tag to Job

Remove Tag from Job

System Management

Get All Tags

Reset Stuck Jobs

Force Job Reindexing

Source Management

Get Sources

Get Specific Source

Test Source Connection

Documentation Search

Search Documentation

Advanced Search

Packages