Skip to content

OptionalSoftware/gatherhub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

GatherHub

TL;DR: It's a content ingestion → processing → indexing → discovery platform—automating the entire lifecycle from input to searchable archive.

A personal data archiver. A system that works alongside bookmark tools and other bookmark data sources to ingest data from them, download, and archive the internet content from various sources automatically.

Overview

GatherHub is a versatile tool designed to download and organize web content from URLs, bookmarks, databases, RSS feeds, torrents and more. It features a modern web interface, job tracking, media type detection, and automated scheduling to help you archive and manage your digital content collection. The integrated search system allows for full-text search across all downloaded content, with support for indexing various file types including HTML, PDF, documents, and streaming video metadata.

Features

  • Multi-source support: Upload files, manually input URLs, and import from sources: browsers, bookmark tools, databases, and RSS feeds
  • Smart content handling: Automatically selects the appropriate download method
  • Web interface: Modern dashboard to manage and monitor download jobs
  • REST API: Programmatic access for integration with other systems
  • Job tracking: Monitor status of all downloads in a centralized database
  • Content organization: Tagging system for easy content discovery
  • Scheduled operations: Automatic synchronization with sources at configurable intervals
  • Event hooks: Run custom scripts when specific events occur
  • Flexible media handling: Customize how different content types are handled with the tool of your choice
  • Content extraction: Automatically extract readable content from HTML, Images, and other documents for indexing

Out of the Box Supported Sources

Browser Bookmarks

  • Firefox
  • Chrome
  • Chromium
  • Brave
  • Vivaldi

Bookmarking Tools

  • Readeck
  • Karakeep
  • Linkding
  • LinkWarden
  • Wallabag

General Databases

  • SQLite
  • MySQL
  • PostgreSQL

Other Sources

  • RSS Feeds (with configurable settings for max items and enclosures)
  • WROLPI saved links
  • Ad hoc direct URL import through the web interface
  • File uploads from local disk or network drives

Supported Media Types

  • Streaming videos: Using yt-dlp with extensive features for YouTube, Vimeo, Twitch, and many other platforms including metadata extraction, thumbnail capture, subtitles, and sponsorblock integration
  • Git repositories: Clones repositories and creates optional ZIP archives
  • Web pages: Full HTML archiving with JavaScript support via monolith or SingleFile
  • Web archives: Comprehensive website archives with automatic crawling, link rewriting, and navigation indexes
  • Documents: PDF, DOCX, TXT, etc.
  • E-books: EPUB, MOBI, AZW, etc.
  • Media files: MP3, MP4, images, and other media formats
  • Archives: ZIP, RAR, 7z, and other compressed formats
  • Maps: PBF files
  • ZIM files: Wikipedia and other ZIM formatted content
  • Torrents: Torrent and Magnet files
  • Custom types: Define your own media types with custom tools and URL patterns

Media Type Customization

Media types in GatherHub are highly configurable:

  • Tool flexibility: Use any command-line tool that can download or process content
  • Custom URL patterns: Define specific regex patterns to target exactly the content you want
  • Domain-based configuration: Configure media types for specific domains without complex regex
  • Pattern priority: More specific patterns take precedence over general ones
  • Intelligent file type detection: Automatically identifies appropriate handlers based on file extensions and URL characteristics
  • Arguments templating: Customize command arguments with variables like {url}, {output_dir}, {id}
  • Add specialized handlers: Create media types for specific websites or content sources
  • Override defaults: Replace the default tools with your preferred alternatives

Example customization for a specialized archiving tool:

[[media_types]]
name = 'special-wiki'
patterns = ['^https?://(www\.)?specialwiki\.org/.*']
tool = 'my-wiki-archiver'
tool_path = '/usr/local/bin/wiki-archiver'
arguments = '--download {url} --output {output_dir}/{id} --format=full'

Example of domain-based media type configuration (simpler than regex patterns):

[[media_types]]
name = 'academic-papers'
domains = ['arxiv.org', 'papers.ssrn.com', 'scholar.google.com']
tool = 'pdf-archiver'
tool_path = '/usr/local/bin/pdf-archiver'
arguments = '--no-js --wait 3 {url} --out {output_dir}/{id}.pdf'

Web Interface

GatherHub provides a comprehensive web interface with:

  • Dashboard: Overview of download statistics and quick actions
  • Job management: Add, retry, filter, and search download jobs
  • File upload: Directly upload files through the web interface with automatic content extraction and indexing
  • Bulk operations: Select multiple jobs and perform actions like retry or delete in a single click
  • Settings: Configure sources, media types, storage, and more
  • Cookie management: Manage browser cookies for authenticated downloads (enabling access to paywalled or private content)
  • Personal Google: Your content is indexed and searchable
  • Documentation: Built-in documentation for all features with it's own search engine
  • Dark/light/Contrast mode: Customizable theme based on system preferences
  • Reset stuck jobs: Recover from processes that failed to complete properly
  • Tour helper: Get an overview of key features

Advanced Features

First Run Setup

GatherHub includes a comprehensive first-run setup wizard that guides new users through initial configuration:

  • Automatic Detection: Detects if this is a first-time installation
  • Setup Wizard: Step-by-step configuration process
  • Documentation Indexing: Automatically indexes built-in documentation for searchability
  • Dependency Checking: Verifies required external tools are installed
  • Configuration Validation: Ensures all settings are properly configured
  • Progress Tracking: Shows setup progress and completion status
  • Skip Options: Allows experienced users to skip certain setup steps

The first-run setup ensures that new installations are properly configured and ready to use immediately.

Shepherd Management

The Shepherd system provides guided tours and contextual help throughout the application:

  • Interactive Tours: Step-by-step walkthroughs of key features
  • Contextual Help: Context-sensitive assistance based on current page
  • Progress Tracking: Tracks which tours and help sections have been completed
  • Customizable Tours: Different tour paths for different user types
  • Reset Capability: Ability to reset and replay tours
  • API Integration: Programmatic control over tour states and progress

Available shepherd tours include:

  • Dashboard overview and navigation
  • Adding and managing download jobs
  • Configuring sources and media types
  • Using the search functionality
  • Managing tags and organization

Enhanced Search and Indexing

GatherHub provides advanced search capabilities beyond basic full-text search:

Faceted Search

  • File Type Facets: Filter results by document type (PDF, HTML, video, etc.)
  • Source Facets: Filter by content source (browser bookmarks, RSS feeds, etc.)
  • Tag Facets: Filter by assigned tags with count information
  • Word Count Facets: Filter by content length ranges (small <500 words, medium 500-2000, large 2000-5000, very large >5000)

Search Highlighting

  • Content Highlighting: Highlights matching terms in search results
  • Multiple Format Support: Works with HTML, text, and extracted content
  • HTML Styling: Uses HTML-based highlighting for web interface
  • Context Preservation: Shows surrounding context for matches

Advanced Query Features

  • Field-Specific Search: Search within specific fields using parameters like fileType:pdf, tags:research, source:"Manual Upload"
  • Phrase Searching: Exact phrase matching with quotes (in documentation search)
  • Parameter Parsing: Google-style search parameters that are extracted from queries
  • Multi-value Filters: Support for multiple tags and other filter values

Search Index Management

  • Automatic Indexing: Content is indexed immediately upon download
  • Incremental Updates: Only changed content is reindexed
  • Batch Processing: Efficient handling of large content collections
  • Metadata Extraction: Comprehensive metadata indexing from various file types
  • Multi-format Support: Indexes text, HTML, JSON, and binary file metadata
  • Index Cleanup: Automatic removal of index entries when jobs are deleted

Documentation Search

  • Dedicated Documentation Index: Separate search index for built-in documentation
  • Structured Results: Organized by documentation sections and topics with title, content, section, and path fields
  • Search Highlighting: Highlights matching terms in both title and content
  • Optimized Queries: Weighted search across title (higher priority), section, and content fields

Content Processing Pipeline

The content processing system provides sophisticated handling of downloaded content:

Multi-Stage Processing

  1. Download: Content is downloaded using appropriate tools
  2. Detection: File type and content analysis
  3. Extraction: Text and metadata extraction using configurable extractors including OCR
  4. Transformation: Content conversion to multiple formats (text, JSON, HTML)
  5. Indexing: Full-text and metadata indexing for search
  6. Tagging: Automatic and manual tag assignment
  7. Storage: Organized storage with configurable directory structures

Extractor Chaining

  • Sequential Processing: Chain multiple extractors for complex workflows
  • Conditional Logic: Apply different extractors based on content type
  • Fallback Mechanisms: Automatic fallback to alternative extractors
  • Custom Pipelines: Define custom processing pipelines for specific content types

Output Format Management

  • Multiple Formats: Generate text, JSON, and HTML versions of extracted content
  • Format-Specific Optimization: Optimized output for different use cases
  • Metadata Preservation: Maintains original metadata across format conversions
  • Template Support: Customizable output templates for different formats

Content Management

GatherHub includes robust content management features:

File Upload

  • Upload files directly through the web interface instead of downloading from URLs
  • Support for various file types including documents, images, and media files
  • Automatic content extraction from uploaded files
  • Intelligent file type detection based on file extensions and content
  • Batch uploads for processing multiple files at once
  • Immediate indexing of uploaded content for searchability
  • Creates structured directories for uploaded content organization
  • Integrates with the tagging system for categorization

Bulk Operations

  • Select multiple jobs for batch processing with a single action
  • Bulk retry for multiple failed downloads at once
  • Bulk deletion of completed or failed jobs
  • Filter and select jobs by status or media type before bulk actions
  • Responsive interface with clear visual feedback during bulk operations
  • Job selection count with real-time updates
  • Clear confirmation dialog before destructive actions

Tagging System

  • Add custom tags to any downloaded content
  • Filter and search content by tags
  • Tag-based organization works across different media types
  • Web interface for managing tags
  • Combined tag searches (e.g., find all "important" AND "work" tagged items)
  • Tag suggestions based on content type and URL patterns

Search Capabilities

  • Full-text search: Search across all indexed content including HTML, documents, and video metadata
  • Content types: Supports over 20 file types including HTML, PDF, DOC, DOCX, TXT, and many more
  • Metadata search: Find content by URL, title, tags, and other metadata fields
  • Media-specific search: Video search includes descriptions, titles, and subtitles when available
  • Status filtering: Filter by download status (pending, downloading, completed, failed)
  • Media type filtering: Narrow results by media type (Streaming Video, Git, HTML, etc.)
  • Tag filtering: Search by one or more tags
  • Combined filtering: Mix multiple search criteria (e.g., completed downloads with specific tags)
  • Search history: Save and reuse search filters
  • Automatic indexing: Content is automatically indexed upon download completion
  • Manual reindexing: Force reindexing of specific content via UI or the --reindex-all flag

Content Extraction

  • Automatically extracts readable content from HTML and other documents
  • Removes clutter, ads, and navigation from web content
  • Preserves important metadata like title, author, and publish date
  • Multiple output formats:
    • Plain text for indexing and readability
    • JSON for programmatic access
    • HTML for clean presentation
  • Pluggable extractor architecture:
    • Internal extractors: Built-in processing for common formats
    • External extractors: Use any command-line tool without code changes
    • Chain extractors: Multi-step pipelines for complex processing
  • Priority-based selection with automatic fallback mechanisms
  • Media type integration for format-specific extractor configuration
  • Extensible without code changes via config.toml settings
  • Configurable via extractors section in config.toml
  • Integrated with event hooks system for automatic processing
  • Command-line tool for manual extraction and batch processing

Example extractor configuration:

# Example of a chain extractor for PDF files
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 10
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Metadata

  • Automatically preserves bookmark metadata (title, last visit date)
  • Custom metadata storage in JSON format for extensibility
  • Activity logs for tracking changes to downloads
  • Storage path information for direct file access

Requirements

  • SQLite3
  • External tools:
    • yt-dlp (for YouTube)
    • git
    • aria2c (for general downloads)
    • monolith or SingleFile (for HTML archiving with JavaScript support)
    • FFmpeg (optional but recommended for yt-dlp)
    • jq (for JSON parsing in hook scripts)
    • zip (for creating ZIP archives)
    • tesseract-ocr (for OCR)

External Tools Installation

Ubuntu/Debian

# Install SQLite
sudo apt-get update
sudo apt-get install -y sqlite3

# Install yt-dlp
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlp

# Install git
sudo apt-get install -y git

# Install aria2
sudo apt-get install -y aria2

# Install FFmpeg
sudo apt-get install -y ffmpeg

# Install jq (required for hook scripts)
sudo apt-get install -y jq

# Install zip (required for creating ZIP archives)
sudo apt-get install -y zip

# Install Monolith for web page archiving
# For most systems:
sudo snap install monolith

sudo apt-get install -y tesseract-ocr

# For raspberry pi 
# wget -O monolith https://github.com/Y2Z/monolith/releases/download/v2.10.1/monolith-gnu-linux-aarch64
# sudo mv monolith /usr/local/bin/
# sudo chmod +x /usr/local/bin/monolith

# Alternatively, install SingleFile
# npm install -g single-file-cli

macOS

# Install SQLite
brew install sqlite

# Install yt-dlp
brew install yt-dlp

# Install git
brew install git

# Install aria2
brew install aria2

# Install FFmpeg
brew install ffmpeg

# Install jq (required for hook scripts)
brew install jq

# Install zip (required for creating ZIP archives)
brew install zip

# Install monolith for web page archiving
brew install monolith

brew install tesseract-ocr

# Install Node.js (if using SingleFile instead of monolith)
# brew install node
# npm install -g single-file-cli

Usage

Command Line Interface

Usage: gatherhub [OPTIONS]

Options:
  --config PATH                 Path to config file
  --scan                        Scan sources for new content
  --download                    Process pending downloads
  --status                      Show download status
  --retry                       Retry failed downloads
  --clean                       Clean failed downloads
  --daemon                      Run as daemon (for scheduling)
  --api                         Start API server
  --web                         Start web interface
  --import-bookmarks NAME       Import browser bookmarks with specified name
  --browser BROWSER             Browser to import from bookmarks (firefox, chrome, etc)
  --profile PATH                Browser profile path (optional)
  --generate-admin-password     Generate a new admin password hash
  --generate-install-script PATH Generate install script for dependencies (specify output filename)
  --reindex-all                 Reindex all downloaded files
  --check-deps                  Check system dependencies
  --help                        Show this message and exit

Examples

Scan sources for new content:

./gatherhub --scan

Import Firefox bookmarks:

./gatherhub --import-bookmarks "firefox-bookmarks" --browser firefox

Import from a specific Firefox profile:

./gatherhub --import-bookmarks "firefox-work" --browser firefox --profile "/path/to/firefox/profile"

Process pending downloads:

./gatherhub --download

Show download status:

./gatherhub --status

Run as a daemon with scheduling:

./gatherhub --daemon

Start the web interface:

./gatherhub --web

Launch both the web interface and API server:

./gatherhub --web --api

Check system dependencies:

./gatherhub --check-deps

Generate installation script for current platform:

./gatherhub --generate-install-script install.sh

Reindex all downloaded content:

./gatherhub --reindex-all

Generate new admin password:

./gatherhub --generate-admin-password

Web Interface

The web interface is available at http://localhost:8060 by default. You can log in with the username and password configured in config.toml (default credentials are admin/admin).

API

The API is available at http://localhost:5000/api by default (configurable in config.toml). The API provides programmatic access to GatherHub's functionality.

Authentication

The API supports API Key-based authentication:

[api.auth]
enabled = true
api_secret = "your-secret-key"  # Change this to a secure random string
apiKey_expiry_hours = 24

When authentication is enabled, all API requests must include an Authorization header:

Authorization: Bearer your-api-key

Core API Endpoints

Job Management
Get All Jobs
  • Endpoint: GET /api/jobs
  • Description: Retrieves a list of all jobs
  • Response: Array of job objects
[
  {
    "id": 1,
    "url": "https://example.com/file.pdf",
    "source_name": "firefox",
    "source_id": "bookmark_123",
    "media_type": "pdf",
    "status": "completed",
    "created_at": "2023-04-15T14:30:45Z",
    "updated_at": "2023-04-15T14:35:12Z"
  }
]
Create Job
  • Endpoint: POST /api/jobs
  • Description: Creates a new download job
  • Request Body:
{
  "url": "https://example.com/file.pdf",
  "source_name": "manual",
  "source_id": "user_123",
  "media_type": "pdf" // Optional, will be auto-detected if not provided
}
  • Response: The created job object
Get Job Details
  • Endpoint: GET /api/jobs/{id}
  • Description: Retrieves details for a specific job
  • Parameters:
    • id: Job ID (integer)
  • Response: Complete job object including status, file path, and metadata
Update Job
  • Endpoint: PUT /api/jobs/{id}
  • Description: Updates a job's status or error message
  • Parameters: id: Job ID (integer)
  • Request Body:
{
  "status": "failed",
  "error": "Download failed: connection timeout"
}
  • Response: Updated job object
Delete Job
  • Endpoint: DELETE /api/jobs/{id}
  • Description: Deletes a job from the system
  • Parameters: id: Job ID (integer)
  • Response: HTTP 204 (No Content) on success
Bulk Operations
Perform Bulk Job Operations
  • Endpoint: POST /api/bulk/jobs
  • Description: Perform operations on multiple jobs at once
  • Request Body:
{
  "action": "delete", // or "retry"
  "ids": [1, 2, 3, 4] // Array of job IDs
}
  • Response:
{
  "success": true,
  "deleted": 3, // or "retried" for retry operations
  "failed": 1,
  "message": "Deleted 3 jobs, 1 failed"
}
Source Operations
Scan Sources
  • Endpoint: POST /api/scan
  • Description: Scan all enabled sources for new content
  • Response: Success message with scan results
{
  "success": true,
  "message": "Scan completed successfully"
}
Download Operations
Process Downloads
  • Endpoint: POST /api/download
  • Description: Process all pending download jobs
  • Response: HTTP 200 with a success message on completion
Retry Failed Jobs
  • Endpoint: POST /api/retry
  • Description: Retry all failed download jobs
  • Response: HTTP 200 with a success message on completion
Clean Jobs
  • Endpoint: POST /api/clean
  • Description: Removes failed jobs and old completed jobs
  • Response: HTTP 200 with a message indicating how many jobs were cleaned (e.g., "Cleaned 5 failed jobs and 10 old jobs")
System Information
Get Statistics
  • Endpoint: GET /api/stats
  • Description: Retrieve system statistics
  • Response:
{
  "total_jobs": 156,
  "pending_jobs": 5,
  "downloading_jobs": 2,
  "completed_jobs": 145,
  "failed_jobs": 4,
  "sources": 3,
  "recent_activity": true
}
File Operations
Upload Files
  • Endpoint: POST /api/upload
  • Description: Upload files directly rather than downloading from URLs
  • Request: Multipart form with:
    • file: File to upload (can be multiple files)
    • media_type: Optional media type, will be auto-detected if not provided
  • Response:
{
  "success": true,
  "message": "2 of 2 files uploaded successfully",
  "jobs": [
    {
      "id": 157,
      "url": "file:///path/to/uploaded/file1.pdf",
      "source_name": "Manual Upload",
      "media_type": "pdf",
      "status": "completed"
    },
    {
      "id": 158,
      "url": "file:///path/to/uploaded/file2.jpg",
      "source_name": "Manual Upload",
      "media_type": "jpg",
      "status": "completed"
    }
  ]
}

Additional API Endpoints

Job Tag Management
Get Job Tags
  • Endpoint: GET /api/jobs/{id}/tags
  • Description: Retrieve all tags for a specific job
  • Parameters: id: Job ID (integer)
  • Response: Array of tag strings
Add Tag to Job
  • Endpoint: POST /api/jobs/{id}/tags
  • Description: Add a tag to a specific job
  • Parameters: id: Job ID (integer)
  • Request Body:
{
  "tag": "important"
}
  • Response: Success message with automatic content reindexing
Remove Tag from Job
  • Endpoint: DELETE /api/jobs/{id}/tags/{tag}
  • Description: Remove a specific tag from a job
  • Parameters:
    • id: Job ID (integer)
    • tag: Tag name to remove (URL-encoded)
  • Response: Success message with automatic content reindexing
System Management
Get All Tags
  • Endpoint: GET /api/tags
  • Description: List all available tags in the system
  • Response: Array of tag strings
Reset Stuck Jobs
  • Endpoint: POST /api/reset-stuck
  • Description: Reset jobs that are stuck in downloading state (downloading for >10 minutes)
  • Response: Message indicating how many jobs were reset
Force Job Reindexing
  • Endpoint: POST /api/reindex-job/{id}
  • Description: Force reindexing of content for a specific job
  • Parameters: id: Job ID (integer)
  • Response: Success message confirming reindexing was scheduled
Source Management
Get Sources
  • Endpoint: GET /api/sources/
  • Description: Retrieve all configured sources
  • Response: Array of source configuration objects
Get Specific Source
  • Endpoint: GET /api/sources/{index}
  • Description: Retrieve configuration for a specific source
  • Parameters: index: Source index (integer)
  • Response: Source configuration object
Test Source Connection
  • Endpoint: POST /api/test-source-connection
  • Description: Test connection to a data source (database, RSS feed, etc.)
  • Request: Form data with source configuration parameters
  • Response: Success/failure message with connection status
Documentation Search
Search Documentation
  • Endpoint: GET /api/docs/search
  • Description: Search the built-in documentation
  • Parameters:
    • q: Search query (required)
    • page: Page number (optional, default: 1)
    • per_page: Results per page (optional, default: 10)
  • Response: Search results with documentation matches
Advanced Search
Search Content
  • Endpoint: GET /api/search
  • Description: Advanced search across all indexed content
  • Parameters:
    • q: Search query
    • page: Page number
    • per_page: Results per page
    • filters: JSON object with search filters
    • highlight: Enable search highlighting
  • Response: Search results with faceted filtering and highlighting
Index Content
  • Endpoint: POST /api/index
  • Description: Manually trigger content indexing
  • Response: Success message confirming indexing was triggered

Configuration

GatherHub is configured via data/config/config.toml. Key configuration sections include:

Sources

Configure bookmark sources from browsers, databases, and RSS feeds:

[[sources]]
name = 'firefox'
type = 'browser'
browser = 'firefox'
# profile_path = '/path/to/firefox/profile'  # Optional
enabled = true

[[sources]]
name = 'readeck'
type = 'sqlite'
path = './readeck.db'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true

[[sources]]
name = 'mysql-bookmarks'
type = 'mysql'
path = 'localhost'  # Host
port = '3306'
database = 'bookmarks'
username = 'user'
password = 'password'
table = 'bookmarks'
id_column = 'id'
url_column = 'url'
enabled = true

[[sources]]
name = 'rss-feed'
type = 'rss'
url = 'https://example.com/feed.xml'
max_items = 20
include_enclosures = true
enabled = true

Media Types

Configure how different content types are handled:

[[media_types]]
name = 'streaming-video'
extensions = []
domains = ['youtube.com', 'youtu.be', 'vimeo.com', 'dailymotion.com', 'twitch.tv', 'facebook.com', 'instagram.com', 'twitter.com', 'tiktok.com', 'reddit.com', 'bilibili.com', 'bitchute.com', 'rumble.com', 'odysee.com', 'peertube.tv', 'nebula.app', 'curiositystream.com', 'patreon.com', 'floatplane.com', 'soundcloud.com', 'mixcloud.com', 'bandcamp.com']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "bestvideo[height<=720]+bestaudio/best[height<=720]" --merge-output-format mp4 --limit-rate 1M --no-check-certificate --ignore-errors --no-abort-on-error --geo-bypass --cookies cookies.txt --sponsorblock-remove all --write-description --write-info-json --write-thumbnail --write-all-thumbnails --write-auto-subs --sub-langs all,-live_chat --write-subs --embed-metadata --extractor-retries 5 --fragment-retries 5 --retry-sleep 3 --force-ipv4 --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'

[[media_types]]
name = 'youtube'
patterns = ['^https?://(www\.)?(youtube\.com|youtu\.be)/.*']
tool = 'yt-dlp'
tool_path = '/usr/local/bin/yt-dlp'
arguments = '--format "best[height<=720]" --output "{output_dir}/%(title)s-%(id)s.%(ext)s" {url}'

[[media_types]]
name = 'html'
patterns = ['.*\.html$', '^https?://[^/]+/?$']
tool = 'monolith'
tool_path = '/usr/local/bin/monolith'
arguments = '{url} -o {output_dir}/{url-hostname}_{id}.html'

# Example of using SingleFile instead of monolith
[[media_types]]
name = 'singlefile-html'
patterns = ['^https?://docs\.example\.org/.*']  # Special pattern for documentation sites
tool = 'single-file'
tool_path = '/usr/local/bin/single-file'
arguments = '{url} --output-directory={output_dir} --filename-template="{url-hostname}_{id}.html"'

Storage

Configure where downloaded content is stored:

[storage]
base_path = './downloads/'

[storage.by_type]
youtube = 'youtube/'
git = 'git/'
html = 'html/'
# Additional media types...

Web Interface and API

Configure web interface and API settings:

[web_interface]
enabled = true
host = '0.0.0.0'
port = 8060
allow_iframe = true  # Set to false if not embedding in another site
session_timeout_minutes = 60

[[web_interface.users]]
username = 'admin'
password_hash = 'scrypt:32768:8:1$4CQiZOt8Pk17kpi4$6094970a974f2f4298da9157c9a9f17b33cb260638906659e565295c65dd841f72115aec72f47fcdf6b2e7b30fc668a45ee37226a871139f40a3c64b31e0337c'
role = 'admin'

[api]
enabled = true
host = '127.0.0.1'
port = 5000
debug = false  # Set to true for verbose API logs

[api.auth]
enabled = true
api_secret = 'your-secret-key'  # Change this to a secure random string
apiKey_expiry_hours = 24

Advanced Configuration

Extractor Configuration

Configure content extractors for different file types:

[extraction]
enabled = true
output_dir = ''  # Use job directory if empty
output_formats = ['text', 'json', 'html']
include_metadata = true
supported_types = ['html', 'books', 'documents']

# Internal extractors (built-in)
[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

[[extractors]]
name = 'spreadsheet'
extensions = ['.xlsx', '.ods']
type = 'internal'
priority = 10

# External extractors (command-line tools)
[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout {input} {output}'
priority = 20

[[extractors]]
name = 'pandoc'
extensions = ['.docx', '.doc']
type = 'external'
command = '/usr/bin/pandoc'
arguments = '{input} -o {output}'
priority = 20

# Chain extractors (multi-step processing)
[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 15
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Search and Indexing Configuration

Configure search index behavior:

[search]
enabled = true
index_path = './data/search_index'
batch_size = 50
auto_index = true
include_content = true
include_metadata = true

[search.facets]
enabled = true
max_facet_count = 20
include_file_types = true
include_sources = true
include_tags = true
include_word_count_ranges = true

First Run and Shepherd Configuration

Configure the first-run setup and guided tours:

[first_run]
enabled = true
require_setup = true
auto_index_docs = true
check_dependencies = true

[shepherd]
enabled = true
auto_start_tours = true
track_progress = true
available_tours = [
  'dashboard-overview',
  'job-management',
  'source-configuration',
  'search-features',
  'tag-management'
]

Logging Configuration

Configure detailed logging options:

[logging]
app_log_path = './data/logs/app.log'
activity_log_path = './data/logs/activity.log'
error_log_path = './data/logs/error.log'
max_size_bytes = 10485760  # 10MB
backup_count = 5
level = 'INFO'  # DEBUG, INFO, WARN, ERROR
console_logging = true

Concurrency Configuration

Configure concurrency settings:

[concurrency]
max_workers = 5
timeout_seconds = 3600

Auto-cleanup Configuration

Configure automatic cleanup of old jobs and files:

[auto_clean]
enabled = false
retry_failed = true
max_retries = 3
clean_after_days = 7

For more detailed configuration options, see the docs_configuration.html page in the web interface.

Event Hooks

GatherHub supports running custom scripts when certain events occur:

[event_hooks]
enabled = true
hooks_dir = './data/hooks'

[[event_hooks.hooks]]
event = 'post_download'
script = 'notify.py'
enabled = true

Event Hook Features

  • Rich JSON context: Hooks receive comprehensive data about the event via stdin
  • Environment variables: Access to GATHERHUB_EVENT, GATHERHUB_APP_LOG, and other variables
  • Multiple hooks per event: Chain several scripts for the same event type
  • Conditional execution: Enable/disable hooks via config without removing them
  • Global hook directory: Centralized management of all hook scripts

Available Events

  • pre_download: Called before a download starts
  • post_download: Called after a download completes
  • on_error: Called when a download fails
  • on_status_change: Called when a download's status changes
  • on_source_scan: Called when a source is scanned
  • on_startup: Called when the application starts
  • on_shutdown: Called when the application shuts down

Hook Script Examples

Python notification hook:

#!/usr/bin/env python3
import json
import sys
import requests

# Read JSON data from stdin
data = json.load(sys.stdin)

# Send notification to external service
if data['status'] == 'completed':
    requests.post('https://notify.example.com/webhook', json={
        'title': 'Download Complete',
        'message': f"Downloaded: {data['url']}",
        'status': 'success'
    })

Bash post-processing hook:

#!/bin/bash
# Parse JSON input
JSON=$(cat)
URL=$(echo $JSON | jq -r '.url')
FILEPATH=$(echo $JSON | jq -r '.file_path')
MEDIA_TYPE=$(echo $JSON | jq -r '.media_type')

# Process downloaded files based on media type
if [ "$MEDIA_TYPE" = "pdf" ]; then
    # Extract text from PDF
    pdftotext "$FILEPATH" "${FILEPATH}.txt"
    echo "Extracted text from PDF to ${FILEPATH}.txt"
fi

Browser Cookie Integration

GatherHub can use browser cookies to access authenticated or paywalled content:

  • Automatic extraction: Extract cookies from supported browsers
  • Website-specific cookies: Apply cookies only to matching domains
  • Cookie management: Add, edit, and delete cookies through the web interface
  • Secure storage: Cookies are stored securely in the tracking database
  • Automatic expiry: Expired cookies are removed during cleanup

To enable cookie integration:

  1. Go to Settings → Cookie Settings in the web interface
  2. Import cookies from a browser or manually add them
  3. Enable cookies for the desired domains
  4. The cookies will be automatically applied to matching download jobs

Deployment

Systemd Services

GatherHub can be run as three different systemd services:

Web, API, and Daemon Mode

For automatic scheduled operations:

sudo cp deploy/gatherhub.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable gatherhub
sudo systemctl start gatherhub
### Docker

GatherHub can also be run using Docker:

```bash
# Build the Docker image
docker build -t gatherhub -f deploy/Dockerfile .

# Run using docker-compose
docker-compose -f deploy/docker-compose.yml up -d

SSL/TLS Configuration

For secure access:

[web_interface]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"

[api]
# ... other settings
ssl_enabled = true
ssl_cert_file = "/path/to/cert.pem"
ssl_key_file = "/path/to/key.pem"

Alternatively, use a reverse proxy like Nginx (configuration included in deploy/gatherhub.nginx.conf).