Skip to content

Bash-1/postgres-schema-crawler-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

PostgreSQL Schema Crawler

I built this lightweight tool that crawls PostgreSQL database schemas, tracks changes over time, and provides both CLI and web interfaces for schema management and comparison.

Features

  • Schema Discovery: Automatically discover and catalog all tables, columns, and constraints
  • Change Detection: Compare schema snapshots to detect additions, removals, and modifications
  • Version History: Maintain a complete history of schema changes with timestamps
  • Multiple Output Formats: Export schemas as JSON, CSV, or Markdown
  • Rich CLI Interface: Beautiful terminal output with tables and progress indicators
  • Web Dashboard: Interactive Streamlit web interface for schema exploration
  • Metadata Storage: Store custom descriptions, ownership, and tags
  • Scheduled Crawling: Support for automated schema monitoring

Requirements

To use my tool, you'll need:

  • Python 3.8+
  • PostgreSQL database access
  • Required Python packages (see requirements.txt)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd postgres-schema-crawler
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure database connection: Edit config.yaml with your PostgreSQL credentials:

    database:
      host: localhost
      port: 5432
      name: your_database
      user: your_username
      password: your_password
      schema: public

How to Run

Quick Start - Single Command (Recommended)

Windows:

run_tool.bat

Linux/Mac:

./run_tool.sh

Or directly with Python:

python run_tool.py

This single command will:

  1. Check all dependencies are installed
  2. Test database connection using your config.yaml
  3. Take a new schema snapshot
  4. Launch the Streamlit web interface at http://localhost:8501

Individual Commands

If you want to run specific parts of the tool:

1. Take a Schema Snapshot

python src/schema_crawler.py crawl

2. List All Snapshots

python src/schema_crawler.py list

3. Show Specific Snapshot Details

python src/schema_crawler.py show <snapshot_id>

4. Compare Two Snapshots

python src/schema_crawler.py diff <snapshot1_id> <snapshot2_id>

5. Compare Latest Two Snapshots

python src/schema_crawler.py diff-latest

6. Launch Web Interface Only

streamlit run enhanced_web_ui.py

Prerequisites

Before running, make sure you have:

  1. Python dependencies installed:

    pip install -r requirements.txt
  2. Database configuration set up in config.yaml:

    database:
      host: localhost
      port: 5432
      name: your_database
      user: your_username
      password: your_password
      schema: public
  3. PostgreSQL database accessible with the credentials in your config

Troubleshooting

If you get errors:

  1. Check dependencies:

    python -c "import psycopg2, sqlalchemy, click, rich, streamlit, plotly, pandas"
  2. Test database connection:

    python src/schema_crawler.py crawl
  3. Check config file exists:

    ls config.yaml

Usage

CLI Interface

My tool provides a comprehensive command-line interface with multiple commands:

1. Crawl Schema

# Basic crawl with default settings
python src/schema_crawler.py crawl

# Custom database connection
python src/schema_crawler.py crawl --host localhost --port 5432 --database mydb --user postgres --password mypass --schema public

# Table filtering examples
python src/schema_crawler.py crawl --include-tables employees departments
python src/schema_crawler.py crawl --exclude-tables temp_table backup_table
python src/schema_crawler.py crawl --include-patterns "user*" "*_log"
python src/schema_crawler.py crawl --exclude-patterns "*_backup" "test_*"
python src/schema_crawler.py crawl --include-tables employees --include-patterns "dept*" --exclude-patterns "*_temp"

2. List Snapshots

# View all saved schema snapshots
python src/schema_crawler.py list-snapshots

3. Compare Schemas

# Compare two specific snapshots
python src/schema_crawler.py diff 1 2

# Compare the two most recent snapshots
python src/schema_crawler.py diff-latest

# Generate a diff report
python src/schema_crawler.py diff 1 2 --output changes.md

4. Export Schemas

# Export as JSON
python src/schema_crawler.py export 1 --format json

# Export as CSV
python src/schema_crawler.py export 1 --format csv

# Export as Markdown
python src/schema_crawler.py export 1 --format markdown

Web Interface

Launch my interactive web dashboard:

streamlit run src/web_ui.py

My web interface provides:

  • Dashboard: Overview of schema snapshots and activity
  • Schema Crawler: Interactive database connection and crawling
  • Schema History: Browse and export historical snapshots
  • Schema Comparison: Visual comparison of schema changes
  • Settings: Configure database connections and preferences

Schema Information Captured

My tool captures comprehensive metadata for each schema object:

Tables

  • Table name and type (BASE TABLE, VIEW, FOREIGN TABLE)
  • Table owner
  • Creation timestamp
  • Custom descriptions and tags

Columns

  • Column name and data type
  • Nullability constraints
  • Default values
  • Ordinal position
  • Character length, numeric precision, scale
  • Custom descriptions

Constraints

  • Primary keys
  • Foreign keys
  • Unique constraints
  • Check constraints
  • Not null constraints

Change Detection

My schema diffing engine detects:

  • Added Objects: New tables, columns, or constraints
  • Removed Objects: Deleted tables, columns, or constraints
  • Modified Objects: Changes to data types, nullability, defaults, etc.

Example Diff Output

Schema Changes Summary:
  Added: 2 | Removed: 1 | Modified: 3

🟒 Added Objects (2):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Type   β”‚ Name         β”‚ Parent β”‚ Details                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Table  β”‚ new_table    β”‚ -      β”‚ Table 'new_table' was added         β”‚
β”‚ Column β”‚ status       β”‚ users  β”‚ Column 'status' was added to table  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

postgres-schema-crawler/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ schema_crawler.py    # Main crawler and CLI interface
β”‚   β”œβ”€β”€ schema_diff.py       # Schema comparison engine
β”‚   └── web_ui.py           # Streamlit web interface
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ schema_metadata.db   # SQLite database for snapshots
β”‚   └── annotations.yaml     # Custom metadata annotations
β”œβ”€β”€ config.yaml             # Configuration file
β”œβ”€β”€ requirements.txt        # Python dependencies
└── README.md              # This file

Configuration

Database Settings

Configure your PostgreSQL connection in my config.yaml:

database:
  host: localhost
  port: 5432
  name: your_database
  user: your_username
  password: your_password
  schema: public

Crawler Settings

crawler:
  include_types:
    - BASE TABLE
    - VIEW
    - FOREIGN TABLE
  max_tables: 0  # 0 = unlimited
  include_constraints: true
  include_indexes: false

Table Filtering

My tool supports powerful table filtering to focus on specific tables:

crawler:
  table_filter:
    # Include only specific tables
    include_tables: 
      - employees
      - departments
      - salaries
    
    # Exclude specific tables
    exclude_tables:
      - temp_table
      - backup_table
    
    # Include tables matching patterns (supports wildcards)
    include_patterns:
      - "user*"        # Tables starting with "user"
      - "*_log"        # Tables ending with "_log"
      - "temp_*"       # Tables starting with "temp_"
    
    # Exclude tables matching patterns
    exclude_patterns:
      - "*_backup"     # Exclude backup tables
      - "test_*"       # Exclude test tables
      - "*_old"        # Exclude old tables
    
    # Case sensitive pattern matching
    case_sensitive: false

My Filtering Options:

  • include_tables: List of specific tables to include (empty = all tables)
  • exclude_tables: List of specific tables to exclude
  • include_patterns: Wildcard patterns for tables to include
  • exclude_patterns: Wildcard patterns for tables to exclude
  • case_sensitive: Whether pattern matching is case sensitive

My Pattern Examples:

  • "user*" - Tables starting with "user"
  • "*_log" - Tables ending with "_log"
  • "temp_*" - Tables starting with "temp_"
  • "*_backup" - Tables ending with "_backup"

Output Settings

output:
  data_dir: data
  export_format: json
  create_reports: true

Use Cases

1. Database Documentation

  • Generate comprehensive schema documentation
  • Track schema evolution over time
  • Maintain up-to-date data dictionaries

2. Change Management

  • Monitor schema changes in development environments
  • Validate migration scripts
  • Ensure compliance with schema standards

3. Data Governance

  • Track ownership and responsibility
  • Monitor data lineage
  • Maintain audit trails

4. Development Workflow

  • Compare development and production schemas
  • Validate database migrations
  • Document schema changes for releases

Automation

Scheduled Crawling

Set up automated schema monitoring using cron:

# Crawl schema daily at 2 AM
0 2 * * * cd /path/to/postgres-schema-crawler && python src/schema_crawler.py crawl

# Compare with previous day
0 3 * * * cd /path/to/postgres-schema-crawler && python src/schema_crawler.py diff-latest --output daily_changes.md

CI/CD Integration

Integrate my schema validation into your CI/CD pipeline:

# Example GitHub Actions workflow
- name: Validate Schema Changes
  run: |
    python src/schema_crawler.py crawl
    python src/schema_crawler.py diff-latest --output schema_changes.md
    # Fail if breaking changes detected
    if grep -q "removed\|modified" schema_changes.md; then
      echo "Breaking schema changes detected!"
      exit 1
    fi

Security Considerations

  • Store database credentials securely (use environment variables)
  • Limit database user permissions to read-only access
  • Regularly rotate database passwords
  • Use SSL connections for production databases

🀝 Contributing

I welcome contributions! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

If you need help with my tool:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue with detailed information

πŸ—ΊοΈ Roadmap

  • Support for other database systems (MySQL, SQL Server)
  • Advanced visualization and reporting
  • Integration with dbt and other data tools
  • API endpoints for programmatic access
  • Advanced change impact analysis
  • Schema validation rules engine

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published