Skip to content

It makes sure that the archive files on @CMSgov's Provider Data Catalog are actually working. Not an official government codebase.

License

Notifications You must be signed in to change notification settings

TheBoatyMcBoatFace/good-pdc

Repository files navigation

You good, PDC? 😎

GitHub Repo stars GitHub forks

GitHub License Last Commit Link Checker

This project periodically checks various links on CMS's Provider Data Catalog (PDC) to ensure that all links are operational and all data is accessible.

THIS IS NOT AN OFFICIAL GOVERNMENT CODEBASE.

Check the Archives.md file to see the status summary and detailed reports of each data topic archive.

To see the datasets, go to Datasets README.

What it do?

  • Link Validation: We check CMS PDC links to make sure they're not ghosting 👻 us.

  • Categorized Reports: We neatly categorize this info in Archives.md and associated dataset markdown files. I'm really proud of how this turned out.

  • Dataset Analysis: We test archive files and all datasets on the PDC. The datasets are analyzed, and basic checks are performed on them.

  • Public Data Check: Uses the PDC API to find the archive and dataset files and then checks to make sure they exist.

  • Summarized Status: A quick glance at the top tells you if things are going smoothly or if there's trouble.

  • Detailed Dataset Reports: To view the dataset results, go to datasets/README.md where you can find links to various data topics.

  • GitHub Actions: Automatically keeps things in check every time you push to the main branch. It also runs every three hours and sends notifications if there is a ❌ in the results file.

  • Error and Performance Monitoring: Uses Sentry for error and performance monitoring. If you don't want to use it, just don't set SENTRY_DSN to any variable.

  • Logging Levels: Select your log level by setting LOG_LEVEL. Options are error, warn, info, debug, trace.

How It Works

Dataset Reports

The dataset reports are generated using a Rust module that performs the following tasks:

  1. Fetching Datasets

    • The module fetches a list of datasets from the PDC API.
    • Datasets are deserialized into a Dataset struct.
  2. Processing Datasets

    • Each dataset is processed in parallel to improve efficiency.
    • The module checks the status of the dataset's download URL and landing page.
  3. Generating Reports

    • The module constructs a markdown report for each dataset, including:
      • Dataset details (e.g., ID, title, description, issued date, modified date, release date).
      • File history and overview (e.g., filesize, row count, column count).
      • Data integrity tests (e.g., column count consistency, header validation, encoding validation).
      • Public access tests (e.g., status of PDC page, landing page, and direct download link).
    • Reports are saved to the datasets directory.
  4. Error Handling and Logging

    • The module uses Sentry for error tracking and performance monitoring.
    • Detailed logging is performed using the tracing crate.

Archive Reports

The archive reports are generated using another Rust module that performs the following tasks:

  1. Fetching Archives

    • The module fetches a list of archive topics from the PDC API.
    • Archive topics are deserialized into a Topics struct.
  2. Processing Archives

    • Each archive is processed to check the status of yearly and monthly archive URLs.
    • The module generates a summary of the yearly and monthly archive checks.
  3. Generating Reports

    • The module constructs a markdown report for each archive topic, including:
      • Archive details and statuses.
      • Summary of yearly and monthly archive checks.
    • Reports are saved to the Archives.md file.
  4. Error Handling and Logging

    • The module uses Sentry for error tracking and performance monitoring.
    • Detailed logging is performed using the tracing crate.

Getting Started

  1. Clone the repo:

    git clone https://github.com/TheBoatyMcBoatFace/good-pdc.git
    cd good-pdc
  2. Run locally:

    cargo run
  3. Vibe Check:

    Open up Archives.md and see if there are any ❌, Hint: those are bad

  4. Automate with GitHub Actions:

    Push to main to run the bot thing. It also runs every three hours and sends notifications if there is a ❌ in the results file.

Contributing

You're awesome for wanting to help (just saying). Here are some guidelines:

  1. Open issues: If you find bugs or have cool ideas, open an issue. No issue = it doesn't exist.

  2. Don't be a jerk: I am not afraid to use the ban 🔨. GitHub is the best social media platform, don't ruin it.

License

This is aggressively open-source under AGPL-3.0 license. Details in the LICENSE file.

Additional Links and Information


This README was generated with AI because I'm tired and don't want to do the documenting part. Bite me

About

It makes sure that the archive files on @CMSgov's Provider Data Catalog are actually working. Not an official government codebase.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages