Skip to content

HTTPArchive/dataform

Repository files navigation

HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.

Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main branch is used on each triggered pipeline run.

HTTP Archive Crawl

Tag: crawl_complete

HTTP Archive Technology Report

Tag: crux_ready

Schedules

  1. crawl-complete PubSub subscription

    Tags: ["crawl_complete"]

  2. bq-poller-crux-ready Scheduler

    Tags: ["crux_ready"]

Triggering workflows

In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

Cloud resources overview

graph TB;
    subgraph Cloud Run
        dataform-service[dataform-service service]
        bigquery-export[bigquery-export job]
    end

    subgraph PubSub
        crawl-complete[crawl-complete topic]
        dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
        crawl-complete --> dataform-service-crawl-complete
    end

    dataform-service-crawl-complete --> dataform-service

    subgraph Cloud_Scheduler
        bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
        bq-poller-crux-ready --> dataform-service
    end

    subgraph Dataform
        dataform[Dataform Repository]
        dataform_release_config[dataform Release Configuration]
        dataform_workflow[dataform Workflow Execution]
    end

    dataform-service --> dataform[Dataform Repository]
    dataform --> dataform_release_config
    dataform_release_config --> dataform_workflow

    subgraph BigQuery
        bq_jobs[BigQuery jobs]
        bq_datasets[BigQuery table updates]
        bq_jobs --> bq_datasets
    end

    dataform_workflow --> bq_jobs

    bq_jobs --> bigquery-export

    subgraph Monitoring
        cloud_run_logs[Cloud Run logs]
        dataform_logs[Dataform logs]
        bq_logs[BigQuery logs]
        alerting_policies[Alerting Policies]
        slack_notifications[Slack notifications]

        cloud_run_logs --> alerting_policies
        dataform_logs --> alerting_policies
        bq_logs --> alerting_policies
        alerting_policies --> slack_notifications
    end

    dataform-service --> cloud_run_logs
    dataform_workflow --> dataform_logs
    bq_jobs --> bq_logs
    bigquery-export --> cloud_run_logs
Loading

Development Setup

  1. Install dependencies:

    npm install
  2. Available Scripts:

    • npm run format - Format code using Standard.js, fix Markdown issues, and format Terraform files
    • npm run lint - Run linting checks on JavaScript, Markdown files, and compile Dataform configs
    • make tf_apply - Apply Terraform configurations

Code Quality

This repository uses:

  • Standard.js for JavaScript code style
  • Markdownlint for Markdown file formatting
  • Dataform's built-in compiler for SQL validation

About

The data pipeline for HTTP Archive orchestrated by Dataform

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages