This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main
branch is used on each triggered pipeline run.
Tag: crawl_complete
-
Crawl dataset
httparchive.crawl.*
Consumers:
- public dataset and BQ Sharing Listing
-
Blink Features Report
httparchive.blink_features.usage
Consumers:
Tag: crux_ready
-
httparchive.reports.cwv_tech_*
andhttparchive.reports.tech_*
Consumers:
-
crawl-complete PubSub subscription
Tags: ["crawl_complete"]
-
bq-poller-crux-ready Scheduler
Tags: ["crux_ready"]
In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
graph TB;
subgraph Cloud Run
dataform-service[dataform-service service]
bigquery-export[bigquery-export job]
end
subgraph PubSub
crawl-complete[crawl-complete topic]
dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
crawl-complete --> dataform-service-crawl-complete
end
dataform-service-crawl-complete --> dataform-service
subgraph Cloud_Scheduler
bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
bq-poller-crux-ready --> dataform-service
end
subgraph Dataform
dataform[Dataform Repository]
dataform_release_config[dataform Release Configuration]
dataform_workflow[dataform Workflow Execution]
end
dataform-service --> dataform[Dataform Repository]
dataform --> dataform_release_config
dataform_release_config --> dataform_workflow
subgraph BigQuery
bq_jobs[BigQuery jobs]
bq_datasets[BigQuery table updates]
bq_jobs --> bq_datasets
end
dataform_workflow --> bq_jobs
bq_jobs --> bigquery-export
subgraph Monitoring
cloud_run_logs[Cloud Run logs]
dataform_logs[Dataform logs]
bq_logs[BigQuery logs]
alerting_policies[Alerting Policies]
slack_notifications[Slack notifications]
cloud_run_logs --> alerting_policies
dataform_logs --> alerting_policies
bq_logs --> alerting_policies
alerting_policies --> slack_notifications
end
dataform-service --> cloud_run_logs
dataform_workflow --> dataform_logs
bq_jobs --> bq_logs
bigquery-export --> cloud_run_logs
-
Install dependencies:
npm install
-
Available Scripts:
npm run format
- Format code using Standard.js, fix Markdown issues, and format Terraform filesnpm run lint
- Run linting checks on JavaScript, Markdown files, and compile Dataform configsmake tf_apply
- Apply Terraform configurations
This repository uses:
- Standard.js for JavaScript code style
- Markdownlint for Markdown file formatting
- Dataform's built-in compiler for SQL validation