Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ insert_final_newline = true

[*.{js,json,md,sql,yaml}]
indent_size = 2

[Makefile]
indent_style = tab
1 change: 1 addition & 0 deletions .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ jobs:
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
VALIDATE_GITHUB_ACTIONS: false
8 changes: 2 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,5 @@ node_modules/
.DS_Store

# Terraform
tf/.terraform/
tf/temp

# Dataform
.df-credentials.json
.gitignore
infra/tf/.terraform/
**/*.zip
14 changes: 14 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FN_NAME = dataform-trigger

.PHONY: *

start:
npx functions-framework --target=$(FN_NAME) --source=./infra/dataform-trigger/ --signature-type=http --port=8080 --debug

tf_plan:
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan \
-var="FUNCTION_NAME=$(FN_NAME)"

tf_apply:
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve \
-var="FUNCTION_NAME=$(FN_NAME)"
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HTTP Archive BigQuery pipeline with Dataform
# HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.

Expand Down Expand Up @@ -62,7 +62,7 @@ Tag: `crawl_results_legacy`

### Triggering workflows

[see here](./src/README.md)
In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./src/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

## Contributing

Expand All @@ -85,7 +85,7 @@ Tag: `crawl_results_legacy`

The issues within the pipeline are being tracked using the following alerts:

1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/3950167380893746326?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/7137542315653007241?authuser=7&project=httparchive)
1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/570799173843203905?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive)

Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
132 changes: 132 additions & 0 deletions docs/infrastructure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Infrastucture

```mermaid
graph LR;
subgraph Cloud_Run_Functions
dataformTrigger[Dataform Trigger Function]
end

subgraph PubSub
crawl_complete_topic[Crawl Complete Topic]
dataformTrigger_subscription[Dataform Trigger Subscription]
crawl_complete_topic --> dataformTrigger_subscription
end

dataformTrigger_subscription --> dataformTrigger

subgraph Cloud_Scheduler
bq_poller_cwv_tech_report[CWV Report Poller Job]
bq_poller_cwv_tech_report --> dataformTrigger
end

subgraph Dataform
dataform_repo[Dataform Repository]
dataform_repo_release_config[Release Configuration]
dataform_repo_workflow[Workflow Execution]
end

dataformTrigger --> dataform_repo[Dataform Repository]
dataform_repo --> dataform_repo_release_config[Release Configuration]
dataform_repo_release_config --> dataform_repo_workflow[Workflow Execution]

subgraph BigQuery
bq_jobs[BigQuery Jobs]
bq_datasets[BigQuery Dataset Updates]
bq_jobs --> bq_datasets
end
dataform_repo_workflow --> bq_jobs

subgraph Logs_and_Alerts
cloud_run_logs[Cloud Run Logs]
dataform_logs[Dataform Logs]
bq_logs[BigQuery Logs]
alerting_policies[Alerting Policies]
slack_notifications[Slack Notifications]

cloud_run_logs --> alerting_policies
dataform_logs --> alerting_policies
bq_logs --> alerting_policies
alerting_policies --> slack_notifications
end

dataformTrigger --> cloud_run_logs
dataform_repo_workflow --> dataform_logs
bq_jobs --> bq_logs

```

## Triggering pipelines

[Configuration](./tf/functions.tf)

### Cloud Run Function

Triggers the Dataform workflow execution, based on events or cron schedules.

- [dataformTrigger](https://console.cloud.google.com/functions/details/us-central1/dataformTrigger?env=gen2&project=httparchive)

[Source](./src/README.md)

### Cloud Scheduler

- [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive)

### Pub/Sub Subscription

- [dataform-trigger-subscription](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive)

## Dataform

Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).

The test reporsitory is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.

Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.

[Configuration](./tf/dataform.tf)

### Dataform Development Workspace

1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

*Some useful hints:*

1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

## Monitoring

[Configuration](./tf/monitoring.tf)

### Dataform repository

- [Production Dataform Workflow Excution logs](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workflows?authuser=7&project=httparchive)
- [Logs Explorer](https://cloudlogging.app.goo.gl/k9qfqCh4RjFwTnQ56)

### Cloud Run logs

- [Trigger function logs](https://console.cloud.google.com/run/detail/us-central1/dataformtrigger/logs?authuser=7&project=httparchive)
- [Logs Explorer](https://cloudlogging.app.goo.gl/6Q879UjnTPDqtVBx5)

### BigQuery logs

- [Logs Explorer](https://cloudlogging.app.goo.gl/rFjRMcvejd1Tyi7KA)

### Alerting policies

- [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/3950167380893746326?authuser=7&project=httparchive)
- [Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/7137542315653007241?authuser=7&project=httparchive)

## CI/CD pipeline

### Dataform / GiHub connection

GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).

- repository: HTTPArchive/dataform
- permissions:
- Commit statuses: read
- Contents: read, write
18 changes: 10 additions & 8 deletions src/README.md → infra/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
# Cloud function for triggering Dataform workflows
# Infrastructure for the HTTP Archive data pipeline

## Cloud function for triggering Dataform workflows

[dataformTrigger](https://console.cloud.google.com/functions/details/us-central1/dataformTrigger?env=gen2&authuser=7&project=httparchive) Cloud Run Function

This function may be triggered by a PubSub message or Cloud Scheduler and triggers a Dataform workflow based on the trigger configuration provided.

## Configuration
### Configuration

Trigger types:

1. `event` - immediately triggers a Dataform workflow using tags provided in configuration.

2. `poller` - first triggers a BigQuery polling query. If the query returns TRUE, the Dataform workflow is triggered using the tags provided in configuration.

See [available trigger configurations](https://github.com/HTTPArchive/dataform/blob/30a3304bf0e903ec0c54ce1318aa4eed8ae828ed/src/index.js#L4).
See [available trigger configurations](https://github.com/HTTPArchive/dataform/blob/main/src/index.js#L4).

Request body example with trigger name:

Expand All @@ -22,12 +24,12 @@ Request body example with trigger name:
}
```

## Local testing
### Local testing

Run the following command to test the function locally:

```bash
npm run start
make start
```

Then, in a separate terminal, run the following command to trigger the function:
Expand All @@ -42,10 +44,10 @@ curl -X POST http://localhost:8080/ \
}'
```

## Deployment
### Deployment

When you're under `src/` run:
When you're under `infra/` run:

```bash
npm run deploy
make deploy
```
File renamed without changes.
6 changes: 3 additions & 3 deletions src/index.js → infra/dataform-trigger/index.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
const functions = require('@google-cloud/functions-framework')
const { BigQuery } = require('@google-cloud/bigquery')
const { getCompilationResults, runWorkflow } = require('./dataform')

const TRIGGERS = {
cwv_tech_report: {
Expand Down Expand Up @@ -109,7 +111,6 @@ async function messageHandler (req, res) {
* @returns {boolean} Query result.
*/
async function runQuery (query) {
const { BigQuery } = require('@google-cloud/bigquery')
const bigquery = new BigQuery()

const [job] = await bigquery.createQueryJob({ query })
Expand Down Expand Up @@ -138,7 +139,6 @@ async function executeAction (actionName, actionArgs) {
* @param {object} args Action arguments.
*/
async function runDataformRepo (args) {
const { getCompilationResults, runWorkflow } = require('./dataform')
const project = 'httparchive'
const location = 'us-central1'
const { repoName, tags } = args
Expand All @@ -163,4 +163,4 @@ async function runDataformRepo (args) {
* }
* }
*/
functions.http('dataformTrigger', (req, res) => messageHandler(req, res))
functions.http('dataform-trigger', (req, res) => messageHandler(req, res))
10 changes: 10 additions & 0 deletions infra/dataform-trigger/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"main": "index.js",
"dependencies": {
"@google-cloud/bigquery": "^7.9.1",
"@google-cloud/dataform": "^1.3.0",
"@google-cloud/functions-framework": "^3.4.2"
},
"name": "dataform-trigger",
"version": "1.0.0"
}
61 changes: 61 additions & 0 deletions infra/tf/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading