Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: implement queuing mechanism for acceptance tests #1038

Merged
merged 249 commits into from
Oct 26, 2021

Conversation

lionel-nj
Copy link
Contributor

@lionel-nj lionel-nj commented Oct 13, 2021

Summary:

This PR provides support to implement a queueing mechanism dedicated to the acceptance test feature. Said mechanism has to be implemented since running the validator on 1000+ datasets sequentially takes more than 6 hours (on github) which causes the workflow to get cancelled

Expected behavior:

  • CI should be triggered on all commits except if the developer added the keyword "[acceptance test skip]" to its latest commit message
  • Older workflows (only Rule acceptance tests) should be cancelled when a new one is triggered to avoid queueing too many workflows for a same PR
  • Datasets' URL should be fetched from the MobilityDatabase
  • For each dataset (i.e. URL retrieved from the database), the reference version of the validator (from the master branch) should run and produce reference.json and reference_errors.json
  • For each dataset (i.e. URL retrieved from the database), the snapshot version of the validator (from the branch where this CI process is run) should run and produce report.json and system_errors.json

⚙️ The task get-reports (from acceptance_test.yml) works as follows:
Once 1000+ urls are retrieved from the Mobility Database as a list of 256 objects (max), each one of these objects are used to define a matrix. The latest generates jobs that run concurrently. Each one of these jobs run queue_runner.sh which parses the input as an array of json objects:

({"id:"dataset id value-1", "url":"dataset url-1"} {"id:"dataset id value-2", "url":"dataset url-2"} ... {"id:"dataset id value-n", "url":"dataset url-n"})

For each one of these dictionaries both snapshot and reference validator are run.

The reports are all saved in a directory called output - which is persisted at the end of the job, and uploaded to Google Cloud Storage to ease download.

Capture d’écran, le 2021-10-26 à 13 13 54

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@lionel-nj
Copy link
Contributor Author

lionel-nj commented Oct 26, 2021

@barbeau, as demonstrated by 7fd762c (see https://github.com/MobilityData/gtfs-validator/runs/4010914933?check_suite_focus=true), including "[acceptance test skip]" in the commit message enables to skip the execution of Rule acceptance test workflow. When the keyword is not included in said commit message, the workflow is executed as shown in https://github.com/MobilityData/gtfs-validator/actions/runs/1386081082 from 7fd762c. PTAL.

Note: this mechanism will be documented in #848 where additional documentation for the acceptance test has been created.

Copy link
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @lionel-nj, LGTM!

- name: Set URL matrix
id: set-matrix
run: |
python3 scripts/mobility-database-harvester/harvest_latest_versions.py -d datasets_metadata -a gtfs_archives_ids.json -l latest_versions.json
Copy link
Member

@barbeau barbeau Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future optimization - I'm not sure what the primary cost is for fetching data from the Mobility Database, but if it's outbound data bandwidth from the Mobility Database you could reduce this by supporting the HTTP header If-Modified-Since and caching a version of the response in the GitHub Action. Then when you send this request to the mobility database with the date of the cached file, you'd only get the full dataset returned to you (and pay for it) if it was modified after the date of the cache file. Both sides (GitHub Action and Mobility Database) would need to support If-Modified-Since for that to work.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will definitely look into that!

@lionel-nj lionel-nj marked this pull request as ready for review October 26, 2021 20:11
@lionel-nj lionel-nj merged commit 8590cd1 into master Oct 26, 2021
@lionel-nj lionel-nj deleted the ci/use-github-matrix branch October 26, 2021 20:11
@isabelle-dr isabelle-dr linked an issue Oct 28, 2021 that may be closed by this pull request
@isabelle-dr isabelle-dr mentioned this pull request Oct 28, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automated tests to see if a PR will trigger new errors in datasets Write higher level acceptance tests
3 participants