Add Statcan data scraping functionality for issue #47#51
Add Statcan data scraping functionality for issue #47#51xrendan merged 20 commits intoBuildCanada:mainfrom
Conversation
5479a30 to
68f85b3
Compare
68f85b3 to
0abca46
Compare
|
@xrendan @verrixkio Hey! Here's an initial incomplete draft for #47 . Few points: Single cron job for dataset syncing: There's a cron, which enqueues one-off jobs to sync stale datasets. This is a simple/flexible approach, and means scheduling can be easily changed. However, it does introduce some latency to dataset syncs (up to 1 hour as currently set). That feels acceptable to me, but let me know what you think. No dataset history: For simplicity, I haven't stored any history for the datasets. If that's a problem, let me know and I can update. No API endpoints yet: Will add this next. Let me know if you have a preference for a specific path ( Finally, this is the first time I've written any ruby/rails code, some of it is fairly different to my usual setup (elixir/phoenix), so if I've made any basic mistakes please say. 😅 |
lib/tasks/statcan.rake
Outdated
| namespace :statcan do | ||
| desc "Setup Statcan datasets" | ||
| task setup_datasets: :environment do | ||
| statcan_datasets = [ |
There was a problem hiding this comment.
Names/urls/schedules taken from the existing workflows https://github.com/BuildCanada/OutcomeTracker/tree/c165db79919c77fc66f0663c5267bd0b0e300337/.github/workflows.
Is there is a specific reason for the existing schedules?
There was a problem hiding this comment.
Note, we might not need this task now that the Avo resource has been added, as it's easy to add the datasets via the admin panel.
There was a problem hiding this comment.
It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb
There was a problem hiding this comment.
I originally implemented this as seed data, and then Claude said I should do as a task instead. 😄
Refactored back to seeds: 32e6671.
377e9d3 to
4210e51
Compare
|
Marked as ready. This should now be feature complete:
I'll do a final pass on the dataset names/links to double check, but other than that everything is ready for review. 🙂 |
xrendan
left a comment
There was a problem hiding this comment.
Looks good to me, some minor nits to make this more railsy, but after it's good to go
lib/tasks/statcan.rake
Outdated
| namespace :statcan do | ||
| desc "Setup Statcan datasets" | ||
| task setup_datasets: :environment do | ||
| statcan_datasets = [ |
There was a problem hiding this comment.
It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb
app/jobs/statcan_sync_job.rb
Outdated
| class StatcanSyncJob < ApplicationJob | ||
| queue_as :default | ||
|
|
||
| def perform(statcan_dataset_id) |
There was a problem hiding this comment.
nit: Using an id here is fine (and more space-efficient especially when using sidekiq or a redis-backed) but practically, it's nicer to let rails do it's magic with global_id and just pass in the object.
So instead when it's in the queue the argument looks like: gid://OutcomeTrackerAPI/StatcanDataset/<id> instead of id. This is nice from an operational perspective because you know what you're working with instead of just having an integer.
app/jobs/statcan_sync_job.rb
Outdated
|
|
||
| def perform(statcan_dataset_id) | ||
| dataset = StatcanDataset.find(statcan_dataset_id) | ||
| data = StatcanFetcher.fetch(dataset.statcan_url) |
There was a problem hiding this comment.
nit: Rails has a convention to prefer fat models and skinny controllers (and imo that applies to jobs too).
I'd rather have a method on a StatcanDataset called sync! instead of having the logic for refreshing live in the job itself.
Combining this and the above suggestion you get:
def perform(statcan_dataset)
statcan_dataset.sync!
end
| queue_as :default | ||
|
|
||
| def perform(current_time = Time.current) | ||
| datasets = StatcanDataset.select(:id, :sync_schedule, :last_synced_at) |
There was a problem hiding this comment.
nit: You should use a scope on for stale datasets instead of querying directly here
datasets = StatcanDataset.stale.select(:id, :sync_schedule, :last_synced_at)
There was a problem hiding this comment.
I'm not sure this is easily doable. The stale logic involves parsing the cron schedule, and I don't think that's possible within a sql query. I could refactor to store e.g. a next_sync_at field, but then we're caching state.
Maybe it's ok to leave with the method approach for now?
There was a problem hiding this comment.
Bit new to this, but for clarity, my understanding is:
- scopes should return an ActiveRecord::Relation Object (so that scopes are composable)
- we need non-sql logic for the cron schedule parsing/calculation
- the suggested approach (using a mix of scope + all.select + application logic) wouldn't return a Relation object, and also wouldn't use the provided filters for the
all.selectquery. i.e. it would fetch all thecurrent_datafor all datasets, even though that's not required for the cron job
Hope that makes sense. Very possible I'm misunderstanding something!
There was a problem hiding this comment.
You're mostly right, you could do it as something like this, but it sucks.
scope :stale, ->(current_time = Time.current) {
StatcanDataset.where(id: all.select { |dataset| dataset.needs_sync?(current_time).pluck(:id)) }
}
| validates :sync_schedule, presence: true | ||
| validate :valid_cron_expression | ||
|
|
||
| def self.filter_stale(datasets, current_time = Time.current) |
| @@ -0,0 +1,30 @@ | |||
| class StatcanDataset < ApplicationRecord | |||
| validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp } | |||
There was a problem hiding this comment.
You can add your scope here
| validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp } | |
| scope :stale, ->(current_time = Time.current) { | |
| all.select { |dataset| dataset.needs_sync?(current_time) } | |
| } | |
| validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp } |
Amazing. Thank you for all the feedback. ❤️ Will make the changes tomorrow. 🚀 |
This pull request introduces functionality for managing and syncing Statistics Canada datasets, including new models, jobs, services, migrations, and tests. It also includes updates to dependencies and configurations to support the new features.
See #47 for context.
New functionality for Statistics Canada dataset management:
app/models/statcan_dataset.rb: AddedStatcanDatasetmodel with validations forstatcan_url,name, andsync_schedule, as well as methods for determining stale datasets and validating cron expressions.app/jobs/statcan_cron_job.rb: CreatedStatcanCronJobto enqueue sync jobs for stale datasets based on their schedules.app/jobs/statcan_sync_job.rb: CreatedStatcanSyncJobto fetch and update data for individual datasets usingStatcanFetcher.app/services/statcan_fetcher.rb: AddedStatcanFetcherservice to fetch and parse CSV data from Statistics Canada URLs.Database changes:
db/migrate/20250707155320_create_statcan_datasets.rb: Added migration to create thestatcan_datasetstable with fields for dataset metadata, sync schedule, and current data.db/schema.rb: Updated schema to include the newstatcan_datasetstable and its indexes.Configuration and dependencies:
Gemfile: Addedcsvandfugitgems to support CSV parsing and cron expression handling. [1] [2]config/initializers/good_job.rb: Configured a cron schedule forStatcanCronJobto run hourly.config/environments/test.rb: Changed Active Job queue adapter to:testto useassert_enqueuedhelper function.Tests:
test/models/statcan_dataset_test.rb: Added unit tests for theStatcanDatasetmodel, including validations and sync logic.test/jobs/statcan_cron_job_test.rb: Added tests forStatcanCronJobto verify job enqueueing for stale datasets. (Fd562c1bR1)test/jobs/statcan_sync_job_test.rb: Added tests forStatcanSyncJobto ensure data fetching and updates work as expected.test/test_helper.rb: Includedminitest/mockfor mocking in tests.These changes collectively enable automated syncing of Statistics Canada datasets, ensuring data is regularly updated and accessible for further processing or analysis.