Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make remote_data_updated mechanism usable for connectors #15711

Merged
merged 10 commits into from Jun 30, 2020

Conversation

jgoizueta
Copy link
Member

Connectors were missing access to the modified_at field of synchronizations.

For non-connector runner it worked because the modified_at date was passed through the downloader.

@jgoizueta jgoizueta force-pushed the jgoizueta/ch69791/do-sync-table branch 2 times, most recently from 0bce8d3 to f32f039 Compare June 29, 2020 06:18
@jgoizueta jgoizueta changed the base branch from jgoizueta/ch69791/do-sync-table to jgoizueta/ch69791/integration June 29, 2020 12:12
@jgoizueta jgoizueta force-pushed the jgoizueta/ch69791/remote_data_updated branch from df7966f to fe8bd13 Compare June 29, 2020 15:06
@jgoizueta jgoizueta changed the base branch from jgoizueta/ch69791/integration to master June 29, 2020 15:10
@jgoizueta jgoizueta marked this pull request as draft June 29, 2020 15:43
@jgoizueta jgoizueta force-pushed the jgoizueta/ch69791/remote_data_updated branch from ab07fea to c6d48ab Compare June 29, 2020 15:57
@jgoizueta
Copy link
Member Author

Parameter-naming confusion

I've kept the last_modified name used pervasively through the code as the remote source update time, but I've changed the newly introduced modified_at connectors parameter to previous_last_modified to indicate that that is the external source last-modified date corresponding to the previous update (sync).

So, how was the change detection originally implemented for file/url sources?

The Runner class, used by both imports and syncs, implements a remote_data_updated? method which calls Downloader#modified? which compares the last-modified date of the new data (from HTTP header) to the value from the previous sync, which is stored in synchronizations.modified_at and passed to the Downloader through http_options.last_modified

The synchronizations.modified_at column is set at the end of syncs through set_success_state_from from the importer's (a Synchronization::Adapter for syncs) last_modified method that takes the value from the Runner and this in turn from the Downloader.

DatasourceDownloader has similar a mechanism (modified?) using checksums instead of dates. Note that table synchronizations also has a checksum column to store this too.

Now, I see a problem with this mechanism, which is that the first import doesn't update the synchronization modified_at.
So the first sync will always be executed even if unnecessary. I'll open a ticket to fix this which should be a matter of adding synchronization.modified_at = importer.last_modified to ::DataImport::update_synchronization.

And how has this been extended to db connectors (and fixed)?

The remote_data_updated? has been implemented in ConnectorRunner, where it calls the connector provider's instance of this method. The provider should be able to implement this method because it receives through its constructor a previous_last_import parameter, that is taken from synchronizations.modified_at.

The ConnectorRunner also implements a last_modified modified method, also relayed to the provider, which allows the update of synchronizations.modified_at from the last modified data of the data just copied.

@jgoizueta
Copy link
Member Author

⚠️ This can break existing connectors ⚠️

Since this introduces a new parameter, previous_last_modified unexpected by the connector providers, this must be deployed with the corresponding db-connectors changes. See the clubhouse story for details

@jgoizueta jgoizueta marked this pull request as ready for review June 29, 2020 17:13
@jgoizueta jgoizueta requested a review from rafatower June 29, 2020 17:13
@jgoizueta jgoizueta merged commit ed04e66 into master Jun 30, 2020
@jgoizueta jgoizueta deleted the jgoizueta/ch69791/remote_data_updated branch June 30, 2020 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants