-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Harvesting Logic repo to incorporate feedback changes #4577
Comments
@jbrown-xentity and I spoke about the def load( harvest_datasets, operation ):
ckan = harvester.create_ckan_entrypoint( ckan_url, api_key )
operations = {
"delete": harvester.purge_ckan_package,
"create": harvester.create_ckan_package,
"update": harvester.update_ckan_package
}
for dataset in harvest_datasets:
operations[operation]( ckan, dataset ) |
I like that enhancement. Will harvesting logic still return three separate lists of datasets? |
yeah that's what i'm thinking. the reason being that we could potentially compare_result = compare(harvest_source, ckan_source)
"""
compare_result = {
"create": [ id1, id2, id3],
"update": [ id5, id6, id7],
"delete": [ id10, id15, id14]
}
"""
load.expand( compare_result["create"], "create" )
load.expand( compare_result["delete"], "delete" )
load.expand( compare_result["update"], "update" ) since we don't have a rollback in the event of a partial completion this could be fine? do we need a rollback in the event of let's say 2/8 dataset creations fail? |
This brings up a good question around error reporting/tracking. I've been thinking about this in regards to the DCAT pipeline. We should discuss as a team how we want to handle things. Take this case:
or
Airflow is happiest when I put a skip exception in at the failure step, which allows it to skip any downstream tasks gracefully, but this also means that the pipeline is "green", so we need a way of recording/handling those exceptions. It's easy enough to log the failures, we just need to know what to do with them. |
pipeline test sketch as a reference. the changes are untested. meant to serve as an aggregate of @robert-bryson 's work with the classes and show the order of operations similar to how they could be in airflow ( not literally just more of workflow ) |
This is what we should do with them: #4582 |
using a data validator like pydantic against our proposed classes could be valuable. this could add teeth to our type hints as well @robert-bryson |
using dataclasses could be a nice way to treat our classes. the equality and hash methods that come with them seem relevant |
making use of the property decorator seems like it could give us more control of how we set, get, and/or delete attributes in our classes. the more constrictive we our with setting attributes could mean less headaches down the road ( or cause more? ) |
if we intend to store an instance of a class we need to make sure it's serializable. dataclasses comes with a @dataclass
class A
age: int
@dataclass
class B
records: Dict = field(default_factory=lambda: {})
a = A(25)
b = B()
b.records["a"] = a
dataclasses.asdict(b) # >> {'records': {'a': {'age': 25}}} the here's an example of what happens when I try to use ckan = ckanapi.RemoteCKAN("path\to\ckan\endpoint",apikey="api_key")
b.records["ckan"] = ckan # >> {'a': A(age=15), 'ckan': <ckanapi.remoteckan.RemoteCKAN object at 0x1010b4cd0>}
dataclasses.asdict(b) # >> returns error "TypeError: ActionShortcut.__getattr__.<locals>.action() takes 0 positional arguments but 1 was given" |
for continuity and organization i'm going to convert our tests to classes. i've also added a ckan extract test class. some tests could be...
a test that could be useful for the compare is when ckan returns nothing suggesting everything needs to be created |
draft pr. this pr may be open for an extended period of time to allow for fixes addressing issues @btylerburton gets as he tests the module out in airflow. or maybe that should be separate ticket? |
i'm gonna process all our current harvest sources through the harvesting logic code to fix any issues on my end ( excluding any real creation, deletion, or updating ) |
ran a test load on catalog-dev admin. number of datasets has increased from 345 to 720. this test was run on my machine and not in airflow. |
log on last nights load test of dcatus. processing seems to have been held up at 00:03:16,472 |
@btylerburton What happens when you use a Maybe this reference might provide some ideas to the team: https://www.restack.io/docs/airflow-knowledge-airflow-skip-exception-guide (i.e. having a fallback mechanism to do the email notification might meet the needs?) |
User Story
In order to incorporate feedback from the most recent design sessions, changes need to be made to the datagov-harvesting-logic (DHL) repo in order to fully test a DCAT-US record end-to-end.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
GIVEN DHL has been supplied as a dependency to an Airflow instance
AND an instance of the ETL-Pipeline has been invoked
WHEN the
extract
method is called with a valid harvest source configTHEN I would like to generate a single Source class which contains incoming source records, the corresponding records for that harvest source in CKAN, and a list of lifecycle methods to perform ETL operations with.
GIVEN the extract process has extracted the incoming records and their counterparts in CKAN
THEN I would like to compare the incoming records with the CKAN records and generate lists of items to: CREATE, UPDATE and DESTROY
AND to return a count, as a dict or a list, of the number of datasets in each.
GIVEN I have called the delete method using a valid harvest source config
THEN I expect the DHL to attempt to delete each
AND to return the success/failure metrics of that operation
GIVEN I have called the validate method using a valid harvest source config
THEN I expect the DHL to attempt to validate each item against the current DCAT-US 1.1 schema
AND to return the success/failure metrics of that operation to Airflow
GIVEN I have called the load method using a valid harvest source config
THEN I expect the DHL to attempt to load each item to be created or updated into CKAN using
package_create
AND to return success/failure metrics of that operation to Airflow
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
Reference
The text was updated successfully, but these errors were encountered: