Skip to content
This repository has been archived by the owner on Jan 5, 2023. It is now read-only.

Harvest source import/export #302

Closed
13 tasks done
kimwdavidson opened this issue May 13, 2020 · 8 comments
Closed
13 tasks done

Harvest source import/export #302

kimwdavidson opened this issue May 13, 2020 · 8 comments
Assignees

Comments

@kimwdavidson
Copy link
Contributor

kimwdavidson commented May 13, 2020

User Story

As a data.gov operator, I want a script or process that can export harvest sources from catalog and import them into catalog-next so that I can keep catalog-next up to date with catalog even as harvest configuration changes.

Acceptance Criteria

  • WHEN I follow the import/export process
  • THEN catalog-next is configured with the same harvest sources and configuration as catalog. (When I go to /harvest on both, they look the same.)
  • WHEN I follow the process additional times
  • THEN I only see the changes and the up-to-date version, not another full replication of the database (Though it doesn't need to delete what's already there if it's deleted on the classic catalog)

Details/Tasks for simple import

  • Define where these tools should live ---> Agreed in catalog.data.gov repo
  • Reuse code from next gen harvester for importing Harvest sources
  • Test it is working manually for a few sources
  • Create the organization in CKAN needed for the harvester to ingest datasets to the correct organization
  • Check that if you add new harvester sources, the import function only adds the diff.
  • Check that if you add amend harvester sources, the import function only adds the diff (including the amended source)
  • Check that if you delete harvester sources, the import function does not delete the previous harvester sources (if they need to be deleted, you do that manually, but this is not recommended)
  • Review the current tests. If we think there should be more then we should add a ticket for this.
  • Aaron approves PR
@avdata99
Copy link
Contributor

avdata99 commented May 21, 2020

PRs for this issue:

  • #18 adds import/export script: Approved and merged
  • #28 adds report for harvest sources
    • Merged
  • #30 adds tests to the import script
    • Merged
  • #31 update harvest sources when already exists
    • Pending update tests

@avdata99
Copy link
Contributor

avdata99 commented May 26, 2020

We can close this but before we should add issues to add new tests.
Required tests:

  • Test if we import the sources correctly (e.g we detect config from extras)
  • Test if the organization are imported correctly (e.g check images)

Added issue here: GSA/catalog.data.gov#42

@thejuliekramer
Copy link
Contributor

@kimwdavidson in order to test this we need to run some commands on the sandbox, so I will need the help of @woodt for this

@avdata99
Copy link
Contributor

To QA scripts @thejuliekramer @woodt:

Install the requirements file at /tools/harvest_source_import

Read sources:

cd tools/harvest_source_import
python list_harvest_sources.py --source_type=csw

To import sources (we need an API key from a user with permission to add harvest sources (packages) and organizations

cd tools/harvest_source_import
python import_harvest_sources.py \
    --origin_url=https://catalog.data.gov \
    --destination_url=http://ckan:5000 \
    --destination_api_key=xxxxx-xxxxx-xxxx-xxxxxx \
    --source_type=waf \
    --limit=10

@thejuliekramer
Copy link
Contributor

thejuliekramer commented May 27, 2020

@avdata99 I am able to run this command with the sandbox api key, but I do not see the datasets listed on the dataset page - the datasets are not private, what do you think is the cause here?

@avdata99
Copy link
Contributor

@avdata99 I am able to run this command with the sandbox api key, but I do not see the datasets listed on the dataset page - the datasets are not private, what do you think is the cause here?

This command just import the harvest sources @thejuliekramer. Didn't run them. So, you only will see the harvest sources at /harvest

@thejuliekramer
Copy link
Contributor

thejuliekramer commented May 27, 2020

Harvest sources are listed here https://catalog-next.sandbox.datagov.us/harvest, I ran datajson, csw, waf, arcgis, and single doc

@avdata99
Copy link
Contributor

Harvest sources are listed here https://catalog-next.sandbox.datagov.us/harvest, I ran datajson, csw, and waf

Also the orgs with images:
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants