-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add robots.txt & sitemap to catalog on cloud.gov #3648
Comments
More context: ckan/ckan#5648 (comment) |
For now we can just point to the version that we pulled over from the FCS environment, which is substantially up-to-date. |
It's been determined that the robots.txt should be handled in a different way: GSA/data.gov#3648
Really just defer the issue to GSA/data.gov#3648; The robots.txt will probably be part of the nginx proxy in this repo
This will require fixing ckanext-geodatagov extension to upgrade the cli to py3 and ckan 2.9. See similar work done on ckanext-dcat_usmetadata:
|
Yesterday, I spent a lot of time trying to track down issues with the requested change of moving requirements into the This morning I have had an issue with getting local make commands to run: |
Thanks to everyone jumping on the huddle yesterday. Especailly thanks to @nickumia-reisys for getting this unstuck with your PR. I should be able to do the last couple things on this ticket and move along. |
This had been blocked by issues with building catalog that were resolved with this PR. My sitemap code is now available with the Am trying to figure out how to debug in this environment. |
The fix for above is simply running the cli commands as a |
I have been blocked on platform issues. Our own upstream containers aren't building correctly (and taking 2k+ seconds) on my system architecture: I can change some build parameters, but I think I would need to rebuild the images upstream to support multi-platform. There is a related issue with the pyproj wheel as the binary fails to build in my local venv. In the absence of a solve locally, I am looking into running |
My blockers yesterday turned out to be related to an issue with the |
Well, same issue with my refactor: I have another idea to call the actual aws s3 cli from python, but that feels very janky. |
Thanks to @nickumia-reisys's hard work, I'm unblocked and testing his solve. |
With GSA/ckanext-geodatagov#224, the work on |
Looks like just staging |
🥳 https://catalog.data.gov/robots.txt now has a good link to the sitemap bucket! And a mostly valid (should be on merge of PR 615) sitemap file! |
With GSA/catalog.data.gov#637 I believe all the work on this is done. The robots file correctly points to the sitemap. The sitemaps are being generated nightly by a github action. Each sitemap recored correctly refers to a dataset: |
User Story
In order to make sure crawls can occur on our site, data.gov admins want the sitemap to be available in cloud.gov.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN the s3-to-sitemap function is run as a task on cloud.gov
THEN all datasets are cataloged in the sitemap
Background
Current robots.txt links to filestore.data.gov, which is the FCS s3 bucket. Need to port this information to cloud.gov, and setup a recurring job (currently runs every day at 5 am, see here)
Also in rough analysis, all datasets are published in the sitemap. This may be able to be optimized to exclude collection level data, and only include the parent record. This might set us up to improve search engine optimization.
Security Considerations (required)
None, all public data
Sketch
The text was updated successfully, but these errors were encountered: