Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robots.txt & sitemap to catalog on cloud.gov #3648

Closed
6 tasks done
jbrown-xentity opened this issue Jan 19, 2022 · 20 comments
Closed
6 tasks done

Add robots.txt & sitemap to catalog on cloud.gov #3648

jbrown-xentity opened this issue Jan 19, 2022 · 20 comments
Assignees

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Jan 19, 2022

User Story

In order to make sure crawls can occur on our site, data.gov admins want the sitemap to be available in cloud.gov.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN an s3 bucket is provisioned for catalog
    WHEN the s3-to-sitemap function is run as a task on cloud.gov
    THEN all datasets are cataloged in the sitemap

Background

Current robots.txt links to filestore.data.gov, which is the FCS s3 bucket. Need to port this information to cloud.gov, and setup a recurring job (currently runs every day at 5 am, see here)

Also in rough analysis, all datasets are published in the sitemap. This may be able to be optimized to exclude collection level data, and only include the parent record. This might set us up to improve search engine optimization.

Security Considerations (required)

None, all public data

Sketch

  • Create s3 bucket for catalog in each environment (see inventory for example), make sure is for public use.
  • Update code to take the credentials of s3 bucket to use in the ckanext-geodatagov code
  • Validate code can create and push sitemap to configured s3 bucket in dev
  • Setup github action to run regularly (probably daily, time doesn't really matter)
  • Add/update robots.txt to point to the new s3 bucket.
@jbrown-xentity jbrown-xentity changed the title Add robots.txt & sitemap to cloud.gov Add robots.txt & sitemap to catalog on cloud.gov Jan 19, 2022
@jbrown-xentity
Copy link
Contributor Author

We will edit this and should probably remove this

@nickumia-reisys
Copy link
Contributor

More context: ckan/ckan#5648 (comment)

@mogul
Copy link
Contributor

mogul commented Jan 20, 2022

For now we can just point to the version that we pulled over from the FCS environment, which is substantially up-to-date.

nickumia-reisys added a commit to GSA/ckanext-datagovcatalog that referenced this issue Jul 14, 2022
It's been determined that the robots.txt should be handled in a different way: GSA/data.gov#3648
nickumia-reisys added a commit to GSA/catalog.data.gov that referenced this issue Jul 14, 2022
Really just defer the issue to GSA/data.gov#3648; The robots.txt will probably be part of the nginx proxy in this repo
@robert-bryson robert-bryson self-assigned this Jul 15, 2022
@jbrown-xentity
Copy link
Contributor Author

jbrown-xentity commented Jul 22, 2022

This will require fixing ckanext-geodatagov extension to upgrade the cli to py3 and ckan 2.9. See similar work done on ckanext-dcat_usmetadata:

@robert-bryson
Copy link
Contributor

robert-bryson commented Sep 7, 2022

Yesterday, I spent a lot of time trying to track down issues with the requested change of moving requirements into the setup.py. Something down the chain is still requiring boto, but I am not finding it and have been unable to run tests because of it.

This morning I have had an issue with getting local make commands to run: Error response from daemon: invalid mount config for type "volume": invalid mount path: 'docker-entrypoint.d/* /docker-entrypoint.d' mount path must be absolute. I have not made any changes to the entry point... am investigating. This does not appear to happen with github actions and is the same reason I raised this issue last week, a lot of dev time has been spent tracking down these sort of things.

@robert-bryson
Copy link
Contributor

Thanks to everyone jumping on the huddle yesterday. Especailly thanks to @nickumia-reisys for getting this unstuck with your PR. I should be able to do the last couple things on this ticket and move along.

@robert-bryson
Copy link
Contributor

This had been blocked by issues with building catalog that were resolved with this PR. My sitemap code is now available with the ckan geodatagov sitemap-to-s3 cli command. Running it, however:

Image

Am trying to figure out how to debug in this environment.

@robert-bryson
Copy link
Contributor

The fix for above is simply running the cli commands as a cf run-task instead of sshing into an app. Running the task raises a small bug with the filename_number not getting incremented, but also an exception: raise ValueError(f'Required parameter {identifier} not set') from the s3 uploading code.

@robert-bryson
Copy link
Contributor

I have been blocked on platform issues. Our own upstream containers aren't building correctly (and taking 2k+ seconds) on my system architecture:

Image

I can change some build parameters, but I think I would need to rebuild the images upstream to support multi-platform. There is a related issue with the pyproj wheel as the binary fails to build in my local venv.

In the absence of a solve locally, I am looking into running tmate in test action with the fancy new action.

@robert-bryson
Copy link
Contributor

robert-bryson commented Sep 29, 2022

My blockers yesterday turned out to be related to an issue with the ckan 2.9.6 release yesterday. Pinning the ckan version to 2.9.5 allowed passing (though will need to be updated at some point) and PRs in ckanex-geodatagov and catalog bumped the versions to allow testing with a new action: sitemap-to-s3 .

@robert-bryson
Copy link
Contributor

robert-bryson commented Oct 6, 2022

Well, same issue with my refactor:

Image

I have another idea to call the actual aws s3 cli from python, but that feels very janky.

@robert-bryson
Copy link
Contributor

Thanks to @nickumia-reisys's hard work, I'm unblocked and testing his solve.

@robert-bryson
Copy link
Contributor

Well, it's not quite right but it's something:

Image

Huzzah!

@robert-bryson
Copy link
Contributor

The s3 upload test has been failing. Hmm..

Image

It's some sort of special magic that it can upload a file to a bucket, but with only the message that the bucket doesn't exist.

@robert-bryson
Copy link
Contributor

With GSA/ckanext-geodatagov#224, the work on ckanext-geodatagov should be done (hopefully). The work on the catalog side should be done with GSA/catalog.data.gov#578.

@robert-bryson
Copy link
Contributor

Catalog work is merged and building!

@robert-bryson
Copy link
Contributor

robert-bryson commented Oct 20, 2022

Looks like just staging catalog-proxy has an issue with the sed command: ERR sed: -e expression #1, char 22: unknown option to s'`. It should be the same as the similar command above. Am investigating.

@robert-bryson
Copy link
Contributor

🥳 https://catalog.data.gov/robots.txt now has a good link to the sitemap bucket! And a mostly valid (should be on merge of PR 615) sitemap file!

@robert-bryson
Copy link
Contributor

Image

!!!

@robert-bryson
Copy link
Contributor

With GSA/catalog.data.gov#637 I believe all the work on this is done. The robots file correctly points to the sitemap. The sitemaps are being generated nightly by a github action. Each sitemap recored correctly refers to a dataset:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants