Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop script to enhance sitemap with data set landing page URLs #378

Closed
jordanpadams opened this issue Mar 17, 2023 · 23 comments 路 Fixed by #415
Closed

Develop script to enhance sitemap with data set landing page URLs #378

jordanpadams opened this issue Mar 17, 2023 · 23 comments 路 Fixed by #415

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Mar 17, 2023

馃挕 Description

Once NASA-PDS/portal-tasks#65 is completed, we should augment the output file with URLs for all PDS data set landing pages

Super simple script to paginate through all the results from https://pds.nasa.gov/services/search/search?wt=json&q=product_class:Product_Collection%20OR%20product_class:Product_Bundle%20OR%20product_class:Product_Data_Set_PDS3%20OR%20product_class:Product_Document&fl=resLocation,modification_date&start=0&rows=1000

  • Query URL
  • Use numFound to keep track of total number of results
  • Increment start and rows to page through the results
  • Use resLocation for URL to data, use the latest modification_date (can be more than 1, and will require conversion to YYYY-MM-DD)
  • Output should build an XML document that looks something like the following, which will be inserted into the sitemap:
  <url>
    <loc>https://pds.nasa.gov/ds-view/pds/viewDocument.jsp?identifier=urn%3Anasa%3Apds%3Asystem_bundle%3Adocument_pds4_standards%3Adph_1.17.0&version=1.0</loc>
    <lastmod>2021-10-14</lastmod>
  </url>
  <url>
    <loc>https://pds.nasa.gov/ds-view/pds/viewDocument.jsp?identifier=urn%3Anasa%3Apds%3Amisc%3Adocument_cassini%3Apds3_titan_ion_ug_itar_mar2012&version=1.0</loc>
    <lastmod>2021-04-20</lastmod>
  </url>
  ...
@jordanpadams jordanpadams transferred this issue from NASA-PDS/portal-tasks Mar 17, 2023
@jordanpadams jordanpadams added p.should-have enhancement New feature or request labels Mar 17, 2023
@jordanpadams jordanpadams removed the task label Mar 28, 2023
@c-suh
Copy link
Contributor

c-suh commented Apr 25, 2023

Script is written and working locally. Need to confirm details of

  1. where to put this script
  2. when to run it (frequency and actual timedate)
  3. where to output this partial XML file (e.g. should it be emailed?)

and then tested.

@tloubrieu-jpl
Copy link
Member

We will have the script deployed on AWS gamma and one of the prodX EC2 instances.

Run it in a cronjob, we'll set the frequency, for example weekly. Beware Catherine experienced weird results when running the script on Frday at 6pm PDT.

The script should update an existing sitemap so that the output is a complete sitemap.

I propose to have the script managed in a specific repository or in a repository containing other script applied to the portal.

We will wait for @jordanpadams to validate that option.

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented May 2, 2023

After breakout discussion today:

The script will be run manually when needed by the operation team, every month.

The script can be archived in the current repository and documented on the internal wiki.

@c-suh
Copy link
Contributor

c-suh commented May 2, 2023

Run with a cron job* However, compare the new number of results to the current number of results. If new < current, postpone the job for 24 hours or some such.

Additionally, store a copy of the current sitemap alongside the script and use that as a base, otherwise will be appending this output to the previous output and so on.

@jordanpadams
Copy link
Member Author

status: integration ongoing

@jordanpadams
Copy link
Member Author

jordanpadams commented May 9, 2023

status: looking at adding GitHub integration portion and pds-github-util

@c-suh
Copy link
Contributor

c-suh commented May 11, 2023

have github integration working locally (did not use pds-github-util). setting up virtual environment on gamma to test this.

@tloubrieu-jpl
Copy link
Member

@c-suh is testing/deploying the script on pds-gamma

@c-suh
Copy link
Contributor

c-suh commented May 16, 2023

looking into error with gitpython. Hopeful to get this done tomorrow but not optimistic.

@jordanpadams jordanpadams assigned nutjob4life and unassigned c-suh Jun 15, 2023
@nutjob4life
Copy link
Member

Question for everyone on this ticket: there's mention of "github integration" and in the current code on pdscloud-gamma I see a function do_git_stuff().

What's the goal of this? Do we want to check the sitemap.xml into a repository?

Also question for @c-suh: does this code exist only on pdscloud-gamma right now? Is there a separate repository for it? Should it be part of "operations"?

Thanks in advance!

@nutjob4life
Copy link
Member

nutjob4life commented Jun 21, 2023

Update: currently blocked by DSIO-4051 (fixed!) DSIO-4059 (fixed!)

@c-suh
Copy link
Contributor

c-suh commented Jun 21, 2023

@nutjob4life hi! Yes, the "github integration" and placeholder function do_git_stuff() is to check the updated sitemap.xml into the website repository. And yes, the code is currently only on pdscloud-gamma. There is no separate repository for it nor an intention to put into the "operations" repository; rather, it was going to be posted to this section along with any other documentation for this script (which I can do; just let me know).

@nutjob4life
Copy link
Member

nutjob4life commented Jun 23, 2023

@c-suh okay couple more questions, bear with me! I'm trying to figure out some motivations here.

First: why does the sitemap.xml go into a git repository?

To me, this feels like committing the results of an ephemeral database query into something that's really meant for software or configuration, not data. Code changes deliberately over time, but data changes willy-nilly. I don't see the need to track this in git.

What about a mechanism like XInclude? The sitemap.xml could essentially be generated from a template (which is checked into git) which has a placeholder for "put the product URLs here". (Heck, sitemap.xml could be a named pipe that's generated on the fly! 馃槃)

Second question: since this script is indeed code, I think it ought to be in git (JPL Enterprise GitHub or public GitHub). What's the motivation for having it as an attachment on the wiki instead?

I apologize if these seem like dumb questions. The "operations" side of PDS-EN is new to me, so I appreciate your helping me figure these out!

Answers posted privately at https://jpl.slack.com/archives/D05E0RKEL8K/p1687799043772159 馃憤

Further background at: https://jpl.slack.com/archives/C05DH31C95E/p1686874407444169

@nutjob4life
Copy link
Member

nutjob4life commented Jun 27, 2023

Okay @c-suh could you (and anyone else who's interested) could review the changes I made in do_git_stuff? I don't want to make a pull request since it's potentially sensitive. The file is on gamma in the expected location.

If that all looks okay, we can then attach the code to the wiki page.

@c-suh
Copy link
Contributor

c-suh commented Jun 28, 2023

@nutjob4life looks great; ty! I edited the existing code to match a couple of your practices (e.g. the logger), and something went wonky with the existing venv, so I recreated it. If you would verify these, please, and then I'll clean up the old one, and we can finish up with documentation and the weekly cron job. If you've no preference, would you handle the latter (cron job), and I will handle the former (documentation)?

@nutjob4life
Copy link
Member

nutjob4life commented Jun 28, 2023

@c-suh works great!

(base) $ date
Wed Jun 28 13:53:47 PDT 2023
(base) $ cd sitemap-script
(base) $ . bin/activate
(sitemap-script) (base) $ python main.py
INFO:__main__:=== Starting update of sitemap.xml with ds-view pages on Wed Jun 28, 2023 at 13:53:58
INFO:__main__:This run's numFound is 15993
INFO:__main__:Last run's numFound is 15993
INFO:__main__:Repository's sitemap hasn't changed since last run. Continuing from the existing local, partial sitemap.
INFO:__main__:Appending results to make a complete sitemap
INFO:__main__:Moving complete sitemap to repository
INFO:__main__:The sitemap.xml hasn't been updated this time, so no changes to commit
(sitemap-script) (base) $ echo 馃帀
馃帀
(sitemap-script) (base) $ 

I updated the pds4 crontab to run the script weekly.

@c-suh
Copy link
Contributor

c-suh commented Jun 28, 2023

@nutjob4life ty! I tweaked the job to source the venv and to save the output. If you wouldn't mind reviewing the documentation once I have that done (likely tomorrow morning), I will let you know once it's ready.

@nutjob4life
Copy link
Member

Looks fine. You don't need to activate the venv if you're using its python executable directly, but it doesn't hurt.

@c-suh
Copy link
Contributor

c-suh commented Jul 5, 2023

Note: moved this to the other user account because it created an issue with pulling from the repo (since settled by the SAs here). Otherwise seems to be working well but will wait to see the next weekly run.

@jordanpadams
Copy link
Member Author

to discuss with @c-suh about adding this script to the repo

@nutjob4life
Copy link
Member

@c-suh @jordanpadams since there's sensitive info in this code, could we put it into a private repo? For more protection, it could be a private repo on Enterprise GitHub, rather than here. Thoughts?

@jordanpadams
Copy link
Member Author

@nutjob4life @c-suh what kind of private information is this? and is there a way for us to use environment variables on the server instead for some of this information to avoid including it in the software?

@c-suh
Copy link
Contributor

c-suh commented Jul 27, 2023

Addressing points from slack conversation and a few more (will also add these to the PR):

  • update to use environment variable
  • get rid of separate virtual environment
  • fold script into this repository
  • update cronjob accordingly
  • change name and location of log
  • change user that runs this script (to same user that runs other scripts from this repository)
  • change time of weekly cron to coincide with other weekly tasks to better ensure timely update (git pull) on other machines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: 馃弫 Done
Development

Successfully merging a pull request may close this issue.

4 participants