New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop script to enhance sitemap with data set landing page URLs #378
Comments
Script is written and working locally. Need to confirm details of
and then tested. |
We will have the script deployed on AWS gamma and one of the prodX EC2 instances. Run it in a cronjob, we'll set the frequency, for example weekly. Beware Catherine experienced weird results when running the script on Frday at 6pm PDT. The script should update an existing sitemap so that the output is a complete sitemap. I propose to have the script managed in a specific repository or in a repository containing other script applied to the portal. We will wait for @jordanpadams to validate that option. |
After breakout discussion today: The script will be run manually when needed by the operation team, every month. The script can be archived in the current repository and documented on the internal wiki. |
Run with a cron job* However, compare the new number of results to the current number of results. If new < current, postpone the job for 24 hours or some such. Additionally, store a copy of the current sitemap alongside the script and use that as a base, otherwise will be appending this output to the previous output and so on. |
status: integration ongoing |
status: looking at adding GitHub integration portion and pds-github-util |
have github integration working locally (did not use pds-github-util). setting up virtual environment on gamma to test this. |
@c-suh is testing/deploying the script on pds-gamma |
looking into error with gitpython. Hopeful to get this done tomorrow but not optimistic. |
Question for everyone on this ticket: there's mention of "github integration" and in the current code on pdscloud-gamma I see a function What's the goal of this? Do we want to check the Also question for @c-suh: does this code exist only on pdscloud-gamma right now? Is there a separate repository for it? Should it be part of "operations"? Thanks in advance! |
Update: currently blocked by |
@nutjob4life hi! Yes, the "github integration" and placeholder function |
@c-suh okay couple more questions, bear with me! I'm trying to figure out some motivations here. First: why does the To me, this feels like committing the results of an ephemeral database query into something that's really meant for software or configuration, not data. Code changes deliberately over time, but data changes willy-nilly. I don't see the need to track this in git. What about a mechanism like XInclude? The Second question: since this script is indeed code, I think it ought to be in git (JPL Enterprise GitHub or public GitHub). What's the motivation for having it as an attachment on the wiki instead? I apologize if these seem like dumb questions. The "operations" side of PDS-EN is new to me, so I appreciate your helping me figure these out! Answers posted privately at https://jpl.slack.com/archives/D05E0RKEL8K/p1687799043772159 馃憤 Further background at: https://jpl.slack.com/archives/C05DH31C95E/p1686874407444169 |
Okay @c-suh could you (and anyone else who's interested) could review the changes I made in If that all looks okay, we can then attach the code to the wiki page. |
@nutjob4life looks great; ty! I edited the existing code to match a couple of your practices (e.g. the logger), and something went wonky with the existing venv, so I recreated it. If you would verify these, please, and then I'll clean up the old one, and we can finish up with documentation and the weekly cron job. If you've no preference, would you handle the latter (cron job), and I will handle the former (documentation)? |
@c-suh works great!
I updated the |
@nutjob4life ty! I tweaked the job to source the venv and to save the output. If you wouldn't mind reviewing the documentation once I have that done (likely tomorrow morning), I will let you know once it's ready. |
Looks fine. You don't need to activate the venv if you're using its |
Note: moved this to the other user account because it created an issue with pulling from the repo (since settled by the SAs here). Otherwise seems to be working well but will wait to see the next weekly run. |
to discuss with @c-suh about adding this script to the repo |
@c-suh @jordanpadams since there's sensitive info in this code, could we put it into a private repo? For more protection, it could be a private repo on Enterprise GitHub, rather than here. Thoughts? |
@nutjob4life @c-suh what kind of private information is this? and is there a way for us to use environment variables on the server instead for some of this information to avoid including it in the software? |
Addressing points from slack conversation and a few more (will also add these to the PR):
|
馃挕 Description
Once NASA-PDS/portal-tasks#65 is completed, we should augment the output file with URLs for all PDS data set landing pages
Super simple script to paginate through all the results from https://pds.nasa.gov/services/search/search?wt=json&q=product_class:Product_Collection%20OR%20product_class:Product_Bundle%20OR%20product_class:Product_Data_Set_PDS3%20OR%20product_class:Product_Document&fl=resLocation,modification_date&start=0&rows=1000
numFound
to keep track of total number of resultsstart
androws
to page through the resultsresLocation
for URL to data, use the latestmodification_date
(can be more than 1, and will require conversion to YYYY-MM-DD)The text was updated successfully, but these errors were encountered: