New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Data Base clean up script/ set of funcitons #1630

Open
mccabete opened this Issue Sep 7, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@mccabete
Contributor

mccabete commented Sep 7, 2017

It would be good to have a script or set of functions or both that round up records that need to be fixed. This could apply to:

files without input records,
inputs with no file records,
file-formats without files,
dbfiles records that point to files that don’t exist
etc.

This could just be some queries that puts records into a generic readable file that could be hand edited, and then re-fed into a function that would cull records that remain.

@mccabete

This comment has been minimized.

Contributor

mccabete commented Sep 7, 2017

@tonygardella What DB errors do we tend to get?

@ashiklom

This comment has been minimized.

Contributor

ashiklom commented Sep 7, 2017

Let's make an editable list! I'll start a checklist here -- feel free to edit this comment (pencil button in the top right of the issue) to add stuff, or check stuff off that has been implemented.

  • Orphan dbfiles records -- i.e. dbfiles records with no input, model, etc. records
  • Childless input records -- i.e. input records that don't have corresponding dbfiles records
  • formats records that don't have any matching dbfiles
@dlebauer

This comment has been minimized.

Member

dlebauer commented Sep 7, 2017

@ashiklom

This comment has been minimized.

Contributor

ashiklom commented Sep 7, 2017

@ankurdesai suggestion -- Function for a given site and met product that would purge everything except original data download.

@mccabete

This comment has been minimized.

Contributor

mccabete commented Sep 7, 2017

Mike- A function that takes away old runs, and especially met made by old runs

  • have certain provisions to prevent deleting met that can't be automatically recreated by PEcAn
@tonygardella

This comment has been minimized.

Contributor

tonygardella commented Sep 7, 2017

It use to be that failed downloads of raw met data created a db record even though actual files are not there. For example if a site only had 2000- 2010 and you ran 1998-2002 it would update the input record to say 1998-2002 was there and only download 2000-2002. No sure how to fix this mismatch other than to delete old runs and their associated file as much as possible.

@ashiklom

This comment has been minimized.

Contributor

ashiklom commented Sep 7, 2017

failed downloads of raw met data created a db record even though actual files are not there

Input records, especially met records, should have a start date, end date, and format specification, right? Perhaps we can use the format record and load_data functionality to get the actual bounds of the data, compare them to the start and end date, and if there is a mismatch, then flag the file for deletion? Similarly, would it be too aggressive to delete any input file that is missing any of these specifications? I.e. Are there circumstances where it's OK for an input to be missing a start date, end date, or format?

@dlebauer

This comment has been minimized.

Member

dlebauer commented Sep 7, 2017

@ashiklom soil datasets often do not have a start and end date

@tonygardella

This comment has been minimized.

Contributor

tonygardella commented Sep 7, 2017

@ashiklom I think that would work. We could update the records to reflect the data that is actually there or delete them. I think for this it would be best to just use the met format records.

@mccabete

This comment has been minimized.

Contributor

mccabete commented Sep 11, 2017

A checklist of functions to create. Also available here in a google doc.

A quick and dirty synthesis of github comments/issues turned into a list of functions:

Mass Execution functions

  • write_out_list_of_records(results_from_sql_query, path)
  • Creates human-readable file with sql query results
  • cull_entries( human_readable_file, connection)
  • Parses file and deletes any remaining entries

Checks

  • check_file_exists (input_record, connection)
  • A check that the file associated with an input record is still there
  • Maybe returns errors from load_data/ read.output?
  • check_record_dates (input_record, connection)
  • Checks that if an input record has a date range that matches that of its associated file

Find DB Entry Functions

  • find_formats_without_inputs(connection, date_range = NULL)
  • Psql query of all formats without associated inputs
  • find_inputs_without_formats(connection, date_range = NULL)
  • psql query of inputs without formats
  • find_variables_without_formats (connection, date_range = NULL)
  • Psql query of all variables without associated formats
  • find_files_without_inputs(connection, date_range = NULL)
  • Psql query of all files without an associated input
  • find_inputs_without_files(connection, date_range = NULL)

Find Run Functions

  • find_runs(connection, date-range, user, stage=NULL)
  • psql query that Lists all failed runs from a date range and for a given user. Date-range and user are possibly overkill, but would hopefully prevent accidental deletion of production runs
  • Could optionally list all runs that fail at a certain step.
    --Ie: all but runs that failed at model step
  • find_in_progress_runs (connection, user=NULL)
  • pqsl queryLists all runs still in progress

@mccabete mccabete referenced this issue Sep 28, 2017

Merged

Database cleanup #1700

5 of 9 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment