Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format Record db.query (Shiny Ingest App) #2012

Merged
merged 36 commits into from
Aug 1, 2018

Conversation

LiamBurke24
Copy link
Contributor

@LiamBurke24 LiamBurke24 commented Jul 24, 2018

This PR adds functionality to insert a format record and an optional format-variables record in the BETY db. The emphasis here is that both of these records can be created simultaneously improving the ease of registering new data with BETY.

Description

I included the insert.format.vars function that allows the user to register a format record in the BETY "formats" and "format_variables" tables. This query includes all the input fields found in the BETY GUI: name, header, skip, mimetype_id, and notes.

Additionally it includes the following inputs for the format-variables record: name, units, storage type, column number, format_id, variable_id.

Motivation and Context

This change adds more information to the formats table than the db.query included in the dbfiles.input.insert function. More importantly, it lays the groundwork for completing the shiny app with a GUI for registering the format-variables record.

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

CHANGELOG.md Outdated
- "Test BETY" button allows users create a record in BETY with `dbfile.input.insert`
- Added separate `db.query` to create format record.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more helpful if you named the new function in the log, rather than just referencing the functions called within your new function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this when I rewrite the function db.files.


output$FormatRecordOut <- renderPrint({print(FormatRecordList)})

cmd <- paste0("INSERT INTO formats ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This needs to be it's own function in the DB package, not here in the shiny app.
  • This function needs to return the new Format ID.
  • I'm fairly confident you don't need to include the NOW() bits -- the existence of that in other functions is a legacy that I thought had been cleaned up, but evidently not as you're emulating it
  • This function also needs to take an (optional) formats-variables table and insert all those relationships.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I can drop the NOW() bits, but there are columns for date created and date updated in the formats table. Just want to confirm that you want to lose that functionality?

Would it be easier/ more efficient to create a separate format-variables table query since I'd be writing to two separate tables? Just double checking before I get started. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOW: I'm 99% sure that those will be updated by postGRES automatically if you don't manually set them. I'd drop and then confirm that they're being set.

Creating Formats and Formats-Variables separately is exactly the pain in arse that we currently have to deal with in the BETY webpage and specifically what we're trying to avoid. Exactly analogous to why we create dbfiles and inputs at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOW: sounds good.

F-V: Ok got it. Just to clarify, by the same time, you mean in the same function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to resolve the mimetype_id in this new db.files function? I am currently resolving the ID within the shiny app. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the same function

mimetype_id would need to be an input to the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! This revision should have all the changes you asked for!

FormatRecordList$NewFormatName <- "Test_Format_Name_99" #input$NewFormatName
FormatRecordList$HeaderBoolean <- "TRUE" #ifelse((input$HeaderBoolean == "Yes"), "TRUE", "FALSE")
FormatRecordList$SkipLines <- "3" #input$SkipLines #This should appear only if header = TRUE
FormatRecordList$FormatNotes <- "These are some test notes" #input$FormatNotes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all of these are still set at testing values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Good catch. Fixing this uncovered a bug that took me over an hour to squash!

#' bety <- betyConnect()
#' dbfile.format.variable.insert('Test_Format', header = TRUE, skip = 2, mimetype_id = 1060, con = bety$con, format_variables = FALSE)
#' }
dbfile.format.variable.insert <- function(format_name, header, skip, mimetype_id, format_notes = "", con, format_variables = FALSE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop dbfile from the name as this function isn't inserting anything into that table. Given that this function meant to parrallel query.format.vars, why not call it insert.format.vars?

Also, when a value is missing, such as format_variables we typically set it to NULL not to FALSE. For the remaining values (description, etc), pass these as a data frame to format_variables rather than passing them individually. Passing them individually is a pain and will be error prone (misalignment of vectors)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional note on argument order: I recommend putting the con argument before format_notes, so that all the arguments without default values come at the beginning of the call and all the arguments with default values come after them.

This is partly because it's kind to the user to put "important" (~= "have to set them every time" ~= "function can't guess it") arguments first, but also so that unnamed values get matched to the required arguments before the optional ones. As currently written, the function requires a minimum of five arguments but dbfile.format.variable.insert('Test_Format', TRUE, 2, 1060, bety$con) will set format_notes to bety$con and then complain that argument "con" is missing, with no default.

(This second point is also a reason to name all arguments when calling any function as a user, but that ship never sailed in the R world)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze I included dbfile in the name since it appeared to be a convention for all the query functions in the dbfiles.R document. I can change the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infotroph Thanks. Hahaha. Good point.

dbfile.format.variable.insert <- function(format_name, header, skip, mimetype_id, format_notes = "", con, format_variables = FALSE,
description = NA, units = NA, var_notes = NA, var_name = NA, max = NA, min = NA, standard_name = NA, label = NA, type = NA){

formatid <- get.id(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really don't follow what you're doing here. It looks like you're using get.id to create the format record, but with only just the name and mimetype. Next, you use a second query to create a DIFFERENT format record, that has header, skip, mimetype_id, notes, name, but doesn't include the returning id necessary to get the format ID. Therefore, this second record with more information is never returned. You need to make a choice between these two ways of creating records and do just one, not both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze My mistake. I thought that get.id resolved the format id but I now see that this was a mistake.

can you clarify what I need to do to return the format id from the second type of query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I see this in the comments below. Thanks.


#### Format Variables Query ####
if(format_variables == TRUE){
cmd_fv <- paste0("INSERT INTO variables ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is wrong. First, you need to loop over all the variables (virtually no files will just have one column). Second, you need to be inserting records into formats_variables not variables as the goal is to associate variables with a specific format. As such you've got to revise all the columns being passed. You also need to use the correct format ID when creating these relationships

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Makes sense.

"(header, skip, mimetype_id, notes, name) VALUES (",
header, ", ", skip, ", '", mimetype_id, "', '", format_notes, "','", format_name, "')")

db.query(query = cmd, con = con)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you're trying to parallel existing code here, but there are two very serious problems here that really ought to be fixed in the existing code (#395) and should not be repeated in new functions:

  • db.query is a wrapper around DBI::dbGetQuery, which is documented to be for SELECT-like statements only. We should not be using it for inserts and updates.
  • Manually generating SQL by pasting together strings and variables is extremely likely to produce SQL injection vulnerabilities, and we'll need to be even more cautious of this as we expose more functionality through Shiny and the like.

We don't yet have a fully formed standard method (hence the open status of #395), but in general instead of building an SQL string inside of any PEcAn function we want to assemble the data that we want to insert and then let an existing function to the translation to SQL. When I needed to generate inserts for #1903 I used DBI::dbWriteTable, and I think @ashiklom intended PEcAn.db::db_merge_into to be useful for similar cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infotroph I see you point and agree that the use of raw SQL is potentially dangerous. DBI::dbWriteTable seems like it is a better solution in most instances. That said, it appears that this function only returns TRUE/FALSE and you can't specify the "RETURNING" argument which I've asked @LiamBurke24 to add. I find this to be an unfortunate and odd design limitation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze I think I follow, but to be clear: "this function" being DBI::dbWriteTable, and "RETURNING" being the SQL keyword, as in INSERT INTO ... VALUES ... RETURNING id? If so: Yeah, I've wished for that as well. Should we consider adding that to db_merge_into?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. Would be good if @ashiklom chimed in about db_merge_into since I haven't used that before and I don't completely follow what it's doing internally. But yes, if there's a way to add RETURNING id

In the mean time, what do you want @LiamBurke24 to do immediately? Fields generated via pull-down menus and other queries should be safe (e.g. mimetype ID), but is there a way to properly sanitize anything that's coming in via a text field (e.g. format name)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point of db_merge_into is to try to solve somewhat elegantly what I felt was the very common problem of updating an SQL table with a local table based on one or more "key" columns. The objective is to have a function that, if one were to run it twice with the same arguments, would maybe do something the first time (if the target SQL table was missing rows compared to the local input, those rows would be added) but would basically always do nothing the second time (because the table has already been updated). I was hoping this would allow us to simplify code for adding various things to the database by eliminating the need for logic like the following (which is, effectively, what db_merge_into does using dplyr filtering joins:

1. Get SQL table IDs
2. Subset local table to IDs not in SQL table
3. If local table has any remaining rows after subset, write them to SQL

For an example of this in action, see the unit test.

The db_merge_into function uses insert_table as a backend, which was designed to be a more robust and user-friendly version of DBI::dbWriteTable that also does things like ensure that column types match and automatically subset and align input data to columns that are in the target SQL table.

I was not aware of RETURNING as an SQL statement when I came up with this -- that's good to know about for the future. We could add it to db_merge_into, but I'm not sure it's necessary. db_merge_into currently returns a dplyr "lazy SQL query" data.frame (which can easily be loaded in memory with just a dplyr::collect) which includes the full target SQL table (including ID columns) inner-joined with the input data. So the IDs and any other columns of interest from the SQL table are already in the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet have a fully formed standard method (hence the open status of #395), but in general instead of building an SQL string inside of any PEcAn function we want to assemble the data that we want to insert and then let an existing function to the translation to SQL.

I strongly agree with this. More generally, creating a well-documented API for inserting data into bety should be a top priority for our next releases, as it could make a lot of tasks much easier and safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiamBurke24 you want to try revising to use db_merge_into? Sound like this would also eliminate the need for a loop in the formats_variables insert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiamBurke24 I tried to make db_merge_into easy to use, but ping me on Slack if you have questions. The general idea is to first construct a data.frame that looks something like your target SQL table (in this case, formats_variables) in terms of column names, and then specify which columns determine the uniqueness of the join via a character vector by argument. E.g.

input_formats <- data.frame(
  header = header,
  skip = skip,
  mimetype_id = mimetype_id,
  notes = format_notes,
  name = format_name
)
inserted_data <- db_merge_into(input_formats, "formats_variables", con, by = c("format_name", "mimetype_id"))

The by argument is optional -- if omitted, it joins based on all columns the two data frames have in common (i.e. if any of the common columns are different, then it assumes you meant to insert a new record). In practice, you usually only want to check some of the columns -- e.g. above, if something with the same format_name and mimetype_id already exists, then you probably don't want to add a new format with the same name and id but, say, slightly different notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Thanks @mdietze @ashiklom and @infotroph! Yes, I will revise to use db_merge_into. If this works, should I delete the dbfiles.format.variable.insert function? I'm not attached to it I just didn't know if it would be of any use.

Or is the idea that I should use db_merge_into instead of the SQL queries to make a function in the dbfiles.R document? Am I making any sense? haha.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. This question is addressed below. Sorry.

Copy link
Member

@ashiklom ashiklom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions here and there.

@@ -636,3 +636,63 @@ dbfile.id <- function(type, file, con, hostname=PEcAn.remote::fqdn()) {
invisible(NA)
}
}

#' Function to Insert Records into the Formats and Variables Tables
#' @name dbfile.format.variable.insert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unnecessary roxygen tag. You should remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we already know it's a function, so you should drop the "Function to" part of the title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

#' @param type ?
#' @author Liam Burke
#'
#' @return formatid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a (brief) description of this. Is it a vector? List? Scalar integer value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, you should avoid redundancy in your documentation. E.g. TRUE/FALSE Boolean is redundant, as is format_name Format Name. Also, in addition to a general description, I think parameter documentation should generally try to make clear (1) the R type of value (e.g. numeric, factor, character, integer, logical), (2) whether the values have to be a scalar, or can be a vector, and (3) if there are additional format requirements (e.g. units string probably has to be parse-able by udunits2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks. This is awesome advice.

"(header, skip, mimetype_id, notes, name) VALUES (",
header, ", ", skip, ", '", mimetype_id, "', '", format_notes, "','", format_name, "')")

db.query(query = cmd, con = con)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiamBurke24 I tried to make db_merge_into easy to use, but ping me on Slack if you have questions. The general idea is to first construct a data.frame that looks something like your target SQL table (in this case, formats_variables) in terms of column names, and then specify which columns determine the uniqueness of the join via a character vector by argument. E.g.

input_formats <- data.frame(
  header = header,
  skip = skip,
  mimetype_id = mimetype_id,
  notes = format_notes,
  name = format_name
)
inserted_data <- db_merge_into(input_formats, "formats_variables", con, by = c("format_name", "mimetype_id"))

The by argument is optional -- if omitted, it joins based on all columns the two data frames have in common (i.e. if any of the common columns are different, then it assumes you meant to insert a new record). In practice, you usually only want to check some of the columns -- e.g. above, if something with the same format_name and mimetype_id already exists, then you probably don't want to add a new format with the same name and id but, say, slightly different notes.


output$FormatRecordOut <- renderPrint({print(FormatRecordList)})

dbfile.format.variable.insert(header = FormatRecordList$HeaderBoolean, skip = FormatRecordList$SkipLines, mimetype_id = FormatRecordList$NewmimetypeID,
format_notes = FormatRecordList$FormatNotes, format_name = FormatRecordList$NewFormatName, con = bety$con, format_variables = FALSE)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using db_merge_into as discussed above, you may not actually need dbfile.format.variable.insert at all. You may just be able to get away with constructing FormatRecordList as a data.frame instead (with as many rows as you want -- it sounds like you could be adding multiple variables at once, which db_merge_into is specifically designed to handle) and then just insert it all at once with a single db_merge_into statement. A nice advantage of db_merge_into in the context of Shiny is that, by design, it will never insert duplicate rows (at least, with a well-specified by argument), so it can be more-or-less safely called over-and-over in a single session (though there is a bit of computational overhead, so I would avoid a design that calls it hundreds of times).

@mdietze Was this more-or-less what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still worth having the function, even if it only ends up being 2-3 lines, so that it doesn't have to be recreated in the future. I had asked @LiamBurke24 to make this a function, rather than just having the insert code directly in the shiny app (original version) precisely because I find going through BETY's current multi-step process of creating the format, and then editing that record to associate variables with the format, to be a pain. The fact that it's taking 4 of us having a long discussion to sort out how to do this correctly implies that not having a function could lead to mistakes by others.

Additional point -- I wonder if this function should return the full format-variable object that @bcow has query.format.vars returns rather than just the format ID? I don't think anyone is going to need to create a format and then immediately use it in load_data, but it's worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze @ashiklom So we're all on the same page. I am going to use db_merge_into to rewrite the dbfile.format.variable function (which will shortly be renamed to something with out dbfile in the name).

I would also like to clarify what we would like the function to be returning. SHould I start with the format ID and then proceed to the full format-variable object if need be/ time permitting?

Finally, @ashiklom based on what you're saying, db_merge_into looks like a perfect fit for this application.

P.S. where should I be storing this file? In the dbfiles.R document? Or somewhere else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • yes, insert.format.vars is my vote for name
  • returning format ID is sufficient for now.
  • put it in it's own file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze Awesome. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up question: where should I put this new function? Is pecan/base/db/R/ ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes it should definitely be in the db package! pecan/base/db/R/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Thanks!

@LiamBurke24
Copy link
Contributor Author

LiamBurke24 commented Jul 25, 2018

@mdietze @ashiklom @infotroph Good evening everyone. That took a lot longer than I originally thought. All the changes that I pushed should address your concerns and suggestions. I really appreciate all the fantastic feedback.

#' @param con Bety connection object
#'
#' @param con SQL connection to BETYdb
#' @param format_notes Character string that describes format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need the format_ part? That seems like a given. Also, it's generally good practice to put the required values before the optional ones, and to put a sensible default on optional variables. So, for example, name and mimetype should come before header, skip, and notes. notes can probably default to NULL, skip to 0, and header to TRUE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an issue with axing the format_ part. My thought was that the user may need some extra reinforcement that the names and notes correspond to the formats record and not the variables record.

And good point! Just fixed the order of the arguments and added these default conditions.

#' \item name: vector of character strings
#' \item unit: vector of character strings
#' \item storage_type: vector of character strings
#' \item column_number: vector of integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this section should

  1. make it clear which variables are required (variable_id) and which are optional (everything else)
  2. Explain what actually goes in each column (not just its type) and when it's OK to leave values blank
  3. if a column only needs a few values, what should the user put in the empty cells? NULL? NA?

In particular, be aware that name is the variable name in the file (e.g. column head) and only needs to be specified if it's different from the BETY variable name. unit needs to be udunit2 parsable and only needs to be specified if the units are different from the BETY standard. column number only needs to be specified for tabular data that doesn't have a header. storage_type serves 2 purposes. First, it indicates columns stored in a format different from expected (e.g. numbers stored as characters). Second, it stores the POSIX codes used to parse any time variables (e.g. a column with a 4-digit year would be %Y). This last bit should include a "see also" to strptime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear on all fronts. These changes are in the most recent commit

#' formats_variables_df <- tibble::tibble(
#' variable_id = c(327, 2.98e+02), # integer
#' name = c("New Name", "Other New Name"), #
#' unit = c("Yoonits", "Other Units"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really bad example because:

  1. variable 327 (c2n_root) is unlikely to be in a column named "New Name" and "Yoonits" is most definitely not convertible by udunits2 to the standard units for C:N ratios. Data on C:N ratios are also not going to be used to store the day of year (%j)
  2. variable numbers should not be entered using scientific notation. Given that variable 298 is LE (latent heat flux), the name, units, and storage type entered are just as absurd as the first column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the confusion. This was a placeholder example since I had not made a formal formats-variable record with a real data set at that point. I have now done this and the new example is updated and should address these points.

#' storage_type = c("%j", "%Y"),
#' column_number = c(as.integer(22), as.integer(24)),
#' )
#' insert.format.vars(con = bety$con, format_notes = "Info about format", format_name = "New Format", header = TRUE, skip = 2, mimetype_id = 1090, formats_variables_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve this example too -- "New Format" is an awful format name, notes is almost as bad (and should probably go last), and make sure to pass the format variable table by name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also address these suggestions in the new commits.

notes = format_notes,
name = format_name,
stringsAsFactors = FALSE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted previously, you might want to perform some basic checks on these values before inserting them. For example, if skip is present it needs to be numeric, header needs to be boolean, notes and names need to be character. Mimetype_id needs to be numeric, and if you want to be really good you'll make sure that ID exists before you insert the record. Similarly, you might want to check the format name doesn't already exist. In general, inserting junk into the db creates problems that are remarkably time consuming to clean up, so it's worth the extra effort on inserts to make sure that what you're inserting makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perform basic checks on all format types of all inputs and check if the format name exists in the new commits. Can you please clarify what you mean by making sure that the ID exists before inserting the record? Originally I thought you meant the mimetype_id, but this should exist already. I would appreciate some clarification. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the variable_id that you need to make sure exists. A test for this is implicit in the unit test above

  u2 <- dplyr::tbl(con, "variables") %>% dplyr::select(id, units) %>% dplyr::filter(id %in% formats_variables[[1, "variable_id"]]) %>% pull(units)

because this will fail if the variable_id is missing, but that implicit test isn't great as the function will just crash with no error message is being returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That was kind of counter intuitive since I keep forgetting that I'm building this function for other applications besides the shiny app. Testing if the variable ID exists is pointless in my app since all the ID's are resolved by the user selecting a variable name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't make sense in this application since I am not passing the variable name. The user selects BETY variables from a dropdown menu and then the app resolves the variable_id. I included a test but I placed it in a conditional suppress statement so that I can suppress it in my app.

format_id_df <- matrix(data = NA, nrow = n, ncol = 1)
for(i in 1:n){
format_id_df[i,1] <- format_id
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is unnecessary, just specify data = format_id (instead of NA) when you create the matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Good idea!

## Insert format record
inserted_formats <- db_merge_into(formats_df, "formats", con = con, by = c("name", "mimetype_id"))
## Make query data.frame
formats_variables_input <- cbind(format_id_df, formats_variables_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, should verify the entries in this table before inserting it. Beyond verifying the correct types, you should also make sure that units are both ud.is.parseable and ud.are.convertible (i.e. grab the BETY units associated with that variable ID and verify that you can convert the data units to the BETY units)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these checks are in the new commits!!

…specific example. Test if all inputs are in the correct format. Convert all NA's to before loading to BETY. Existing format name test in progress.
@LiamBurke24
Copy link
Contributor Author

Summary of changes since last Review:

  • rearrange arguments in function in order of importance and update default values
  • Note and which variables are required in both processes
  • Explain what should be entered in each column for format_variables
  • Explain what the user should put in empty cells (NA for sake of clarity and testing. I replace NA's with "" as per the BETY standard.)
  • Updated the example to use plausible values now that I have more experience with format-variables record registration.
  • Test if format name already exists
  • test if units are ud.is.parseable and ud.are.convertable
  • QUESTION: What ID should I check before making the record? I believe db_merge_into already checks the variable_id and only inserts the record if the record does not exist.

#'
#' @param con SQL connection to BETYdb
#' @param format_name The name of the format. Type: character string.
#' @param notes Additional description of the format: character string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted earlier, please put optional variables at the end. This one in particular could be a lot of text, so will probably be best to put last

#' @param notes Additional description of the format: character string.
#' @param header Boolean that indicates the presence of a header in the format. Defaults to "TRUE".
#' @param skip Integer that indicates the number of lines to skip in the header. Defaults to 0.
#' @param mimetype_id The id associated with the mimetype of the format. Type: integer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put required variables (e.g. mimetype_id) before optional variables that have defaults

#' @details The formats_variables argument must be a 'tibble' and be structured in a specific format so that the SQL query functions properly. All arguments should be passed as vectors so that each entry will correspond with a specific row. All empty values should be specified as NA.
#' \describe{
#' \item{variable_id}{(Required) Vector of integers.}
#' \item{name}{(Optional) Vector of character strings. Needs only be specified if it differs from the BETY variable name.}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify what it is

#' \describe{
#' \item{variable_id}{(Required) Vector of integers.}
#' \item{name}{(Optional) Vector of character strings. Needs only be specified if it differs from the BETY variable name.}
#' \item{unit}{(Optional) Vector of type character string. Should be in a format parseable by the udunits library and need only be secified if they differ from the BETY standard.}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be explicit about what "they" refers to here (the units of the data in the file).

#' name = c("NPP", NA, "YEAR"), #
#' unit = c("g C m-2 yr-1", NA, NA),
#' storage_type = c(NA, NA, "%Y"),
#' column_number = c(NA, NA, NA),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, column_number is all NA's and thus doesn't need to be specified

)
}

# Test if format name already exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same bit of code repeated twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it


# Test if name is a character string
if(!is.character(name)){
PEcAn.logger::logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so right now you're sending a message to the log that this is an error, but then you're letting the function proceed on. You either want to exit here (logger.severe) or return with an error code that your Shiny app knows how to handle. The latter is really worth considering because the error message here will not be shown to the user of the app and an exit would just cause your app to crash, rather than giving the user an opportunity to fix bad entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle this on the shiny side. Thanks!

notes = format_notes,
name = format_name,
stringsAsFactors = FALSE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the variable_id that you need to make sure exists. A test for this is implicit in the unit test above

  u2 <- dplyr::tbl(con, "variables") %>% dplyr::select(id, units) %>% dplyr::filter(id %in% formats_variables[[1, "variable_id"]]) %>% pull(units)

because this will fail if the variable_id is missing, but that implicit test isn't great as the function will just crash with no error message is being returned

formats_variables[is.na(formats_variables)] <- ""

### udunit tests ###
for(i in 1:nrow(formats_variables)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This test should probably be within the if(!is.null(formats_variables)){
  2. it's also better to use seq_along over 1:nrow
    Both these changes protect against the sequence ending up over 1:0 and then iterating over empty tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll move the tests into that if statement. I don't follow on your second point.

FormatRecordList$NewFormatName <- input$NewFormatName
FormatRecordList$HeaderBoolean <- input$HeaderBoolean
FormatRecordList$HeaderBoolean <- ifelse((input$HeaderBoolean == "Yes"), "TRUE", "FALSE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned this in the other PR, but why not have input$HeaderBoolean just return TRUE/FALSE instead of Yes/No?

@LiamBurke24
Copy link
Contributor Author

Ok. This should be the last push necessary for this PR.

Merge branch 'develop' of github.com:PecanProject/pecan into link_bety

# Conflicts:
#	CHANGELOG.md
@mdietze mdietze merged commit 9db3fc4 into PecanProject:develop Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants