Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate input tables #7

Merged
merged 25 commits into from
Jun 12, 2020
Merged

Generate input tables #7

merged 25 commits into from
Jun 12, 2020

Conversation

kdorheim
Copy link
Collaborator

@crvernon and @bpbond this is a larger PR than I would have liked and plan on doing smaller PRs in the future. However I felt a PR of this size was needed to give an idea of the package structure.

The objective here is to create the ability to generate Hector input csv files and ini files for the CMIP6 scenarios with a minimal effort from a user. Right now it is quite a hassle and has been a reoccurring problem for the RCMIP and hector calibration work. So this should be something that is ideally easy for any user to use and makes our lives easier when we have to generate new scenarios in the future.

In this PR there are helper functions that would be useful for developers/advanced users that are trying to generate their own hector inputs. But I think that the average Hector user would be interacting with generate function, which would allow users to generate the Hector csv inputs and ini files (not implemented yet) that are canonical aka the RCP, SPPs, and DECK scenarios.

If you have comments, concerns, or would like to chat before looking over this PR please let me know. I look forward to working with the both of you on this and hearing your feedback. Thank you very much!

@kdorheim kdorheim requested review from bpbond and crvernon May 28, 2020 17:01
@kdorheim
Copy link
Collaborator Author

@bpbond and @crvernon the tests are failing on git hub actions because I don't have the rpackageutils importing properly for the tests. Do you want me to try and work that out before of after you take a look at the PR?

@bpbond
Copy link
Member

bpbond commented May 29, 2020

@kdorheim Re failing tests, no worries for now, thanks. Will look at this shortly.

Copy link
Member

@bpbond bpbond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of goodness here @kdorheim – nice work! – though I think many opportunities to clarify code, improve comments, and improve robustness.

# Make sure data exists for the scenario(s) selected to process.
data_scns <- unique(emiss_data$Scenario, conc_data$Scenario)
missing <- !scenario %in% data_scns
assertthat::assert_that(sum(missing) == 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to say

available <- scenario %in% data_scns
assert_that(all(available), ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes!

# unit, ect. These columns will be used to transform the data from being wide to long so that each row
# corresponds to concenration for a specific year.
id_vars <- which(!grepl(pattern = "[[:digit:]]{4}", x = names(conc_data)))
conc_long <- data.table::melt.data.table(data = conc_data, id.vars = id_vars,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting importFrom data.table melt.table.data in the header


# Determine the columns that contain identifier information, such as the model, scneairo, region, variable,
# unit, ect. These columns will be used to transform the data from being wide to long so that each row
# corresponds to concenration for a specific year.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"concentration"

# Determine the columns that contain identifier information, such as the model, scneairo, region, variable,
# unit, ect. These columns will be used to transform the data from being wide to long so that each row
# corresponds to concenration for a specific year.
id_vars <- which(!grepl(pattern = "[[:digit:]]{4}", x = names(conc_data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the columns just years? If yes perhaps make the pattern more specific, i.e. "^[[:digit:]]{4}$"

variable.factor = FALSE)

# Concatenate the long emissions and concetnration data tables together and subset so that
# only the scenarios of intrest will be converted. Remove the NA entries that arose when converted from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"concentration...interest...converting"

#' @return a formated unit string
#' @author Alexey Shiklomanov
#' @noRd
parse_chem <- function(unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An internal comment or two might be useful...a bit hard to follow this code

R/helper_fxns.R Outdated
}


#' Drop " " from the begning of strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"beginning"

R/helper_fxns.R Outdated
for(i in cols){

assertthat::assert_that(is.character(df[[i]]) | is.factor(df[[i]]))
df[[i]] <- gsub(pattern = '^ ', replacement = '', x = df[[i]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be greatly simplified by using base R's trimws function?

R/helper_fxns.R Outdated
# TODO add some sort of method to make sure that the data frame contains all of the required
# emissions or constraints. Otherwise errors will not be triggered until trying to run the
# Hector core.
assertthat::assert_that(sum(emis, conc) == 1, msg = 'input data should include either emissions or constrained data not both.')
Copy link
Member

@bpbond bpbond May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely time to importFrom assertthat assert_that I'd say.

R/helper_fxns.R Outdated

# Transform the data frame into the wide format that Hector expects.
input_data <- x[ , list(Date = year, variable, value)]
input_data <- dcast(input_data, Date ~ variable, value.var = 'value')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcast? Where is this coming from?

@kdorheim
Copy link
Collaborator Author

kdorheim commented Jun 3, 2020

@bpbond thanks for the suggestions @crvernon whenever you have a chance to take a look at this that would be great!

Copy link
Member

@crvernon crvernon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdorheim Great work! The changes you made for @bpbond were great. The following are a few high-level comments to go along with what is inline:

  • Some functions duplicate quite a bit of functionality for different constraints. For example, from R/generate_fxns.R the generate_input_tables function could be broken down into (1) a function that processes a constraint being passed, and (2) a function that uses function (1) to process each constraint and then return your tables. This allows you to reduce the size of your codebase, reduce the possibility for error since you only have to make changes to a block of code when needed, and
    allow you to write succinct tests that target specific functionality.

  • Remove hard-coded values in functions where possible. These could either be passed in through a YAML config file or set as defaults for arguments. This will prevent folks for having to mess with your code when something like a new year range is needed or a header changes in a file.

@@ -18,4 +18,5 @@ LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.0
Suggests:
testthat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious as to why you are not specifying a version constraint for data.table or zoo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! because I forgot 😬 thanks for pointing that out.


# Remove trailing spaces from the RCMIP inputs.
cols_to_modify <- which(names(raw_inputs) %in% c("Model", "Scenario", "Region", "Variable", "Unit", "Mip_Era"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these column names ever change? If so, then they should be passed into the function or read in from a YAML file. This comment applies to all hard-coded names thereafter and expected_year value that on line 81 that could be a default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't ever change, they are really only relevant to the RCMIP files because of some funky formatting.

@kdorheim
Copy link
Collaborator Author

kdorheim commented Jun 5, 2020

Some functions duplicate quite a bit of functionality for different constraints. For example, from R/generate_fxns.R the generate_input_tables function could be broken down into (1) a function that processes a constraint being passed, and (2) a function that uses function (1) to process each constraint and then return your tables. This allows you to reduce the size of your codebase, reduce the possibility for error since you only have to make changes to a block of code when needed, and
allow you to write succinct tests that target specific functionality.

hmmm this is true and what I was aiming to do 🙈... Which is why convert_rcmipCMIP6_hector is separate from save_hector_table and wrapped inside generate_input_tables hmmm.. it is a tad confusing though I'll go back to the drawing board to try to stream some of the code. Thanks!

@kdorheim
Copy link
Collaborator Author

kdorheim commented Jun 5, 2020

@crvernon and @bpbond thanks for taking a look at this! I really appreciate your input, I'm struggling with getting package checks to pass because of a dependency with assertthat. But as soon as that passes I'm going to merge this and start working on the functions that will be used to set up the ini files.

@codecov-commenter
Copy link

Codecov Report

Merging #7 into master will increase coverage by 38.52%.
The diff coverage is 93.06%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master       #7       +/-   ##
===========================================
+ Coverage   54.54%   93.06%   +38.52%     
===========================================
  Files           1        3        +2     
  Lines          11      101       +90     
===========================================
+ Hits            6       94       +88     
- Misses          5        7        +2     
Impacted Files Coverage Δ
R/generate_fxns.R 92.85% <92.85%> (ø)
R/helper_fxns.R 93.02% <93.02%> (ø)
R/convert_RCMIP.R 93.33% <93.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9b6ea1...b6295ab. Read the comment docs.

@kdorheim kdorheim merged commit 5757ab1 into master Jun 12, 2020
@kdorheim kdorheim deleted the generate_input_tables branch June 12, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants