Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data storage objects #102

Merged
merged 30 commits into from
Feb 20, 2021
Merged

Data storage objects #102

merged 30 commits into from
Feb 20, 2021

Conversation

odunbar
Copy link
Collaborator

@odunbar odunbar commented Feb 5, 2021

Purpose

Fixes #90
This PR adds DataStorage.jl containing objects to store singleton, or paired, data samples. The idea is to help ensure dimensionally consistent data (as these always store a 2D array where the samples are columns, so if in doubt - the data are columns), which has lead to numerous issues in the past. They also ensure that your input-output pairs for the emulator are safeguarded (e.g it checks consistent dimensions of samples).

Users will need to specify orientation when they provide arrays (as to whether the data stored as a list of rows, or a list of columns), then internally they are arranged always as columns. Effectively therefore we are really just adding more rigid constructors for data arrays.

Contains:

  • The DataContainer object stores data as columns
  • The PairedDataContainer object stores 2 DataContainer objects, one for input data, one for output data. The number of samples is checked to be the same length in inputs and outputs, while the data dimensions can differ.
  • Unit tests for the object and related functions, also updated unit tests for the other objects to use these
  • Small changes to examples (Lorenz 96/Cloudy/LossMinimization/GaussianProcessEmulator) to work with new data storage
  • To obtain training points from the EKP: in src/Utilities there is a get_training_points function which returns a PairedDataContainer
  • The GaussianProcessEmulator only takes in PairedDataContainer now

Also...

  • I required some larger changes to the Cloudy example run (even without the new data storage))
  • I removed some confusing vcat([X.first, X.last]...) type lines in GP setup and replaced with interpretable loop
  • getters and setters for EKP, I really push for us to use getters/setters as an interface
  • MCMC stores posterior array with data-as-columns, and utilities work with this too.

Construction example

DataContainer

To create a Data container simply create an array, and specify whether you consider data within as a list of columns, or not. e.g for 2D data and 10 samples

x = rand(2,10)
column_store = DataContainer(x,data_are_columns=true)

or for 10D data and 2 samples

column_store2 = DataContainer(x,data_are_columns=false)

Remember, the data is always stored as columns so

@assert(get_data(column_store) == get_data(column_store2))

returns true

PairedDataContainer

Can be constructed from 2 DataContainers of the same sample size.

input_output_pairs = PairedDataContainer(column_store,column_store)

Or from arrays themselves, (so long as they have the same "samples" dimension)

input_output_pairs2 = PairedDataContainer(x,x,data_are_columns=true) 

and,

PairedDataContainer(column_store,column_store2) 

throws a dimension mismatch error.

@odunbar odunbar changed the title [WIP] Data storage objects Data storage objects Feb 12, 2021
src/DataStorage.jl Outdated Show resolved Hide resolved
src/DataStorage.jl Show resolved Hide resolved
src/EnsembleKalmanProcesses.jl Outdated Show resolved Hide resolved
src/EnsembleKalmanProcesses.jl Show resolved Hide resolved
@CliMA CliMA deleted a comment from codecov bot Feb 18, 2021
src/DataStorage.jl Outdated Show resolved Hide resolved
src/DataStorage.jl Outdated Show resolved Hide resolved
src/EnsembleKalmanProcesses.jl Outdated Show resolved Hide resolved
src/EnsembleKalmanProcesses.jl Outdated Show resolved Hide resolved
src/EnsembleKalmanProcesses.jl Outdated Show resolved Hide resolved
src/GaussianProcessEmulator.jl Outdated Show resolved Hide resolved
src/GaussianProcessEmulator.jl Outdated Show resolved Hide resolved
test/DataStorage/runtests.jl Show resolved Hide resolved
test/DataStorage/runtests.jl Show resolved Hide resolved
test/MarkovChainMonteCarlo/runtests.jl Outdated Show resolved Hide resolved
@CliMA CliMA deleted a comment from codecov bot Feb 19, 2021
@CliMA CliMA deleted a comment from codecov bot Feb 19, 2021
@odunbar
Copy link
Collaborator Author

odunbar commented Feb 20, 2021

bors r+

@bors
Copy link
Contributor

bors bot commented Feb 20, 2021

Build succeeded:

@bors bors bot merged commit 468f5a2 into master Feb 20, 2021
@bors bors bot deleted the orad/DataStorage branch February 20, 2021 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consistency in Rows/Columns/1d-arrays
2 participants