Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized methods outputs #452

Open
chahank opened this issue Jul 15, 2021 · 10 comments
Open

Standardized methods outputs #452

chahank opened this issue Jul 15, 2021 · 10 comments

Comments

@chahank
Copy link
Contributor

chahank commented Jul 15, 2021

This is an idea/suggestion for current and future development. It would be very helpful that the basic methods sample and analyze always return data of the same type with a similar structure. In this way, it is much easier to embed SAlib into other packages.

For example, now the method SALib.analyze.morris.analyze returns a dictionary SI_morris containing types list, numpy.ndarray, and numpy.ma.core.MaskedArray whereas SALib.analyze.sobol.analyze returns a dictionary Si_sobol containing types numpy.ndarray. As far as I can tell, there is no fundamental reason to have these differences. Thus, a proposal would be to always use numpy.array. Furthermore, the dictionary SI_morris has the key names, which the SI_sobol does not have.

@ConnectedSystems
Copy link
Member

Hi @chahank

If you're doing analyses, try converting to a DataFrame first.

Example:

Si = morris.analyze(...)
Si.to_df()

Otherwise, I agree that standardization is warranted.

@chahank
Copy link
Contributor Author

chahank commented Jul 15, 2021

Thanks! The method to convert to dataframes is indeed quite useful. Unfortunately, I based my embedding of SAlib into CLIMADA on the dictionaries so far. I might change it in the future though.

@chahank
Copy link
Contributor Author

chahank commented Jul 15, 2021

To give a bit more context: one of the reasons that I used the dictionary is that it is easy to differentiate between first and second order indices by checking whether it is a 1D or 2D numpy array. With the dataframe indices, the differentiation is less trivial.

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 18, 2021

it is easy to differentiate between first and second order indices by checking whether it is a 1D or 2D numpy array

I think I'm missing some more context here, but are the dictionary keys or DF column names themselves not usable for this?

>>> Si = sobol.analyze(...)

>>> Si
{'S1': array([ 0.31057564,  0.44365337, -0.01296208]),
 'S1_conf': array([0.05220492, 0.05386508, 0.05120255]),
 'ST': array([0.55794676, 0.4421895 , 0.24140187]),
 'ST_conf': array([0.07801056, 0.04067199, 0.02647518]),
 'S2': array([[        nan, -0.01439748,  0.24623147],
        [        nan,         nan,  0.00053893],
        [        nan,         nan,         nan]]),
 'S2_conf': array([[       nan, 0.07894212, 0.10970485],
        [       nan,        nan, 0.05909114],
        [       nan,        nan,        nan]])}

>>> Si.keys()
dict_keys(['S1', 'S1_conf', 'ST', 'ST_conf', 'S2', 'S2_conf'])

>>> Si["S2"]
array([[        nan, -0.01439748,  0.24623147],
       [        nan,         nan,  0.00053893],
       [        nan,         nan,         nan]])

# Convert to DF, in this case returns list of 3 dataframes 
# in order of Total, First and Second order indices
>>> Si.to_df()  
[          ST   ST_conf
 x1  0.557947  0.078011
 x2  0.442189  0.040672
 x3  0.241402  0.026475,
           S1   S1_conf
 x1  0.310576  0.052205
 x2  0.443653  0.053865
 x3 -0.012962  0.051203,
                 S2   S2_conf
 (x1, x2) -0.014397  0.078942
 (x1, x3)  0.246231  0.109705
 (x2, x3)  0.000539  0.059091]

>>> ST, S1, S2 = Si.to_df()

@chahank
Copy link
Contributor Author

chahank commented Jul 19, 2021

I would like to avoid any hard-coded variable names. This is an effort to keep compatibility with potential future methods in SAlib with minimal effort. As long a the structure of the output data from the .sample and .analyze methods do not change, this should be fine. There is of course no guarantee from you that this will not change it in the future, but I hope the package will remain stable for some time.

Since for the different methods in SAlib the output variables have different names, I differentiate between first order and second order indices by looking whether the array is 1D or 2D.

I could also use the Dataframes (but would have to rewrite my integration of SAlib ^^). How stable is the DataFrame output, however? Will it remain available for all methods?

@ConnectedSystems
Copy link
Member

I would like to avoid any hard-coded variable names
I differentiate between first order and second order indices by looking whether the array is 1D or 2D.

This is not a criticism (I have no context on the integration after all), just wanted to raise that there is a danger to attempting to future-proof against all possible cases. Checking the shape of the returned array is also a little fragile (what happens if we move to providing 3rd order sensitivities as well?). I suppose you could check the length of DataFrame indices (this should increase with number of orders).

Names of methods/results are unlikely to change, however.

How stable is the DataFrame output, however? Will it remain available for all methods?

I intend for this, yes 😄

@ConnectedSystems
Copy link
Member

Incidentally, would you be in a position to submit a PR listing CLIMADA in the readme?

Like this PR #433

@chahank
Copy link
Contributor Author

chahank commented Jul 26, 2021

I would like to avoid any hard-coded variable names
I differentiate between first order and second order indices by looking whether the array is 1D or 2D.

This is not a criticism (I have no context on the integration after all), just wanted to raise that there is a danger to attempting to future-proof against all possible cases.

Sure! Thanks for the feedback. I agree that future-proof for all possible cases is impossible. So far, I tried to make at least somehow flexible, and it mostly worked out well over the last couple of release of SAlib :D.

Checking the shape of the returned array is also a little fragile (what happens if we move to providing 3rd order sensitivities as well?). I suppose you could check the length of DataFrame indices (this should increase with number of orders).

That could be something. For the moment, the code simply ignores not 1-d or 2-d arrays. But of course, this is not foolproof.

Names of methods/results are unlikely to change, however.
That is why I tried to avoid any hard-coded names so far.

How stable is the DataFrame output, however? Will it remain available for all methods?

I intend for this, yes 😄

Awesome :D !

Thanks again for all the feedback and for making this great package!

@ConnectedSystems
Copy link
Member

Thanks for PR #455 and your kind words. For the record though I didn't make this package, I merely help maintain it :)

All credit goes to @jdherman @willu47 and their colleagues.

@chahank
Copy link
Contributor Author

chahank commented Aug 10, 2021

One suggestion about the pandas DataFrame output for the higher-order sensitivity indices: instead of using a single index with a longer string "(x1, x2)", it would be convenient to use a multi-index. Hence, to have two separate index columns for the first and second variables.

@ConnectedSystems ConnectedSystems moved this from 1.4.x series to v1.5 onward in SALib Development Roadmap Sep 4, 2021
@ConnectedSystems ConnectedSystems added this to the v1.5 milestone Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants