Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arni/extract henry coeff #100

Merged
merged 19 commits into from Apr 11, 2019
Merged

Arni/extract henry coeff #100

merged 19 commits into from Apr 11, 2019

Conversation

Surluson
Copy link
Collaborator

Added two functions (along with docstrings and tests):
extract_henry_coefficient:
Grabs a DataFrame, names of the Pressure column and Adsorption column, and the number of points it uses to extract the Henry coefficient.
I used MultivariateStats for a linear least squared regression (llsq). One issue I found was that the llsq function doesn't like the columns in the DataFrames because they're usually a Union between Floats and Missings.

fit_langmuir:
Here I linearize N = (MKP)/(1 + KP) and solve it with llsq as well. I check to see if the first pressure point is 0, because after I linearize the I divide by the pressure (which blows up if P=0)

@SimonEnsemble
Copy link
Owner

very nice and clean code!

  • Could you also return information about the goodness of fit, e.g. the error? This way we can assess the fit systematically. Does this package provide error bounds on the slope? So KH = 12 +/- 3.0? If so, that would be really helpful for assessing goodness of fit and automatically selecting how many points to include.
  • For Langmuir fitting, I typically minimize sum over i pressure points:
    [n_i - M * K * p_i / (1 + K * p_i)]^2
    which is a non-linear optimization problem. You first linearized the Langmuir isotherm, which will not minimize the above. I haven't thought much about which is better (it is probably context dependent). Do you know the difference between the two methods? I imagine one will over-emphasize high or low pressure data points? This well-cited paper suggests that the linearized version (while easier) is not a good way to go: https://pubag.nal.usda.gov/download/5945/PDF The Optim.jl package is good for non-linear optimization.

@Surluson
Copy link
Collaborator Author

I added a RMSE error between the points that were fitted to the slope for now. Error bounds for the slope would require us to assume some error for the datapoints no?

@SimonEnsemble
Copy link
Owner

The RMSE is smallest if you include two points, so based on RMSE, you would always choose two points to estimate the Henry coefficient. There is a tradeoff here though: include more points might mean more confidence in the slope estimate, but also might be less confidence if you are outside of the Henry regime. So how to systematically choose the number of points to include in a Henry coefficient calculation?

The error estimate in the slope is based on Gaussian distributed noise, but the variance of that Gaussian is inferred from the data. See here: https://www.chem.utoronto.ca/coursenotes/analsci/stats/ErrRegr.html I'm not 100% sure the estimate of error in slope is the way to go, but certainly RMSE is not suitable on its own since that always leads to the conclusion to use only two points.

Code looks good except in practice you need a more automatic guess for K and M; I suspect rarely will it converge to a global min with your default guess for K and M; it will get stuck in local min often. In pyIAST, I use the last data pt times 1.1 as the saturation loading, and use the first data points to get a Henry coefficient, giving K = M * KH. But you already thought of a better way to get a good starting point! you can keep your linearized function for Langmuir fitting. Use those as starting params for the nonlinear fitting routine! 👍 Maybe it would be beautiful to keep one function with method=:nonlinear default option passed, which calls method=:linear for starting params.

@SimonEnsemble
Copy link
Owner

@Surluson this might be stale, could you re-make a pull request now that we have Travis working?

Also it looks like the Langmuir guess is poor starting point; we can do better than having a default guess of M=1.0; this probably will only rarely converge. If you guess M as say 1.1 * maximum(df[:P]) then it will be more robust. Then you can estimate Henry coefficient from the first point, then get the Langmuir K from that as a better estimate. (this is simpler, two lines of code, than the linear fitting I suggested above)

@SimonEnsemble
Copy link
Owner

@Surluson checks failed.

  • looks like there are commented out lines there?
  • i just realize we can generalize this, since there is a lot of repeating code. how about model can be henry or langmuir, then we hv one function fit_isotherm(df, pressure_col_name, n_col_name, model) then we hv _guess(df, model) that guesses the params specific to each model. the guess for the henry coeff can be the slope based on the first data pt. for the langmuir, slope of the first data pt, then 1.1 times max value of adsorbed gas in the data. So instead of using the llsq function, we use the non-linear fitting routine for generality. the objective function is just different for every isotherm. this will allow us to incorprate all sorts of isotherm models. and it is as easy as adding if method == :mymodel then adding a guess and writing the objective function and functional form for that model.

@Surluson
Copy link
Collaborator Author

Surluson commented Mar 9, 2019

It seems all tests failed because MultivariateStats wasn't in the REQUIRE file.
And I really like that idea, I'll make the changes asap

@SimonEnsemble
Copy link
Owner

@Surluson to get this PR merged how about we simplify it and

  • return the MSE without trying some way of normalizing it
  • assume the user will trim the dataframe themselves if they don't want any points to be used in the fitting (i.e. do not pass n_pts as the number of points to use or, more ambitiously, choose n_pts automatically for Henry fits)

Copy link
Owner

@SimonEnsemble SimonEnsemble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • @test isapprox(henry, 0.9688044280548711) what is the rationale of why it should be 0.96? Can you make a comment in the test to explain?
  • what is the point of the prepend!() of zeros?
  • should we use "H" as a name for the Henry coefficient to distinguish it from "K" since the Langmuir constant is not the Henry coefficient in the Langmuir model?
  • if θ0["K0"] is the guess returned from the guessing function, why are we multiplying it by 1.1 in the optimize function for the Henry coeff? Should we build the 1.1 into the guess function instead?
  • n[2]/p[2] etc should be n[2] / p[2] for style, so we can't look at a piece of code and tell who wrote it
  • seems you can save some lines of code in _guess since the henry coefficient needs to be guessed for both models.
  • the henry constant in the langmuir model is not equal to K, so the guess for K0 for the Langmuir model is not correct, shouldn't it be multiplied by M? so as stated above you could get the henry coeff guess then multiply it by M for the Langmuir K. H = K * M in the Langmuir model if you write it out

@Surluson
Copy link
Collaborator Author

I've made some edits to the code.
As for the first point: This is a value I used after calculating the slope from the first 3 points. At the time, the code was also trying to find the perfect number of data points to use for the fitting, so this was also used to make sure exactly 3 data points were used for the fitting procedure.

Copy link
Owner

@SimonEnsemble SimonEnsemble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what does min_mse do?
  • what does henry_tol do?

these seem to be artifacts of automatically choosing the number of points?

@Surluson
Copy link
Collaborator Author

Yeah sorry, those shouldn't be there. They've been removed

@SimonEnsemble
Copy link
Owner

  • fit_isotherm --> fit_adsorption_isotherm
  • simplified Henry test (thought we didn't need to load in a .csv for that and have an additional file laying around just for this simple test when we could put the data in the code)
  • added comments to explain guesses, plus comment to explain what _guess is doing.

please review my changes and approve, then we can merge the PR; it looks good to me!

@Surluson
Copy link
Collaborator Author

Everything looks good to me 👍
You'll have to approve the changes though, because I made the pull request.

@SimonEnsemble SimonEnsemble merged commit 6f759f3 into master Apr 11, 2019
@SimonEnsemble SimonEnsemble deleted the arni/extract_henry_coeff branch April 11, 2019 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants