Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate storing much of the data in a database, to reduce load time and memory use #42

Merged
merged 43 commits into from
Jun 5, 2022

Conversation

CalebBell
Copy link
Owner

@CalebBell CalebBell commented May 22, 2022

This pull request adds an alternative option to looking up constant properties in Pandas dataframes. Essentially, a common path is to use a function like Tb without a method specified; and to not use either the Tb_methods or the Tb_sources function and variable respectively. For this one common application, the only needed information is the Tb value itself.

Therefore, a script dev/generate_sqlite_database.py was created to obtain the default values for each constant property in chemicals. The resulting database is 7.3 MB, and compresses to 3.3 MB in a zip archive.

The sqlite database is always stored on disk, so there is substantial memory savings from this approach. The query time is acceptable in general. When using something like %timeit, one search is 10-15 us. However, in this case the used database pages being cached in memory. On my machine when the disk has to be accessed (as measured by doing a single read), the time is about 800 us. I wrote the database lookup to retrieve all of the pure component properties for a compound at once, and cache them in memory once accessed. The layout of sqlite is such that this is as fast as only retrieving one property.

Loading the sqlite3 module takes ~3 ms. It is delayed until needed.

There is a setting chemicals.data_reader.USE_CONSTANTS_DATABASE which defaults to True. The sqlite database can't be committed to git as it will be updated regularly and therefore bloat up the repository size unreasonably.

Included with this change is further work on the int_CAS setting earlier discussed. The files with CAS numbers using int_CAS are now expected to have the CAS number without the dashes included, i.e. 50000 instead of 50-00-0. Doing the conversion in Python turned out to be quite slow, and this way the Pandas csv reader can read them directly as integers. To make this work, all of the data files had their header corrected to use CAS as the index variable.

There are a few more minor data corrections as well.

This PR does add considerably complexity to the project, but I believe it will be appreciated and will make the project better. In the future, it may be possible to provide something for temperature dependent properties as well.

@CalebBell CalebBell requested a review from yoelcortes May 23, 2022 22:31
@yoelcortes
Copy link
Collaborator

@CalebBell, this is awesome. I don't think there is much complexity added. I made some minor edits here and there under the assumption that the index must be either a string or an integer. The database_constant_lookup fails in my machine every time with the following exception:

File "...\chemicals\data_reader.py", line 247, in cached_constant_lookup
    CONSTANTS_CURSOR.execute("SELECT * FROM constants WHERE `index`=?", (str(CASi),))

OperationalError: no such table: constants

I have not worked with sql before so I am not able to debug this quickly, but maybe you could help me? I added some code that prevent database lookup after the first failure related to sql searches.

Other than getting the search to work. I think this branch can be merged whenever. Feel free to revert any of my changes that aren't helpful.

Thanks!

@CalebBell
Copy link
Owner Author

Hi Yoel!
I hope you are well and thank you for the feedback! It is always nice to not be alone. Thank you for taking the time to take a look! I hope this branch can improve your libraries as well as mine.
I believe the error you are getting is that the database isn't on your machine. It needs to be generated first:
python dev/generate_sqlite_database.py
Please let me know if this doesn't work for you.

The database is easy to distribute as part of the pip release. It will take a little more work to get the conda build to ship with it. This is a big part of my reluctance to the approach; but as more and more data is added to the library this type of approach seems like the only one that doesn't unduly increase load time.

@CalebBell
Copy link
Owner Author

@yoelcortes have you been able to check if the database generation script works for you?

@yoelcortes
Copy link
Collaborator

yoelcortes commented May 30, 2022

Hi @CalebBell, sorry about that, I meant to get back to you sooner but was doing some traveling for memorial weekend. Yes, it works well on my computer. I also fixed some bugs I introduced in my past commits (which required the sql file to run). As always, your work on enhancing your ChemE libraries is much appreciated by me and tons others. I am happy I can help and contribute too :)

By the way, I like the new int_CAS parameter. Would it be a good idea to load other files in chemicals using this? I'm not sure if it would add too much overhead, what do you think? Here are some of the files that whose index are not currently converted to iCASs:

Physical Constants of Inorganic Compounds.csv
Physical Constants of Organic Compounds.csv
IUPACOrganicCriticalProps.tsv
CRCCriticalOrganics.tsv
Mathews1972InorganicCriticalProps.tsv
Appendix to PSRK Revision 4.tsv
PassutDanner1973.tsv
Yaws Collection.tsv
DIPPRPinaMartines.tsv
Antoine Collection Poling.tsv
Table 2-8 Vapor Pressure of Inorganic and Organic Liquids.tsv
Alcock_Itkin_Horrigan_metalic_elements.tsv
Wagner Original McGarry.tsv
Wagner Collection Poling.tsv
Antoine Extended Collection Poling.tsv
VDI PPDS Boiling temperatures at different pressures.tsv
Florian_Liming_RON_experimental.tsv
Florian_Liming_MON_experimental.tsv
Florian_Liming_RON_MON_ANN.tsv
Travis_Kessler_Combustdb_RON.tsv
Travis_Kessler_Combustdb_MON.tsv
Travis_Kessler_Combustdb_predictions.tsv
Dahmen_Marquardt_ignition_delay_IQT.tsv
Poling Dipole.csv
cccbdb.nist.gov Dipoles.csv
Muller Supporting Info Dipoles.csv
psi4_dipoles.tsv
Official Global Warming Potentials 2007.tsv
Official Global Warming Potentials 2014.tsv
Ozone Depletion Potentials.tsv
CRC logP table.tsv
Syrres logP data.csv.gz
PolingDatabank.tsv
TRC Thermodynamics of Organic Compounds in the Gas State.tsv
CRC Standard Thermodynamic Properties of Chemical Substances.tsv
MuleroCachadinaParameters.tsv
Jasper-Lange.tsv
Somayajulu.tsv
SomayajuluRevised.tsv
VDI PPDS surface tensions.tsv
MagalhaesLJ.tsv
PolingLJ.tsv
psi4_radius_of_gyrations.tsv
chemsep_radius_of_gyrations.tsv
psi4_linear.tsv
Permittivity (Dielectric Constant) of Liquids.tsv
Yaws Boiling Points.tsv
OpenNotebook Melting Points.tsv
Ghazerati Appendix Vaporization Enthalpy.tsv
CRC Handbook Heat of Vaporization.tsv
CRC Handbook Heat of Fusion.tsv
Ghazerati Appendix Sublimation Enthalpy.tsv
Table 2-150 Heats of Vaporization of Inorganic and Organic Liquids.tsv
VDI PPDS Enthalpies of vaporization.tsv
Alibakhshi one-coefficient enthalpy of vaporization.tsv
API TDB Albahri Hf (g).tsv
ATcT 1.112 (g).tsv
ATcT 1.112 (l).tsv
Yaws Hf S0 (g).tsv
CRC Handbook Organic RI.csv
NFPA 497 2008.tsv
IS IEC 60079-20-1 2010.tsv
DIPPR T_flash Serat.csv
National Toxicology Program Carcinogens.tsv
IARC Carcinogen Database.tsv
Dutt Prasad 3 term.tsv
Viswanath Natarajan Dynamic 3 term.tsv
Viswanath Natarajan Dynamic 2 term.tsv
Viswanath Natarajan Dynamic 2 term Exponential.tsv
Table 2-313 Viscosity of Inorganic and Organic Liquids.tsv
Table 2-312 Vapor Viscosity of Inorganic and Organic Substances.tsv
VDI PPDS Dynamic viscosity of saturated liquids polynomials.tsv
VDI PPDS Dynamic viscosity of gases polynomials.tsv
Table 2-314 Vapor Thermal Conductivity of Inorganic and Organic Substances.tsv
Table 2-315 Thermal Conductivity of Inorganic and Organic Liquids.tsv
VDI PPDS Thermal conductivity of saturated liquids.tsv
VDI PPDS Thermal conductivity of gases.tsv
Staveley 1981.tsv
COSTALD Parameters.tsv
Mchaweh SN0 deltas.tsv
Perry Parameters 105.tsv
CRC Liquid Inorganic Constant Densities.tsv
CRC Solid Inorganic Constant Densities.tsv
VDI PPDS Density of Saturated Liquids.tsv
CRC Inorganics densties of molten compounds and salts.tsv
CRC Virial polynomials.tsv

Please feel free to merge whenever,
Thanks!

@yoelcortes
Copy link
Collaborator

Forgot to mention, I think adding the sql database to the pip installation would be a good idea. I'm not sure how to add it to the conda installation either.

@CalebBell CalebBell merged commit 15738c5 into master Jun 5, 2022
@CalebBell
Copy link
Owner Author

Hi Yoel!

We are very lucky - the conda package is built right from the package uploaded to PyPi. Because of that, the default.sqlite database needs to be generated on the machine which uploads the package to pypi. I added a script dev/prerelease.py which creates the database and will do other things related to the release as required. I am very pleased with this development; I didn't expect such a large win. The maintenance overhead will very hopefully be minimal.

The new release is version 1.1.1, as this feels like a major change. I have tested and confirmed both releases ship with the sqlite database.

I am not sure the idea of converting older .tsv files to use integer CAS numbers is worthwhile. In particular, I know of some people indexing into those files using CAS numbers as strings, so it would not be good to break that functionality them. I think pursuing expanding the database with temperature dependent property fits is a larger win, and hope to focus potential future memory reduction efforts into that.

Thank you for your excellent review!

Caleb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants