Investigate storing much of the data in a database, to reduce load time and memory use #42

CalebBell · 2022-05-22T21:42:58Z

This pull request adds an alternative option to looking up constant properties in Pandas dataframes. Essentially, a common path is to use a function like Tb without a method specified; and to not use either the Tb_methods or the Tb_sources function and variable respectively. For this one common application, the only needed information is the Tb value itself.

Therefore, a script dev/generate_sqlite_database.py was created to obtain the default values for each constant property in chemicals. The resulting database is 7.3 MB, and compresses to 3.3 MB in a zip archive.

The sqlite database is always stored on disk, so there is substantial memory savings from this approach. The query time is acceptable in general. When using something like %timeit, one search is 10-15 us. However, in this case the used database pages being cached in memory. On my machine when the disk has to be accessed (as measured by doing a single read), the time is about 800 us. I wrote the database lookup to retrieve all of the pure component properties for a compound at once, and cache them in memory once accessed. The layout of sqlite is such that this is as fast as only retrieving one property.

Loading the sqlite3 module takes ~3 ms. It is delayed until needed.

There is a setting chemicals.data_reader.USE_CONSTANTS_DATABASE which defaults to True. The sqlite database can't be committed to git as it will be updated regularly and therefore bloat up the repository size unreasonably.

Included with this change is further work on the int_CAS setting earlier discussed. The files with CAS numbers using int_CAS are now expected to have the CAS number without the dashes included, i.e. 50000 instead of 50-00-0. Doing the conversion in Python turned out to be quite slow, and this way the Pandas csv reader can read them directly as integers. To make this work, all of the data files had their header corrected to use CAS as the index variable.

There are a few more minor data corrections as well.

This PR does add considerably complexity to the project, but I believe it will be appreciated and will make the project better. In the future, it may be possible to provide something for temperature dependent properties as well.

…on and add GWP and ODP to database

…ect CASs fixed

yoelcortes · 2022-05-24T20:49:37Z

@CalebBell, this is awesome. I don't think there is much complexity added. I made some minor edits here and there under the assumption that the index must be either a string or an integer. The database_constant_lookup fails in my machine every time with the following exception:

File "...\chemicals\data_reader.py", line 247, in cached_constant_lookup
    CONSTANTS_CURSOR.execute("SELECT * FROM constants WHERE `index`=?", (str(CASi),))

OperationalError: no such table: constants

I have not worked with sql before so I am not able to debug this quickly, but maybe you could help me? I added some code that prevent database lookup after the first failure related to sql searches.

Other than getting the search to work. I think this branch can be merged whenever. Feel free to revert any of my changes that aren't helpful.

Thanks!

CalebBell · 2022-05-25T00:50:38Z

Hi Yoel!
I hope you are well and thank you for the feedback! It is always nice to not be alone. Thank you for taking the time to take a look! I hope this branch can improve your libraries as well as mine.
I believe the error you are getting is that the database isn't on your machine. It needs to be generated first:
python dev/generate_sqlite_database.py
Please let me know if this doesn't work for you.

The database is easy to distribute as part of the pip release. It will take a little more work to get the conda build to ship with it. This is a big part of my reluctance to the approach; but as more and more data is added to the library this type of approach seems like the only one that doesn't unduly increase load time.

CalebBell · 2022-05-30T00:31:20Z

@yoelcortes have you been able to check if the database generation script works for you?

yoelcortes · 2022-05-30T01:08:06Z

Hi @CalebBell, sorry about that, I meant to get back to you sooner but was doing some traveling for memorial weekend. Yes, it works well on my computer. I also fixed some bugs I introduced in my past commits (which required the sql file to run). As always, your work on enhancing your ChemE libraries is much appreciated by me and tons others. I am happy I can help and contribute too :)

By the way, I like the new int_CAS parameter. Would it be a good idea to load other files in chemicals using this? I'm not sure if it would add too much overhead, what do you think? Here are some of the files that whose index are not currently converted to iCASs:

Physical Constants of Inorganic Compounds.csv
Physical Constants of Organic Compounds.csv
IUPACOrganicCriticalProps.tsv
CRCCriticalOrganics.tsv
Mathews1972InorganicCriticalProps.tsv
Appendix to PSRK Revision 4.tsv
PassutDanner1973.tsv
Yaws Collection.tsv
DIPPRPinaMartines.tsv
Antoine Collection Poling.tsv
Table 2-8 Vapor Pressure of Inorganic and Organic Liquids.tsv
Alcock_Itkin_Horrigan_metalic_elements.tsv
Wagner Original McGarry.tsv
Wagner Collection Poling.tsv
Antoine Extended Collection Poling.tsv
VDI PPDS Boiling temperatures at different pressures.tsv
Florian_Liming_RON_experimental.tsv
Florian_Liming_MON_experimental.tsv
Florian_Liming_RON_MON_ANN.tsv
Travis_Kessler_Combustdb_RON.tsv
Travis_Kessler_Combustdb_MON.tsv
Travis_Kessler_Combustdb_predictions.tsv
Dahmen_Marquardt_ignition_delay_IQT.tsv
Poling Dipole.csv
cccbdb.nist.gov Dipoles.csv
Muller Supporting Info Dipoles.csv
psi4_dipoles.tsv
Official Global Warming Potentials 2007.tsv
Official Global Warming Potentials 2014.tsv
Ozone Depletion Potentials.tsv
CRC logP table.tsv
Syrres logP data.csv.gz
PolingDatabank.tsv
TRC Thermodynamics of Organic Compounds in the Gas State.tsv
CRC Standard Thermodynamic Properties of Chemical Substances.tsv
MuleroCachadinaParameters.tsv
Jasper-Lange.tsv
Somayajulu.tsv
SomayajuluRevised.tsv
VDI PPDS surface tensions.tsv
MagalhaesLJ.tsv
PolingLJ.tsv
psi4_radius_of_gyrations.tsv
chemsep_radius_of_gyrations.tsv
psi4_linear.tsv
Permittivity (Dielectric Constant) of Liquids.tsv
Yaws Boiling Points.tsv
OpenNotebook Melting Points.tsv
Ghazerati Appendix Vaporization Enthalpy.tsv
CRC Handbook Heat of Vaporization.tsv
CRC Handbook Heat of Fusion.tsv
Ghazerati Appendix Sublimation Enthalpy.tsv
Table 2-150 Heats of Vaporization of Inorganic and Organic Liquids.tsv
VDI PPDS Enthalpies of vaporization.tsv
Alibakhshi one-coefficient enthalpy of vaporization.tsv
API TDB Albahri Hf (g).tsv
ATcT 1.112 (g).tsv
ATcT 1.112 (l).tsv
Yaws Hf S0 (g).tsv
CRC Handbook Organic RI.csv
NFPA 497 2008.tsv
IS IEC 60079-20-1 2010.tsv
DIPPR T_flash Serat.csv
National Toxicology Program Carcinogens.tsv
IARC Carcinogen Database.tsv
Dutt Prasad 3 term.tsv
Viswanath Natarajan Dynamic 3 term.tsv
Viswanath Natarajan Dynamic 2 term.tsv
Viswanath Natarajan Dynamic 2 term Exponential.tsv
Table 2-313 Viscosity of Inorganic and Organic Liquids.tsv
Table 2-312 Vapor Viscosity of Inorganic and Organic Substances.tsv
VDI PPDS Dynamic viscosity of saturated liquids polynomials.tsv
VDI PPDS Dynamic viscosity of gases polynomials.tsv
Table 2-314 Vapor Thermal Conductivity of Inorganic and Organic Substances.tsv
Table 2-315 Thermal Conductivity of Inorganic and Organic Liquids.tsv
VDI PPDS Thermal conductivity of saturated liquids.tsv
VDI PPDS Thermal conductivity of gases.tsv
Staveley 1981.tsv
COSTALD Parameters.tsv
Mchaweh SN0 deltas.tsv
Perry Parameters 105.tsv
CRC Liquid Inorganic Constant Densities.tsv
CRC Solid Inorganic Constant Densities.tsv
VDI PPDS Density of Saturated Liquids.tsv
CRC Inorganics densties of molten compounds and salts.tsv
CRC Virial polynomials.tsv

Please feel free to merge whenever,
Thanks!

yoelcortes · 2022-05-30T01:09:26Z

Forgot to mention, I think adding the sql database to the pip installation would be a good idea. I'm not sure how to add it to the conda installation either.

CalebBell · 2022-06-06T00:26:42Z

Hi Yoel!

We are very lucky - the conda package is built right from the package uploaded to PyPi. Because of that, the default.sqlite database needs to be generated on the machine which uploads the package to pypi. I added a script dev/prerelease.py which creates the database and will do other things related to the release as required. I am very pleased with this development; I didn't expect such a large win. The maintenance overhead will very hopefully be minimal.

The new release is version 1.1.1, as this feels like a major change. I have tested and confirmed both releases ship with the sqlite database.

I am not sure the idea of converting older .tsv files to use integer CAS numbers is worthwhile. In particular, I know of some people indexing into those files using CAS numbers as strings, so it would not be good to break that functionality them. I think pursuing expanding the database with temperature dependent property fits is a larger win, and hope to focus potential future memory reduction efforts into that.

Thank you for your excellent review!

Caleb

CalebBell added 26 commits May 22, 2022 14:20

Investigate database lookup

5b9454a

Better performance

17b163a

Allow to not load data by saying if in db or not

85e36a6

There is a bug with critical pressure

5fa5e6d

.

6573f42

Add one more check to CASs

1a1dffc

Correct one CAS

aa1816f

Few more files

cd03848

More data

b65ecde

More of them

7c171ac

Half of safety

3484965

Triple

a151984

Bug fix

40262c9

Update wikidata properties

61b5523

Switch GWP default value to 20 year given the current climate situati…

664b2bd

…on and add GWP and ODP to database

Add refractivity table

7bda40d

Refractivity

db83f1c

Missed some

d873449

Look at getting the CAS as an integer read in by pandas

0fd1463

Fix index naming

96e268b

Start consistenly naming dataframe columns with CAS

7a503d2

Clean up index column name to CAS

86cf853

Go ahead and change the default behavior to be expecting integer CASs

e753a88

Update prediction scripts

3db1e09

Add memory usage debug function

ef1b447

Fix type erro for sqlite MW

0088332

CalebBell requested a review from yoelcortes May 23, 2022 22:31

CalebBell added 3 commits May 23, 2022 18:20

Code cleanup

7b404ec

Two fixes

373542a

Save 0.1 second on test_CAS_numbers_valid_and_unique

18235bb

CalebBell and others added 8 commits May 23, 2022 23:18

Save another few dozen ms

43fcd1f

Find three more values that were not being checked correctly - Incorr…

81acf4a

…ect CASs fixed

3.10 workflow

82e7c6d

Fix

f559f60

tweaks to avoid function call and minor simplifications

76ec826

minor edits

a92e164

add explanatory comment

efbfc44

do not switch off database lookup due to dict errors

654ce22

do not switch off sql lookup if user error

0363d94

fix bugs introduced in PR enhancement

bdb4e45

CalebBell added 4 commits June 1, 2022 21:43

Update chemsep data

7c7b834

Merge branch 'database' of github.com:CalebBell/chemicals into database

94a63cc

Initialize the setting based on whether or not the database file exists

f0d6ef7

Add prerelease script

41cb04d

CalebBell merged commit 15738c5 into master Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate storing much of the data in a database, to reduce load time and memory use #42

Investigate storing much of the data in a database, to reduce load time and memory use #42

CalebBell commented May 22, 2022 •

edited

Loading

yoelcortes commented May 24, 2022

CalebBell commented May 25, 2022

CalebBell commented May 30, 2022

yoelcortes commented May 30, 2022 •

edited

Loading

yoelcortes commented May 30, 2022

CalebBell commented Jun 6, 2022

Investigate storing much of the data in a database, to reduce load time and memory use #42

Investigate storing much of the data in a database, to reduce load time and memory use #42

Conversation

CalebBell commented May 22, 2022 • edited Loading

yoelcortes commented May 24, 2022

CalebBell commented May 25, 2022

CalebBell commented May 30, 2022

yoelcortes commented May 30, 2022 • edited Loading

yoelcortes commented May 30, 2022

CalebBell commented Jun 6, 2022

CalebBell commented May 22, 2022 •

edited

Loading

yoelcortes commented May 30, 2022 •

edited

Loading