-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate storing much of the data in a database, to reduce load time and memory use #42
Conversation
…on and add GWP and ODP to database
@CalebBell, this is awesome. I don't think there is much complexity added. I made some minor edits here and there under the assumption that the index must be either a string or an integer. The File "...\chemicals\data_reader.py", line 247, in cached_constant_lookup
CONSTANTS_CURSOR.execute("SELECT * FROM constants WHERE `index`=?", (str(CASi),))
OperationalError: no such table: constants I have not worked with sql before so I am not able to debug this quickly, but maybe you could help me? I added some code that prevent database lookup after the first failure related to sql searches. Other than getting the search to work. I think this branch can be merged whenever. Feel free to revert any of my changes that aren't helpful. Thanks! |
Hi Yoel! The database is easy to distribute as part of the pip release. It will take a little more work to get the conda build to ship with it. This is a big part of my reluctance to the approach; but as more and more data is added to the library this type of approach seems like the only one that doesn't unduly increase load time. |
@yoelcortes have you been able to check if the database generation script works for you? |
Hi @CalebBell, sorry about that, I meant to get back to you sooner but was doing some traveling for memorial weekend. Yes, it works well on my computer. I also fixed some bugs I introduced in my past commits (which required the sql file to run). As always, your work on enhancing your ChemE libraries is much appreciated by me and tons others. I am happy I can help and contribute too :) By the way, I like the new int_CAS parameter. Would it be a good idea to load other files in chemicals using this? I'm not sure if it would add too much overhead, what do you think? Here are some of the files that whose index are not currently converted to iCASs: Physical Constants of Inorganic Compounds.csv Please feel free to merge whenever, |
Forgot to mention, I think adding the sql database to the pip installation would be a good idea. I'm not sure how to add it to the conda installation either. |
Hi Yoel! We are very lucky - the conda package is built right from the package uploaded to PyPi. Because of that, the default.sqlite database needs to be generated on the machine which uploads the package to pypi. I added a script dev/prerelease.py which creates the database and will do other things related to the release as required. I am very pleased with this development; I didn't expect such a large win. The maintenance overhead will very hopefully be minimal. The new release is version 1.1.1, as this feels like a major change. I have tested and confirmed both releases ship with the sqlite database. I am not sure the idea of converting older .tsv files to use integer CAS numbers is worthwhile. In particular, I know of some people indexing into those files using CAS numbers as strings, so it would not be good to break that functionality them. I think pursuing expanding the database with temperature dependent property fits is a larger win, and hope to focus potential future memory reduction efforts into that. Thank you for your excellent review! Caleb |
This pull request adds an alternative option to looking up constant properties in Pandas dataframes. Essentially, a common path is to use a function like
Tb
without a method specified; and to not use either theTb_methods
or theTb_sources
function and variable respectively. For this one common application, the only needed information is the Tb value itself.Therefore, a script
dev/generate_sqlite_database.py
was created to obtain thedefault
values for each constant property in chemicals. The resulting database is 7.3 MB, and compresses to 3.3 MB in a zip archive.The sqlite database is always stored on disk, so there is substantial memory savings from this approach. The query time is acceptable in general. When using something like %timeit, one search is 10-15 us. However, in this case the used database pages being cached in memory. On my machine when the disk has to be accessed (as measured by doing a single read), the time is about 800 us. I wrote the database lookup to retrieve all of the pure component properties for a compound at once, and cache them in memory once accessed. The layout of sqlite is such that this is as fast as only retrieving one property.
Loading the sqlite3 module takes ~3 ms. It is delayed until needed.
There is a setting
chemicals.data_reader.USE_CONSTANTS_DATABASE
which defaults to True. The sqlite database can't be committed to git as it will be updated regularly and therefore bloat up the repository size unreasonably.Included with this change is further work on the
int_CAS
setting earlier discussed. The files with CAS numbers usingint_CAS
are now expected to have the CAS number without the dashes included, i.e. 50000 instead of 50-00-0. Doing the conversion in Python turned out to be quite slow, and this way the Pandas csv reader can read them directly as integers. To make this work, all of the data files had their header corrected to useCAS
as the index variable.There are a few more minor data corrections as well.
This PR does add considerably complexity to the project, but I believe it will be appreciated and will make the project better. In the future, it may be possible to provide something for temperature dependent properties as well.