Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTRC improvements #136

Merged
merged 8 commits into from
Apr 19, 2017
Merged

HTRC improvements #136

merged 8 commits into from
Apr 19, 2017

Conversation

organisciak
Copy link
Member

This bundles a couple of changes that were done around the same time.

  • Fixing the broken logging, as per manager.py logging broken #133
  • Adding --index-only, --no-index, and --no-delete flags to bookworm prep database_wordcounts, resolving Allow resuming of unigram ingest #135 (and fixing one bug that came up).
  • Two small improvements: db.query() supports executemany calls, and there is a backup process for writing csv files to DB from Python if LOAD DATA INFILE fails. Not sure when this might be useful except with a permission error - I wrote it for some benchmarking and figured it could be kept in as a failsafe.
  • Support for ingest from h5 files. This looks for a table called unigrams inside the file, writes a set of temporary CSVs in parallel, then uses LOAD DATA INFILE. The reason I opted for H5 is because it's well supported in Pandas and contains support for 'blosc', a fast compression algorithm. I tried to keep this code as simple as possible, it would have been easy to over-engineer it.
  • I started generalizing create_unigram_book_counts, toward eventually being able to convert it to a create_book_counts_table method that create_unigram_book_counts and create_bigram_book_counts can both use. This relates to the discussion in Generalize unigram and bigram ingest methods #134. Updates above are currently specific to unigram tables, my use case, so this will allow bigrams indexes to keep pace.

@bmschmidt bmschmidt merged commit ee7866a into master Apr 19, 2017
@bmschmidt bmschmidt deleted the htrc_improvements branch March 21, 2019 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants