-
Notifications
You must be signed in to change notification settings - Fork 2
Add dataloader lib #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dataloader/dataloader.md
Outdated
|
|
||
|
|
||
| This package is used for automated customisable dataloading and database creation and is a generalisation of http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles. | ||
| Load all delimeted files in a directory into memory in configurable chunk sizes then output the resulting tables to disk in kdb+ partiioned format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partiioned - typo
I think this is underselling it a bit. It will not load it all into memory and then to disk. It will load the data chunk by chunk, so the aim is to mininimise memory usage. The memory usage for this should be related to the maximum of
- the space required to load into memory and save one chunk of data
- the memory required to sort the resultant table
So therefore we should be able to load large volumes of on-disk data using a relatively small memory footprint.
A couple of examples would be good:
- loading very large files, but in small chunks
- loading data across partitions from a number of small files e.g. if we had a month worth of AAPL data in one file, and a month of MSFT data in one file, I believe the way this is structured it would handle it relatively efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation highlighting the advantages of chunking
To account for the new import method, have removed the top level (.loader) namespacing. Also eliminated the second level .util namespacing as seemed superfluous. Unclear around how setting globals within a namespace from within a function will be impacted by new import changes so have changed how globals are set within the init function.
Account for removing of namespacing in q script; may need to change this again once mechanism for package importing becomes clearer
File which creates private namespace of functionality and exposes public interface
dataloader/dataloader.q
Outdated
| flip loadparams[`headers]!(loadparams[`types];loadparams[`separator])0:rawdata] | ||
| / loads data in from delimited file, applies processing function, enumerates and writes to db | ||
| / NOTE: it is not trivial to check user has inputted headers correctly, assume they have | ||
| data:$[(`$"," vs rawdata 0)~loadparams`headers; / check if first row matches headers provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per https://github.com/DataIntellectTech/kdbx-packages/blob/main/style.md please place comments on preceding line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unresolving this comment as comments are still in-line
Same concept as TorQ's/Kx dataloader script.
Tested for kdb+ 4.1 on Linux
Fixes #38