Skip to content

Conversation

@DHopkinson-DI
Copy link
Contributor

@DHopkinson-DI DHopkinson-DI commented Jul 21, 2025

Same concept as TorQ's/Kx dataloader script.
Tested for kdb+ 4.1 on Linux

Fixes #38

@DHopkinson-DI DHopkinson-DI self-assigned this Jul 21, 2025
@jonathonmcmurray jonathonmcmurray linked an issue Jul 24, 2025 that may be closed by this pull request


This package is used for automated customisable dataloading and database creation and is a generalisation of http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles.
Load all delimeted files in a directory into memory in configurable chunk sizes then output the resulting tables to disk in kdb+ partiioned format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partiioned - typo

I think this is underselling it a bit. It will not load it all into memory and then to disk. It will load the data chunk by chunk, so the aim is to mininimise memory usage. The memory usage for this should be related to the maximum of

  • the space required to load into memory and save one chunk of data
  • the memory required to sort the resultant table

So therefore we should be able to load large volumes of on-disk data using a relatively small memory footprint.

A couple of examples would be good:

  1. loading very large files, but in small chunks
  2. loading data across partitions from a number of small files e.g. if we had a month worth of AAPL data in one file, and a month of MSFT data in one file, I believe the way this is structured it would handle it relatively efficiently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation highlighting the advantages of chunking

@jamiechandler99 jamiechandler99 self-assigned this Jul 30, 2025
To account for the new import method, have removed the top level (.loader) namespacing. Also eliminated the second level .util namespacing as seemed superfluous. Unclear around how setting globals within a namespace from within a function will be impacted by new import changes so have changed how globals are set within the init function.
Account for removing of namespacing in q script; may need to change this again once mechanism for package importing becomes clearer
File which creates private namespace of functionality and exposes public interface
flip loadparams[`headers]!(loadparams[`types];loadparams[`separator])0:rawdata]
/ loads data in from delimited file, applies processing function, enumerates and writes to db
/ NOTE: it is not trivial to check user has inputted headers correctly, assume they have
data:$[(`$"," vs rawdata 0)~loadparams`headers; / check if first row matches headers provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per https://github.com/DataIntellectTech/kdbx-packages/blob/main/style.md please place comments on preceding line

Copy link
Member

@jonathonmcmurray jonathonmcmurray Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolving this comment as comments are still in-line

@eliotrobinson eliotrobinson merged commit 93a1f03 into main Oct 29, 2025
@eliotrobinson eliotrobinson deleted the add_dataloader_lib branch October 29, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Package: Dataloader

7 participants