Option to choose smallest possible numeric types #39

molsonkiko · 2022-09-30T16:33:23Z

One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.

For example, with this option active, maybe the Generate metadata form making a Python script would specify np.int32 for columns with no values outside the range (-2**31, 2**31 - 1), np.int64 for integers in the range (-2**63, 2**63 - 1), decimal for really huge integers, and so on and so forth.

I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.

The text was updated successfully, but these errors were encountered:

BdR76 · 2022-10-02T10:23:56Z

Good point, when using the Python scripts or database export this could technically save some memory use or database diskspace. Python has np.int32, np.int64, Rscript seems to only have 32bit integers, and there are different integer types on MySQL and the same on MS-SQL.

However, in order to determine which Tinyint, SmallInt etc. to use the plugin should need the exact min/max values, which aren't stored in the current metadata format and autodetect. Curently for integers it only keeps the maxwidth, for example integer width=3 means the column can hold values -99 through 999, which also doesn't work correcly when there are thousand separators, like width=9 could be max 1,234,567. And like you said it could break things when you load a later different data set has different data and thus potentially different min/max values.

Still, it's a good suggestion as an optional feature and I'll keep it in mind. But as long as the plugin uses the schema.ini as the metadata format I probably won't be able to implement this feature.

BdR76 added the enhancement New feature or request label Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to choose smallest possible numeric types #39

Option to choose smallest possible numeric types #39

molsonkiko commented Sep 30, 2022 •

edited

Loading

BdR76 commented Oct 2, 2022 •

edited

Loading

Option to choose smallest possible numeric types #39

Option to choose smallest possible numeric types #39

Comments

molsonkiko commented Sep 30, 2022 • edited Loading

BdR76 commented Oct 2, 2022 • edited Loading

molsonkiko commented Sep 30, 2022 •

edited

Loading

BdR76 commented Oct 2, 2022 •

edited

Loading