Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to choose smallest possible numeric types #39

Open
molsonkiko opened this issue Sep 30, 2022 · 1 comment
Open

Option to choose smallest possible numeric types #39

molsonkiko opened this issue Sep 30, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@molsonkiko
Copy link
Contributor

molsonkiko commented Sep 30, 2022

One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.

For example, with this option active, maybe the Generate metadata form making a Python script would specify np.int32 for columns with no values outside the range (-2**31, 2**31 - 1), np.int64 for integers in the range (-2**63, 2**63 - 1), decimal for really huge integers, and so on and so forth.

I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.

@BdR76
Copy link
Owner

BdR76 commented Oct 2, 2022

Good point, when using the Python scripts or database export this could technically save some memory use or database diskspace. Python has np.int32, np.int64, Rscript seems to only have 32bit integers, and there are different integer types on MySQL and the same on MS-SQL.

However, in order to determine which Tinyint, SmallInt etc. to use the plugin should need the exact min/max values, which aren't stored in the current metadata format and autodetect. Curently for integers it only keeps the maxwidth, for example integer width=3 means the column can hold values -99 through 999, which also doesn't work correcly when there are thousand separators, like width=9 could be max 1,234,567. And like you said it could break things when you load a later different data set has different data and thus potentially different min/max values.

Still, it's a good suggestion as an optional feature and I'll keep it in mind. But as long as the plugin uses the schema.ini as the metadata format I probably won't be able to implement this feature.

@BdR76 BdR76 added the enhancement New feature or request label Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants