You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.
For example, with this option active, maybe the Generate metadata form making a Python script would specify np.int32 for columns with no values outside the range (-2**31, 2**31 - 1), np.int64 for integers in the range (-2**63, 2**63 - 1), decimal for really huge integers, and so on and so forth.
I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.
The text was updated successfully, but these errors were encountered:
Good point, when using the Python scripts or database export this could technically save some memory use or database diskspace. Python has np.int32, np.int64, Rscript seems to only have 32bit integers, and there are different integer types on MySQL and the same on MS-SQL.
However, in order to determine which Tinyint, SmallInt etc. to use the plugin should need the exact min/max values, which aren't stored in the current metadata format and autodetect. Curently for integers it only keeps the maxwidth, for example integer width=3 means the column can hold values -99 through 999, which also doesn't work correcly when there are thousand separators, like width=9 could be max 1,234,567. And like you said it could break things when you load a later different data set has different data and thus potentially different min/max values.
Still, it's a good suggestion as an optional feature and I'll keep it in mind. But as long as the plugin uses the schema.ini as the metadata format I probably won't be able to implement this feature.
One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.
For example, with this option active, maybe the
Generate metadata
form making aPython
script would specifynp.int32
for columns with no values outside the range(-2**31, 2**31 - 1)
,np.int64
for integers in the range(-2**63, 2**63 - 1)
,decimal
for really huge integers, and so on and so forth.I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.
The text was updated successfully, but these errors were encountered: