-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsupported datatypes should be imported as strings #5390
Comments
Similar to #4838 which proposes to add an option to import all cell contents as strings in the excel importer. |
In this specific case, I would suggest that time should not be cast, no matter if the user selects to cast dates or not, since time is not a date. Having an option to disable automatic cast would be useful, nonetheless. |
Would you be able to provide a sample XLS file with a time value? I would then mark this as a good first issue. |
Internally these are all just datetimes with formatting strings to Excel. Also, OpenRefine doesn't have a time (or date) data type - only datetime. Converting Excel dates to strings my require a full and faithful implementation of the Excel date formatting (unless there's a cached version of the pre-rendered string available). I'm not sure that would be super easy as a first issue. |
|
I've updated the description to generalize the issue to all unsupported data formats. The biggest question, in my mind, is dates (without times), which OpenRefine doesn't technically support, but it seems like importing dates as (perhaps locale-specific) strings and then forcing the users to deal with them is (almost?) worse than importing them as datetimes with a time of midnight. |
One big issue I have had with importing dates as datetimes with a time of midnight is the timezone-sensitivity. If somehow the importer takes it to be midnight in the local time determined by the user's settings, and then the datetime is represented in UTC, then you can get off by one errors on the date components. It can be really frustrating. For that reason I think I prefer importing them as strings - in the spirit of "what you see in Excel is what you get in OpenRefine". |
👍 agree |
For comparison - opening this file in Mac Numbers app interprets the time as: 31/12/1899 19:00:00 I'm not an Excel expert, but a quick look suggests that this is, in fact, how Excel stores the time as well (based on https://www.exceltactics.com/definitive-guide-using-dates-times-excel/#How-Excel-Stores-Times - although this says it uses 1st Jan 1900 - so not sure why I see 31/12/1899 instead - possibly timezone conversion?) |
Isn't this what #5397 enables? |
The change in #5397 is very helpful to work around this problem indeed, but I think it is still worth discussing what the behaviour of the importer should be when the option introduced in #5397 is not enabled, for those unsupported datatypes. Say I have a spreadsheet with numbers and dates without times: I would like to be able to import the numbers directly as numbers (because OpenRefine supports them okay) but the dates as strings (because OpenRefine does not support dates without times, at least not yet). |
Timezones are an orthogonal problem. Excel doesn't have the concept of timezones for dates, datetimes, or times, so any timezone imposed by the importer (including UTC) is fictional. |
Sure, but the problem is that OpenRefine (and the Java date support in general) does have support for timezones, and that is creating those off-by-one errors. Those errors would not be there if we were not trying to shoehorn dates without timezones into datetimes with timezones. So I do think it is relevant for this issue. |
I'm trying to make sense of this - so apologies if any of this is incorrect.
1 is effectively adding information to the data coming in I'm not convinced any of these are great as 'default' behaviour tbh, but given we now have options for supporting both 1 and 3, beyond adding support for a date (rather than datetime) data type in OpenRefine it feels like we have at least the options available I feel like I'm slightly skating over some detail here because I think elsewhere @tfmorris has stated that ultimately its just a number - so I'm not sure if even formatting as a date is really just cell formatting over the number? |
I'm going to suggest that people defer further discussion until there's a PR to review |
Fixes OpenRefine#5389. Fixes OpenRefine#5390. - Import floats, integers, percentages, & currency as numbers preferentially as integers - Import everything else, including dates, times, scientific notation, SSNs, telephone numbers, numbers with leading 0s, etc as (formatted) strings - Excel's conditional formatting isn't supported/implemented, so any rendering changes that it would cause are ignored NOTE: Apache POIs rendering isn't fully implemented or compatible with Excel's implementation, so there will be edge cases that give bad renderings.
Import cells with unsupported date & number formats as strings Fixes #5389. Fixes #5390. - Import floats, integers, percentages, & currency as numbers, preferentially as integers - Import everything else, including dates, times, scientific notation, SSNs, telephone numbers, numbers with leading 0s, etc as (formatted) strings - Excel's conditional formatting isn't supported/implemented, so any rendering changes that it would cause are ignored NOTE: Apache POIs rendering isn't fully implemented or compatible with Excel's implementation, so there will be edge cases that give bad renderings.
There are a large number of Excel date, time, duration, post code, etc formats which are not supported by OpenRefine, but are currently imported as either numbers or datetimes. Anything unsupported should be imported as a string instead of a number or datetime.
To Reproduce
Steps to reproduce the behavior:
09:46:00 AM
in a columnCurrent Results
The cell is converted to
1899-12-31T09:46:00Z
.Expected Behavior
The column should be kept as a string.
Additional context
I have not found any way to disable cast on import.
The text was updated successfully, but these errors were encountered: