Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported datatypes should be imported as strings #5390

Closed
frafra opened this issue Oct 27, 2022 · 15 comments · Fixed by #5468
Closed

Unsupported datatypes should be imported as strings #5390

frafra opened this issue Oct 27, 2022 · 15 comments · Fixed by #5468
Assignees
Labels
import About importers in general - add a label for the data format if available Type: Bug Issues related to software defects or unexpected behavior, which require resolution. XLS(X) About the Excel import / export functionality
Milestone

Comments

@frafra
Copy link
Contributor

frafra commented Oct 27, 2022

There are a large number of Excel date, time, duration, post code, etc formats which are not supported by OpenRefine, but are currently imported as either numbers or datetimes. Anything unsupported should be imported as a string instead of a number or datetime.

To Reproduce

Steps to reproduce the behavior:

  1. Import an Excel worksheet containing dates such as 09:46:00 AM in a column
  2. Open the project

Current Results

The cell is converted to 1899-12-31T09:46:00Z.

Expected Behavior

The column should be kept as a string.

Additional context

I have not found any way to disable cast on import.

@frafra frafra added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Oct 27, 2022
@wetneb wetneb added import About importers in general - add a label for the data format if available XLS(X) About the Excel import / export functionality and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Oct 27, 2022
@wetneb
Copy link
Member

wetneb commented Oct 27, 2022

Similar to #4838 which proposes to add an option to import all cell contents as strings in the excel importer.

@frafra frafra closed this as completed Oct 28, 2022
@frafra frafra reopened this Oct 28, 2022
@frafra frafra changed the title Time is casted to datetime on import Time is cast to datetime on import Oct 28, 2022
@frafra
Copy link
Contributor Author

frafra commented Oct 28, 2022

In this specific case, I would suggest that time should not be cast, no matter if the user selects to cast dates or not, since time is not a date. Having an option to disable automatic cast would be useful, nonetheless.

@wetneb
Copy link
Member

wetneb commented Oct 28, 2022

Would you be able to provide a sample XLS file with a time value? I would then mark this as a good first issue.

@tfmorris
Copy link
Member

Internally these are all just datetimes with formatting strings to Excel. Also, OpenRefine doesn't have a time (or date) data type - only datetime.

Converting Excel dates to strings my require a full and faithful implementation of the Excel date formatting (unless there's a cached version of the pre-rendered string available). I'm not sure that would be super easy as a first issue.

@frafra
Copy link
Contributor Author

frafra commented Oct 31, 2022

Would you be able to provide a sample XLS file with a time value? I would then mark this as a good first issue.

openrefine-5390.xlsx

@tfmorris tfmorris changed the title Time is cast to datetime on import Unsupported datatypes should be imported as strings Nov 22, 2022
@tfmorris
Copy link
Member

I've updated the description to generalize the issue to all unsupported data formats. The biggest question, in my mind, is dates (without times), which OpenRefine doesn't technically support, but it seems like importing dates as (perhaps locale-specific) strings and then forcing the users to deal with them is (almost?) worse than importing them as datetimes with a time of midnight.

@wetneb
Copy link
Member

wetneb commented Nov 22, 2022

One big issue I have had with importing dates as datetimes with a time of midnight is the timezone-sensitivity. If somehow the importer takes it to be midnight in the local time determined by the user's settings, and then the datetime is represented in UTC, then you can get off by one errors on the date components. It can be really frustrating.

For that reason I think I prefer importing them as strings - in the spirit of "what you see in Excel is what you get in OpenRefine".

@thadguidry
Copy link
Member

For that reason I think I prefer importing them as strings - in the spirit of "what you see in Excel is what you get in OpenRefine".

👍 agree

@ostephens
Copy link
Member

For comparison - opening this file in Mac Numbers app interprets the time as: 31/12/1899 19:00:00
However, it only displays "19:00:00" based on a custom format applied to the cell.

I'm not an Excel expert, but a quick look suggests that this is, in fact, how Excel stores the time as well (based on https://www.exceltactics.com/definitive-guide-using-dates-times-excel/#How-Excel-Stores-Times - although this says it uses 1st Jan 1900 - so not sure why I see 31/12/1899 instead - possibly timezone conversion?)

@ostephens
Copy link
Member

For that reason I think I prefer importing them as strings - in the spirit of "what you see in Excel is what you get in OpenRefine".

Isn't this what #5397 enables?

@wetneb
Copy link
Member

wetneb commented Nov 23, 2022

The change in #5397 is very helpful to work around this problem indeed, but I think it is still worth discussing what the behaviour of the importer should be when the option introduced in #5397 is not enabled, for those unsupported datatypes.

Say I have a spreadsheet with numbers and dates without times: I would like to be able to import the numbers directly as numbers (because OpenRefine supports them okay) but the dates as strings (because OpenRefine does not support dates without times, at least not yet).

@tfmorris tfmorris self-assigned this Nov 23, 2022
@tfmorris
Copy link
Member

One big issue I have had with importing dates as datetimes with a time of midnight is the timezone-sensitivity. If somehow the importer takes it to be midnight in the local time determined by the user's settings, and then the datetime is represented in UTC, then you can get off by one errors on the date components. It can be really frustrating.

Timezones are an orthogonal problem. Excel doesn't have the concept of timezones for dates, datetimes, or times, so any timezone imposed by the importer (including UTC) is fictional.

@wetneb
Copy link
Member

wetneb commented Nov 23, 2022

Sure, but the problem is that OpenRefine (and the Java date support in general) does have support for timezones, and that is creating those off-by-one errors. Those errors would not be there if we were not trying to shoehorn dates without timezones into datetimes with timezones. So I do think it is relevant for this issue.

@ostephens
Copy link
Member

ostephens commented Nov 23, 2022

Say I have a spreadsheet with numbers and dates without times: I would like to be able to import the numbers directly as numbers (because OpenRefine supports them okay) but the dates as strings (because OpenRefine does not support dates without times, at least not yet).

I'm trying to make sense of this - so apologies if any of this is incorrect.
It seems like there are three ways we could import a date (not datetime) from Excel:

  1. as currently - convert to a datetime, midnight on the date specified
  2. as the underlying number stored by Excel (which would either be the number of days since 1st Jan 1900 or the number of days since 1st Jan 1904 depending on the version of Excel - brilliant)
  3. as a string representation of the data based on the cell formatting in the Excel file (the solution implemented in Add option to import Excel as text only. Closes #4838 #5397)

1 is effectively adding information to the data coming in
2 is hard to interpret and probably not expected by the user
3 is potentially lossy as the cell formatting can hide any part of the date

I'm not convinced any of these are great as 'default' behaviour tbh, but given we now have options for supporting both 1 and 3, beyond adding support for a date (rather than datetime) data type in OpenRefine it feels like we have at least the options available

I feel like I'm slightly skating over some detail here because I think elsewhere @tfmorris has stated that ultimately its just a number - so I'm not sure if even formatting as a date is really just cell formatting over the number?

@tfmorris
Copy link
Member

I'm going to suggest that people defer further discussion until there's a PR to review

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Dec 1, 2022
Fixes OpenRefine#5389. Fixes OpenRefine#5390.

- Import floats, integers, percentages, & currency as numbers
  preferentially as integers
- Import everything else, including dates, times, scientific
  notation, SSNs, telephone numbers, numbers with leading 0s, etc
  as (formatted) strings
- Excel's conditional formatting isn't supported/implemented, so
  any rendering changes that it would cause are ignored

NOTE: Apache POIs rendering isn't fully implemented or compatible
with Excel's implementation, so there will be edge cases that give
bad renderings.
tfmorris added a commit that referenced this issue Dec 2, 2022
Import cells with unsupported date & number formats as strings

Fixes #5389. Fixes #5390.

- Import floats, integers, percentages, & currency as numbers,
  preferentially as integers
- Import everything else, including dates, times, scientific
  notation, SSNs, telephone numbers, numbers with leading 0s, etc
  as (formatted) strings
- Excel's conditional formatting isn't supported/implemented, so
  any rendering changes that it would cause are ignored

NOTE: Apache POIs rendering isn't fully implemented or compatible
with Excel's implementation, so there will be edge cases that give
bad renderings.
@tfmorris tfmorris added this to the 3.7 milestone Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available Type: Bug Issues related to software defects or unexpected behavior, which require resolution. XLS(X) About the Excel import / export functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants