Skip to content

Option to render non-text Excel cells as text when importing #4838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
annalauraw opened this issue May 10, 2022 · 5 comments · Fixed by #5397
Closed

Option to render non-text Excel cells as text when importing #4838

annalauraw opened this issue May 10, 2022 · 5 comments · Fixed by #5397
Assignees
Labels
import About importers in general - add a label for the data format if available Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. XLS(X) About the Excel import / export functionality
Milestone

Comments

@annalauraw
Copy link

It would be easier if OpenRefine offered an option to parse all cells as text when importing Excel files, as it does with csv files. When refining e.g. publication dates, the default behaviour of parsing digit characters in Excel files as numbers can lead to unwanted results and is cumbersome to fix.

Proposed solution

Offer the option to leave a checkbox "Parse cell text into numbers, dates, ..." unchecked at import, as with csv files.

Alternatives considered

All Excel files could be converted to csv files before importing them into OpenRefine.

Additional context

image

@annalauraw annalauraw added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels May 10, 2022
@wetneb wetneb added import About importers in general - add a label for the data format if available XLS(X) About the Excel import / export functionality and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels May 10, 2022
@thadguidry
Copy link
Member

Agree, and we have said this often, #1908 (comment) ,

just import these things as plain strings

@tfmorris
Copy link
Member

One or more examples of how the current behavior should be modified would be extremely helpful. Excel has different data types internally, so it's simply not possible "parse all cells as text." If the desire is to take an internal number or date and format it using the Excel format string and then import the resulting string, that might be possible, but it's actually more conversion, not less. Or it's possible that there's just some simple bug lurking, but without examples, it's hard to tell.

Here's the core of the import code:

CellType cellType = cell.getCellType();
if (cellType.equals(CellType.FORMULA)) {
cellType = cell.getCachedFormulaResultType();
}
if (cellType.equals(CellType.ERROR) ||
cellType.equals(CellType.BLANK)) {
return null;
}
Serializable value = null;
if (cellType.equals(CellType.BOOLEAN)) {
value = cell.getBooleanCellValue();
} else if (cellType.equals(CellType.NUMERIC)) {
double d = cell.getNumericCellValue();
if (DateUtil.isCellDateFormatted(cell)) {
value = ParsingUtilities.toDate(DateUtil.getJavaDate(d));
// TODO: If we had a time datatype, we could use something like the following
// to distinguish times from dates (although Excel doesn't really make the distinction)
// Another alternative would be to look for values < 0.60
// String format = cell.getCellStyle().getDataFormatString();
// if (!format.contains("d") && !format.contains("m") && !format.contains("y") ) {
// // It's just a time
// }
} else {
value = d;
}
} else {
String text = cell.getStringCellValue();
if (text.length() > 0) {
value = text;
}

@thadguidry
Copy link
Member

thadguidry commented Oct 18, 2022

@tfmorris can we not retain the shape of the data as it is stored in Excel at least for XLSX and simply output as a text string and let the users clean up and reformat as they wish? In other words, don't think for our users (as that code seems to want to do), give them the power to decide later how to transform?

@wetneb
Copy link
Member

wetneb commented Oct 19, 2022

I guess one way to specify this issue further would be as follows.

When opening an excel file in Excel, all non-empty cells will have a certain string representation in the UI (possibly formatted or aligned differently to mark the non-string datatype). As an OpenRefine user, I would like to have the option of importing that excel file in OpenRefine such that the cells of the resulting OpenRefine project contain the same strings as the ones that Excel displays in its UI. All those cells would therefore have string datatype in OpenRefine.

Perfect faithfulness to Excel is probably not achievable (typically because of date rendering specifics, locale differences, rounding of numbers, and things like that) but probably not needed in most cases.

From an implementation perspective, this task would consist in adapting the code @tfmorris quoted above to convert non-string datatypes to strings on the fly, at importing time, if the option to do so is used.

@tfmorris tfmorris changed the title Possibility to parse all cells as text when importing Excel files Option to render non-text Excel cells as text when importing Oct 19, 2022
@tfmorris tfmorris self-assigned this Nov 1, 2022
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 1, 2022
@tfmorris
Copy link
Member

tfmorris commented Nov 1, 2022

Actually the Apache POI project already has a utility class which knows how to do Excel compatible formatting, so this is straightforward to add.

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 15, 2022
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 15, 2022
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 15, 2022
wetneb pushed a commit that referenced this issue Nov 21, 2022
* Add option to import as text only. Closes #4838

* Add test for Excel text import
@wetneb wetneb added this to the 3.6 milestone Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. XLS(X) About the Excel import / export functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants