-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow XLS and XLSX files without a file extension to be detected as Excel instead of Wikitext #2850
Comments
Apache Tika has content sniffing, although it's a relatively heavyweight dependency. |
@tfmorris probably the easiest is just use Apache POI with a
|
I want to do that |
@thadguidry Now I am understanding the codebase for this issue.
please give some Idea for this topic |
@thadguidry This looks to be a fairly complicated issue. I see a few different problems:
When those things are fixed, we can see if we still need more. I don't think we want to be blindly trying to open files using heavyweight parsers. In the case of XLS and XLSX we would want to, at a minimum, make sure they were of the correct container format. |
@tfmorris Sure, it's not a huge priority (medium) but would be really nice if we can do the XLS and XLSX detection without relying on the file extension. Whatever you think can help with that even if you think creating our own container detector for those 2 formats specifically. I trust you to make the right decision here. |
I'm not saying it shouldn't be fixed, just that I think it's too complex for someone who's new to the project to tackle. |
Ok then i solve some other problem |
Fixes OpenRefine#2850 - Add simple magic detector for zip & gzip files to keep it from attempting to guess binary files - Add a counter for C0 controls for the same reason - Tighten wikitable counters to require marker at beginning of the line, per the specification - Refactor to use Apache Commons instead of private counting methods
Fixes OpenRefine#2850 - Add simple magic detector for zip & gzip files to keep it from attempting to guess binary files - Add a counter for C0 controls for the same reason - Tighten wikitable counters to require marker at beginning of the line, per the specification - Refactor to use Apache Commons instead of private counting methods - Add tests for most TextGuesser formats
…ext (#2924) * Fix text guesser so it doesn't guess wikitext Fixes #2850 - Add simple magic detector for zip & gzip files to keep it from attempting to guess binary files - Add a counter for C0 controls for the same reason - Tighten wikitable counters to require marker at beginning of the line, per the specification - Refactor to use Apache Commons instead of private counting methods - Add tests for most TextGuesser formats * Remove misplaced duplicate test data file * Fix LGTM warning + minor cleanups * Use BoundedInputStream to prevent runaway lines
Is your feature request related to a problem or area of OpenRefine? Please describe.
Currently, if a XLSX file has no file extension because of someone's preference or OS setup for saving files, then when importing the XLSX file, OpenRefine will not detect it as an Excel file but instead will choose Wikitext importer as the first option.
Describe the solution you'd like
XLS and XLSX files should be detected as Excel for the importer regardless of their file extension.
Describe alternatives you've considered
Ensuring that files when saved include the .xls or .xlsx extension as well as renaming lots of XLSX files to include the file extension on their filename.
Additional context
Windows is an OS that heavily depends on file extensions to navigate and perform actions based on them. However, when a filename has no file extension, OpenRefine should go through its sets of algorithms to detect the most appropriate importer based on the content of the file.
Unfortunately, there is not file signature (magic bytes) for .xls and .xlsx files. Although .xlsx files do start with the same magic bytes as .zip files https://en.wikipedia.org/wiki/List_of_file_signatures
So other detection mechanisms will have to be used such as Apache POI.
Test file
Extract file and then try to open
Colorado-Municipalities-XLSX
as well as rename it to justColoradoMunicipalities
and notice that OpenRefine will detect as Wikitext and not Excel.Colorado-Municipalities-XLSX.zip
The text was updated successfully, but these errors were encountered: