Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow XLS and XLSX files without a file extension to be detected as Excel instead of Wikitext #2850

Closed
thadguidry opened this issue Jun 30, 2020 · 10 comments · Fixed by #2924
Labels
import About importers in general - add a label for the data format if available Priority: Medium Represents important issues that need to be addressed but are not urgent Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@thadguidry
Copy link
Member

thadguidry commented Jun 30, 2020

Is your feature request related to a problem or area of OpenRefine? Please describe.
Currently, if a XLSX file has no file extension because of someone's preference or OS setup for saving files, then when importing the XLSX file, OpenRefine will not detect it as an Excel file but instead will choose Wikitext importer as the first option.

Describe the solution you'd like
XLS and XLSX files should be detected as Excel for the importer regardless of their file extension.

Describe alternatives you've considered
Ensuring that files when saved include the .xls or .xlsx extension as well as renaming lots of XLSX files to include the file extension on their filename.

Additional context
Windows is an OS that heavily depends on file extensions to navigate and perform actions based on them. However, when a filename has no file extension, OpenRefine should go through its sets of algorithms to detect the most appropriate importer based on the content of the file.

Unfortunately, there is not file signature (magic bytes) for .xls and .xlsx files. Although .xlsx files do start with the same magic bytes as .zip files https://en.wikipedia.org/wiki/List_of_file_signatures
So other detection mechanisms will have to be used such as Apache POI.

Test file
Extract file and then try to open Colorado-Municipalities-XLSX as well as rename it to just ColoradoMunicipalities and notice that OpenRefine will detect as Wikitext and not Excel.

Colorado-Municipalities-XLSX.zip

@thadguidry thadguidry added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. import About importers in general - add a label for the data format if available labels Jun 30, 2020
@tfmorris
Copy link
Member

Apache Tika has content sniffing, although it's a relatively heavyweight dependency.

@thadguidry
Copy link
Member Author

thadguidry commented Jun 30, 2020

@tfmorris probably the easiest is just use Apache POI with a try catch or throws IOException against each of these to see if it indeed has a workbook. Then we know we can use the Excel importer.

//xlsx  https://poi.apache.org/apidocs/4.1/org/apache/poi/xssf/usermodel/XSSFWorkbook.html
        workbook = new XSSFWorkbook(inputStream);

// xls  https://poi.apache.org/apidocs/4.1/org/apache/poi/hssf/usermodel/HSSFWorkbook.html
        workbook = new HSSFWorkbook(inputStream);

@chetan-v
Copy link
Contributor

chetan-v commented Jul 2, 2020

I want to do that

@chetan-v
Copy link
Contributor

chetan-v commented Jul 3, 2020

@thadguidry Now I am understanding the codebase for this issue.

@tfmorris probably the easiest is just use Apache POI with a try catch or throws IOException against each of these to see if it indeed has a workbook. Then we know we can use the Excel importer.

//xlsx  https://poi.apache.org/apidocs/4.1/org/apache/poi/xssf/usermodel/XSSFWorkbook.html
        workbook = new XSSFWorkbook(inputStream);

// xls  https://poi.apache.org/apidocs/4.1/org/apache/poi/hssf/usermodel/HSSFWorkbook.html
        workbook = new HSSFWorkbook(inputStream);

please give some Idea for this topic

@thadguidry
Copy link
Member Author

@chetan-v Tom ( @tfmorris ) can give you guidance here.

@tfmorris
Copy link
Member

tfmorris commented Jul 3, 2020

@thadguidry This looks to be a fairly complicated issue. I see a few different problems:

  • the wikitable format guesser is much too permissive (who the heck keeps their data in a wiki table anyway?), including:
    • it doesn't require the markers to be at the beginning of the line (modulo whitespace) as required by the spec
    • it doesn't check for matching numbers of begin/end table markers
  • there's nothing in the importer framework that keeps it from attempting to parse binary data (e.g. a zip file as in this case) as text

When those things are fixed, we can see if we still need more. I don't think we want to be blindly trying to open files using heavyweight parsers. In the case of XLS and XLSX we would want to, at a minimum, make sure they were of the correct container format.

@thadguidry
Copy link
Member Author

thadguidry commented Jul 3, 2020

@tfmorris Sure, it's not a huge priority (medium) but would be really nice if we can do the XLS and XLSX detection without relying on the file extension. Whatever you think can help with that even if you think creating our own container detector for those 2 formats specifically. I trust you to make the right decision here.

@thadguidry thadguidry added the Priority: Medium Represents important issues that need to be addressed but are not urgent label Jul 3, 2020
@tfmorris
Copy link
Member

tfmorris commented Jul 3, 2020

I'm not saying it shouldn't be fixed, just that I think it's too complex for someone who's new to the project to tackle.

@thadguidry
Copy link
Member Author

@tfmorris yeap, understood. Feel free to reassign. I'll let you and @chetan-v figure out a way forward.

@chetan-v
Copy link
Contributor

chetan-v commented Jul 4, 2020

Ok then i solve some other problem

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Jul 4, 2020
Fixes OpenRefine#2850
- Add simple magic detector for zip & gzip files to keep
  it from attempting to guess binary files
- Add a counter for C0 controls for the same reason
- Tighten wikitable counters to require marker at
  beginning of the line, per the specification
- Refactor to use Apache Commons instead of private
  counting methods
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Jul 12, 2020
Fixes OpenRefine#2850
- Add simple magic detector for zip & gzip files to keep
  it from attempting to guess binary files
- Add a counter for C0 controls for the same reason
- Tighten wikitable counters to require marker at
  beginning of the line, per the specification
- Refactor to use Apache Commons instead of private
  counting methods
- Add tests for most TextGuesser formats
wetneb pushed a commit that referenced this issue Jul 15, 2020
…ext (#2924)

* Fix text guesser so it doesn't guess wikitext

Fixes #2850
- Add simple magic detector for zip & gzip files to keep
  it from attempting to guess binary files
- Add a counter for C0 controls for the same reason
- Tighten wikitable counters to require marker at
  beginning of the line, per the specification
- Refactor to use Apache Commons instead of private
  counting methods
- Add tests for most TextGuesser formats

* Remove misplaced duplicate test data file

* Fix LGTM warning + minor cleanups

* Use BoundedInputStream to prevent runaway lines
@tfmorris tfmorris added this to the 3.5 milestone Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available Priority: Medium Represents important issues that need to be addressed but are not urgent Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants