Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel importer doesn't work from URL #6418

Open
tfmorris opened this issue Mar 5, 2024 · 6 comments · May be fixed by #6537
Open

Excel importer doesn't work from URL #6418

tfmorris opened this issue Mar 5, 2024 · 6 comments · May be fixed by #6537
Assignees
Labels
Type: Bug Issues related to software defects or unexpected behavior, which require resolution. XLS(X) About the Excel import / export functionality

Comments

@tfmorris
Copy link
Member

tfmorris commented Mar 5, 2024

When the Excel importer is invoked after downloading from a URL, the sheet selection logic doesn't work because it's comparing against the original URL rather than the local filename.

The relevant piece of code is:

if (!fileNameAndSheetIndex[0].equals(fileSource))
continue;

which causes every sheet to be skipped. Unfortunately, we need original URL for provenance if the user has requested it to be saved AND we need the local filename for the sheet matching, but we don't have it available.

To Reproduce

Steps to reproduce the behavior:

  1. Create Project
  2. Web Addresses (URL) - https://usace.contentdm.oclc.org/digital/api/collection/p16021coll2/id/2106/download

Current Results

Both the preview and the actual project import are empty

Expected Behavior

File contents are correctly previewed and imported.

Versions

  • OpenRefine: 3.8-SNAPSHOT

Additional context

This may also be a problem with Excel files which are contained in archives or are compressed, but I haven't checked.

@tfmorris tfmorris added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators XLS(X) About the Excel import / export functionality labels Mar 5, 2024
@wetneb wetneb removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label Mar 5, 2024
@Kurocifer
Copy link

Hey @tfmorris can I work on this issue ?

@tfmorris
Copy link
Member Author

tfmorris commented Apr 1, 2024

I only took a quick look at this, but I think it may be a little complicated due to not having the necessary information available when needed, as outlined above, but if you want to give it a go, go ahead! Let us know if you have any questions.

@Kurocifer
Copy link

Thanks. It actually works well with Excel files which are contained in archives or are compressed without making any modification to the program.

@Kurocifer
Copy link

Kurocifer commented Apr 3, 2024

This method might be what's causing this problem with Excel importing from URL

excelimp

when the fileRecord contains the URL (which for some reason I don't know, that of the zipped file I was using to test did not have which is why it worked for it), this method will ultimately return the URL rather than the fileName which is why at the if statement highlighted above it was comparing to the file's download URL. I don't yet really understand why this was done so, but modifying that method actually solves the problem.

Please @tfmorris or @wetneb tell me if there was any reason for that method written like that so I could dig deeper or I submit the pull request.

@tfmorris
Copy link
Member Author

tfmorris commented Apr 4, 2024

pro tip you can use Github's permalink feature to create clickable code previews that will take people directly to the code in question

static public String getFileSource(ObjectNode fileRecord) {
return JSONUtilities.getString(
fileRecord,
"url",
JSONUtilities.getString(fileRecord, "fileName", "unknown"));
}

his method will ultimately return the URL rather than the fileName which is why at the if statement highlighted above it was comparing to the file's download URL. I don't yet really understand why this was done so, but modifying that method actually solves the problem.

The original data source is stored as part of the provenance for the file creation (you can see it in the project metadata on the Open Project screen with all the projects listed). There are (potentially) multiple layers to the onion:

  1. The URL that was downloaded
  2. An archive/zip file containing one or more files
  3. An Excel/ODS file containing one or more sheets

My memory (without having had the chance to refresh it, which I'll try to do tomorrow) is that we have one fewer variables than we need to hold all the necessary data in this particular case.

They key thing to double check as you look for a solution is that you've preserved the provenance information in the project metadata.

@Kurocifer
Copy link

Kurocifer commented Apr 4, 2024

Did not know about the tip thanks. For the preservation of the provenance information in the project metadata, I just checked it and after the modifications to that method, the provenance information in the project metadata is as preserved as it was before the modifications, and the Excel importing works right.

They key thing to double check as you look for a solution is that you've preserved the provenance information in the project metadata.

Kurocifer added a commit to Kurocifer/OpenRefine that referenced this issue Apr 15, 2024
@Kurocifer Kurocifer linked a pull request Apr 15, 2024 that will close this issue
Kurocifer added a commit to Kurocifer/OpenRefine that referenced this issue Apr 30, 2024
Add a getFileName method in the ImportingUtilities so it can be used to
get the name of the file from the fileRecord.
Kurocifer added a commit to Kurocifer/OpenRefine that referenced this issue Apr 30, 2024
Add a new variable (fileName) in the ImportingParserBase method,that
will hold the return value of the new getFileName method of the
ImportingUtilities. So it can be parsed to the parseOneFile method
instead of the fileSource.

This is done to keep the provenance information from being lost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Issues related to software defects or unexpected behavior, which require resolution. XLS(X) About the Excel import / export functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants