-
-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help importing html #2179
Comments
This export appears to be a complete HTML document including multiple TABLE entries. They cheat by naming the file with a Currently the HTML parser only grabs the first table. Looking at the file, the first table runs from line 45 to 108. There are two TR rows: lines 46-74 and 79-108. Excel is clearly not doing the right thing here, considering A1 in the worksheet is a bunch of JS code. Should each table be presented as a separate worksheet or should they be appended to the same worksheet? All of the changes apply to Separate worksheets: function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
var wb = utils.book_new();
mtch.forEach(function(s, idx) { utils.book_append_sheet(wb, html_to_sheet(str, opts), "Sheet" + (idx+1)); });
return wb;
} Same worksheet: function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
var ws = html_to_sheet(mtch[0], opts);
mtch.forEach(function(s, idx) {
if(idx == 0) return;
utils.sheet_add_aoa(ws, utils.sheet_to_json(html_to_sheet(mtch[0], opts), {header:1}), {origin: -1});
});
return sheet_to_workbook(ws, opts);
} |
Thank you very much for your help. You customized the html parse for my use case, multiple html tables, that is way above & beyond. I can understand that it is reasonable that when importing html this project scans for a table and expects one, and that finding more than one is an ambiguous situation. Yes, it seems the Board attempted to leverage their investment in getting an html presentation of their data right and then offering a data interface by offering said html as a download and relying on aggressive spreadsheet import of said html. The Board data interface really has little to do with spreadsheets. I think it will be cleaner for me to make a DOM out of the html and traverse it with xpath. Thank you again for this project and for your extraordinary assistance. |
I am running into the same issue. Currently dealing with parsing an export from SAP Ariba that is in xml with multiple table entries. What is the best way to add the above code if I am using SheetJS as a depedency via npm? Will I have to clone a local copy and make the change there? Thanks for the great library! |
@HughDevlin @joshkay we'd accept a PR but would need to settle on an approach. Should each table be presented as a separate worksheet or should they be appended to the same worksheet? If each table is a separate worksheet, what sheet names should be assigned or how should the developer specify names? If a single worksheet is created, how many rows should be inserted between tables? How should column widths be handled (since the columns would be shared by all of the tables)? |
Thank you for this project. New user here. Requesting help with a data import of an html data file.
The Chicago Board of Elections data interface is based on xls downloads. Sample url, this is for the 2020 presidential race:
https://chicagoelections.gov/en/data-export.asp?election=251&race=11
Sample data download attached:
b.xlsx
The first character is a carriage return. DOCTYPE and HTML headers are within the 1st 1K characters.
LibreOffice Calc 6.4.6.2 opens this file fine, it prompts with the Import dialog and the defaults work fine, a workbook with one sheet with 2277 rows.
This file opens fine in a browser.
Related import script:
https://observablehq.com/d/7678ec0af15e0ab3
xlsx 0.16.8 seem to parse just 2 rows, looks like the 1st named range, HTML_1?
Goal is to download & convert to text for further javascript processing.
Thanks again.
The text was updated successfully, but these errors were encountered: