Help importing html #2179

HughDevlin · 2020-11-19T19:41:37Z

Thank you for this project. New user here. Requesting help with a data import of an html data file.

The Chicago Board of Elections data interface is based on xls downloads. Sample url, this is for the 2020 presidential race:

https://chicagoelections.gov/en/data-export.asp?election=251&race=11

Sample data download attached:

b.xlsx

The first character is a carriage return. DOCTYPE and HTML headers are within the 1st 1K characters.
LibreOffice Calc 6.4.6.2 opens this file fine, it prompts with the Import dialog and the defaults work fine, a workbook with one sheet with 2277 rows.
This file opens fine in a browser.

Related import script:

https://observablehq.com/d/7678ec0af15e0ab3

response = fetch(proxyUrl + electionsResultUrl).then(response => response)
txt = response.text()
csf = xlsx.read(txt, {type: 'string', WTF: true})
sheet = csf.Sheets[csf.SheetNames[0]];
csv = xlsx.utils.sheet_to_txt(sheet);

xlsx 0.16.8 seem to parse just 2 rows, looks like the 1st named range, HTML_1?

Goal is to download & convert to text for further javascript processing.

Thanks again.

The text was updated successfully, but these errors were encountered:

SheetJSDev · 2020-11-19T21:54:54Z

This export appears to be a complete HTML document including multiple TABLE entries. They cheat by naming the file with a .xls extension.

Currently the HTML parser only grabs the first table. Looking at the file, the first table runs from line 45 to 108. There are two TR rows: lines 46-74 and 79-108.

Excel is clearly not doing the right thing here, considering A1 in the worksheet is a bunch of JS code. Should each table be presented as a separate worksheet or should they be appended to the same worksheet?

All of the changes apply to html_to_book in bits/79_html.js

Separate worksheets:

	function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
		var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
		if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
		var wb = utils.book_new();
		mtch.forEach(function(s, idx) { utils.book_append_sheet(wb, html_to_sheet(str, opts), "Sheet" + (idx+1)); });
		return wb;
	}

Same worksheet:

	function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
		var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
		if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
		var ws = html_to_sheet(mtch[0], opts);
		mtch.forEach(function(s, idx) {
			if(idx == 0) return;
			utils.sheet_add_aoa(ws, utils.sheet_to_json(html_to_sheet(mtch[0], opts), {header:1}), {origin: -1});
		});
		return sheet_to_workbook(ws, opts);
	}

HughDevlin · 2020-11-20T15:32:32Z

Thank you very much for your help. You customized the html parse for my use case, multiple html tables, that is way above & beyond. I can understand that it is reasonable that when importing html this project scans for a table and expects one, and that finding more than one is an ambiguous situation.

Yes, it seems the Board attempted to leverage their investment in getting an html presentation of their data right and then offering a data interface by offering said html as a download and relying on aggressive spreadsheet import of said html. The Board data interface really has little to do with spreadsheets. I think it will be cleaner for me to make a DOM out of the html and traverse it with xpath.

Thank you again for this project and for your extraordinary assistance.

joshkay · 2021-03-09T16:10:55Z

I am running into the same issue. Currently dealing with parsing an export from SAP Ariba that is in xml with multiple table entries.

What is the best way to add the above code if I am using SheetJS as a depedency via npm? Will I have to clone a local copy and make the change there?

Thanks for the great library!

SheetJSDev · 2021-09-10T06:22:37Z

@HughDevlin @joshkay we'd accept a PR but would need to settle on an approach.

Should each table be presented as a separate worksheet or should they be appended to the same worksheet?

If each table is a separate worksheet, what sheet names should be assigned or how should the developer specify names?

If a single worksheet is created, how many rows should be inserted between tables? How should column widths be handled (since the columns would be shared by all of the tables)?

reviewher · 2021-11-06T03:39:42Z

3a26260

reviewher closed this as completed Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help importing html #2179

Help importing html #2179

HughDevlin commented Nov 19, 2020 •

edited

Loading

SheetJSDev commented Nov 19, 2020

HughDevlin commented Nov 20, 2020 •

edited

Loading

joshkay commented Mar 9, 2021

SheetJSDev commented Sep 10, 2021

reviewher commented Nov 6, 2021

Help importing html #2179

Help importing html #2179

Comments

HughDevlin commented Nov 19, 2020 • edited Loading

SheetJSDev commented Nov 19, 2020

HughDevlin commented Nov 20, 2020 • edited Loading

joshkay commented Mar 9, 2021

SheetJSDev commented Sep 10, 2021

reviewher commented Nov 6, 2021

HughDevlin commented Nov 19, 2020 •

edited

Loading

HughDevlin commented Nov 20, 2020 •

edited

Loading