Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help importing html #2179

Closed
HughDevlin opened this issue Nov 19, 2020 · 5 comments
Closed

Help importing html #2179

HughDevlin opened this issue Nov 19, 2020 · 5 comments

Comments

@HughDevlin
Copy link

HughDevlin commented Nov 19, 2020

Thank you for this project. New user here. Requesting help with a data import of an html data file.

The Chicago Board of Elections data interface is based on xls downloads. Sample url, this is for the 2020 presidential race:

https://chicagoelections.gov/en/data-export.asp?election=251&race=11

Sample data download attached:

b.xlsx

The first character is a carriage return. DOCTYPE and HTML headers are within the 1st 1K characters.
LibreOffice Calc 6.4.6.2 opens this file fine, it prompts with the Import dialog and the defaults work fine, a workbook with one sheet with 2277 rows.
This file opens fine in a browser.

Related import script:

https://observablehq.com/d/7678ec0af15e0ab3

response = fetch(proxyUrl + electionsResultUrl).then(response => response)
txt = response.text()
csf = xlsx.read(txt, {type: 'string', WTF: true})
sheet = csf.Sheets[csf.SheetNames[0]];
csv = xlsx.utils.sheet_to_txt(sheet);

xlsx 0.16.8 seem to parse just 2 rows, looks like the 1st named range, HTML_1?

Goal is to download & convert to text for further javascript processing.

Thanks again.

@SheetJSDev
Copy link
Contributor

This export appears to be a complete HTML document including multiple TABLE entries. They cheat by naming the file with a .xls extension.

Currently the HTML parser only grabs the first table. Looking at the file, the first table runs from line 45 to 108. There are two TR rows: lines 46-74 and 79-108.

Excel is clearly not doing the right thing here, considering A1 in the worksheet is a bunch of JS code. Should each table be presented as a separate worksheet or should they be appended to the same worksheet?

All of the changes apply to html_to_book in bits/79_html.js

Separate worksheets:

	function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
		var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
		if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
		var wb = utils.book_new();
		mtch.forEach(function(s, idx) { utils.book_append_sheet(wb, html_to_sheet(str, opts), "Sheet" + (idx+1)); });
		return wb;
	}

Same worksheet:

	function html_to_book(str/*:string*/, opts)/*:Workbook*/ {
		var mtch = str.match(/<table.*?>[\s\S]*?<\/table>/gi);
		if(!mtch || mtch.length == 0) throw new Error("Invalid HTML: could not find <table>");
		var ws = html_to_sheet(mtch[0], opts);
		mtch.forEach(function(s, idx) {
			if(idx == 0) return;
			utils.sheet_add_aoa(ws, utils.sheet_to_json(html_to_sheet(mtch[0], opts), {header:1}), {origin: -1});
		});
		return sheet_to_workbook(ws, opts);
	}

@HughDevlin
Copy link
Author

HughDevlin commented Nov 20, 2020

Thank you very much for your help. You customized the html parse for my use case, multiple html tables, that is way above & beyond. I can understand that it is reasonable that when importing html this project scans for a table and expects one, and that finding more than one is an ambiguous situation.

Yes, it seems the Board attempted to leverage their investment in getting an html presentation of their data right and then offering a data interface by offering said html as a download and relying on aggressive spreadsheet import of said html. The Board data interface really has little to do with spreadsheets. I think it will be cleaner for me to make a DOM out of the html and traverse it with xpath.

Thank you again for this project and for your extraordinary assistance.

@joshkay
Copy link

joshkay commented Mar 9, 2021

I am running into the same issue. Currently dealing with parsing an export from SAP Ariba that is in xml with multiple table entries.

What is the best way to add the above code if I am using SheetJS as a depedency via npm? Will I have to clone a local copy and make the change there?

Thanks for the great library!

@SheetJSDev
Copy link
Contributor

@HughDevlin @joshkay we'd accept a PR but would need to settle on an approach.

Should each table be presented as a separate worksheet or should they be appended to the same worksheet?

If each table is a separate worksheet, what sheet names should be assigned or how should the developer specify names?

If a single worksheet is created, how many rows should be inserted between tables? How should column widths be handled (since the columns would be shared by all of the tables)?

@reviewher
Copy link
Contributor

3a26260

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants