Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 collumn names written as ascii #63

Open
sehHeiden opened this issue Dec 30, 2021 · 1 comment
Open

UTF-8 collumn names written as ascii #63

sehHeiden opened this issue Dec 30, 2021 · 1 comment

Comments

@sehHeiden
Copy link

sehHeiden commented Dec 30, 2021

I tried to open ESRI Shape files offered by the OpenData website of TW (https://data.gov.tw/en/datasets/all). For example in a file for smaller Taichung City. Column names that are saved as UTF-8 characters and can be display in QGIS (my version is 3.20), are displayed by a sort ASCII 8-bit? characters, when I open these with GeoDataFrames.jl.
圖名代碼 ->\xb9ϦW\xa5N\xbdX

@visr
Copy link
Member

visr commented Dec 30, 2021

I downloaded one of the shapefiles from that website, but it looks like the DBF files with the column names are not encoded in UTF-8, but BIG5. We assume UTF-8 in DBFTables.jl, by converting the bytes to String, which is UTF-8.

https://github.com/JuliaData/DBFTables.jl/blob/6b4ef1ab5843225a0e0fae04abbc3bbb44fcac44/src/DBFTables.jl#L69

If the same bytes are decoded as BIG5 however the result seems fine:

using StringEncodings
bytes = UInt8[0xb9, 0xcf, 0xa6, 0x57, 0xa5, 0x4e, 0xbd, 0x58]
decode(bytes, "BIG5")  # -> "圖名代碼"

It is not supported by this package, but often a .cpg file is added that specifies the encoding:

.cpg—An optional file that can be used to specify the codepage for identifying the characterset to be used.

I'm impressed that GDAL (used by GeoDataFrames and QGIS) seems to correctly guess the encoding here. If I export the file from QGIS, it encodes it in UTF-8 and writes a .cpg file with UTF-8 inside.

Looking at these links

it seems there is a Language Driver ID in some DBF headers, that can be used for this as well. I suppose that is how GDAL figured out the encoding. I suppose that DBFTables could take a dependency on StringEncodings and add support for both CPG files and Language Driver IDs, though that would require some effort.

Probably your best bet for now is to just use GeoDataFrames.jl. GDAL in general is better at reading a wide variety of shapefiles compared to this package. Another option is to pull this file through GDAL's ogr2ogr utility, and since it will be UTF-8 you can use this package on the result, see the snippet here: #53 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants