-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: GeoJSON file from URL not recognized as a supported file format #3284
Comments
Could be related to the latest GDAL release from a few days back. |
The CI is using pip and installing wheels, so it should not be related to the GDAL version (it's installing fiona's latest wheel, which is from March) |
Comparing the latest working run (https://github.com/vega/altair/actions/runs/8853323212/job/24313906217) vs the failing run (https://github.com/vega/altair/actions/runs/9005770219/job/24853034437) on the main branch, and comparing the versions of packages installed, the most relevant one is geopandas 0.14.3 -> 0.14.4. So we should check if this is not another regression related to the path changes |
At a quick glance it looks to me like something goes wrong here: if _is_url(filename):
# if it is a url that supports random access -> pass through to
# pyogrio/fiona as is (to support downloading only part of the file)
# otherwise still download manually because pyogrio/fiona don't support
# all types of urls (https://github.com/geopandas/geopandas/issues/2908)
with urllib.request.urlopen(filename) as response:
if not response.headers.get("Accept-Ranges") == "bytes":
filename = response.read()
from_bytes = True and correspondingly in Not entirely sure how that behaviour could change given that If I update the code snippet to read As an aside, |
@mattijn could you maybe restart one of the failing builds to check if it is still failing, in case this was a temporary glitch with the CDN? I thought that I could actually reproduce the issue this morning with latest main, but can't anymore right now (and no longer have the output of the console session to verify). I can reproduce the error by explicitly trying to read with
Something else I noticed (although this is an issue for pyogrio), when leaving out the
Although that might also be something on the GDAL side, because AFAIK this is just passing that url (preprended with |
I did a re-run, but that still gives the same error (link to run). By comparing the changes between 0.14.3 and 0.14.4, I see some potential related changes that might be introduced by #3232? |
Yes, but so those changes are only behind a geopandas/geopandas/io/file.py Lines 268 to 277 in 5186e69
And so locally I get: >>> response = urllib.request.urlopen("https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/earthquakes.json")
>>> print(response.headers.get("accept-ranges"))
None resulting in taking that path to read the response. |
I realized I've access to a linux machine and had the chance to do some tests. import urllib.request
url = 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/earthquakes.json'
# Open URL with custom request
with urllib.request.urlopen(url) as response:
print(response.headers.get("Accept-Ranges"))
header = response.getheaders()
# lower/higher case doesn't matter for the `headers.get()`, but it does for sorting.
sorted_header = sorted([(key.lower(), value) for key, value in header])
print(sorted_header)
print(len(sorted_header)) consistently returns: bytes
[('accept-ranges', 'bytes'), ('access-control-allow-origin', '*'), ('access-control-expose-headers', '*'), ('age', '223460'), ('alt-svc', 'h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400'), ('cache-control', 'public, max-age=31536000, s-maxage=31536000, immutable'), ('connection', 'close'), ('content-length', '1219853'), ('content-type', 'application/json; charset=utf-8'), ('cross-origin-resource-policy', 'cross-origin'), ('date', 'Thu, 16 May 2024 21:14:41 GMT'), ('etag', 'W/"129d0d-nk6KiNV9fUTf5O95Ns8JhUD6yxk"'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('timing-allow-origin', '*'), ('vary', 'Accept-Encoding'), ('x-cache', 'HIT, HIT'), ('x-content-type-options', 'nosniff'), ('x-jsd-version', '1.29.0'), ('x-jsd-version-type', 'version'), ('x-served-by', 'cache-fra-eddf8230110-FRA, cache-ams21061-AMS')]
20 Where on windows the return differs. Sometimes it returns the same as above and sometimes it returns the following: None
[('access-control-allow-origin', '*'), ('access-control-expose-headers', '*'), ('age', '268142'), ('alt-svc', 'h3=":443"; ma=86400'), ('cache-control', 'public, max-age=31536000, s-maxage=31536000, immutable'), ('cf-cache-status', 'HIT'), ('cf-ray', '884e612bb9c866ab-AMS'), ('connection', 'close'), ('content-type', 'application/json; charset=utf-8'), ('cross-origin-resource-policy', 'cross-origin'), ('date', 'Thu, 16 May 2024 21:14:49 GMT'), ('etag', 'W/"129d0d-nk6KiNV9fUTf5O95Ns8JhUD6yxk"'), ('nel', '{"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}'), ('report-to', '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=ftXz3PUHqi1D5sRZvNku8s9TgOPc7Uc2BjYMWoyymLptxWZNKCfovtFLWLx3TEjTSzVnYjoqHgnXNJm2Zvb67vbgWyu5BPDWj5oPXh1r29ZJs88D8auMLijhgNn52qMnYI8%3D"}],"group":"cf-nel","max_age":604800}'), ('server', 'cloudflare'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('timing-allow-origin', '*'), ('transfer-encoding', 'chunked'), ('vary', 'Accept-Encoding'), ('x-cache', 'MISS, HIT'), ('x-content-type-options', 'nosniff'), ('x-jsd-version', '1.29.0'), ('x-jsd-version-type', 'version'), ('x-served-by', 'cache-fra-eddf8230110-FRA, cache-lga21924-LGA')]
24 And with sometimes, I mean, with an interval of a few minutes between the next call. It is not a fixed interval and I cannot force it with a User-Agent request header. Inspecting the differences in header seems that the request on windows is sometimes served by a cloudflare endpoint. Maybe depending on server-load? When it is served by the cloudflare endpoint the header does not contain an While typing this, I think this might be a side-effect introduced by resolving jsdelivr/jsdelivr#18565 (comment) recently? But again, still, if it the return import fiona
with fiona.open("/vsicurl/https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/earthquakes.json") as features:
columns = list(features.schema["properties"])
print(len(columns)) Then it returns output (27) and it doesn't error (fiona version 1.9.5). If I do the same on windows (while the header doesn't return an
One step closer or further... Hope this info helps somebody else finding the culprit. |
So on my Linux machine (Ubuntu 22.04), I consistently get But checking the header items, I am also seeing
Can you reproduce it locally that it fails when reading through Indeed, if it does return "bytes", geopandas will not download it because we then indeed assume fiona/pyogrio (GDAL/OGr in fact) can read it. |
A way to directly test with GDAL:
This fails for me: the verbose logging shows that it again uses the cloudfare server, and it basically "hangs" for 6 minutes The output I see with some extra debugging output enabled:
One potentially interesting bit I see: cc @rouault in case you have any insight here |
We now also get the error for geopandas 0.14.3: https://github.com/vega/altair/actions/runs/9138091817/job/25128832186?pr=3419 -> Seems to not be related to changes between 0.14.3 and 0.14.4. |
That's a pity for you (no easy workaround to have CI green), but at least something that makes sense! ;) (I otherwise really couldn't explain how the changes between 0.14.3 and 0.14.4 would have impacted this) |
:) Good to hear that it's consistent with what you would expect. And we can just disable that test for now to keep going, that's fine. Btw, great Arrow tutorial at PyData Berlin! Really enjoyed it. |
@binste, after empty the cache of that CI run it is working with 0.14.3. Unfortunately, the issue is still there with 0.14.4. Will have a look next week if I can isolate it further. |
After numerous tests, I think this comment is correct:
Eventually I was able to make most of our CI happy again (from this branch), see https://github.com/vega/altair/actions/runs/9199264680/job/25303690232 with the following changes to geopandas: main...mattijn:geopandas:read_file_adaptation. The failing CI is on python 3.8. Geopandas 1.0 drops support for this python version? Btw, If I understand right, there is not yet partial data access support from urls using the pyogrio engine? |
Thanks for the investigation @mattijn! Feel free to open a PR and we can have a look, we are switching the default io engine from pyogrio to fiona, but there aren't an plans to drop fiona as of now.
We actually dropped formal support for python 3.8 in geopandas 0.14 (on the basis of SPEC 0 timings, rather than us switching to use python 3.9+ exclusive features.
I'm not sure about this myself |
My understanding is that pyogrio should support that just as much as fiona does, because I thought this support came from passing the URL down to GDAL/OGR. |
We start to receive errors within the CI of Vega-Altair related to something we are not sure about. Xref vega/altair#3418
The full traceback we see is the following:
It basically errors on this lines:
But the issue is, we can't reproduce it yet outside the CI on both Windows and MacOS. According to the CI (
ubuntu-latest
) it should fail using this combination of related packages:We have seen the same error on python 3.10, 3.11 and 3.12 so far.
Is this a glitch that automagically will disappear in a few days, or is this something that is reproducible by others?
The text was updated successfully, but these errors were encountered: