[BUG] 0.9.2 seems to not quite handle 206's properly #617

palfrey · 2024-02-26T00:04:00Z

Describe the bug

I haven't seen this on 0.9.1, but now seeing on 0.9.2. For some sites, especially https://www.theguardian.com/ I'm getting logs like newspaper.network:network.py:192 get_html_status(): bad status code 206 on URL and it looks like basically a 206 is getting hit with a gzip-encoded response and it's not going back and pulling the rest of the content.

Sometimes I get a newspaper.exceptions.ArticleBinaryDataException when it's clearly not an actual binary page, it's just a partially retrieved page that's failing zlib stuff because it's only got half the page.

To Reproduce
Annoyingly, I don't have an easy repo of this. I've got it semi-reliable, but only within a large test sequence with pytest and VCR.py, and I'm trying to get something more reliable that's a single file I can provide.

Expected behavior
Download just works with gzip-encoded pages, even if they do a 206 part way through.

System information

OS: Linux
Python version: 3.11.5
Library version: 0.9.2

The text was updated successfully, but these errors were encountered:

AndyTheFactory · 2024-03-05T12:54:18Z

Hi
i was not so familiar with the way 206 is used and implemented, had to research a little bit.

yes, it's hard to find a test case. that is why i created quickly a simple server that delivers 206 responses

Gzip encoded and Ranged server

I tested it with a cnn article (cnn_article.html from tests data)

the Warning logged should not influence the result. The only problem is that you will get only the partial response parsed. If the Html is too aggressive cut (for instance if you set the limit to 10000 for the above mentioned article), you will not get any useful text.

There is the question why would 206 appear on non-binary content. As far as i could test, browsers do not request the rest of the partial content if they get 206 on the main html page. sure, streaming resources maybe. did not test

Regarding gzip encoded, it's weird that they would split the content after gzip, I did not find any references to such a practice .. What i found was that the chunk is gziped and sent. what you encounter seems to be rather an network error? could it be?

you can play around with the server and maybe you can simulate a case that is similar to what you encountered in the wild.

palfrey · 2024-03-16T22:43:49Z

I haven't been able to fully reproduce this one. What I do have is a a VCR.py cassette (https://gist.github.com/palfrey/f8556218fe86e57c1f507b8d65a3e311) that got recorded and then caused issues. Note that it somehow has both a GET with the 206 response and a partial data bit and another GET for the same URL with partial data. I have no idea what's causing that, but deleting the 206 responses from my stored data seems to solve things, and AFAIK this is only occurring in the test scenarios not prod, so it might be a vcr.py issue...

AndyTheFactory · 2024-03-17T16:10:34Z

ok, let's keep an eye on it. I will release 0.9.3 without any extra changes to address this.
Working now on the last touch-ups

palfrey added the bug Something isn't working label Feb 26, 2024

AndyTheFactory added this to the Release 0.9.3 milestone Feb 26, 2024

AndyTheFactory closed this as completed Mar 17, 2024

AndyTheFactory reopened this Mar 17, 2024

AndyTheFactory removed this from the Release 0.9.3 milestone Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 0.9.2 seems to not quite handle 206's properly #617

[BUG] 0.9.2 seems to not quite handle 206's properly #617

palfrey commented Feb 26, 2024

AndyTheFactory commented Mar 5, 2024 •

edited

palfrey commented Mar 16, 2024

AndyTheFactory commented Mar 17, 2024

[BUG] 0.9.2 seems to not quite handle 206's properly #617

[BUG] 0.9.2 seems to not quite handle 206's properly #617

Comments

palfrey commented Feb 26, 2024

AndyTheFactory commented Mar 5, 2024 • edited

palfrey commented Mar 16, 2024

AndyTheFactory commented Mar 17, 2024

AndyTheFactory commented Mar 5, 2024 •

edited