Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

purl.obolibrary.org URLs return 403 when using urllib.request (Python) #923

Closed
nklsbckmnn opened this issue Jun 5, 2023 · 16 comments
Closed

Comments

@nklsbckmnn
Copy link

No description provided.

@jamesaoverton
Copy link
Member

Please provide example PURLs and code so we can try to replicate.

@nklsbckmnn
Copy link
Author

nklsbckmnn commented Jun 5, 2023

import urllib.request

url = "http://purl.obolibrary.org/obo/hp.owl" 

response = urllib.request.urlopen(url)

@nklsbckmnn nklsbckmnn changed the title purl.obolibrary.org URLs return 403 when using urlib.request (Python) purl.obolibrary.org URLs return 403 when using urllib.request (Python) Jun 5, 2023
@matentzn
Copy link
Contributor

matentzn commented Jun 5, 2023

Is this behaviour new?

This works:

from urllib import request
from urllib.request import Request, urlopen
 
url = "https://purl.obolibrary.org/obo/hp.owl"
request_site = Request(url, headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(request_site)

But I am wondering if the recent changes to PURL system now cause:

https://www.pythonpool.com/urllib-error-httperror-http-error-403-forbidden

The wheregoes trace works:

https://wheregoes.com/trace/20232599746/

@nklsbckmnn
Copy link
Author

Yes it's new. I think it still worked on Friday. Maybe some abuse-suspecting user agent block by GitHub?

@matentzn
Copy link
Contributor

matentzn commented Jun 5, 2023

On Friday we changed something in our PURL config, cc @kltm, but it is a bit odd that all other tools other than urllib.request work - wget / curl / wheregoes.

Thanks for the report!

@jamesaoverton
Copy link
Member

jamesaoverton commented Jun 5, 2023

EDIT: Posted too soon, the following is incorrect.

I think the problem is GitHub, not the PURL server. The PURL server redirects http://purl.obolibrary.org/obo/hp.owl to https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl ( https://github.com/OBOFoundry/purl.obolibrary.org/blob/master/config/hp.yml#LL9C11-L9C99). This code gives me a 403:

import urllib.request
url = "https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl" 
response = urllib.request.urlopen(url)

@matentzn
Copy link
Contributor

matentzn commented Jun 5, 2023

Normal requests also works:

import requests
r = requests.get('http://purl.obolibrary.org/obo/omo.owl', allow_redirects=True)
open('omo.owl', 'wb').write(r.content)

@matentzn
Copy link
Contributor

matentzn commented Jun 5, 2023

Using @eliasweatherfield code in try catch also works:

import urllib.request

url = "http://purl.obolibrary.org/obo/omo.owl"

try:
    response = urllib.request.urlopen(url)
except:
    print("Ignore this error")

print(response.read(100))

This suggests that the request is successful, but the error is thrown regardless.

@jamesaoverton
Copy link
Member

I can confirm the 403 described by @eliasweatherfield in Python 3.9 and 3.11. I think @matentzn is seeing an old response object, because I get a NameError: name 'response' is not defined error from the final line print(response.read(100)).

@jamesaoverton
Copy link
Member

Ok, now I think that Cloudflare is rejecting the request, which makes sense given the timing of this issue:

import urllib.request
url = "http://purl.obolibrary.org/obo/hp.owl"
try:
    response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e)
    print(e.code)
    print(e.reason)
    print(e.headers)
HTTP Error 403: Forbidden
403
Forbidden
Date: Mon, 05 Jun 2023 17:48:43 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 16
Connection: close
X-Frame-Options: SAMEORIGIN
Referrer-Policy: same-origin
Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Vary: Accept-Encoding
Server: cloudflare
CF-RAY: 7d2a3f81995dcab8-YYZ

@jamesaoverton
Copy link
Member

I think the cause is Cloudflare's Browser Integrity Check, which is a security setting that can be turned off: https://developers.cloudflare.com/support/firewall/settings/understanding-the-cloudflare-browser-integrity-check/

@kltm
Copy link
Contributor

kltm commented Jun 5, 2023

@jamesaoverton I believe that I've turned off BIC for this domain (Cloudflare docs are apparently wildly out of date and not great to begin with).

@jamesaoverton
Copy link
Member

Thanks @kltm! I'm now getting a 200 response from the first test code posted above -- no more error.

@eliasweatherfield Can you confirm that this is now working for you?

@nklsbckmnn
Copy link
Author

Yes, it's working again. Thanks everyone.

@jamesaoverton
Copy link
Member

Thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants