Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Encoding Not as Expected #21

Closed
hariravi opened this issue Dec 24, 2021 · 5 comments
Closed

URL Encoding Not as Expected #21

hariravi opened this issue Dec 24, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@hariravi
Copy link

hariravi commented Dec 24, 2021

I have been attempting to use this package to scrape Google News - I am using the most recent release (v1.0.10), and have configured the AWS-CLI. The exact code sequence resulting in a failure is as follows:

  1. Get blocked by Google :) (run this, and you'll likely be blocked after 750 to 1000 requests)
import requests
for i in range(1,10000):
    response = requests.get("http://www.google.com/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
    if response.status_code != 200:
        print(i)
        print(response.status_code)
        break
  1. After getting blocked on my IP, I should still be able to access google using the module (i.e. after running the above block, I should be able to run the below block, and get a 200 response).
with ApiGateway("https://google.com") as g:
    session = requests.Session()
    session.mount("https://google.com", g)
    response = session.get("http://www.google.com/search?q=elon+musk after:2021-12-22 before:    2021-12-23&tbm=nws&hl=en&num=10")
    print(response.status_code)

Unfortunately, the result is a 429 response for me ... on the other hand, when I tried using a proxy from scrapingbee.com after initially getting blocked by Google (performing step 1), I actually did get a 200 response. I configured the AWS CLI, and I also tried inputting the keys as arguments and creating new users with the API Gateway enabled, as well as using the root key, but have had no luck.

Screen Shot 2021-12-24 at 5 26 40 PM

Are you able to replicate this issue/first artificially block yourself from Google, and then being unable to scrape using this ip-rotator module? Thank you very much for an excellent module, and Merry Christmas and happy holidays!

@hariravi hariravi changed the title Linux vs Mac OS issue? Linux vs Mac OS issue, or different computers issue? Dec 24, 2021
@hariravi hariravi changed the title Linux vs Mac OS issue, or different computers issue? Linux vs Mac OS issue, or machine issue? Dec 24, 2021
@hariravi hariravi changed the title Linux vs Mac OS issue, or machine issue? Issue After ~1000 requests Dec 24, 2021
@hariravi hariravi changed the title Issue After ~1000 requests Issue After Initially Getting Blocked (steps to replicate included) Dec 25, 2021
@hariravi
Copy link
Author

hariravi commented Dec 26, 2021

Update: Have fixed the issue, apparently the module was sensitive to the url formatting, see below - also, for those of you who are doing this at scale, please make sure to specify regions (I kept getting 404 responses and realized it was because of the european regions, when I just specified US, many of those were resolved), thanks!

from requests_ip_rotator import ApiGateway, EXTRA_REGIONS, ALL_REGIONS

gateway = ApiGateway("https://www.google.com")
gateway.start(force=True)
session = requests.Session()
session.mount("https://www.google.com", gateway)

# Attempt 1: Doesn't work, url formatting
response = session.get("http://www.google.com/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
print(response.status_code)
# Attempt 2: Some URL formatting, and works
response = session.get("https://www.google.com/search?q=barry+bonds%20after:2021-12-22%20before:2021-12-23&tbm=nws&hl=en&num=10")
print(response.status_code)
gateway.shutdown()

@Ge0rg3
Copy link
Owner

Ge0rg3 commented Dec 27, 2021

Hi! Thank you for your issue, glad it seems to be resolved. I'm going to keep this open, and will be fixing this URL formatting issue in the next release 😊

@Ge0rg3 Ge0rg3 reopened this Dec 27, 2021
@Ge0rg3 Ge0rg3 changed the title Issue After Initially Getting Blocked (steps to replicate included) URL Encoding Not as Expected Dec 27, 2021
@Ge0rg3 Ge0rg3 added the bug Something isn't working label Dec 27, 2021
@hariravi
Copy link
Author

Thanks George, I've had it running for 15 hours straight now and no IP blocking, outstanding module (I am starting a big-data/machine learning company which involves a ton of web scraping)!

And as you know different urls work in different regions (so upon good url formatting and restricting regions, everything has been working well, and the costs to this seem far less than many of these web-scraping/proxy services)

@Ge0rg3
Copy link
Owner

Ge0rg3 commented Dec 30, 2021

Hi, it looks like AWS messes up the URL encoding on their end... Will take a look at patching, but in the meantime I'd recommend using the requests standard params dict, which prevents these issues from taking place:

SITE = "https://site.com"
gateway = ApiGateway(SITE)
gateway.start()
s = requests.Session()
s.mount(SITE, gateway)
# path reaches target site as /search?hl=en&num=10&q=barry+bonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
# path reaches target site as /search?hl=en&num=10&q=barry%2Bbonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search", params={
    "q": "barry+bonds after:2021-12-22 before: 2021-12-23",
    "tbm": "nws",
    "hl": "en",
    "num": 10
})

@Ge0rg3 Ge0rg3 closed this as completed Dec 30, 2021
@Ge0rg3 Ge0rg3 pinned this issue Dec 30, 2021
@hariravi
Copy link
Author

Awesome, makes sense, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants