URL Encoding Not as Expected #21

hariravi · 2021-12-24T15:24:22Z

I have been attempting to use this package to scrape Google News - I am using the most recent release (v1.0.10), and have configured the AWS-CLI. The exact code sequence resulting in a failure is as follows:

Get blocked by Google :) (run this, and you'll likely be blocked after 750 to 1000 requests)

import requests
for i in range(1,10000):
    response = requests.get("http://www.google.com/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
    if response.status_code != 200:
        print(i)
        print(response.status_code)
        break

After getting blocked on my IP, I should still be able to access google using the module (i.e. after running the above block, I should be able to run the below block, and get a 200 response).

with ApiGateway("https://google.com") as g:
    session = requests.Session()
    session.mount("https://google.com", g)
    response = session.get("http://www.google.com/search?q=elon+musk after:2021-12-22 before:    2021-12-23&tbm=nws&hl=en&num=10")
    print(response.status_code)

Unfortunately, the result is a 429 response for me ... on the other hand, when I tried using a proxy from scrapingbee.com after initially getting blocked by Google (performing step 1), I actually did get a 200 response. I configured the AWS CLI, and I also tried inputting the keys as arguments and creating new users with the API Gateway enabled, as well as using the root key, but have had no luck.

Are you able to replicate this issue/first artificially block yourself from Google, and then being unable to scrape using this ip-rotator module? Thank you very much for an excellent module, and Merry Christmas and happy holidays!

The text was updated successfully, but these errors were encountered:

hariravi · 2021-12-26T14:48:09Z

Update: Have fixed the issue, apparently the module was sensitive to the url formatting, see below - also, for those of you who are doing this at scale, please make sure to specify regions (I kept getting 404 responses and realized it was because of the european regions, when I just specified US, many of those were resolved), thanks!

from requests_ip_rotator import ApiGateway, EXTRA_REGIONS, ALL_REGIONS

gateway = ApiGateway("https://www.google.com")
gateway.start(force=True)
session = requests.Session()
session.mount("https://www.google.com", gateway)

# Attempt 1: Doesn't work, url formatting
response = session.get("http://www.google.com/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
print(response.status_code)
# Attempt 2: Some URL formatting, and works
response = session.get("https://www.google.com/search?q=barry+bonds%20after:2021-12-22%20before:2021-12-23&tbm=nws&hl=en&num=10")
print(response.status_code)
gateway.shutdown()

Ge0rg3 · 2021-12-27T14:56:06Z

Hi! Thank you for your issue, glad it seems to be resolved. I'm going to keep this open, and will be fixing this URL formatting issue in the next release 😊

hariravi · 2021-12-27T15:30:04Z

Thanks George, I've had it running for 15 hours straight now and no IP blocking, outstanding module (I am starting a big-data/machine learning company which involves a ton of web scraping)!

And as you know different urls work in different regions (so upon good url formatting and restricting regions, everything has been working well, and the costs to this seem far less than many of these web-scraping/proxy services)

Ge0rg3 · 2021-12-30T18:34:39Z

Hi, it looks like AWS messes up the URL encoding on their end... Will take a look at patching, but in the meantime I'd recommend using the requests standard params dict, which prevents these issues from taking place:

SITE = "https://site.com"
gateway = ApiGateway(SITE)
gateway.start()
s = requests.Session()
s.mount(SITE, gateway)
# path reaches target site as /search?hl=en&num=10&q=barry+bonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
# path reaches target site as /search?hl=en&num=10&q=barry%2Bbonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search", params={
    "q": "barry+bonds after:2021-12-22 before: 2021-12-23",
    "tbm": "nws",
    "hl": "en",
    "num": 10
})

hariravi · 2021-12-30T19:07:38Z

Awesome, makes sense, thanks!

hariravi changed the title ~~Linux vs Mac OS issue?~~ Linux vs Mac OS issue, or different computers issue? Dec 24, 2021

hariravi changed the title ~~Linux vs Mac OS issue, or different computers issue?~~ Linux vs Mac OS issue, or machine issue? Dec 24, 2021

hariravi changed the title ~~Linux vs Mac OS issue, or machine issue?~~ Issue After ~1000 requests Dec 24, 2021

hariravi changed the title ~~Issue After ~1000 requests~~ Issue After Initially Getting Blocked (steps to replicate included) Dec 25, 2021

hariravi closed this as completed Dec 26, 2021

Ge0rg3 reopened this Dec 27, 2021

Ge0rg3 changed the title ~~Issue After Initially Getting Blocked (steps to replicate included)~~ URL Encoding Not as Expected Dec 27, 2021

Ge0rg3 added the bug Something isn't working label Dec 27, 2021

Ge0rg3 closed this as completed Dec 30, 2021

Ge0rg3 pinned this issue Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL Encoding Not as Expected #21

URL Encoding Not as Expected #21

hariravi commented Dec 24, 2021 •

edited

hariravi commented Dec 26, 2021 •

edited

Ge0rg3 commented Dec 27, 2021

hariravi commented Dec 27, 2021

Ge0rg3 commented Dec 30, 2021

hariravi commented Dec 30, 2021

URL Encoding Not as Expected #21

URL Encoding Not as Expected #21

Comments

hariravi commented Dec 24, 2021 • edited

hariravi commented Dec 26, 2021 • edited

Ge0rg3 commented Dec 27, 2021

hariravi commented Dec 27, 2021

Ge0rg3 commented Dec 30, 2021

hariravi commented Dec 30, 2021

hariravi commented Dec 24, 2021 •

edited

hariravi commented Dec 26, 2021 •

edited