# Web Scraping Deep Dive 

# Errors and Exceptions

requests throws different types of exception and errors if there is ever a network problem. All exceptions are inherited from requests.exceptions.RequestException class.

### Here is a short description of the common erros you may run in to:

- ConnectionError exception is thrown in case of DNS failure,refused connection or any other connection related issues.
- Timeout is raised if a request times out.
- TooManyRedirects is raised if a request exceeds the maximum number of predefined redirections.
- HTTPError exception is raised for invalid HTTP responses.

For a more complete list and description of the exceptions you may run in to, check out the requests documentation.
https://buildmedia.readthedocs.org/media/pdf/requests/master/requests.pdf


# HANDLING WEB RESPONSE STATUS CODES

In [1]:
import requests

In [2]:
r2 = requests.get('https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/forbidden')
print(r2.status_code)


403


In [3]:
r3 = requests.get('http://google.com')
print(r3.history)

print(r3.history[0].status_code)

print(r3.history[0].text)

[<Response [301]>]
301
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>



In [4]:
r = requests.get('http://google.com') # `url` has been defined before
if r.status_code < 300:
    print('request was successful')
elif r.status_code >= 400 and r.status_code < 500:
    print('request failed because the resource either does not exist or is forbidden')
else:
    print('request failed because the response server encountered an error')

request was successful


In [5]:
url='http://www.google.com/blahblah'

try:
    r = requests.get(url,timeout=3)
    r.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
    print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
    print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
    print ("OOps: Something Else",err)



Http Error: 404 Client Error: Not Found for url: http://www.google.com/blahblah


# Handling Redirections

Redirection in HTTP means forwarding the network request to a different URL. For example, if we make a request to "http://www.github.com", it will redirect to "https://github.com" using a 301 redirect.

In [34]:
r = requests.post("http://www.github.com")
print(r.url)
print(r.history)
print(r.status_code)

https://github.com/
[<Response [301]>, <Response [301]>]
200


As you can see the redirection process is automatically handled by requests, so you don't need to deal with it yourself. The history property contains the list of all response objects created to complete the redirection. In our example, two Response objects were created with the 301 response code. HTTP 301 and 302 responses are used for permanent and temporary redirection, respectively.

If you don't want the Requests library to automatically follow redirects, then you can disable it by passing the allow_redirects=False parameter along with the request.

# Handling Timeouts

Another important configuration is telling our library how to handle timeouts, or requests that take too long to return. We can configure requests to stop waiting for a network requests using the timeout parameter. By default, requests will not timeout. So, if we don't configure this property, our program may hang indefinitely, which is not the functionality you'd want in a process that keeps a user waiting.

In [35]:
requests.get('http://www.google.com', timeout=1)

<Response [200]>

Here, an exception will be thrown if the server will not respond back within 1 second (which is still aggressive for a real-world application). To get this to fail more often (for the sake of an example), you need to set the timeout limit to a much smaller value, like 0.001.

The timeout can be configured for both the "connect" and "read" operations of the request using a tuple, which allows you to specify both values separately:

In [None]:
requests.get('http://www.google.com', timeout=(5, 14))


Here, the "connect" timeout is 5 seconds and "read" timeout is 14 seconds. This will allow your request to fail much more quicklly if it can't connect to the resource, and if it does connect then it will give it more time to download the data.

# SSL Handling
We can also use the Requests library to verify the HTTPS certificate of a website by passing verify=true with the request.

In [None]:
r = requests.get('https://www.github.com', verify=True)

This will throw an error if there is any problem with the SSL of the site. If you don't want to verity, just pass False instead of True. This parameter is set to True by default.

# Downloading a File
For downloading a file using requests, we can either download it by streaming the contens or directly downloading the entire thing. The stream flag is used to indicate both behaviors.

As you probably guessed, if stream is True, then requests will stream the content. If stream is False, all content will be downloaded to the memory bofore returning it to you.

For streaming content, we can iterate the content chunk by chunk using the iter_content method or iterate line by line using iter_line. Either way, it will download the file part by part.

For example:

In [36]:
r = requests.get('https://cdn.pixabay.com/photo/2018/07/05/02/50/sun-hat-3517443_1280.jpg', stream=True)
downloaded_file = open("sun-hat.jpg", "wb")
for chunk in r.iter_content(chunk_size=256):
    if chunk:
        downloaded_file.write(chunk)

The code above will download an image from Pixabay server and save it in a local file, sun-hat.jpg.


# MAKING ASYNCHRONOUS REQUESTS

When you make massive requests to websites, it can be extremely time-consuming. To complete your request faster, you can take advantage of the async module of requests. Here's how ( you'll need to first install asyncio with pip before you can try the following code ):

In [24]:
import asyncio, requests

urls = [
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/breakfast.jpg',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/forbidden',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/the-html5-breakfast-site.html'
]

async def main():
    loop = asyncio.get_event_loop()
    futures = [loop.run_in_executor(None, requests.get, url) for url in urls]
    for response in await asyncio.gather(*futures):
        print(response.status_code)



In [15]:

loop = asyncio.get_event_loop()

In [16]:
loop.run_until_complete(main())


RuntimeError: This event loop is already running

200
403
200


# DEALING WITH THROTTLING AND RATE LIMITING

In modern websites especially those having massive users, throttling and/or rate limiting is often enforced so that a certain person cannot make too frequent API/web requests to the website. These approaches are not targeting at regular human users but rather search engine bots and especially hackers. With throttling, the same requester (usually judged by IP address or account) cannot make more requests than the limit allowed within a certain period of time (e.g. 10,000 requests/day). If the limit is exceeded, the requester will receive an error or simply no response. With rate limiting, a requester must control the frequency of the requests under a certain threshold (e.g. 10 requests/second). When you test your web scraping scripts, if you receive a lot of errors in your responses, it does not necessarily mean the web resources are invlaid. It may because the websites you make requests to are throttling or limiting your requests.

The throttling thresholds and wait peroids differ from website to website. In order to know whether you may be throttled by exceeding the limit, you need to check out the user agreement of the website that you make requests to.

To control the request rate of your scripts, you can wait a period of time after making each request by calling time.sleep(). For example:

In [26]:
import requests, time

urls = [
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/breakfast.jpg',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/forbidden',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/the-html5-breakfast-site.html'
]

for url in urls:
    response = requests.get(url)
    print(response.status_code)
    time.sleep(1)

200
403
200


# Links 

https://nordicapis.com/everything-you-need-to-know-about-api-rate-limiting/

https://realpython.com/python-requests/