## DEMO : Jupyter Notebook Fetching Webpages with the Requests Library  

**Requests** is good for times when you just need to fetch a webpage and do something with the raw HTML.  
It doesn't give you a whole lot more, but does do that incredibly well!

The [Requests homepage](https://docs.python-requests.org/en/latest/) has lots of good examples and full documentation.

<br>

---  
History:
+ 04.09.2023 v1 dbe --- initial version for KETE-HS23  


---

## A) Import Library and Request Full Webpage

In [1]:
import requests

In [10]:
url = 'https://www.nau.ch/'
response = requests.get(url)

---  
## B) Check the returned object (i.e. response)

+ Check HTTP Status  
You can make sure the request actually worked (ie. [HTTP status code](http://en.wikipedia.org/wiki/List_of_HTTP_status_codes) 200)

In [11]:
response.status_code

200

+ Check Content Type  
You can check what type of content the webpage returned (ie. text, json, csv, etc)

In [12]:
response.headers['content-type']

'text/html; charset=UTF-8'

+ Check Content Encoding  
You can check the character set (sure hope it is utf-8!)

In [13]:
response.encoding

'UTF-8'

+ Check Content itself  
Of course, you can get the actual HTML text too!

In [14]:
# show the first 200 characters of the content
response.text[0:200]

'<!DOCTYPE html><html lang="de" data-critters-container><head>\n        <meta charset="utf-8">\n        <title>News für die Schweiz - Nau.ch</title>\n        <base href="/">\n\n        <meta name="viewport"'

+ Save Content locally
You can save the Web content in a local file for further analysis and processing!

In [18]:
with open('response.html', 'a') as fp:
    fp.write(str(response.text))

---  
## C) Request Selected Webpage Parts

But sometimes you just want the header content, for instance if you want to resolve redirects without actually downloading the full webpage content.

In [19]:
url = 'https://www.nau.ch/'

In [25]:
response = requests.head(url, allow_redirects=True)

In [28]:
# show all possible parts to get from a response object
vars(response).keys()

dict_keys(['_content', '_content_consumed', '_next', 'status_code', 'headers', 'raw', 'url', 'encoding', 'history', 'reason', 'cookies', 'elapsed', 'request', 'connection'])

In [38]:
response.headers

{'Date': 'Tue, 05 Sep 2023 11:41:32 GMT', 'Server': 'Apache', 'Cache-Control': 'no-cache, max-age=0', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Expires': 'Tue, 05 Sep 2023 11:41:32 GMT', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': '*', 'Content-Type': 'text/html; charset=UTF-8', 'Referrer-Policy': 'no-referrer-when-downgrade'}