In this notebook, we'll explore some examples where we will learn about the concept of cookies in HTTP.

In [1]:
import requests
from bs4 import BeautifulSoup

Let us first start with a simple example. Navigate to http://www.webscrapingfordatascience.com/cookielogin/. This page asks you to login (any username and password is fine), after which you can access a "secret" page, http://www.webscrapingfordatascience.com/cookielogin/secret.php.

Let's try whether we can access this secret page directly using Requests:

In [2]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/secret.php'

# Without cookies
r = requests.get(url)
print(r.text)

Hmm... it seems you are not logged in


That didn't work. So let's try to attempt to login ourselves using Requests first with a `POST` request as seen before:

In [3]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/'

r = requests.post(url, data={'username': 'user', 'password': 'pass'})

print(r.text)


<html>

<body>


You are logged in, you can now see the <a href="secret.php">secret page</a>.


</body>

</html>



Let's now try to access the secret page:

In [4]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/secret.php'

r = requests.get(url)
print(r.text)

Hmm... it seems you are not logged in


What happened here? After submitting the login data, the server includes `Set-Cookie` headers in its HTTP reply, which it will use to identify us in subsequent requests. So we have to do the same here...

In [5]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/'

r = requests.post(url, data={'username': 'user', 'password': 'pass'})

print(r.headers)

{'Date': 'Fri, 31 Jul 2020 10:18:09 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Set-Cookie': 'PHPSESSID=iuh0i1jq783t1ried5i6m0bo85; path=/', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Pragma': 'no-cache', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '114', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


In [6]:
r.cookies

<RequestsCookieJar[Cookie(version=0, name='PHPSESSID', value='iuh0i1jq783t1ried5i6m0bo85', port=None, port_specified=False, domain='www.webscrapingfordatascience.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]>

In [7]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/secret.php'

r = requests.get(url, cookies=r.cookies)
print(r.text)

This is a secret code: 1234


Let us now take a look at a trickier example, using http://www.webscrapingfordatascience.com/redirlogin/.

This page behaves very similar to the example before, so let's try the same approach.

In [8]:
url = 'http://www.webscrapingfordatascience.com/redirlogin/'

r = requests.post(url, data={'username': 'user', 'password': 'pass'})

print(r.headers)
r.cookies

{'Date': 'Fri, 31 Jul 2020 10:19:23 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Pragma': 'no-cache', 'Content-Length': '27', 'Keep-Alive': 'timeout=5, max=99', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<RequestsCookieJar[]>

Strange, there are no cookies here. If we inspect the requests using our web browser, we see why: the status code of the HTTP reply here is a redirect (`302`), which Requests follows by default. Since it is the intermediate page which provides the `Set-Cookie` header, this gets overridden by the final destination reply headers.

Luckily, we can override this behavior using `allow_redirects`.

In [9]:
url = 'http://www.webscrapingfordatascience.com/redirlogin/'

r = requests.post(url, data={'username': 'user', 'password': 'pass'}, allow_redirects=False)

print(r.status_code)
print(r.headers)
r.cookies

302
{'Date': 'Fri, 31 Jul 2020 10:21:42 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Set-Cookie': 'PHPSESSID=vsm1065k77t1qjepijmnr9rjg6; path=/', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Pragma': 'no-cache', 'Location': 'http://www.webscrapingfordatascience.com/redirlogin/secret.php', 'Content-Length': '114', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<RequestsCookieJar[Cookie(version=0, name='PHPSESSID', value='vsm1065k77t1qjepijmnr9rjg6', port=None, port_specified=False, domain='www.webscrapingfordatascience.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]>

In [10]:
url = 'http://www.webscrapingfordatascience.com/redirlogin/secret.php'

r = requests.get(url, cookies=r.cookies)
print(r.text)

This is a secret code: 1234


Luckily, Requests provides a handy mechanism which avoids us having to manage cookies ourselves: sessions. By creating a session and using it as the `requests` module so far, Requests will make sure to keep track of cookies and send them with each subsequent request.

In [15]:
my_session = requests.Session()

In addition, sessions make it easy to change some headers you want to apply for all requests you make using this session. This helps to avoid having to provide a `headers` argument for every request, which is particularly helpful for e.g. the `User-Agent` header (you can still use the `headers` argument as well to set request-specific headers).

In [18]:
my_session.headers.update({'User-Agent': 'Chrome!'})

Let's now try out our session on the URL http://www.webscrapingfordatascience.com/trickylogin/. Inspect your browser to see what is happening there.

Note that we do perform an explicit GET request below to get out the login form first. What happens if you don't use this? Can you figure out using the developer tools what is happening here?

In [24]:
url = 'http://www.webscrapingfordatascience.com/trickylogin/'

# Perform a GET request
r = my_session.get(url)
print(r.request.headers)
print(r.headers)
print()

# Login using a POST
r = my_session.post(url, params={'p': 'login'}, data={'username': 'dummy', 'password': '1234'}) 
print(r.request.headers)
print(r.headers)
print()

# Get the protected page (note that in this example, a URL parameter is used as well)
r = my_session.get(url, params={'p': 'protected'})
print(r.request.headers)
print(r.headers)
print()

{'User-Agent': 'Chrome!', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'Date': 'Fri, 31 Jul 2020 11:11:43 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Set-Cookie': 'PHPSESSID=tp07889unftqnaabp0jl8bkvt3; path=/, PHPSESSID=op9ldo1bjmc1j02noh7ljrkv13; path=/', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Pragma': 'no-cache', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '167', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}

{'User-Agent': 'Chrome!', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'PHPSESSID=ha1vbffei4bchg2e5i5k3totk3'}
{'Date': 'Fri, 31 Jul 2020 11:11:43 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Pragma': 'no-cache', 'Set-Cookie': 'PHPSESSID=tk2ph9bndp4nff

In [25]:
print(r.text)

Here is your secret code: 3838.


Cookies can also be managed yourself through the session cookiejar, which behaves like a normal Python dictionary.

In [26]:
my_session.cookies

<RequestsCookieJar[Cookie(version=0, name='PHPSESSID', value='9eecg4bl00onra516qb1sdi4c4', port=None, port_specified=False, domain='www.webscrapingfordatascience.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]>

In [27]:
my_session.cookies.get('PHPSESSID')

'9eecg4bl00onra516qb1sdi4c4'

In [28]:
my_session.cookies.clear()

In [29]:
my_session.cookies

<RequestsCookieJar[]>