#### Let’s now take a look at how we can deal with this use case of Javascript using requests and Beautiful Soup:

In [4]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.webscrapingfordatascience.com/simplejavascript/'
r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')
# No tag will be found here
ul_tag = html_soup.find('ul')
print(ul_tag)
# Show the JavaScript code
script_tag = html_soup.find('script', attrs={'src': None})
print(script_tag)

None
<script>
	$(function() {
	document.cookie = "jsenabled=1";
	$.getJSON("quotes.php", function(data) {
		var items = [];
		$.each(data, function(key, val) {
			items.push("<li id='" + key + "'>" + val + "</li>");
		});
		$("<ul/>", {
			html: items.join("")
			}).appendTo("body");
		});
	});
	</script>


 We have no way to parse and query the actual JavaScript code.
In simple situations such as this one, this is not necessarily a problem. We know
that the browser is making requests to a page at “quotes.php”, and that we need to set a
cookie. We can still scrape the data directly:



In [2]:
import requests
url = 'http://www.webscrapingfordatascience.com/simplejavascript/quotes.php'
# Note that cookie values need to be provided as strings
r = requests.get(url)
print(r.json())

['Are you using a web scraper?']


In [3]:
import requests
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
my_session = requests.Session()
# Get the main page first to obtain the PHPSESSID cookie
r = my_session.get(url)
# Manually set the nonce cookie
my_session.cookies.update({
    'nonce': '2315'
    })
r = my_session.get(url + 'quotes.php', params={'p': '0'})
print(r.text)
# Shows: No quotes for you!

No quotes for you!


Sadly, this doesn’t work. Figuring out why requires some creative thinking, though
we can take a guess at what might be going wrong here. We’re getting a fresh session
identifier by visiting the main page as if we were coming from a new browsing session
to provide the “PHPSESSID” cookie. However, we’re reusing the “nonce” cookie value
that our browser was using. The web page might see that this “nonce” value does not
match with the “PHPSESSID” information. As such, we have no choice but to also reuse 
the “PHPSESSID” value. Again, yours might be different (inspect your browser’s network 
requests to see which values it is sending for your session):


nonce=1497; _ga=GA1.2.1481335662.1625916386; PHPSESSID=li86h0i1o5igp31sge3ej1338h

In [5]:
import requests
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
my_cookies = {
    'nonce':'1497',
    'PHPSESSID': 'li86h0i1o5igp31sge3ej1338h'
    }
r = requests.get(url + 'quotes.php', params={'p': '0'}, cookies=my_cookies)
print(r.text)

<div class="quote decode">TGlmZSBpcyBhYm91dCBtYWtpbmcgYW4gaW1wYWN0LCBub3QgbWFraW5nIGFuIGluY29tZS4gLUtldmluIEtydXNlDQo=</div><div class="quote decode">CVdoYXRldmVyIHRoZSBtaW5kIG9mIG1hbiBjYW4gY29uY2VpdmUgYW5kIGJlbGlldmUsIGl0IGNhbiBhY2hpZXZlLiDigJNOYXBvbGVvbiBIaWxsDQo=</div><div class="quote decode">CVN0cml2ZSBub3QgdG8gYmUgYSBzdWNjZXNzLCBidXQgcmF0aGVyIHRvIGJlIG9mIHZhbHVlLiDigJNBbGJlcnQgRWluc3RlaW4NCg==</div><br><br><br><br><a class="jscroll-next" href="quotes.php?p=3">Load more quotes</a>
