In this notebook, we'll take Beautiful Soup for a spin to parse HTML. See https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for the full docs.

We assume you already have Beautiful Soup installed. If not, run this first (the `q` flag surpresses output to keep our output clean, the `U` flag updates an existing Beautiful Soup installation):

In [1]:
!pip install -qU beautifulsoup4

Next, we import Requests and Beautiful Soup.

In [2]:
import requests
from bs4 import BeautifulSoup

# Scraping the BlueCourses web site

Let's start by scraping our course page. We'll try to fetch a list of courses and their respective links and description. Remember to open up https://www.bluecourses.com/ in your web browser to follow along with the developer tools.

In [3]:
url = 'https://www.bluecourses.com/'

r = requests.get(url)
html_contents = r.text

# If you don't pass the second argument here, Beautiful Soup will attempt to pick a parser for you
html_soup = BeautifulSoup(html_contents, 'html.parser')

Now look around in your web browser for a bit. Note that we can start from `li` tags with the class `courses-listing-item`, so let's select on this.

In [4]:
course_info_elements = html_soup.find_all(class_='courses-listing-item')
len(course_info_elements)

17

Looks like we're on the right way. Let's look at the first element in a bit more detail.

In [5]:
# Type of the elements: Tag
type(course_info_elements[0])

bs4.element.Tag

In [6]:
# Tag name of the element
course_info_elements[0].name

'li'

In [7]:
# A list containing the tag's children (its direct descendants) as a list
# Note that this can return text elements, as shown below
course_info_elements[0].contents

['\n',
 <article aria-label="Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS" class="course" id="course-v1:bluecourses+BC1+September2019" role="region">
 <a href="/courses/course-v1:bluecourses+BC1+September2019/about">
 <header class="course-image">
 <div class="cover-image">
 <img alt="Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS BC1" src="/asset-v1:bluecourses+BC1+September2019+type@asset+block@Ocean.jpg"/>
 <div aria-hidden="true" class="learn-more">LEARN MORE</div>
 </div>
 </header>
 <div aria-hidden="true" class="course-info row align-items-stretch mx-0">
 <h2 class="course-name col col-12 px-0">
 <span class="course-title my-1">Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS</span>
 </h2>
 <div class="course-description col col-12 mb-1">In this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines</div>
 <div aria-hidden="true" class="course-date localized_datetime col col-12 pb

In [8]:
# Converting the Tag object shows the HTML markup
str(course_info_elements[0])

'<li class="courses-listing-item">\n<article aria-label="Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS" class="course" id="course-v1:bluecourses+BC1+September2019" role="region">\n<a href="/courses/course-v1:bluecourses+BC1+September2019/about">\n<header class="course-image">\n<div class="cover-image">\n<img alt="Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS BC1" src="/asset-v1:bluecourses+BC1+September2019+type@asset+block@Ocean.jpg"/>\n<div aria-hidden="true" class="learn-more">LEARN MORE</div>\n</div>\n</header>\n<div aria-hidden="true" class="course-info row align-items-stretch mx-0">\n<h2 class="course-name col col-12 px-0">\n<span class="course-title my-1">Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS</span>\n</h2>\n<div class="course-description col col-12 mb-1">In this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines</div>\n<div aria-hidden="true" class="course-date loca

In [9]:
# Get the textual contents as clear text. Note the differences between text and string:
print(course_info_elements[0].text)
print(course_info_elements[0].string)







LEARN MORE




Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS

In this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines
Starts: Aug 18, 2019



BC1
Starts: 





None


In [10]:
# Even better is to use get_text:
course_info_elements[0].get_text('\n', strip=True)

'LEARN MORE\nBasic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS\nIn this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines\nStarts: Aug 18, 2019\nBC1\nStarts:'

In [11]:
# But note that this is different from:
course_info_elements[0].text.strip()

'LEARN MORE\n\n\n\n\nBasic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS\n\nIn this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines\nStarts: Aug 18, 2019\n\n\n\nBC1\nStarts:'

Based on this, we can easily use `find` again on the tags to get out the details we want:

In [12]:
for course_info_element in course_info_elements:
    course_name = course_info_element.find(class_='course-title').get_text(strip=True)
    course_desc = course_info_element.find(class_='course-description').get_text(strip=True)
    course_link = course_info_element.find('a').get('href')
    print(course_name, course_link)
    print(course_desc)
    print()

Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS /courses/course-v1:bluecourses+BC1+September2019/about
In this course, students learn how to develop credit risk models in the context of the Basel and IFRS 9 guidelines

Advanced Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS /courses/course-v1:bluecourses+BC2+September2019/about
In this course, students learn how to do advanced credit risk modeling.

Machine Learning Essentials /courses/course-v1:bluecourses+BC3+October2019/about
In this course, participants learn the essentials of machine learning.

Fraud Analytics /courses/course-v1:bluecourses+BC4+December2019/about
In this course, participants learn the essentials of fraud analytics.

Social Network Analytics /courses/course-v1:bluecourses+BC5+2020/about
In this course, participants learn the essentials of social network analytics.

Recommender Systems /courses/course-v1:bluecourses+BC7+2020_Q1/about
In this course, you will learn the essentials of recommend

# Scraping Hacker News

For this second example, let us scrape Hacker News (https://news.ycombinator.com/) -- we'll get the titles, links and points.

In [13]:
url = 'https://news.ycombinator.com/'

r = requests.get(url)
html_contents = r.text

html_soup = BeautifulSoup(html_contents, 'html.parser')

Again, make sure to confirm in your browser how we construct our selections here.

In [14]:
for post in html_soup.find_all('tr', class_='athing'):
    post_title_element = post.find('a', class_='storylink')
    post_title = post_title_element.get_text(strip=True)
    post_link = post_title_element.get('href')
    post_points = post.find_next(class_='score').get_text(strip=True)
    print(post_title, post_link, post_points)
    print()

Monitoring demystified: A guide for logging, tracing, metrics https://techbeacon.com/enterprise-it/monitoring-demystified-guide-logging-tracing-metrics 110 points

Australia to make Facebook, Google pay for news in world first https://www.reuters.com/article/us-australia-media-regulator/australia-to-make-facebook-google-pay-for-news-in-world-first-idUSKCN24V3UP 163 points

Google Earth Timelapse https://earthengine.google.com/timelapse/ 234 points

YouTube: Community contributions will be discontinued across all channels https://support.google.com/youtube/answer/6052538 192 points

Show HN: A bookmarking tool designed to help synthesize your web research https://klobie.com 147 points

Reverse Engineering the PLA Chip in the Commodore 128 https://c128.se/posts/silicon-adventures/ 113 points

Philosophers on GPT-3 http://dailynous.com/2020/07/30/philosophers-gpt-3/ 211 points

Chronic mania and persistent euphoric states https://srconstantin.github.io/2020/07/29/chronic-mania.html 72 poi

# The JavaScript problem

Now take a look at http://www.webscrapingfordatascience.com/simplejavascript/. Based on inspecting this site in your browser, you might try the following:

In [15]:
url = 'http://www.webscrapingfordatascience.com/simplejavascript/'

r = requests.get(url)
html_contents = r.text

html_soup = BeautifulSoup(html_contents, 'html.parser')

for item in html_soup.find_all('li'):
    print(item)

Nothing happens... We will see why this is the case.

In [16]:
print(html_contents)

<html>

<head>
	<script src="https://code.jquery.com/jquery-3.2.1.min.js"></script>
	<script>
	$(function() {
	document.cookie = "jsenabled=1";
	$.getJSON("quotes.php", function(data) {
		var items = [];
		$.each(data, function(key, val) {
			items.push("<li id='" + key + "'>" + val + "</li>");
		});
		$("<ul/>", {
			html: items.join("")
			}).appendTo("body");
		});
	});
	</script>
</head>

<body>

<h1>Here are some quotes</h1>

</body>

</html>

