# Web Scraping in Python

## Setting up

Install the packages we will use today.
We'll need
- requests, to get the HTML content of a web page
- BeautifulSoup, which is used to parse HTML into a manipulable object in Python
- pandas, to work with dataframes

In [None]:
!pip install beautifulsoup4 requests pandas

Defaulting to user installation because normal site-packages is not writeable


And let's import them

In [114]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Scraping a single page

### Downloading HTML

Let's start by scraping some content from the RBloggers webpage, which contains information and tutorials for R.

The first step is to download the HTML for a webpage using `requests.get`.

In [63]:
response = requests.get(
	url='https://www.r-bloggers.com'
)

We can view the HTML code that we get:

In [64]:
print(response.text)

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US" prefix="og: https://ogp.me/ns#" prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>R-bloggers</title>

		<!-- All in One SEO 4.5.9.2 - aioseo.com -->
		<meta name="description" content="R news and tutorials contributed by hundreds of R bloggers" />
		<meta name="robots" content="max-image-preview:large" />
		<link rel="canonical" href="https://www.r-bloggers.com/" />
		<link rel="next" href="https://www.r-bloggers.com/page/2/" />
		<meta name="generator" content="All in One SEO (AIOSEO) 4.5.9.2" />
		<meta property="og:locale" content="en_US" />
		<meta property="og:site_name" content="R-bloggers" />
		<meta property="og:type" content="article" />
		<meta property="og:title" content="R-bloggers" />
		<meta property="og:description" content="R news and tutorials contributed by hundreds of R bloggers" />
		<meta property="og:url" content="htt

What you saw above is the actual HTML code underlying the webpage on your browser. However, this is currently just text (string). To make it easier to work with, we will use BeautifulSoup to turn it into an object that allows us to extract information easily.

In [17]:
webpage = BeautifulSoup(response.text)

### Scraping text

Let's create a new variable for the page's logo as text using the CSS selector to identify it. (Hint: SelectorGadget can make this easier!)

In [23]:
page_logo = webpage.select_one('.logo-wrap').get_text()

Print the variable to see what we've downloaded:

In [54]:
print(page_logo)




R-bloggers
R news and tutorials contributed by hundreds of R bloggers




Chances are you'll want only the text, so you can use `.strip()` to get rid of the surrounding spaces/empty lines.

In [68]:
print(page_logo.strip())

R-bloggers
R news and tutorials contributed by hundreds of R bloggers


Changing `select_one` to `select` will select all of the elements with the attribute and save them as a list object. Let's do this for the article title on the page.

In [51]:
title_elements = webpage.select('h3 a')
title_elements

[<a href="https://www.r-bloggers.com/2025/10/tabler-0-1-0-is-here/" rel="bookmark">Tabler 0.1.0 is here!</a>,
 <a href="https://www.r-bloggers.com/2025/10/double-descent-explained/" rel="bookmark">Double Descent Explained</a>,
 <a href="https://www.r-bloggers.com/2025/10/approximating-evidence-via-bounded-harmonic-means-and-hpd-regions-with-known-volumes/" rel="bookmark">Approximating evidence via bounded harmonic means (and HPD regions with known volumes)</a>,
 <a href="https://www.r-bloggers.com/2025/10/mapping-antarctica/" rel="bookmark">Mapping Antarctica</a>,
 <a href="https://www.r-bloggers.com/2025/10/ropensci-news-digest-october-2025/" rel="bookmark">rOpenSci News Digest, October 2025</a>,
 <a href="https://www.r-bloggers.com/2025/10/eurobioc2025-conference-recap/" rel="bookmark">EuroBioC2025 conference recap</a>,
 <a href="https://www.r-bloggers.com/2025/10/change-is-good-so-dont-change-my-change/" rel="bookmark">Change is good, so don’t change my change</a>,
 <a href="https:/

To turn this into a list of texts instead of list of elements, you can use a list comprehension. If you are not familiar with a list comprehension, briefly, it is like a for loop, but more compact (see below for comparison):

In [None]:
# For loop version:
titles = []
for title_element in title_elements:
	titles.append(title_element.get_text())
	
# List comprehension version (more compact):
titles = [title_element.get_text() for title_element in title_elements]

titles

['Tabler 0.1.0 is here!',
 'Double Descent Explained',
 'Approximating evidence via bounded harmonic means (and HPD regions with known volumes)',
 'Mapping Antarctica',
 'rOpenSci News Digest, October 2025',
 'EuroBioC2025 conference recap',
 'Change is good, so don’t change my change',
 'Be Mindful of the Time',
 'Orchestrating Polyglot, Reproducible Data Science with Nix and {rixpress}',
 'The use of SAT/ACT for college admissions',
 'Go for Launch! Packages Shipped to the R-Multiverse',
 'Rfuzzycoco released on CRAN',
 'Compositional modeling of plant communities with Dirichlet regression',
 'gssrdoc Updates',
 'D3po 1.0.0 is here!',
 'Inequality and homicide, within-country and between country by @ellis2013nz',
 'Tutorial for Developing an Advanced Stock Dashboard for the S&P 500 for the 2025 Posit Table Contest',
 'What’s new for Python in 2025?',
 'Building and Customising Statistical Models with Stan and R: An Introduction to Bayesian Inference workshop',
 'Two New Preprints on 

## Scraping images

We can do the same thing for images by choosing their HTML selectors and using the `src` attribute:

In [47]:
image_elements = webpage.select('img')
images = [image_element.get('src') for image_element in image_elements]

Let's see the first two elements in the list of images:

In [56]:
images[:2]

['https://www.r-bloggers.com/wp-content/uploads/2020/07/R_02.webp',
 'https://pacha.dev/blog/2025/10/27/tabler/NAVBAR-OVERLAP-light.png']

To see the images themselves, you could follow the links. In a scraping scenario, we often want to download the images, so we will do that.

Note that common image file types are: `.png`, `.jpg`, `.jpeg`, `.gif` and `.webp`. The last one is a recent addition and becoming more common.

In [None]:
image_binary = requests.get(images[0]).content

# Note the mode 'wb', which tells Python to write the content as binary
# This is needed to save things that are not text
with open('r_bloggers.webp', 'wb') as f:
	f.write(image_binary)

## Exercise 1: basic web scraping

On your own, repeat the steps above on the Wikipedia homepage.

First, download the HTML for <https://en.wikipedia.org/wiki/Main_Page>:

In [62]:
# Code goes here

Next, view the HTML code:

In [65]:
# Code goes here

Scrape the headings for the homepage features ('From today's featured article', 'In the news', etc.)

In [66]:
# Code goes here

Scrape the text of the homepage features:

In [67]:
# Code goes here

## More advanced scraping

Often, when undertaking a web scraping project, we find we'll need to download content from multiple pages or multiple locations.

The Connosr database contains a variety of whisky reviews, ratings, and information. The website is structured with information nested under the main URL, www.connosr.com.

[https://www.connosr.com/](https://www.connosr.com)

Here is the structure of the webpage: 

![StructureWebsite](images/WebsiteStructure.jpg)

We're interested in scraping data about Scottish whisky, located in the [Scotch Whiskys sub-folder](https://www.connosr.com/scotch-whisky). Let's save the URL in a new variable:


In [69]:
whiskypage = 'https://www.connosr.com/scotch-whisky'

## Level 1: Home Page

### Extracting information from links

Let's get the links to all of the Scottish whisky distilleries listed on the page. Once we have the list, we'll be able to use a spider to 'crawl' through our links one at a time, extracting information about each distillery.

We'll use the CSS selector to grab the HTML nodes for the names and the corresponding HTML attributes:

In [None]:
def get_dist_links(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	dist_links = [name_element['href'] for name_element in name_elements]
	return dist_links

dist_links = get_dist_links(whiskypage)

We can take a peek at the first few links to see that nothing is wrong.

In [82]:
dist_links[:10]

['/ardbeg-scotch-whisky',
 '/laphroaig-scotch-whisky',
 '/lagavulin-scotch-whisky',
 '/springbank-scotch-whisky',
 '/glenfarclas-scotch-whisky',
 '/highland-park-scotch-whisky',
 '/talisker-scotch-whisky',
 '/caol-ila-scotch-whisky',
 '/glendronach-scotch-whisky',
 '/old-pulteney-scotch-whisky']

How long is the list of links?

In [76]:
len(dist_links)

193

You may notice that the link isn't actually a full link (with `http...` in front of the link). We will need the full URLs to work with, so we will add the homepage in front of the links using either a loop or list comprehension

In [79]:
full_links = ['http://www.connosr.com' + dist_link for dist_link in dist_links]

We can also extract the names of the distilleries. We might write a function for this, simiar to above.

In [89]:
def get_dist_names(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	dist_names = [name_element.get_text() for name_element in name_elements]
	return dist_names

In [91]:
dist_names = get_dist_names(whiskypage)

Note that in practice, you can save some time and space by doing all of these steps in one loop!

### Cleaning scraped text

Data scraped from the web often needs some cleaning. As the output of the previous block shows us, the distillery names are preceded by an extra letter.

Using regex, we'll go ahead and remove the extra letters:

In [None]:
import re

# Compile the pattern in advance to speed things up
regex_pattern = re.compile(r'^\w+\s')

# Create a new list of dist names without the extra letter (and leading space).
cleaned_names = []
for dist_name in dist_names:
	cleaned_name = regex_pattern.sub('', dist_name)
	cleaned_names.append(cleaned_name)

Let's see how it looks:

In [98]:
cleaned_names[:10]

['Ardbeg',
 'Laphroaig',
 'Lagavulin',
 'Springbank',
 'Glenfarclas',
 'Highland Park',
 'Talisker',
 'Caol Ila',
 'GlenDronach',
 'Old Pulteney']

Now, we have a list of the Scottish distilleries in the Connosr database. We might also be interested in seeing how community members have rated them.

We can write a function to loop over the webpage that scrapes the rating for each distillery using the HTML nodes:

In [107]:
def get_rate_dist(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	avg_rating_elements = html_soup.select('.not-small-phone')
	average_rating_texts = [avg_rating_element.get_text() for avg_rating_element in avg_rating_elements]
	return average_rating_texts

rate_dist = get_rate_dist(whiskypage)

We'll also want to remove the 'Average Rating: ' appended to each distillery's rating, and turn them into actual ratings (i.e. numeric values).

In [None]:
cleaned_rate_dist = [float(rate_text.replace('Average rating: ', '')) for rate_text in rate_dist]

ValueError: could not convert string to float: '~'

That gives you an error! That's because there's a rating that's not a numerical value (`~`). If you look at the website, it seems like these are whiskies that have no ratings. To handle this, we will have to decide what to do with these values. The best way is to assign them some equivalent of "NA". In this case, we will assign them `None`.

In [None]:
cleaned_rate_dist = []
for rate_text in rate_dist:
	rating = rate_text.replace('Average rating: ', '')
	if rating == '~':
		cleaned_rate_dist.append(None)
	else:
		cleaned_rate_dist.append(float(rating))


To save the information we've extracted, we can merge it into a dataframe:

In [None]:
distillery_df = pd.DataFrame(
	zip(full_links, cleaned_names, cleaned_rate_dist),
	columns=['full_link', 'cleaned_name', 'rating']
)
distillery_df

Unnamed: 0,full_link,cleaned_name,rating
0,http://www.connosr.com/ardbeg-scotch-whisky,Ardbeg,88.2
1,http://www.connosr.com/laphroaig-scotch-whisky,Laphroaig,87.3
2,http://www.connosr.com/lagavulin-scotch-whisky,Lagavulin,89.6
3,http://www.connosr.com/springbank-scotch-whisky,Springbank,86.8
4,http://www.connosr.com/glenfarclas-scotch-whisky,Glenfarclas,86.6
...,...,...,...
188,http://www.connosr.com/craiglodge-scotch-whisky,Craiglodge,
189,http://www.connosr.com/macdonald-martin-scotch...,Macdonald Martin,
190,http://www.connosr.com/pride-of-strathspey-sco...,Pride of Strathspey,
191,http://www.connosr.com/ben-wyvis-scotch-whisky,Ben Wyvis,


## Exercise 2: Level 1 scraping

On your own, use what we've learned to scrape a list of whisky distilleries for another region on Connosr.

First, save the URL for the page.


In [None]:
# Code goes here

Next, download the links by grabbing the HTML nodes for the names of distilleries and the corresponding HTML attributes. (Hint: we've already defined the function `get_dist_links`in the previous step, so you'll just need to use it on the new URL!)

In [None]:
# Code goes here

According to the Connosr databased, how many whisky distilleries are in the new region you've explored?

In [None]:
# Code goes here

## Level 2: Distilleries pages

At this point in the lesson, we'll be scraping multiple pages for information. So, you may find that the blocks of code may take longer to run.

We're going to repeat the process of writing a function to download the links for reviews for specific bottles

In [119]:
def get_bottle_links(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	bottle_links = [name_element['href'] for name_element in name_elements]
	return bottle_links

bottle_links = []
for full_link in full_links:
	bottle_links += get_bottle_links(full_link)

### Completing partial URLs

The links are incomplete, with only part of the URL path. Let's fix this:

In [120]:
full_bottle_links = ['http://www.connosr.com' + bottle_link for bottle_link in bottle_links]

Let's look at some of them to see if we got it right

In [122]:
full_bottle_links[:10]

['http://www.connosr.com/ardbeg-uigeadail-whisky-reviews-1976',
 'http://www.connosr.com/ardbeg-10-year-old-whisky-reviews-373',
 'http://www.connosr.com/ardbeg-corryvreckan-whisky-reviews-2174',
 'http://www.connosr.com/ardbeg-1990-airigh-nam-beist-whisky-reviews-1248',
 'http://www.connosr.com/ardbeg-alligator-whisky-reviews-2458',
 'http://www.connosr.com/ardbeg-ardbog-whisky-reviews-2631',
 'http://www.connosr.com/ar1-elements-of-islay-whisky-reviews-689',
 'http://www.connosr.com/ar2-elements-of-islay-whisky-reviews-1600',
 'http://www.connosr.com/ardbeg-10-year-old-whisky-reviews-373',
 'http://www.connosr.com/ardbeg-10-year-old-mor-whisky-reviews-944']

### Extracting information from links

Now we have the links of each bottle page. We also want to get the name of each bottle:

In [123]:
def get_bottle_names(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	bottle_names = [name_element.get_text() for name_element in name_elements]
	return bottle_names

bottle_names = []
for full_link in full_links:
	bottle_names += get_bottle_names(full_link)

## Level 3: Reviews

### Downloading reviews

The full list of reviewed bottles includes 3,508 observations. To save time, we are going to work on a subset of the list of links. (If you want to work on the full list, please note that it can take up to 10 minutes for each function to run. You can work with the full list by subbing `test_links` with `full_bottle_links`.)

In [125]:
test_links = full_bottle_links[:100]

In [134]:
def get_bottle_reviews(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	p_elements = html_soup.select('.simple-review-content p')
	bottle_reviews = [p_element.get_text() for p_element in p_elements]
	return bottle_reviews

bottle_review_list = []
for full_bottle_link in test_links:
	bottle_review_list.append(get_bottle_reviews(full_bottle_link))

Now, we have a very long list of reviews! Let's have a look at the reviews for the first bottle:

In [135]:
bottle_review_list[0]

["Not the first I've tried, mind you, but the first I've owned. Not sure why it took me so long.",
 "On the nose, there is warm, thick, sweet, perfumed, enveloping smokiness. A different quality of smoke that some of the other Islays I've tried. It is not sharp, or ashen, or medicinal. There is a tarry quality to it, and a salty note, but also a note of burning logs, interwoven with aromatic incense-like sweetness. As it sits in the glass there develops a molasses and caramel note. Very complex, yet not overly intense. ",
 'At full force, this comes strong on the palate. The tarry smoke really fills the mouth, driven in by the high alcohol and backed up by dark chocolate-covered cherries. The sherry casks are very well-integrated. I find with many smoky sherried whiskies the two styles can clash and become overbearing. Not here. ',
 'The finish is pleasantly smoky, not the usual mouth-full-of-ashes sensation from a Laphroaig or Lagavulin. Some fragrant sweetness as the flavour fades. '

### Merging scraped data

We may want to merge the reviews for each bottle together. We can do this by joining the strings.

In [138]:
reviews_by_bottle = [' '.join(bottle_reviews) for bottle_reviews in bottle_review_list]

We can also create a dataframe for further data manipulation.

In [None]:
with_bottle_names = pd.DataFrame(
	zip(bottle_names[:100], reviews_by_bottle),
	columns=['bottle_name', 'review']
)
with_bottle_names

Unnamed: 0,bottle_name,review
0,Ardbeg UigeadailOB,"Not the first I've tried, mind you, but the fi..."
1,Ardbeg 10 Year OldOB,My first Connosr review was an Ardbeg 10 over ...
2,Ardbeg CorryvreckanOB,Last time I sipped Corryvreckan was summer 201...
3,Ardbeg 1990 Airigh Nam BeistOB,Nose: First up is wet moss with ocean misting ...
4,Ardbeg AlligatorOB,My taste in Scotch tends to run to the Islays....
...,...,...
95,Ardbeg 1974 Cask 4989OB,There are no community reviews of Ardbeg 1974 ...
96,Ardbeg 1974 Cask 5666OB,There are no community reviews of Ardbeg 1974 ...
97,Ardbeg 1974 Connoisseurs Choice bottled 1993 G...,There are no community reviews of Ardbeg 1974 ...
98,Ardbeg 1974 Connoisseurs Choice bottled 2003 G...,There are no community reviews of Ardbeg 1974 ...


# The end!