## Web Scraping in Python

It might not be a bit surprising, but the internet is an interesting source of data for aspiring data scientists! Even more so for acquiring textual data. Hence the need for skills to retrieve data from the internet.

The act of using a computer program to retrieve web pages and store and analyse their content is known as 'web scraping'.


#### Law, copyright and ethics
This might be a good time to point out that, while web scraping is not in itself an illegal activity, there are some finer points which you will have to consider before starting to apply your skills.

Firstly, while reading web pages on the internet is fine, it is obviously what they're there for, depending on what you want to do with the acquired data might be illegal. You will need to consider both privacy laws and copyright claims that might rest upon said data. If, for instance, you're using the output of a famous AI chatbot to train your own chatbot, let's call it SeepDeek, the owner of the famous AI chatbot might want to have a word with you about your disregard of their legal terms for using their service.

Secondly, computer programs are usually way better at consuming enormous amounts of web pages than their human counterparts. Reading a web page and navigating a web site 'by hand' is obviously going to be slow in comparison with the automated version. A computer program downloading web pages might in fact cause a congestion on the computer program serving the web pages, the web server. This would result in other users of the same web service experiencing slow downs. When used in a malicious sense this would be regarded as a denial-of-service attack (a DOS attack).  
It is obviously unethical to burden a web server to a point where other users start experiencing such slowdowns. And besides, the proprietor of this web server is likely charged monthly by both CPU usage and used bandwidth, which you are both increasing solely for your own benefit.   

That's why you really should take all the necessary care to ensure you are allowed to read a web site using web scraping techniques. To help both you and the proprietor of the web server (and/or the data being served) come to a sensible agreement, you could start by reading up on a standard called '[robots.txt](https://en.wikipedia.org/wiki/Robots.txt)'. This name refers to a file the proprietor can put on the web server. The contents of the file try to explain whether and, if so, how you're allowed to read the contents of the web site.

As always, it is completely up to you to be an nice person. But [you really should](https://en.wikipedia.org/wiki/Karma).

#### Web pages - what are they made of?

Before you start downloading all kinds of web pages with Python, it might be worth your while to read up on what web pages are made actually of.

Most web pages are made of [HTML](https://en.wikipedia.org/wiki/HTML), a markup language.  
And as the [internet](https://en.wikipedia.org/wiki/History_of_the_World_Wide_Web) grew beyond the ideas of Sir Tim Berners-Lee, people invented stuff like [CSS](https://en.wikipedia.org/wiki/CSS) to style their web pages and [JavaScript](https://en.wikipedia.org/wiki/JavaScript) to allow the pages to become more interactive.  
You can find tutorials about all of these standards on [w3schools.com](https://www.w3schools.com/html/).

And when scraping for data you might also encounter [JSON](https://en.wikipedia.org/wiki/JSON), which is a quite common data format nowadays. Python even has a [built-in library](https://www.w3schools.com/python/python_json.asp) to convert to and from JSON text. 

#### Downloading a single web page in Python

As is always the case, a lot of options are available for downloading stuff from the internet using Python.  
Let's start simple: downloading single web pages.

Python has a built-in package for working with web addresses. These addresses, designating a resource located somewhere on the internet, are called [Uniform Resource Locator](https://en.wikipedia.org/wiki/URL)s or URLs for short. The package is aptly named [urllib](https://docs.python.org/3/library/urllib.request.html).

Using the urllib.request.urlopen() will return a file handle from which you can read the web page.  
Downloading a web page is therefor as simple as:

In [9]:
import urllib.request

url = 'https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen'
with urllib.request.urlopen(url) as f:
    html = f.read()

print(html[:100])  ## byte string cut short for readability

b'<!DOCTYPE html>\n\n<html lang="en" data-content_root="../">\n  <head>\n    <meta charset="utf-8" />\n    '


As you can see in the example above, the urlopen() function returns a byte string by default.  
If you prefer a Python str (string) to work with, you can decode the byte string using an appropriate codepage:

In [10]:
html = urllib.request.urlopen(url).read().decode('utf-8')
print(html[:100])

<!DOCTYPE html>

<html lang="en" data-content_root="../">
  <head>
    <meta charset="utf-8" />
    


#### Downloading (all) tables from a web page

The pandas library has a function you will find useful to read all tables from a web page in one go:

In [15]:
from io import StringIO
import pandas as pd

url = 'https://en.wikipedia.org/wiki/2026_Winter_Olympics_medal_table'
# wikipedia will reject requests from the default urllib user agent,
# so we replace it with an alternative
request = urllib.request.Request(url, headers={'User-Agent': 'Mozilla'})
tables = pd.read_html(
    StringIO(
        urllib.request.urlopen(request).read().decode('utf-8')
    )
)
print('Number of tables read:', len(tables))
tables[2]

Number of tables read: 6


Unnamed: 0,Rank,NOC,Gold,Silver,Bronze,Total
0,1,Norway,3,1,2,6
1,2,United States,2,0,0,2
2,3,Italy*,1,2,6,9
3,4,Japan,1,2,1,4
4,5,Austria,1,2,0,3
5,6,Germany,1,1,1,3
6,7,Czech Republic,1,1,0,2
7,7,France,1,1,0,2
8,7,Sweden,1,1,0,2
9,10,Switzerland,1,0,0,1


Keep in mind that it will return all tables in the web page, even the ones that aren't easily or even actually visible.

#### Downloading web pages and more

While the built-in urllib package is useful for accidental single page downloads, more complicated scenarios require a more able web client. As you could read on the Python documentation page for the [urllib.request](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) module, the [requests package](https://requests.readthedocs.io/en/latest/), a third-party library, provides a commonly used higher-level interface for interacting with web servers using HTTP.  
It will, for instance, allow you to mimic other browsers, authenticate to web sites or create more complicated HTTP requests like GET and POST requests carrying data.

The requests package is a part of the Anaconda standard package, so you're likely to have it installed.

The rather elaborate example below uses Eindhoven's [Open Data API](https://data.eindhoven.nl/pages/home/) to download and show recent (realtime) measurements of particulate matter for the location of the TU/e campus.  

If you want to learn more about using requests, you might find some tutorials on Youtube:
- [BeautifulSoup + Requests | Web Scraping in Python](https://www.youtube.com/watch?v=bargNl2WeN4)
- follow-up: [Find and Find_All | Web Scraping in Python](https://www.youtube.com/watch?v=xjA1HjvmoMY)
- follow-up: [Scraping Data from a Real Website | Web Scraping in Python](https://www.youtube.com/watch?v=8dTpNajxaH0)

In [16]:
import requests
import json

url = 'https://data.eindhoven.nl/api/explore/v2.1/catalog/datasets/real-time-fijnstof-monitoring/records'
# 'select=pm25&where=&limit=20'
params = {
    'select': 'lossetimestamps_timestamp,pm25',
    'where': 'latitude="51.4454" AND longitude="5.4854"',
    'timezone': 'Europe/Amsterdam',
    'limit': 20,
}
response = requests.get(url, params)
if response.status_code == 200:
    data = json.loads(response.text)
    print(f'Observation count: {data["total_count"]}')
    for i, observation in enumerate(data['results'], 1):
        print(f'obs {i:02} - {observation["lossetimestamps_timestamp"]} - pm25 = {observation["pm25"]}')
else:
    print('Error retrieving specified URL:')
    print(response.text)

Observation count: 2
obs 01 - 2026-02-08T23:40:00+01:00 - pm25 = 17.39
obs 02 - 2026-02-08T23:30:01+01:00 - pm25 = 17.24


#### Navigating a web site and extracting parts from web pages 

While downloading full web pages is a breeze in Python, retrieving useful information from them is a different matter.  
Like natural language, HTML, the markup language in which web pages are written, is subject to change over time. And to make matters worse, web developers change the layout and markup of their web sites every so often. Which causes your scripts to grind to a halt, spewing errors about missing tags you assumed would always be there.  

You could start by using regular expressions to extract information from webpages. They are just text, so REs will work fine.

For example, let's retrieve today's featured article from Wikipedia:

In [None]:
import re

url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url, headers={'User-Agent': 'Python Requests'})
try:
    html = response.text
    m = re.search(
        r'<a href="(?P<rel_url>[^"]+)" title="(?P<title>[^"]+)">Full&#160;article...</a>',
        html
    )
    assert m is not None, 'Link for Featured Article not found!'
    rel_url = m.group('rel_url')
    title = m.group('title')
    print(f'Featured article: {title}')
    print(f'Hyperlink: https://en.wikipedia.org{rel_url}')
except Exception as ex:
    print('Error retrieving featured article from Wikipedia!')
    print(ex)

Featured article: Master Juba
Hyperlink: https://en.wikipedia.org/wiki/Master_Juba


For a higher-level look at the contents of the web page, you could use a library like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).  
It turns the HTML for the web page into a tree-like structure, allowing you to search for specific tags, or tags with specific attributes or content: 

In [23]:
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url, headers={'User-Agent': 'Python Requests'})
html = BeautifulSoup(response.text)

tfa_div = html.find('div', {'id': 'mp-tfa'})
tfa_p = tfa_div.find('p')
print(tfa_p.text.strip())

Master Juba (c. 1825 – c. 1854) was an African-American dancer. He was one of the first black performers in the United States to play onstage for white audiences and the only one of the era to tour with a white minstrel group. He began his career in Manhattan's Five Points neighborhood and moved to minstrel shows in the mid-1840s. His act featured a sequence in which he imitated famous dancers, then closed by performing in his style. In 1848, he became a sensation in Britain for his dance style, but writers treated him as an exhibit on display. Juba's popularity faded and he died around 1854, probably of fever. He was largely forgotten by historians until a 1947 article by Marian Hannah Winter popularized his story. Juba's dancing style was percussive, varied in tempo and expressive. It likely incorporated European folk steps and African-derived steps used by plantation slaves. Juba was highly influential in the development of tap, jazz, and step-dancing styles. (Full article...)


#### Automating web scraping and beyond...

Without going into further details, it is worth mentioning, if only for the sake of completeness, that specific solutions exist to automate advanced web scraping solutions. One such solution using Python is [scrapy](https://scrapy.org/).