# Web Scraping

Extracting data from websites programatically.

We will talk about three kinds of scraping:

1. **APIs:** "Application programming interfaces"; some websites and services offer access to data in an already structured format via an API.
2. **Screenscraping:** Scraping from static websites (the information is in the page source code itself).
3. **Dynamic scraping:** Scraping from dynamic websites (information is dynamically loaded e.g. from a database & cannot be found in the source code itself).
    - *Intercept/mimick calls to the backend:* "Trick" the backend of the website into sending data directly to you
    - *Headless Browser*: "Zombie-Browser" that fakes user interaction to retrieve dynamic elements, e.g. using Selenium

In [None]:
import pandas as pd
import requests
import json

from bs4 import BeautifulSoup
from io import StringIO
from functools import reduce

## APIs

If you are lucky, the data is provided via an API that you can access programmatically. For example, the data from [Abgeordnetenwatch](https://www.abgeordnetenwatch.de/) ("monitoring" of German MPs) is provided via an API. Usually, an API has [documentation](https://www.abgeordnetenwatch.de/api) where you can see how to retrieve the data.

A common format to deliver the data is **JSON** (JavaScript Object Notation), which can easily be parsed in Python as a dictionary:

In [None]:
mps = requests.get("https://www.abgeordnetenwatch.de/api/v2/politicians").json()
mps

Data delivered in JSON is usually pretty straight forward to work with & get into a `DataFrame`-format:

In [None]:
mps = pd.DataFrame(mps["data"])[["id", "first_name", "last_name", "year_of_birth", "occupation", "party"]]
mps.head()

...and since the data is delivered pre-structured should require only little extra cleaning:

In [None]:
mps = mps.assign(party=[p["label"] for p in mps["party"]])
mps.head()

Some APIs require paying a one-time or subscription fee and/or require you to use authentication. In these cases, it's best to refer to the specfic documentation.

## Screenscraping

![](assets/scraping_meme.png){width="40%"}

If data is not provided via an API, we have to parse it from the page source code ourselves. Websites are usually built from three code components:

* **HTML:** "HyperText Markup Language"; defines structure and content of the website
* **CSS:** "Cascading Style Sheets"; defines presentation and styling
* **JavaScript:** Programming language used to built interactive elements of websites (e.g. what happens when you click a button)

To see the source code of a web page, in most browsers on right-click you are shown an option like "View page source code". Alternatively, most browsers also support simply adding `view-source:` in front of the URL. This will present you with the HTML-code of a website, e.g. the [Wikipedia-page for the ESC 2024](https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024) without makeup looks something like this:

![](assets/html.png)

This course is not about web development, so we will only talk about the absolute basic structure of HTML that you will need to parse information from it. HTML is structured in "tags", like `<p>` (paragraph), `<h>` (header) or `<img>` (image), that have to be openend and closed & often contain text or other elements, like this:

```
<p>This is a paragraph.</p>
```

Tags can also have attributes, e.g.:

```
<h2 class="vector-pinnable-header-label">Contents</h2>
```

These attributes can for example be used to make elements look a certain way (according to some style defined somewhere else in a CSS stylesheet, but looks are not of interest to us right now).

We will start by scraping a mock-website we *know* we are allowed to scrape: https://books.toscrape.com/. The easiest way to find something of interest in the page source code is via selector. Most browsers come with a right click option like "Inspect" or "Inspect element", which has some form of selector feature:

![](assets/selector_menu.png)

Hovering over or clicking on elements of the page will now show where they are in the source code (alternatively you can search through the opened page source code using Ctrl+F or Cmd+F on Mac):

![](assets/selector_hover.png)

 But how to actually get & parse the source code? To retrieve the page source of a website, use [requests](https://requests.readthedocs.io/en/latest/):

In [None]:
response = requests.get("https://books.toscrape.com/")

Responses to a GET-request contain a status code that can tell you something about whether your request succeeded, and if it failed it might tell you why:

In [None]:
response.status_code

Brief guide:

* `1XX`: Wait
* `2XX`: Successful (maybe with caveats)
* `3XX`: "Go away!"
* `4XX`: You f\*cked up
* `5XX`: The site f\*cked up

A response code of `200` means everything went fine. The actual page source is stored in the content of the response (looking at the first 100 characters):

In [None]:
response.content[:100]

Instead of working with this as a string, we can parse it into a format that is easier to navigate and query using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/):

In [None]:
soup = BeautifulSoup(response.content, "html.parser")

`BeautifulSoup` supports all kinds of operations on the source code now; as we saw, the title of a book is always stored in a `<h3>`-tag. `.find()` always retrieves the first matching element:

In [None]:
soup.find("h3")

`.find_all()` retrieves all matching elements:

In [None]:
soup.find_all("h3")

To get to the title for a single tag, you can either use the text inside the tag, but this is cut off:

In [None]:
soup.find("h3").text

As we can see above, the title is also an attribute of the `<a>`-tag that is embedded:

In [None]:
soup.find("h3").find("a")["title"]

To get all titles, we can now iterate over all `<h3>`-tags

In [None]:
[h3.find("a")["title"] for h3 in soup.find_all("h3")]

To find a tag by attribute, you can pass a dictionary to `.find` or `.find_all`:

In [None]:
soup.find("div", {"class": "product_price"})

#### Exercise

Use this to find

1. The price of the first product
2. The prices of all products

### Data from HTML-tables

Sometimes, data is already stored in tables which you can parse as a pandas DataFrame. Say we are interested in retrieving the [final results of the ESC](https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024#Final), there is a table in the wikipedia-article:

![](assets/table.png)

Using the inspector again we find out we are looking for a `<table>` of a certain class & with a certain caption as its next "child" element.

In [None]:
esc = requests.get("https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024")

Let's look for our table. We can find the caption first (which requires writing some nastier code to match - welcome to real websites):

In [None]:
soup.find(lambda tag: tag.name == "caption" and "Final of the Eurovision Song Contest 2024" in tag.get_text())

We saw that the caption is inside the `<table>` tag, so we can find the table by getting the parent of that tag:

In [None]:
table = soup.find(lambda tag: tag.name == "caption" and "Final of the Eurovision Song Contest 2024" in tag.get_text()).parent
table

In [None]:
results = pd.read_html(StringIO(str(table)))[0] # [0] because read_html returns a list of DataFrames & the rest is because pandas is retarded
results.head()

### Non-tabular data from a real website

At this point a brief general note about scraping: web scraping is *in principle* legal, but you may accidentally cross legal boundaries, e.g. if you scrape restricted content. Many websites provide a `robots.txt` that explicitly states what you are allowed to access and what not, and sometimes also provides easier ways of navigating the website.

As an example for scraping non-tabular data, we will scrape metadata of news articles from the German news outlet [SPIEGEL](spiegel.de). We are interested in the author, the time of publication, the news keywords & whether it is paywalled. We can start off [this random article](https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7) & first inspect the page source code. 

* Click on the link & then we will explore the source code together to find what we are looking for!

Let's start by retrieving the page:

In [None]:
test_article = requests.get("https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7")
soup = BeautifulSoup(test_article.content, "html.parser")

Finding the title: the title is inside a single `<title>`-tag:

In [None]:
soup.find("title").text

The date is inside a `<meta>`-tag:

In [None]:
soup.find("meta", {"name": "date"})

We can see that the actual date is stored inside the `content`-attribute:

In [None]:
soup.find("meta", {"name": "date"})["content"]

Keywords & author are stored in a similar format, so it might be smart to write a function:

In [None]:
def get_meta(soup: BeautifulSoup, name: str) -> str:
    return soup.find("meta", {"name": name})["content"]

print(f"Date: {get_meta(soup, 'date')}")
print(f"Author: {get_meta(soup, 'author')}") # Author = DER SPIEGEL means no dedicated author
print(f"Keywords: {get_meta(soup, 'news_keywords')}")

Sometimes, you need to be a bit clever with how you parse information. E.g. the paywall-attribute is found inside an embedded JSON-string:

![](assets/embedded_json.png)

In [None]:
soup.find("script", {"type": "application/settings+json"}).text

We can now parse this JSON as a dictionary:

In [None]:
app_json = json.loads(soup.find("script", {"type": "application/settings+json"}).text)
app_json["paywall"]["attributes"]["is_active"]

Let's put all our scraping code in a function:

In [None]:
def get_article_data(soup: BeautifulSoup) -> dict:
    result_dict = {
        "title": soup.find("title").text,
        "date": get_meta(soup, 'date'),
        "author": get_meta(soup, 'author'),
        "keywords": get_meta(soup, 'news_keywords'),
        "paywalled": json.loads(soup.find("script", {"type": "application/settings+json"}).text)["paywall"]["attributes"]["is_active"]
    }
    return result_dict

get_article_data(soup)

You can also wrap the retrieval into a function, so that all you need is the article URL:

In [None]:
def scrape(url: str) -> pd.DataFrame:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    return get_article_data(soup)

scrape("https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7")

Now for the cool part: because most articles on the news website are structured the same, you can use this function to retrieve data from other articles as well:

In [None]:
scrape("https://www.spiegel.de/sport/schach-wm-titelverteidiger-ding-liren-ueberrascht-mit-auftaktsieg-a-9cbc4765-d586-410d-8b97-471ad190f0f0")

Many websites offer sitemaps that allow you to easier navigate the content. For news websites, these often hold an archive of article-URLS: https://www.spiegel.de/sitemaps/news-de.xml

In [None]:
resp = requests.get("https://www.spiegel.de/sitemaps/news-de.xml")
sitemap = BeautifulSoup(resp.content, "xml")

In [None]:
article_urls = [url.find("loc").text for url in sitemap.find_all("url")]
article_urls[:10]

We can now scrape the first 100 or so articles to see if it works:

In [None]:
to_scrape = article_urls[:100]
results = []

for i, url in enumerate(to_scrape):
    print(f"Retrieving {i + 1}/{len(to_scrape)}...")
    try:
        results.append(scrape(url))
    except Exception as e:
        print(f"Problem with URL {url}")
        next

In [None]:
articles = pd.DataFrame(results)
articles.head()

You could now proceed analyzing this data normally:

In [None]:
articles["paywalled"].value_counts()

In [None]:
pd.Series([kw for keyword in articles["keywords"] for kw in keyword.split(sep=", ")]).value_counts()

### Exercise

Find out how to scrape title, date & keywords for articles from the [ZEIT](https://www.zeit.de/) newspaper (bonus for author or paywall). You can use [this article](https://www.zeit.de/politik/deutschland/2024-11/spd-vorstand-nominiert-scholz-offiziell-als-kanzlerkandidaten) to experiment. *Tip*: Look for the `<meta>`-tags again. You may assume that all articles are structured the same.

## Bonus: Dynamic Webpages

This is where the true magic begins. Often, the data you may be interested in is not actually embedded anywhere in the page source code, but is loaded dynamically via a request the website makes to some form of backend/database.

Consider for example the dynamic map on [this website](https://interaktiv.waz.de/bundestagswahl-2021-umfragen-ergebnisse-wahlkarte/gemeinden-ergebnisse-1990-1994-1998-2002-2005-2009-2013-2017-2021.html). If we click on a municipality, we are shown the electoral results, but we can't find them anywhere in the page source code!

We will use a little trick: everything that is displayed on the webpage has to be loaded & sent from and to somewhere. Go to Right click > Inspect again, and go to the network tab:

![](assets/network_tab.png)

Now click on a municipality & look at what is being retrieved:

![](assets/network_traffic.png)

In fact, when a municipality on the map is clicked, the data about that municipality is being sent to the site/to you in JSON-Format. If you click on the response in the network viewer, the JSON should be opened in your browser.

You can mimick the request the website makes to its backend to retrieve the data by simply requesting the URL yourself:

In [None]:
test = requests.get("https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/ags_12073572.json") # ask for some municipality
test_json = json.loads(test.content)
test_json

In fact, here we can find that the base URL https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/ just yields a list of *all* the municipalities with their JSON-URLS:

In [None]:
all_muns_resp = requests.get("https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/")
all_muns = BeautifulSoup(all_muns_resp.content, "html.parser")

[a["href"] for a in all_muns.find_all("a")][:10]

Knowing that we could now iteratively request all the data as we did for the news articles. Another approach to get at dynamically generated information is to use [Selenium](https://selenium-python.readthedocs.io/) to fire up a headless browser. There are good guides in text and video form online which you can look at, but in many cases using Selenium is a little like shooting ballistic missiles at sparrows; with a little detective work you can often find a much more efficient approach.

**NOTE:** This kind of mimicking/intercepting backend requests borders the danger zone: you may - accidentally or not - access data you are not allowed to. 

## Concluding remarks

* Be careful with what you scrape.
* If you scrape the wrong content or too aggressively, you may incur bans (e.g. IP-bans)
* Many websites may actively block you if they think you are scraping. A good idea is to [modify the headers of your request](https://www.zenrows.com/blog/python-requests-user-agent#what-is) to spoof your user agent (for example pretend you are an iPhone, or a Google Chrome browser running on Windows).