# Web Scraping

Extracting data from websites programatically.

We will talk about three kinds of scraping:

1. **APIs:** "Application programming interfaces"; some websites and services offer access to data in an already structured format via an API.
2. **Screenscraping:** Scraping from static websites (the information is in the page source code itself).
3. **Dynamic scraping:** Scraping from dynamic websites (information is dynamically loaded e.g. from a database & cannot be found in the source code itself).
    - *API-interception*: intercept/mimick calls to the backend
    - *Headless Browser*: "Zombie-Browser" that fakes user interaction to retrieve dynamic elements, e.g. using Selenium

In [137]:
import pandas as pd
import requests
import json

from bs4 import BeautifulSoup
from io import StringIO
from functools import reduce

## APIs

If you are lucky, the data is provided via an API that you can access programmatically. For example, the data from [Abgeordnetenwatch](https://www.abgeordnetenwatch.de/) ("monitoring" of German MPs) is provided via an API. Usually, an API has [documentation](https://www.abgeordnetenwatch.de/api) where you can see how to retrieve the data.

A common format to deliver the data is **JSON** (JavaScript Object Notation), which can easily be parsed in Python as a dictionary:

In [6]:
mps = requests.get("https://www.abgeordnetenwatch.de/api/v2/politicians").json()
mps

{'meta': {'abgeordnetenwatch_api': {'version': '2.7',
   'changelog': 'https://www.abgeordnetenwatch.de/api/version-changelog/aktuell',
   'licence': 'CC0 1.0',
   'licence_link': 'https://creativecommons.org/publicdomain/zero/1.0/deed.de',
   'documentation': 'https://www.abgeordnetenwatch.de/api/entitaeten/politician'},
  'status': 'ok',
  'status_message': '',
  'result': {'count': 100,
   'total': 33640,
   'range_start': 0,
   'range_end': 100}},
 'data': [{'id': 182362,
   'entity_type': 'politician',
   'label': 'Christopher Salm',
   'api_url': 'https://www.abgeordnetenwatch.de/api/v2/politicians/182362',
   'abgeordnetenwatch_url': 'https://www.abgeordnetenwatch.de/profile/christopher-salm',
   'first_name': 'Christopher',
   'last_name': 'Salm',
   'birth_name': None,
   'sex': 'm',
   'year_of_birth': None,
   'party': {'id': 2,
    'entity_type': 'party',
    'label': 'CDU',
    'api_url': 'https://www.abgeordnetenwatch.de/api/v2/parties/2'},
   'party_past': None,
   'educ

Data delivered in JSON is usually pretty straight forward to work with & get into a `DataFrame`-format:

In [10]:
mps = pd.DataFrame(mps["data"])[["id", "first_name", "last_name", "year_of_birth", "occupation", "party"]]
mps.head()

Unnamed: 0,id,first_name,last_name,year_of_birth,occupation,party
0,182362,Christopher,Salm,,Rechtsanwalt,"{'id': 2, 'entity_type': 'party', 'label': 'CD..."
1,182361,Oliver,Skopec,1985.0,,"{'id': 229, 'entity_type': 'party', 'label': '..."
2,182360,Gunnar,Lehmann,1993.0,,"{'id': 229, 'entity_type': 'party', 'label': '..."
3,182359,Christian,Dorst,1970.0,,"{'id': 229, 'entity_type': 'party', 'label': '..."
4,182358,Simon,Reinhard,1951.0,,"{'id': 229, 'entity_type': 'party', 'label': '..."


...and since the data is delivered pre-structured should require only little extra cleaning:

In [12]:
mps = mps.assign(party=[p["label"] for p in mps["party"]])
mps.head()

Unnamed: 0,id,first_name,last_name,year_of_birth,occupation,party
0,182362,Christopher,Salm,,Rechtsanwalt,CDU
1,182361,Oliver,Skopec,1985.0,,BSW
2,182360,Gunnar,Lehmann,1993.0,,BSW
3,182359,Christian,Dorst,1970.0,,BSW
4,182358,Simon,Reinhard,1951.0,,BSW


Some APIs require paying a one-time or subscription fee and/or require you to use authentication. In these cases, it's best to refer to the specfic documentation.

## Screenscraping

![](assets/scraping_meme.png){width="40%"}

If data is not provided via an API, we have to parse it from the page source code ourselves. Websites are usually built from three code components:

* **HTML:** "HyperText Markup Language"; defines structure and content of the website
* **CSS:** "Cascading Style Sheets"; defines presentation and styling
* **JavaScript:** Programming language used to built interactive elements of websites (e.g. what happens when you click a button)

To see the source code of a web page, in most browsers on right-click you are shown an option like "View page source code". Alternatively, most browsers also support simply adding `view-source:` in front of the URL. This will present you with the HTML-code of a website, e.g. the [Wikipedia-page for the ESC 2024](https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024) without makeup looks something like this:

![](assets/html.png)

This course is not about web development, so we will only talk about the absolute basic structure of HTML that you will need to parse information from it. HTML is structured in "tags", like `<p>` (paragraph), `<h>` (header) or `<img>` (image), that have to be openend and closed & often contain text or other elements, like this:

```
<p>This is a paragraph.</p>
```

Tags can also have attributes, e.g.:

```
<h2 class="vector-pinnable-header-label">Contents</h2>
```

These attributes can for example be used to make elements look a certain way (according to some style defined somewhere else in a CSS stylesheet, but looks are not of interest to us right now).

### Data from HTML-tables

Say we are interested in retrieving the [final results of the ESC](https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024#Final), there is a table in the wikipedia-article:

![](assets/table.png)

The easiest way to find something of interest in the page source code is via selector. Most browsers come with a right click option like "Inspect" or "Inspect element", which has some form of selector feature:

![](assets/selector_menu.png)

Hovering over or clicking on elements of the page will now show where they are in the source code:

![](assets/selector_hover.png)

We now know we are looking for a `<table>` of a certain class & with a certain caption as its next "child" element. But how to actually get & parse the source code? To retrieve the page source of a website, use [requests](https://requests.readthedocs.io/en/latest/):

In [31]:
esc = requests.get("https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2024")

Responses to a GET-request contain a status code that can tell you something about whether your request succeeded, and if it failed it might tell you why:

In [32]:
esc.status_code

200

Brief guide:

* `1XX`: Wait
* `2XX`: Successful (maybe with caveats)
* `3XX`: "Go away!"
* `4XX`: You f\*cked up
* `5XX`: The site f\*cked up

A response code of `200` means everything went fine. The actual page source is stored in the content of the response (looking at the first 100 characters):

In [33]:
esc.content[:100]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

Instead of working with this as a string, we can parse it into a format that is easier to navigate and query using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/):

In [34]:
soup = BeautifulSoup(esc.content, "html.parser")

Let's look for our table. We can find the caption first:

In [69]:
soup.find(lambda tag: tag.name == "caption" and "Final of the Eurovision Song Contest 2024" in tag.get_text())

<caption>Final of the Eurovision Song Contest 2024<sup class="reference" id="cite_ref-Final_results_184-0"><a href="#cite_note-Final_results-184"><span class="cite-bracket">[</span>176<span class="cite-bracket">]</span></a></sup>
</caption>

We saw that the caption is inside the `<table>` tag, so we can find the table by getting the parent of that tag:

In [72]:
table = soup.find(lambda tag: tag.name == "caption" and "Final of the Eurovision Song Contest 2024" in tag.get_text()).parent
table

<table class="sortable wikitable plainrowheaders">
<caption>Final of the Eurovision Song Contest 2024<sup class="reference" id="cite_ref-Final_results_184-0"><a href="#cite_note-Final_results-184"><span class="cite-bracket">[</span>176<span class="cite-bracket">]</span></a></sup>
</caption>
<tbody><tr>
<th scope="col"><abbr title="Running order">R/O</abbr>
</th>
<th scope="col">Country
</th>
<th scope="col">Artist
</th>
<th scope="col">Song
</th>
<th scope="col">Points
</th>
<th scope="col">Place
</th></tr>
<tr>
<th scope="row" style="text-align:center;">1
</th>
<td><span class="nowrap"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="1000" data-file-width="1600" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/4/4c/Flag_of_Sweden.svg/23px-Flag_of_Sweden.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/4c/Flag_of_Sweden.svg/35px-Flag_of_Sweden.svg.png 1.5x, //upload.w

In [87]:
results = pd.read_html(StringIO(str(table)))[0] # [0] because read_html returns a list of DataFrames & the rest is because pandas is retarded
results.head()

Unnamed: 0,R/O,Country,Artist,Song,Points,Place
0,1,Sweden,Marcus & Martinus,"""Unforgettable""",174,9
1,2,Ukraine,Alyona Alyona and Jerry Heil,"""Teresa & Maria""",453,3
2,3,Germany,Isaak,"""Always on the Run""",117,12
3,4,Luxembourg,Tali,"""Fighter""",103,13
4,5,Netherlands,Joost Klein,"""Europapa""",—,—


### Non-tabular data

At this point a brief general note about scraping: web scraping is *in principle* legal, but you may accidentally cross legal boundaries, e.g. if you scrape restricted content. Many websites provide a `robots.txt` that explicitly states what you are allowed to access and what not, and sometimes also provides easier ways of navigating the website.

As an example for scraping non-tabular data, we will scrape metadata of news articles from the German news outlet [SPIEGEL](spiegel.de). We are interested in the author, the time of publication, the news keywords & whether it is paywalled. Let's start off with [this random article](https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7) & inspect the page source code.

Let's start by retrieving the page:

In [89]:
test_article = requests.get("https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7")
soup = BeautifulSoup(test_article.content, "html.parser")

Finding the title: the title is inside a single `<title>`-tag:

In [92]:
soup.find("title").text

'Bunker-Plan für Deutschland: Behörden suchen nach intakten Schutzräumen - DER SPIEGEL'

The date is inside a `<meta>`-tag:

In [97]:
soup.find("meta", {"name": "date"})

<meta content="2024-11-25T14:59:00+01:00" name="date"/>

We can see that the actual date is stored inside the `content`-attribute:

In [98]:
soup.find("meta", {"name": "date"})["content"]

'2024-11-25T14:59:00+01:00'

Keywords & author are stored in a similar format, so it might be smart to write a function:

In [None]:
def get_meta(soup: BeautifulSoup, name: str) -> str:
    return soup.find("meta", {"name": name})["content"]

print(f"Date: {get_meta(soup, 'date')}")
print(f"Author: {get_meta(soup, 'author')}") # Author = DER SPIEGEL means no dedicated author
print(f"Keywords: {get_meta(soup, 'news_keywords')}")

Date: 2024-11-25T14:59:00+01:00
Author: DER SPIEGEL
Keywords: Politik, Deutschland


Sometimes, you need to be a bit clever with how you parse information. E.g. the paywall-attribute is found inside an embedded JSON-string:

![](assets/embedded_json.png)

In [106]:
soup.find("script", {"type": "application/settings+json"}).text

'{"general":{"breakpoints":{"lg":{"min":1020},"md":{"max":1019,"min":720},"sm":{"max":719}},"cacheControl":{"breakingnews":{"sessionStorageMaxAge":900}},"consent":{"disabled":false,"globallyDisabled":false,"minUpdatedAt":1626213600,"utiqDisabled":false},"cookieDomains":["www.spiegel.de",".www.spiegel.de",".spiegel.de","abo.spiegel.de","sportdaten.spiegel.de","lotto.spiegel.de","akademie.spiegel.de","ed.spiegel.de"],"disableAdobeLaunch":false,"disableBookmarks":false,"disableBreakingnews":false,"disableSourcepoint":false,"domain":"spon","noAds":false,"noContentAds":false,"secondLevelDomain":"spiegel","subscriptions":{"metered":"Spmetered","noads":"Sppur","paid":["Spplus"]},"text":{"authorDetailsSuffix":{"removeWhitespaceBeforeCharacters":[",",".",":",";"]}},"topLevelDomain":"de","urls":{"assetsBasePath":"https://cdn.prod.www.spiegel.de","backofficeBaseUrl":"https://gruppenkonto.spiegel.de","base":"https://www.spiegel.de","offers":{"Spplus":"https://abo.spiegel.de/de/c/microsites/pl/stan

We can now parse this JSON as a dictionary:

In [109]:
app_json = json.loads(soup.find("script", {"type": "application/settings+json"}).text)
app_json["paywall"]["attributes"]["is_active"]

False

Let's put all our scraping code in a function:

In [None]:
def get_article_data(soup: BeautifulSoup) -> dict:
    result_dict = {
        "title": soup.find("title").text,
        "date": get_meta(soup, 'date'),
        "author": get_meta(soup, 'author'),
        "keywords": get_meta(soup, 'news_keywords'),
        "paywalled": json.loads(soup.find("script", {"type": "application/settings+json"}).text)["paywall"]["attributes"]["is_active"]
    }
    return pd.DataFrame(result_dict, index=[0])# again pandas is ridiculous

get_article_data(soup)

Unnamed: 0,title,date,author,keywords,paywalled
0,Bunker-Plan für Deutschland: Behörden suchen n...,2024-11-25T14:59:00+01:00,DER SPIEGEL,"Politik, Deutschland",False


In [118]:
def scrape(url: str) -> pd.DataFrame:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    return get_article_data(soup)

scrape("https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-schutzraeumen-a-f4745099-c5c1-45aa-b12b-fb0eab0ea7c7")

Unnamed: 0,title,date,author,keywords,paywalled
0,Bunker-Plan für Deutschland: Behörden suchen n...,2024-11-25T14:59:00+01:00,DER SPIEGEL,"Politik, Deutschland",False


Now for the cool part: because most articles on the news website are structured the same, you can use this function to retrieve data from other articles as well:

In [119]:
scrape("https://www.spiegel.de/sport/schach-wm-titelverteidiger-ding-liren-ueberrascht-mit-auftaktsieg-a-9cbc4765-d586-410d-8b97-471ad190f0f0")

Unnamed: 0,title,date,author,keywords,paywalled
0,Schach-WM: Titelverteidiger Ding Liren überras...,2024-11-25T14:49:00+01:00,DER SPIEGEL,"Sport, Schach",False


Many websites offer sitemaps that allow you to easier navigate the content. For news websites, these often hold an archive of article-URLS: https://www.spiegel.de/sitemaps/news-de.xml

In [126]:
resp = requests.get("https://www.spiegel.de/sitemaps/news-de.xml")
sitemap = BeautifulSoup(resp.content, "xml")

In [134]:
article_urls = [url.find("loc").text for url in sitemap.find_all("url")]
article_urls[:10]

['https://www.spiegel.de/panorama/jva-burg-lageplan-von-hochsicherheitsgefaengnis-geraet-an-insassen-a-de821f5c-1ce6-4824-a6f6-4a69c5b8562a',
 'https://www.spiegel.de/sport/fussball/manchester-city-und-pep-guardiola-ein-formtief-oder-der-anfang-vom-ende-der-guardiola-aera-a-0c1c3895-8f1b-456e-bd73-e620a539d0b0',
 'https://www.spiegel.de/sport/fussball/fussball-bundesliga-der-tv-rechtepoker-geht-in-die-naechste-runde-a-c636c6a4-caa5-4b3b-91a5-0a19b349567a',
 'https://www.spiegel.de/panorama/taegliches-quiz-beim-spiegel-7-fragen-zum-allgemeinwissen-pro-tag-a-8a9692b2-4462-4192-942c-fd7809c7519c',
 'https://www.spiegel.de/politik/deutschland/angela-merkel-die-meisterin-der-politischen-zurechtstutzung-kolumne-a-b2ddf658-ccfd-47ad-8413-6960366ea713',
 'https://www.spiegel.de/panorama/tag-gegen-gewalt-an-frauen-sexistische-gewalt-zu-beenden-ist-maenneraufgabe-a-fef80383-e928-4763-928d-3828f3f7014a',
 'https://www.spiegel.de/politik/bunker-plan-fuer-deutschland-behoerden-suchen-nach-intakten-

We can now scrape the first 100 or so articles to see if it works:

In [135]:
to_scrape = article_urls[:100]
results = []

for i, url in enumerate(to_scrape):
    print(f"Retrieving {i + 1}/{len(to_scrape)}...")
    try:
        results.append(scrape(url))
    except Exception as e:
        print(f"Problem with URL {url}")
        next

Retrieving 1/100...
Retrieving 2/100...
Retrieving 3/100...
Retrieving 4/100...
Retrieving 5/100...
Retrieving 6/100...
Retrieving 7/100...
Retrieving 8/100...
Retrieving 9/100...
Retrieving 10/100...
Retrieving 11/100...
Retrieving 12/100...
Retrieving 13/100...
Retrieving 14/100...
Retrieving 15/100...
Retrieving 16/100...
Retrieving 17/100...
Retrieving 18/100...
Retrieving 19/100...
Retrieving 20/100...
Retrieving 21/100...
Retrieving 22/100...
Retrieving 23/100...
Retrieving 24/100...
Retrieving 25/100...
Retrieving 26/100...
Retrieving 27/100...
Retrieving 28/100...
Retrieving 29/100...
Retrieving 30/100...
Retrieving 31/100...
Retrieving 32/100...
Retrieving 33/100...
Retrieving 34/100...
Retrieving 35/100...
Retrieving 36/100...
Retrieving 37/100...
Retrieving 38/100...
Retrieving 39/100...
Retrieving 40/100...
Retrieving 41/100...
Retrieving 42/100...
Retrieving 43/100...
Retrieving 44/100...
Retrieving 45/100...
Retrieving 46/100...
Retrieving 47/100...
Retrieving 48/100...
R

In [138]:
articles = reduce(lambda x, y: pd.concat([x, y]), results)
articles.head()

Unnamed: 0,title,date,author,keywords,paywalled
0,JVA Burg: Lageplan von Hochsicherheitsgefängni...,2024-11-25T15:38:00+01:00,DER SPIEGEL,"Panorama, Justiz, Sachsen-Anhalt",False
0,Manchester City und Pep Guardiola: Ein Formtie...,2024-11-25T15:34:00+01:00,"Danial Montazeri, DER SPIEGEL","Sport, Fußball-News, Manchester City, Premier ...",True
0,Fußball-Bundesliga: Der TV-Rechtepoker geht in...,2024-11-25T15:17:00+01:00,"Peter Ahrens, DER SPIEGEL","Sport, Fußball-News, Fußball-Bundesliga, DFL",False
0,Tägliches Quiz beim SPIEGEL: 7 Fragen zum Allg...,2024-11-25T15:13:00+01:00,DER SPIEGEL,Panorama,False
0,Angela Merkel: Die Meisterin der politischen Z...,2024-11-25T15:13:00+01:00,"Nikolaus Blome, DER SPIEGEL","Politik, Deutschland, Angela Merkel, Jetzt ers...",False


You could now proceed analyzing this data normally:

In [139]:
articles["paywalled"].value_counts()

paywalled
False    70
True     30
Name: count, dtype: int64

In [149]:
pd.Series([kw for keyword in articles["keywords"] for kw in keyword.split(sep=", ")]).value_counts()

Panorama                        21
Deutschland                     16
Wirtschaft                      14
Politik                         14
Ausland                         14
                                ..
Mercedes-Benz                    1
Volkswagen                       1
Waffenhandel                     1
Frauen in Führungspositionen     1
Matthias Miersch                 1
Name: count, Length: 286, dtype: int64

### Exercise

Find out how to scrape title, date & keywords for articles from the [ZEIT](https://www.zeit.de/) newspaper (bonus for author or paywall). You can use [this article](https://www.zeit.de/politik/deutschland/2024-11/spd-vorstand-nominiert-scholz-offiziell-als-kanzlerkandidaten) to experiment. *Tip*: Look for the `<meta>`-tags again. You may assume that all articles are structured the same.

## Bonus: Dynamic Webpages

This is where the true magic begins. Often, the data you may be interested in is not actually embedded anywhere in the page source code, but is loaded dynamically via a request the website makes to some form of backend/database.

Consider for example the dynamic map on [this website](https://interaktiv.waz.de/bundestagswahl-2021-umfragen-ergebnisse-wahlkarte/gemeinden-ergebnisse-1990-1994-1998-2002-2005-2009-2013-2017-2021.html). If we click on a municipality, we are shown the electoral results, but we can't find them anywhere in the page source code!

We will use a little trick: everything that is displayed on the webpage has to be loaded & sent from and to somewhere. Go to Right click > Inspect again, and go to the network tab:

![](assets/network_tab.png)

Now click on a municipality & look at what is being retrieved:

![](assets/network_traffic.png)

In fact, when a municipality on the map is clicked, the data about that municipality is being sent to the site/to you in JSON-Format. If you click on the response in the network viewer, the JSON should be opened in your browser.

You can mimick the request the website makes to its backend to retrieve the data by simply requesting the URL yourself:

In [151]:
test = requests.get("https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/ags_12073572.json") # ask for some municipality
test_json = json.loads(test.content)
test_json

{'wahlkreis': 'Uckermark – Barnim I',
 'meta': [{'key': 'berechtigte', 'year': 1990, 'value': 14007},
  {'key': 'berechtigte', 'year': 1994, 'value': 13814},
  {'key': 'berechtigte', 'year': 1998, 'value': 14403},
  {'key': 'berechtigte', 'year': 2002, 'value': 14751},
  {'key': 'berechtigte', 'year': 2005, 'value': 14540},
  {'key': 'berechtigte', 'year': 2009, 'value': 14212},
  {'key': 'berechtigte', 'year': 2013, 'value': 13799},
  {'key': 'berechtigte', 'year': 2017, 'value': 13505},
  {'key': 'berechtigte', 'year': 2021, 'value': 13241},
  {'key': 'gueltige', 'year': 1990, 'value': 10018},
  {'key': 'gueltige', 'year': 1994, 'value': 9039},
  {'key': 'gueltige', 'year': 1998, 'value': 10218},
  {'key': 'gueltige', 'year': 2002, 'value': 9520},
  {'key': 'gueltige', 'year': 2005, 'value': 10717},
  {'key': 'gueltige', 'year': 2009, 'value': 8957},
  {'key': 'gueltige', 'year': 2013, 'value': 9045},
  {'key': 'gueltige', 'year': 2017, 'value': 9433},
  {'key': 'gueltige', 'year': 2

In fact, here we can find that the base URL https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/ just yields a list of *all* the municipalities with their JSON-URLS:

In [156]:
all_muns_resp = requests.get("https://interaktiv.morgenpost.de/data/wahl/gemeinden-2021/")
all_muns = BeautifulSoup(all_muns_resp.content, "html.parser")

[a["href"] for a in all_muns.find_all("a")][:10]

['../',
 'ags_01001000.json',
 'ags_01002000.json',
 'ags_01003000.json',
 'ags_01004000.json',
 'ags_01051001.json',
 'ags_01051002.json',
 'ags_01051003.json',
 'ags_01051004.json',
 'ags_01051005.json']

Knowing that we could now iteratively request all the data as we did for the news articles. Another approach to get at dynamically generated information is to use [Selenium](https://selenium-python.readthedocs.io/) to fire up a headless browser. There are good guides in text and video form online which you can look at, but in many cases using Selenium is a little like shooting ballistic missiles at sparrows; with a little detective work you can often find a much more efficient approach.

**NOTE:** This kind of mimicking/intercepting backend requests borders the danger zone: you may - accidentally or not - access data you are not allowed to. 

## Concluding remarks

* Be careful with what you scrape.
* If you scrape the wrong content or too aggressively, you may incur bans (e.g. IP-bans)
* Many websites may actively block you if they think you are scraping. A good idea is to [modify the headers of your request](https://www.zenrows.com/blog/python-requests-user-agent#what-is) to spoof your user agent (for example pretend you are an iPhone, or a Google Chrome browser running on Windows).