# How to crawl and extract data from Web pages

This notebook is broken into three sections:
- Crawling traditional websites using requests
- Crawling javascript websites using playwright
- Extracting data from web pages using BeautifulSoup

Regardless of how you crawl a website, please remember three things:
- Obey the robot exclusion protocol rules found in the robots.txt file at the root of nearly every domain: https://domain-name/robots.txt
- Put (at least) a 3 second delay between requests.
- Respect copyrights.

In [1]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

## Crawling traditional websites using requests

The python requests library makes http(s) requests to fetch web pages just like your browser.

We'll start at a "root" page and use BeautifulSoup to find links to additional pages to crawl. We'll learn more about BeautifulSoup later.

Let's use requests to crawl general conference talks.

In [2]:
import json
import os
import requests
import time
from typing import Optional
from typing import Tuple
from urllib.parse import urljoin, urlparse

from bs4 import BeautifulSoup

In [3]:
year = 2024
month = '04'
host = 'https://www.churchofjesuschrist.org'
base_dir = 'data/raw'
bs_parser = 'html.parser'
delay_seconds = 5

if not os.path.exists(base_dir):
    os.makedirs(base_dir)

In [5]:
def _is_talk_url(url):
    """A talk URL has 6 components (first component is empty) and last component does not end in -session."""
    path_components = urlparse(url).path.split('/')
    return len(path_components) == 6 and not path_components[-1].endswith('-session')


def get_talk_urls(base_url, html):
    """Find all talk URLs on the page."""
    soup = BeautifulSoup(html, bs_parser)
    return [urljoin(base_url, a['href']) for a in soup.find_all('a', href=True) \
            if _is_talk_url(urljoin(base_url, a['href']))]


def get_talk_path(url):
    """Return the file path for saving the talk."""
    path_components = urlparse(url).path.split('/')
    year, month, title = path_components[3:6]
    return os.path.join(base_dir, f"{year}-{month}-{title}.json")

# this function uses type hints like typescript to help your IDE detect errors in what you pass to the function
def get_page(
    url: str,
    delay_seconds: int = 30,
    headers: Optional[dict[str, str]] = None,
    encoding: str = "utf-8",
    timeout: int = 30,
) -> Tuple[int, str]:
    """Get page from url."""
    if headers is None:
        # make your program look like a chrome
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",  # noqa: B950
            "Accept-Encoding": "gzip, deflate",  # gzip, deflate, br, zstd
            "Accept-Language": "en-US,en;q=0.9",
            "Cache-Control": "no-cache",
            "Cookie": "TAsessionID=8f51490c-5611-45b6-9847-8585037a0e1b|NEW; notice_behavior=implied|us; gpv_Page=general-conference%7C2024%7C04%7C11oaks; gpv_cURL=www.churchofjesuschrist.org%2Fstudy%2Fgeneral-conference%2F2024%2F04%2F11oaks; s_ips=838; s_tp=1517; s_ppv=general-conference%257C2024%257C04%257C11oaks%2C55%2C55%2C55%2C838%2C1%2C1; AMCVS_66C5485451E56AAE0A490D45%40AdobeOrg=1; AMCV_66C5485451E56AAE0A490D45%40AdobeOrg=179643557%7CMCIDTS%7C19909%7CMCMID%7C88116570082250571280802679967931299750%7CMCAAMLH-1720711596%7C9%7CMCAAMB-1720711596%7C6G1ynYcLPuiQxYZrsz_pkqfLG9yMXBpb2zX5dvJdYQJzPXImdj0y%7CMCOPTOUT-1720113996s%7CNONE%7CvVersion%7C5.5.0; s_cc=true; s_plt=1.01; s_pltp=general-conference%7C2024%7C04%7C11oaks; adcloud={%22_les_v%22:%22c%2Cy%2Cchurchofjesuschrist.org%2C1720108596%22}; at_check=true; mbox=session#6bb5efff4aea494c8e2e9c7d3469ab29#1720108658|PC#6bb5efff4aea494c8e2e9c7d3469ab29.35_0#1783351598",
            "Pragma": "no-cache",
            "Priority": "u=0, i",
            "Sec-Ch-Ua": '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
            "Sec-Ch-Ua-Mobile": "?0",
            "Sec-Ch-Ua-Platform": '"macOS"',
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "same-origin",
            "Sec-Fetch-User": "?1",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",  # noqa: B950
        }
    # make the request
    response = requests.get(url, headers=headers, timeout=timeout)
    # wait 
    time.sleep(delay_seconds)
    if encoding:
        response.encoding = encoding
    return response.status_code, response.text


def save_page(path: str, url: str, html: str, encoding: str = "utf-8") -> None:
    """Save page url and html to path."""
    page_info = {
        "url": url,
        "html": html,
    }
    with open(path, "w", encoding=encoding) as f:
        json.dump(page_info, f, ensure_ascii=False, indent=2)

In [6]:
dir_url = f"{host}/study/general-conference/{year}/{month}?lang=eng"
# get the root page
status_code, dir_html = get_page(dir_url, delay_seconds)
if status_code != 200:
    print(f"Status code={status_code} url={dir_url}")

In [7]:
# get all of the talk URLs from the conference root
talk_urls = get_talk_urls(dir_url, dir_html)
print(dir_url, len(talk_urls))

https://www.churchofjesuschrist.org/study/general-conference/2024/04?lang=eng 68


In [8]:
# fetch each talk
for ix, talk_url in enumerate(talk_urls):
    path = get_talk_path(talk_url)
    # don't re-crawl if you've already crawled
    if os.path.exists(path):
        continue
    print("    ", path)
    status_code, talk_html = get_page(talk_url, delay_seconds)
    if status_code != 200:
        print(f"Status code={status_code} url={talk_url}")
        continue
    save_page(path, talk_url, talk_html)
    if ix > 10:
        break

     data/raw/2024-04-11oaks.json
     data/raw/2024-04-12larson.json
     data/raw/2024-04-13holland.json
     data/raw/2024-04-14dennis.json
     data/raw/2024-04-15dushku.json
     data/raw/2024-04-16soares.json
     data/raw/2024-04-17gerard.json
     data/raw/2024-04-18eyring.json
     data/raw/2024-04-21bednar.json
     data/raw/2024-04-22de-feo.json
     data/raw/2024-04-23nielson.json
     data/raw/2024-04-24alonso.json


## Crawling javascript websites using playwright

For some websites, the web page is simply a skeleton HTML plus some javascript. The browser has to execute the javascript to populate the full HTML.

The Playwright library lets you control a browser from your python program: https://playwright.dev/python/docs/library

Playwright has it's own way to find links to additional pages to crawl that uses XPaths, so instead of using BeautifulSoup, we'll use Playwright with XPaths. I will show you how to come up with the XPaths by inspecting a web page.

Let's use Playwright to crawl game forums.

In [10]:
import os
import time
from urllib.parse import urljoin, urlparse

from playwright.async_api import async_playwright

In [11]:
url = "https://boardgamegeek.com/boardgame/410201/wyrmspan/forums/66"
base_dir = 'data/raw'
delay_seconds = 5

if not os.path.exists(base_dir):
    os.makedirs(base_dir)

In [12]:
def get_post_path(url):
    """Return the file path for saving the forum post."""
    path_components = urlparse(url).path.split('/')
    thread, title = path_components[2:4]
    return os.path.join(base_dir, f"{thread}-{title}.json")

In [13]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

### Get all post links

In [15]:
# Playwright uses XPaths to identify elements on the page.
# you can find the XPath of an element by right-clicking on the element, 
# selecting Inspect, then right-clicking on the element in the Elements tab,
# selecting Copy, then Copy Full XPath
post_xpath = '/html/body/div[2]/main/div[2]/div/div[1]/div[2]/ng-include/div/div/ui-view/ui-view/div/div/div[2]/forums-module/div/div[2]/div[2]/div/forum-threads/ul/li/div[2]/a'


In [16]:
post_links = []
page_id = 1
page_url = page.url
while True:
    page_url = f'{url}?pageid={page_id}'
    print(page_url)
    await page.goto(page_url)
    await page.wait_for_load_state()
    time.sleep(delay_seconds)
    if page.url != page_url:
        break
    for elm in await page.locator("xpath="+post_xpath).element_handles():
        post_url = urljoin(page_url, await elm.get_attribute("href"))
        post_links.append(post_url)
    page_id += 1
        
print(len(post_links), len(set(post_links)))

https://boardgamegeek.com/boardgame/410201/wyrmspan/forums/66?pageid=1
https://boardgamegeek.com/boardgame/410201/wyrmspan/forums/66?pageid=2
https://boardgamegeek.com/boardgame/410201/wyrmspan/forums/66?pageid=3
https://boardgamegeek.com/boardgame/410201/wyrmspan/forums/66?pageid=4
132 132


### Get the html for each post

In [17]:
for ix, post_link in enumerate(post_links):
    print(ix, post_link)
    path = get_post_path(post_link)
    await page.goto(post_link)
    await page.wait_for_load_state()
    time.sleep(delay_seconds)
    html = await page.content()
    print(len(html))
    save_page(path, post_link, html)
    if ix > 10:
        break

0 https://boardgamegeek.com/thread/3328660/rules-card-117
352084
1 https://boardgamegeek.com/thread/3326902/if-in-later-rounds-i-still-have-coins-but-no-color
430734
2 https://boardgamegeek.com/thread/3324470/pay-with-cache-resources
478858
3 https://boardgamegeek.com/thread/3323492/caching-cards-scoring
355209
4 https://boardgamegeek.com/thread/3322869/play-cave-and-dragon-cards-only-from-hand
350544
5 https://boardgamegeek.com/thread/3322734/when-is-a-cave-considered-full
356373
6 https://boardgamegeek.com/thread/3320451/placing-dragon-on-another-dragon
354486
7 https://boardgamegeek.com/thread/3319935/vp-for-orthogonally-adjacent-dragons
352058
8 https://boardgamegeek.com/thread/3319882/two-different-benefits
353048
9 https://boardgamegeek.com/thread/3319383/question-about-having-9-coins-at-the-end-of-your-t
476179
10 https://boardgamegeek.com/thread/3311620/dragon-71-multicolor-flyer-how-does-it-score-point
447284
11 https://boardgamegeek.com/thread/3310522/another-timing-question-

In [18]:
await browser.close()
await playwright.stop()

### Extracting data from web pages using BeautifulSoup

We'll use BeautifulSoup to extract data from a conference talk and use markdownify to convert HTML to markdown.

In [19]:
import re
from typing import Any, cast

from bs4 import BeautifulSoup, Tag
from markdownify import MarkdownConverter 

In [20]:
path = 'data/raw/2024-04-13holland.json'
bs_parser = 'html.parser'

In [21]:
 def clean(text: Any) -> str:
    """Convert text to a string and clean it."""
    if text is None:
        return ""
    if isinstance(text, Tag):
        text = text.get_text()
    if not isinstance(text, str):
        text = str(text)
    """Replace non-breaking space with normal space and remove surrounding whitespace."""
    text = text.replace(" ", " ").replace("\u200b", "").replace("\u200a", " ")
    text = re.sub(r"(\n\s*)+\n", "\n\n", text)
    text = re.sub(r" +\n", "\n", text)
    return cast(str, text.strip())
    
class ConferenceMarkdownConverter(MarkdownConverter):  # type: ignore
    """Create a custom MarkdownConverter."""

    def __init__(self, **kwargs: Any):
        """Initialize custom MarkdownConverter."""
        super().__init__(**kwargs)
        self.base_url = kwargs.get("base_url", "")

    def convert_a(self, el, text, convert_as_inline):  # type: ignore
        """Join hrefs with a base url."""
        if "href" in el.attrs:
            el["href"] = urljoin(self.base_url, el["href"])
        return super().convert_a(el, text, convert_as_inline)

    def convert_p(self, el, text, convert_as_inline):  # type: ignore
        """Add anchor tags to paragraphs with ids."""
        if el.has_attr("id") and len(el["id"]) > 0:
            _id = el["id"]
            text = f'<a name="{_id}"></a>{text}'  # noqa: B907
        return super().convert_p(el, text, convert_as_inline)

# Create shorthand method for custom conversion
def _to_markdown(html: str, **options: Any) -> str:
    """Convert html to markdown."""
    return cast(str, ConferenceMarkdownConverter(**options).convert(html))

In [22]:
# read conference talk file
with open(path, encoding="utf8") as f:
    data = json.load(f)
print(data['url'], len(data['html']))

https://www.churchofjesuschrist.org/study/general-conference/2024/04/13holland?lang=eng 198757


In [23]:
url = data['url']
html = data['html']
soup = BeautifulSoup(html, bs_parser)
title = clean(soup.select_one("article header h1").get_text())
author = clean(soup.select_one("article p.author-name").get_text())
author_role = clean(soup.select_one("article p.author-role").get_text())
body = soup.select_one("article div.body-block")
markdown = clean(_to_markdown(str(body), base_url=url, heading_style="ATX", strip=["script", "style"]))

print('Title:', title)
print('Author:', author)
print('Author role:', author_role)
print()
print(markdown[:1024])

Title: Motions of a Hidden Fire
Author: By President Jeffrey R. Holland
Author role: Acting President of the Quorum of the Twelve Apostles

<a name="p4"></a>Brothers and sisters, I have learned a painful lesson since I last occupied this pulpit in October of 2022. That lesson is: if you don’t give an acceptable talk, you can be banned for the next several conferences. You can see I am assigned early in the first session of this one. What you can’t see is that I am positioned on a trapdoor with a very delicate latch. If this talk doesn’t go well, I won’t see you for another few conferences.

<a name="p5"></a>In the spirit of that beautiful hymn with this beautiful choir, I *have* learned some lessons recently that, with the Lord’s help, I wish to share with you today. That will make this a very personal talk.

<a name="p6"></a>The most personal and painful of all these recent experiences has been the passing of my beloved wife, Pat. She *was* the greatest woman I have ever known—a perfe