# Web Scraping Links and Retrieving IDs from FineWeb

In this section, we are going to scrape links that are potentially high-quality educational content using the scraping libraries **BeautifulSoup4** and **Selenium**. Both strategies will follow the same basic procedure:

1. Visit the target website. Find the pages where the target content is indexed.
2. Iterate through the index pages, and collect links that match the appropriate schema.
3. Once we have the links, match them with the FineWeb dataframe to find the appropriate IDs.

## Setup

You will need to install these libraries and import these packages:

In [None]:
# pip install beautifulsoup4 selenium polars tqdm --upgrade typing

In [22]:
import bs4
import os
from pathlib import Path
import polars as pl
import re
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from tqdm.notebook import tqdm
from typing import List
from urllib.parse import urlparse

## Scraping with BeautifulSoup4

BeautifulSoup4 is an easy-to-use scraping library that simplifies the process of parsing, navigating, and extracting data from HTML and XML documents.

To use this libary effectively, you need to develop a strategy that is specialized for the website you are trying to scrape. This notebook provides a procedure that generalizes well to simple websites with an index page, but its effectiveness is case by case.

Crucuially, this strategy assumes that target website has an **index page**, where the contents of the site are mapped out.

Let's start by visiting the index page of [Oshiete](https://oshiete.goo.ne.jp/watch/pro/?pg=2).

The important thing to note is the structure of the URL. From this website, we only want pages that are flagged as as **Expert**. On Oshiete, **Expert** pages are indexed seperately, denoted in the path by the term **pro**.

oshiete.goo.ne.jp/watch/**pro**/{page_number}.

Scroll to the bottom, and you will see that there are just 22 pages. So, we can make a list of index pages to crawl with a simple list comprehension:

In [5]:
urls = [f'https://oshiete.goo.ne.jp/watch/pro/?pg={i}' for i in range(1, 23)]
urls[:10]

['https://oshiete.goo.ne.jp/watch/pro/?pg=1',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=2',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=3',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=4',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=5',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=6',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=7',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=8',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=9',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=10']

From here, you should click some of the links from the list comprehension, and confirm that they land on the target pages.

The next codebox is a function that will, given a url, return all the links that are on that page.

In [5]:
def scrape_links_with_bs4(
    url : str, 
    pattern: str="") -> List[str]:
    """
    Scrapes links from a target page using the BeautifulSoup library.

    Args:
        url (str): The url from which to scrape links.
        pattern (str, optional): If provided, regular expressions will be used to filter the links collected from the target page.

    Returns:
        links (list): A list of links scraped from the target page.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise HTTPError for bad responses
        
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        links = []

        # Find all <a> tags and extract their href attributes
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            if pattern:
                #Use regular expressions to filter links with the target pattern.
                if re.search(pattern, href):
                    links.append(href)
            else:
                links.append(href)

        return links
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return []

Using this script, let's scrape one of the URLs and see what we get.

In [17]:
scrape_links_with_bs4(urls[0])[:10]

['//oshiete.goo.ne.jp/',
 'javascript:void 0;',
 'javascript:void 0;',
 '//oshiete.goo.ne.jp/',
 '//oshiete.goo.ne.jp/articles/qa/',
 '//oshiete.goo.ne.jp/category/list/',
 'javascript:void 0;',
 'javascript:void 0;',
 'javascript:void 0;',
 'javascript:void 0;']

The scraper does it's job, but most of the links are not useful. Using the argument, **pattern**, we can filter the links using regular expressions.

To find a pattern, click on one of the links you are trying to scrape. Let's look at an example:

https://oshiete.goo.ne.jp/watch/entry/89078c9390f15a9fae58d085c8091e8d/

The pattern is in the path:

**watch/entry/{foo}**

We can target that pattern with this regex:

"watch/entry/.+"

- **r** signals that the contents of the string are to be treated as raw text
- **watch/entry/** matches the target path that follows the top-level domain
- **.+** matches with any character

Let's try scaping links with that regex:

In [10]:
scrape_links_with_bs4(
    url=urls[0],
    pattern='watch/entry/.+'
)[:10]

['/watch/entry/c4f4a51672ca7851d0e8dcff780801f8/',
 '/watch/entry/c4f4a51672ca7851d0e8dcff780801f8/',
 '/watch/entry/4b1b58c2fce51cc54e9333783ff469dc/',
 '/watch/entry/4b1b58c2fce51cc54e9333783ff469dc/',
 '/watch/entry/7e5c41b9790a3945d253b26cf818e696/',
 '/watch/entry/7e5c41b9790a3945d253b26cf818e696/',
 '/watch/entry/53fa05ab24009c8454469ee8fcf75427/',
 '/watch/entry/53fa05ab24009c8454469ee8fcf75427/',
 '/watch/entry/72412ec44e50a29d28ec7d34a03d70f5/',
 '/watch/entry/72412ec44e50a29d28ec7d34a03d70f5/']

Much better. If you combine any one of those paths with the base, oshiete.goo.ne.jp, it will take you to an article that is flagged as an expert.

The manner by which links are encoded varies from site to site, so you need to do some experimentation before you start scraping.

Now, we have a list of URLs to scrape, and we have a method for scraping the exact links that we want. Let's set up a crawler.

In [9]:
def save_list_to_txt(
        _list: List[str],
        path_to_output: str
) -> None:
    """
    Provided with a list and file_path, this function will save a list to a text file.
    Items in the list will be seperated by newlines.

    Args:
        _list (list): List of strings.
        path_to_output: Path where the list is to be saved.
    """
    #As the crawl iterates across pages, scraped links are saved to a text file.
    if os.path.exists(path_to_output):
        with open(path_to_output, 'r', encoding ='utf-8') as f:

            #Combines new and old contents.
            old_list = f.read().split('\n')
            _list += old_list
    
    #Eliminates duplicates.
    _list = list(set(_list))

    #Writes the list to a text file, sepearted by newlines.
    with open(path_to_output, 'w', encoding = 'utf-8') as f:
        f.write('\n'.join(_list))

def crawl(
        scrape_method,
        urls : List[str],
        pattern: str,
        path_to_output: str
) -> None:
    """
    Given a list of URLs, this functional will iterate across pages, scrape links, and save the list to a .txt file.

    Args:
        scrape_method (function): Identifies which library to use for scraping.
        urls (List[str]): List of URLs to scrape.
        pattern (str): Regex used to filter URLs.
        path_to_output (str): Path to save the list of links.
    """
    
    #If needed, creates a directory to save links.
    os.makedirs(os.path.dirname(path_to_output), exist_ok=True)
    
    #Iterates across URLs, scrapes links, and saves them.
    for url in tqdm(urls, desc = 'Crawling and scraping links'):
        links = scrape_method(url, pattern)
        if links:
            save_list_to_txt(links, path_to_output)
    
    #Upon completion, states how many links were scraped.
    with open(path_to_output, 'r', encoding ='utf-8') as f:
        links = f.read().split('\n')
        print(f"Crawl complete. {len(links)} links scraped")

Now we have defined methods for crawling and saving links, lets try crawling.

In [10]:
crawl(
    scrape_method= scrape_links_with_bs4,
    urls = [f'https://oshiete.goo.ne.jp/watch/pro/?pg={i}' for i in range(1, 23)],
    pattern = 'watch/entry/.+',
    path_to_output = 'links/oshiete.txt'
)

Crawling and scraping links:   0%|          | 0/22 [00:00<?, ?it/s]

Crawl complete. 769 links scraped


## Retrieve IDs from FineWeb

Now that we have a list of links, it's time to query the FineWeb. If the links are present, we will extract the ID.

There is a lot of variation in how links are presented on-page, so we need to define a function to normalize the links.

We will normalize the links by extracting the path. So, whether the link is:
- https://www.domain.tld/this/is/the/path
- http://www.domain.tld/this/is/the/path
- www.domain.tld/this/is/the/path/
- domain.tld/this/is/the/path/

The function will return:

- **this/is/the/path**

In [13]:
def extract_path(url: str) -> str:
    """
    Normalizes a given url by extracting the path and removing the terminal '/'.

    Args:
        url (str): The URL or path string.

    Returns:
        str: The extracted path, or an empty string if no path is found.
    """

    #Removes the terminal '/'
    if url[-1] == '/':
        url = url[:-1]

    # Check if the input is a full URL. If so, extracts the path.
    if re.match(r'^(https?:\/\/|www\.)', url):
        parsed_url = urlparse(url)
        return parsed_url.path if parsed_url.path else '/'
    
    # If not a full URL, simply returns the input.
    else:
        return url

Next, we will define a function to combine our list of links into a regex for efficient querying.

In [18]:
def combine_links_into_regex(links: List[str]) -> str:
    """Given a list of strings, this will produce a regex for querying FineWeb."""
    
    return '|'.join(map(re.escape, links))

Now, we will define a function to query a single shard of FineWeb, and then test it out.

In [19]:
def get_ids(
        df: pl.DataFrame,
        domain: str,
        links: List[str]
) -> List[str]:
    """
    Given a polars DataFrame, filter rows that match one of the target links, and return a list of IDs.

    Args:
        df (pl.DataFrame): DataFrame to filter.
        domain (str): Target domain.
        links (List[str]): List of promising links from the target domain.

    Returns:
        List of IDs from FineWeb tahtt align with the target links.
    """

    #Normalize the links.
    links = [extract_path(link) for link in links]

    #Combine the links.
    pattern = combine_links_into_regex(links)

    #Filter the DataFrame for rows that contain the target domain.
    filtered = df.filter(df['domain'].str.contains(domain))

    #From the filtered DataFrame, filter rows the URL matches one of the target links.
    filtered = filtered.filter(filtered['url'].str.contains(pattern, literal=False))

    #Return a list of IDs from the filtered rows.
    return filtered['id'].to_list()


#Now, let's test the function.

domain = 'oshiete'

with open('links/oshiete.txt', 'r', encoding='utf-8') as f:
    links = f.read().split('\n')

data_dir = 'preprocess/stripped'
df = pl.read_parquet(Path(data_dir, os.listdir(data_dir)[0]))

get_ids(
    df=df,
    domain=domain,
    links=links
)

['<urn:uuid:9f04ee84-c012-4947-93a6-7d545ed8ff89>',
 '<urn:uuid:cd26f36c-88e2-434b-a592-7536c1640aea>',
 '<urn:uuid:9263920e-e830-4331-bf21-2b1b465bd4eb>']

Finally, let's define a pipe line that will iterate through FineWeb files, extract links, and save them.

In [20]:
def id_retrieval_pipeline(
        data_dir: str,
        domain: str,
        links: List[str],
        path_to_output: str
)-> None:
    """
    Given a domain, a list of links, and a directory where files are saved,
    this pipeline will retrieve IDs from a series of FineWeb files.

    Args:
        data_dir: Path to directory where FineWeb files are saved.
        domain: Target domain for ID extraction.
        links: List of target links for ID extraction.
        path_to_output: Path to save list of IDs.
    """
    #If neccesary, makes a directory to save list of IDs.
    os.makedirs(os.path.dirname(path_to_output), exist_ok=True)

    #Build a list of filepaths.
    paths = [Path(data_dir, file) for file in os.listdir(data_dir)]

    #Iterates through files, extracts IDs, and saves them.
    for path in tqdm(paths, desc = "Finding IDs"):
        df = pl.read_parquet(path)
        ids = get_ids(df, domain, links)
        save_list_to_txt(ids, path_to_output)
    
    #Completes the process by returning the number of IDs matched.
    with open(path_to_output, 'r', encoding='utf-8') as f:
        ids = f.read().split('\n')
        print(f"IDs extracted. {len(ids)} found.")

Let's test it out.

In [21]:
id_retrieval_pipeline(
    data_dir='preprocess/stripped',
    domain='oshiete',
    links=links,
    path_to_output='ids/oshiete.txt'
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 341 found.


## Scraping with Selenium

Selenium is a more flexible library because, unlike BeutifulSoup4, it can interact with web pages as a normal user. Therefore, it can deal with things like lazy loading and Javascript.

In the case of **Qiita**, the article retrieval system is scripted in JavaScript, so BeautifulSoup4 is ineffective.

Although it is more flexible, it is also much slower. Nevertheless, it is quite easy to use.

Just like before, let's start by visiting the **Qiita** index [page](https://qiita.com/search).

In the search bar, click on the icon on the right, and it provides a list of search parameters. We are going to use these ones:
- **tag**: 初心者 (beginner)
- **created**: <=2024-04-24 (the date of the most recent article in Japanese FineWeb)
- **sort**: by Likes, we are assuming the articles with more Likes are higher quality

Plug in the query parameters, and the URL looks like this.

https://qiita.com/search?q=tag%3A%E5%88%9D%E5%BF%83%E8%80%85&sort=like&stocked=&page=1

Let's start by defining a function to scrape URLs with Selenium:

In [29]:
def scrape_links_with_selenium(
    url: str,
    pattern: str,
    driver: object = None
    )-> List[str]:
    """
    Given a URL, scrape links using Selenium.

    Args:
        url (str): URL from which to scrape links.
        pattern (str): Regex to match target links.
        driver (object): Web browser for scraping links. Defaults to Chrome.
    
    Returns:
        List of links that matches the input regex.
    """
    output = []
    if not driver:
        driver = webdriver.Chrome()
    driver.get(url)

    # Locate all links on the page using the <a> tag
    links = driver.find_elements(By.TAG_NAME, "a")
    # Extract the href attribute of each link
    for link in links:
        href = link.get_attribute("href")
        if href:  # Ensure href is not None
            if pattern:
                if re.search(pattern, href):
                    output.append(href)
            else:
                output.append(href)

    # Print the collected links
    driver.quit()
    return list(set(output))

From here, the strategy is the same. Prepare a list of links to crawl and pick out a pattern for you target pages.

In [28]:
#Prepare a list of URLs to crawl.
urls = [f"https://qiita.com/search?q=tag%3A%E5%88%9D%E5%BF%83%E8%80%85%20created%3A%3C%3D2024-04-25&sort=like&stocked=&page={i}"
        for i in range(1, 101)]

#Define your regex.
pattern = ".com/.+/items/.+$"

#Test out the function.
scrape_links_with_selenium(
    url=urls[0],
    pattern=".com/.+/items/.+$"
)[:10]

['https://qiita.com/bo_zu_/items/88f45b132c8293dcd9b1',
 'https://qiita.com/shimajiri/items/501828dc8d589e214470',
 'https://qiita.com/JunyaShibato/items/3aa5f7f3fc991de17f3f',
 'https://qiita.com/0xfffffff7/items/028ff8c920a6a8c67dc5',
 'https://qiita.com/jesus_isao/items/63557eba36819faa4ad9',
 'https://qiita.com/zamis/items/703bfcea027a70c1cec6',
 'https://qiita.com/yasuoyasuo/items/c43783316a4d141a140f',
 'https://qiita.com/suzu-4/items/ea5d802cb0ad16682ae2',
 'https://qiita.com/YudaiTsukamoto/items/42a8df22ca4c6b327dfd',
 'https://qiita.com/soyanchu/items/d1cb9785fc211941a009']

Everything works, so let's crawl. This is going to take quite a bit longer.

In [31]:
crawl(
    scrape_method=scrape_links_with_selenium,
    urls = urls,
    pattern = pattern,
    path_to_output='links/qiita.txt'
)

Crawling and scraping links:   0%|          | 0/100 [00:00<?, ?it/s]

Crawl complete. 2007 links scraped


Finally, let's extract IDs.

In [32]:
data_dir='preprocess/stripped'
domain='qiita'
with open('links/qiita.txt', 'r', encoding='utf-8') as f:
    links = f.read().split('\n')
path_to_output = 'ids/qiita.txt'

id_retrieval_pipeline(
    data_dir=data_dir,
    domain=domain,
    links=links,
    path_to_output=path_to_output
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 585 found.


## Conclusion

With the list of IDs in hand, we can now prepare a sample for annotation that prioritizes better educational density. While this methodology requires significant effort, it is far less labor-intensive than data annotation itself.

Data annotation is a monotonous task, and the attention of our annotators is a valuable resource. By implementing strategies for pre-screening data, we can optimize this process and achieve greater efficiency in the long run.
