# Web Scraping Links and Retrieving IDs from FineWeb

In this section, we are going to scrape links that are potentially high-quality educational content using the scraping libraries **BeautifulSoup4** and **Selenium**. Both strategies will follow the same basic procedure:

1. Visit the target website. Find the pages where the target content is indexed.
2. Iterate through the index pages, and collect links that match the appropriate schema.
3. Once we have the links, match them with the FineWeb dataframe to find the appropriate IDs.

In the previous notebook, **preprocess.ipynb**, the Japanese segment of FineWeb was grouped by domain and combined in a single DataFrame. From that point, it was very easy to figure out which websites have contributed a lot of data.

This notebook will demonstrate how to scrape two websites:

- **Oshiete** is a QA forum. This website stood out because it has articles that are marked as [expert](https://oshiete.goo.ne.jp/watch/entry/32a0ba00f95212ae379a105f4a09ed36/), which are overall higher quality. I think this will be a good source of Minimally and Basic Educational material on some general topics.
- **Qiita** is a popular Japanese developers' [forum](https://qiita.com/U-MA/items/896c49d46585e32ff7b1). This seems to be a good source of Basic and Good Educational material within the domain of programming and web development.

## Setup

If you haven't already, pleaes install the library, **fineweb_tools**. All of the neccesary dependencies will be installed.

In [None]:
# pip install fineweb_tools

## Scraping with BeautifulSoup4

BeautifulSoup4 is an easy-to-use scraping library that simplifies the process of parsing, navigating, and extracting data from HTML and XML documents.

To use this libary effectively, you need to develop a strategy that is specialized for the website you are trying to scrape. This notebook provides a procedure that generalizes well to simple websites with an index page, but its effectiveness is case by case.

Crucuially, this strategy assumes that target website has an **index page**, where the contents of the site are mapped out.

Let's start by visiting the index page of [Oshiete](https://oshiete.goo.ne.jp/watch/pro/?pg=2).

The important thing to note is the structure of the URL. From this website, we only want pages that are flagged as as **Expert**. On Oshiete, **Expert** pages are indexed seperately, denoted in the path by the term **pro**.

oshiete.goo.ne.jp/watch/**pro**/{page_number}.

Scroll to the bottom, and you will see that there are just 22 pages. So, we can make a list of index pages to crawl with a simple list comprehension:

In [2]:
urls = [f'https://oshiete.goo.ne.jp/watch/pro/?pg={i}' for i in range(1, 23)]
urls[:5]

['https://oshiete.goo.ne.jp/watch/pro/?pg=1',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=2',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=3',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=4',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=5']

From here, you should click some of the links from the list comprehension, and confirm that they land on the target pages.

Let's try scraping one of these pages. Use the function, **scrape_links**. In this demonstration, we will input the required arguments:
- **method**: Which library to use for scraping, either **bs4** or **selenium**.
- **url**: The URL from which to scrape links.

In [5]:
from fineweb_tools.webscrape import scrape_links

scrape_links(
    method='bs4',
    url=urls[0]
)[:5]

['/',
 '/login?callback_action=login_or_signup&redirect_to=%2Fsearch%3Fq%3Dtag%253A%25E5%2588%259D%25E5%25BF%2583%25E8%2580%2585%2520created%253A%253C%253D2024-04-25%26sort%3Dlike%26stocked%3D%26page%3D1&realm=qiita',
 '/signup?callback_action=login_or_signup&redirect_to=%2Fsearch%3Fq%3Dtag%253A%25E5%2588%259D%25E5%25BF%2583%25E8%2580%2585%2520created%253A%253C%253D2024-04-25%26sort%3Dlike%26stocked%3D%26page%3D1&realm=qiita',
 '/',
 '/question-feed']

The scraper does it's job, but most of the links are not useful. Using the argument, **regex**, we can filter the links using regular expressions.

To forumlate a **regex**, click on one of the links you are trying to scrape. Let's look at an example:

https://oshiete.goo.ne.jp/watch/entry/89078c9390f15a9fae58d085c8091e8d/

The pattern is in the path:

**watch/entry/{foo}**

We can target that pattern with this regex:

**r"watch/entry/.+"**

- **r** signals that the contents of the string are to be treated as raw text
- **watch/entry/** matches the target path that follows the top-level domain
- **.+** matches with any character, as long as there are 1 or more characters

Let's try scaping links with that regex:

In [5]:
scrape_links(
    method = 'bs4',
    url=urls[0],
    regex='watch/entry/.+'
)[:5]

['/watch/entry/53fa05ab24009c8454469ee8fcf75427/?from=entry_side_new',
 '/watch/entry/4786777415b37917a79acfa64327b629/',
 '/watch/entry/997ca7db83e4da7314d3fc491fc3167a/',
 '/watch/entry/5a038db44f5e70d6a23fbf814bace574/?from=entry_side_rank',
 '/watch/entry/72412ec44e50a29d28ec7d34a03d70f5/']

Much better. If you combine any one of those paths with the base, oshiete.goo.ne.jp, it will take you to an article that is flagged as an expert.

The manner by which links are encoded varies from site to site, so you need to do some experimentation before you start scraping.

Next, let's try crawling. The function **crawl** takes these arguments:
- **method**: Scraping library, **bs4** or **selenium**.
- **urls**: A list of URLs to scrape.
- **output_path**: The path to save your list of scraped links.
- **regex**: The regular expression used to filter scraped links.

In [None]:
from fineweb_tools.webscrape import crawl

crawl(
    method='bs4',
    urls=urls,
    output_path='links/oshiete.txt',
    regex='watch/entry/.+'
)

Crawling and scraping links:   0%|          | 0/22 [00:00<?, ?it/s]

Crawl complete. 774 links scraped


## Retrieve IDs from FineWeb

Now that we have a list of IDs, let's find out if they are contained in FineWeb and extract the corresponding IDs.

The **id_retrieval_pipeline** takes four arguments:
- **data_dir**: Path to FineWeb data. I reccommend that you used the **stripped data** from from preprocessing because it is more efficient. This the pipeline works fine on raw data too.
- **domain**: The domain name from which you scraped links. Used for initial filtering.
- **links_path**: Path to the text file where links are saved.
- **output_path**: Path to where the list of IDs is to be saved.

In [None]:
from fineweb_tools.webscrape import id_retrieval_pipeline

id_retrieval_pipeline(
    data_dir='intermediate/stripped/jpn_Jpan',
    domain='oshiete',
    links_path='links/oshiete.txt',
    output_path='ids/oshiete.txt'
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 341 found.


## Scraping with Selenium  

Selenium is a more flexible library because, unlike BeautifulSoup4, it can interact with web pages as a normal user. Therefore, it can handle things like lazy loading and JavaScript.  

In the case of **Qiita**, the article retrieval system is scripted in JavaScript, so BeautifulSoup4 is ineffective.  

Although it is more flexible, Selenium is also much slower. Nevertheless, it is quite easy to use.  

Just like before, let's start by visiting the **Qiita** index [page](https://qiita.com/search).  

In the search bar, click on the icon on the right, which provides a list of search parameters. We are going to use the following ones:  
- **tag**: 初心者 (beginner), which becomes '%3A%E5%88%9D%E5%BF%83%E8%80%85' after URL encoding.
- **created**: <=2024-04-24 (the date of the most recent article in Japanese FineWeb)  
- **sort**: by Likes (we are assuming articles with more Likes are of higher quality)  


Plug in the query parameters, and the URL looks like this.

https://qiita.com/search?q=tag%3A%E5%88%9D%E5%BF%83%E8%80%85&sort=like&stocked=&page=1

Let's start by defining a function to scrape URLs with Selenium:

From here, the strategy is the same. Prepare a list of links to crawl and pick out a pattern for you target pages.

Note that, when using selenium, you need to consider your **driver**, which is the web browser used to interact with pages.

It defaults to Chrome, but you can change it with the argument, **driver**.

In [2]:
from fineweb_tools.webscrape import scrape_links

#Prepare a list of URLs to crawl.
urls = [f"https://qiita.com/search?q=tag%3A%E5%88%9D%E5%BF%83%E8%80%85%20created%3A%3C%3D2024-04-25&sort=like&stocked=&page={i}"
        for i in range(1, 10)]

#Define your regex.
pattern = ".com/.+/items/.+$"

#Test out the function.
scrape_links(
    method='selenium',
    url=urls[0],
    regex=".com/.+/items/.+$"
)[:10]

['https://qiita.com/jesus_isao/items/63557eba36819faa4ad9',
 'https://qiita.com/zamis/items/703bfcea027a70c1cec6',
 'https://qiita.com/kazuo_reve/items/d1a3f0ee48e24bba38f1',
 'https://qiita.com/shimajiri/items/501828dc8d589e214470',
 'https://qiita.com/jnchito/items/dedb3b889ab226933ccf',
 'https://qiita.com/m-yamashita/items/889c116b92dc0bf4ea7d',
 'https://qiita.com/rana_kualu/items/379eefb3a40c6b44cb92',
 'https://qiita.com/nesheep5/items/e7196ba496e59bb2aa28',
 'https://qiita.com/0xfffffff7/items/028ff8c920a6a8c67dc5',
 'https://qiita.com/karamage/items/771b633c3243989418a2']

Everything works, so let's crawl. This is going to take quite a bit longer.

In [None]:
from fineweb_tools.webscrape import crawl

crawl(
    method='selenium',
    urls = urls,
    output_path = 'links/qiita.txt',
    regex = ".com/.+/items/.+$"
)

Crawling and scraping links:   0%|          | 0/9 [00:00<?, ?it/s]

Crawl complete. 178 links scraped


Finally, let's extract IDs.

In [None]:
from fineweb_tools.webscrape import id_retrieval_pipeline

id_retrieval_pipeline(
    data_dir='intermediate/stripped/jpn_Jpan',
    domain='qiita',
    links_path='links/qiita.txt',
    output_path='ids/qiita.txt'
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 585 found.


## Conclusion

With the list of IDs in hand, we can now prepare a sample for annotation that prioritizes better educational density. While this methodology requires significant effort, it is far less labor-intensive than data annotation itself.

Data annotation is a monotonous task, and the attention of our annotators is a valuable resource. By implementing strategies for pre-screening data, we can optimize this process and achieve greater efficiency in the long run.
