# Collect IDs from FineWeb

I am working on collecting IDs from the Japanese segment of FineWeb that are potentially high-quality educational content. My goal is to isolate the educational material from 50 websites that cover a variety of topics. You can track my progress here:

[Japanese FineWeb IDs](https://huggingface.co/datasets/LoneWolfgang/Japanese-FineWeb-IDs)

This notebook will guide you through my methods.

## Structure:

1. **Selecting Websites**: An overview of the strategies I used to identify websites with high-quality educational content.  
2. **Distilling Educational Material**: How to use path structure and web scraping to isolate educational content from non-educational material.  
3. **Retrieve IDs**: Given a list of links, retrieve corresponding IDs from the FineWeb corpus.  

## Setup:

Before getting started, please read through this [notebook](https://github.com/Lone-Wolfgang/Distill-FineWeb2/blob/main/preprocess.ipynb). Completing the **Quickstart** will ensure that you have installed the necessary dependencies, downloaded your FineWeb language segment, and preprocessed it for efficient browsing.

## Selecting Websites

This is the most labor-intensive part of the task. Educational content is sparse, and a significant portion is not represented in the FineWeb corpus. To pinpoint educational content, I have used four strategies:

1. Solicit recommendations.  
2. Query FineWeb domains using keywords.  
3. Follow external links from educational websites.  
4. Sift through blog ranking websites.

### Solicit Recommendations

This is the simplest strategy and, with sufficient community support, scales the best. Among my Japanese connections, I have contacts with expertise in web development, English education, various STEM topics, and music production. I have asked them to recommend sites with educational content that they use themselves or would recommend to others. For instance, a friend of mine recommended [Zenn](https://zenn.dev/), which hosts an assortment of user-generated articles covering a variety of topics in web development.

### Query FineWeb Domains using Keywords

The next simplest strategy is to search FineWeb domains using educational keywords. This method utilizes the [preprocessed](https://github.com/Lone-Wolfgang/Distill-FineWeb2/blob/main/preprocess.ipynb). To query the DataFrame, we will use the function, **fineweb_tools.analyze.filter_fn**. This function takes the arguments:
- **path**: Path to the preprocessed DataFrame. 
- **filter_column**: Column to apply a filter.
- **target**: Term to filter from the filter_column.
- **retrieve_columns**: Function only returns columns speceified in the retrieve columns.

Let's try searching for **rekishi**, which is the Japanese term for history:


In [1]:
from fineweb_tools.analyze import filter_fn
import polars as pl

result = filter_fn(
    path = 'FineWeb/preprocessed/jpn_Jpan.parquet',
    filter_column='domain',
    target='rekishi',
    retrieve_columns=['domain', 'count']
).head(10)

for row in result.iter_rows():
    print (row)

('https://rekishisuki.com', 3138)
('https://www.rekishijin.com', 1963)
('https://rekishijidaisakka.hatenablog.com', 1120)
('http://rekishi-club.com', 1020)
('https://rekishiru.site', 814)
('https://chirirekishizukihisachan.hatenablog.com', 710)
('http://www.rekishinoshinzui.com', 683)
('https://rekishi-shizitsu.jp', 653)
('http://rekishi.info', 591)
('https://www.rekishiwales.com', 570)


From this simple query, we have found three valuable resources:

- [RekishiJin](https://www.rekishijin.com/) & [Rekishi Shizitsu](https://rekishi-shizitsu.jp/), which are niche Wikipedia style articles about important people and events in Japanese history
- [RekishiWales](https://www.rekishiwales.com), which focuses on Welsh history

### Follow External Links from Educational Websites

Teachers help teachers.

When querying 'biology', I discovered a blog written by a high school biology teacher. Most of their content discussed their stance on educational methods, which are not useful for FineWeb. However, they also included a section of [links](https://biologymanabiai.jimdoweb.com/%E3%83%AA%E3%83%B3%E3%82%AF%E9%9B%86/%E9%AB%98%E6%A0%A1%E7%94%9F%E7%89%A9%E6%8E%88%E6%A5%AD%E9%96%A2%E9%80%A3%E3%83%AA%E3%83%B3%E3%82%AF%E9%9B%86/) to useful educational resources.

From here, I found:

- [Shinrin Ringyou](https://www.shinrin-ringyou.com/): A fantastic resource about forest ecology.  
- [Statistics Academy](https://www.stat.go.jp/naruhodo/index.html): A collection of basic statistic lessons for elementary and high school students.  

These two sites are some of the best quality that I have found. In your exploration, if you do stumble upon a website that is earnestly passionate about education, look for external links before you move on.

### Sift through Blog Ranking Websites

A final resource for educational content is blog aggregators, such as [Blog Mura](https://blogmura.com/). This website is particularly useful because blogs are categorized and ranked based on user activity.

This is a valuable resource for niche information that you would not find in a typical educational curriculum, such as:  
- [Ace Compliance](https://www.ace-compliance.com/blog/): Explores the legal code directing waste management practices and explains it in common-sense terminology.  
- [KamiConsal](https://kamiconsal.jp/): All things paper—manufacturing, use-cases, and trivia.  
- [AgriNavi](https://agri.mynavi.jp): Dedicated to disseminating environmentally friendly agricultural practices.  

Although this stragegy is useful for accessing resources you would otherwise overlook there are significant limitations:

- **Recency**: Blogs are typically ranked by recent actitivity, but the FineWeb corpus was collected about a year ago. Many of the top-ranked blogs are **not** represented in the FineWeb corpus. If you find a promising resource, start by querying the domain using **fineweb_tools.analyze.filter_fn**.

- **SEO Dominanace**: Many of the top ranking blogs are intended to promote some product or service. Although they are educationally presenting, their content is usually underwhelming.

- **Credibility**: Unlike professional resources, amateur blogs are not obligated to share their sources. It is difficult to vet resources, so please procede with caution.

## Distilling Educational Content

Regardless of their educational quality, all websites have at least some non-educational content, such as category pages or an 'About Me' section. Many sites have excellent quality material, but only on a minority of pages. To improve the sample quality, it is important to sift educational from non-educational material.

This section of the notebook presents three strategies for distilling links to pages with educational content. After collecting links, the final section of the notebook explains how to retrieve IDs from the FineWeb corpus.

- **Use Path Branches**: When possible, this is the preferred method. Sometimes, the type of content is denoted in the structure of the URL. In this case, it is very easy to target educational content with regular expressions.

- **Scrape Links Using BeautifulSoup4**: When the type of content is not reflected in the link structure, it is better to crawl index pages with educational content and scrape links. BeautifulSoup4 (bs4) is the simplest method for web scraping.

- **Scrape Links Using Selenium**: More advanced websites use JavaScript with query parameters to dynamically retrieve pages based on user requests. BS4 is not effective for this type of website. Selenium is a powerful library that interacts with webpages as a human-user, but it is much slower.

### Use Path Branches

This is an effective strategy when the majority of website content is educational, or when site is well-structured and content type is clearly reflected in the URL.

To demonstrate, we are going to use [Mathematica](https://mathematica.site/), a website dedicated to the ancient origins of mathematical theory.

We will start by collecting a list of links from mathematica that are in the FineWeb Corpus. We will use **fineweb_tools.analyze.filter_fineweb**. This functions iterates through FineWeb files, applies **fineweb_tools.analyze.filter_fn**, and collects the results into a single DataFrame.

**filter_fineweb** takes two arguments:
- **data_dir**: Where the FineWeb files are saved. I usee the stripped files from the intermediate folder, but raw files will work too.
- **filter_fn**: The custom filtering function.

As before, **filter_fn** takes four arguments:
- **path**: Input as a lambda argument.
- **filter_column**: The column to filter.
- **target**: The string to locate in the filter column.
- **retrieve_columns**: The columns returned in the final DataFrame.


In [2]:
from fineweb_tools.analyze import filter_fineweb, filter_fn

filtered = filter_fineweb(
    data_dir = 'FineWeb/intermediate/jpn_Jpan/stripped', #
    filter_fn = lambda path: filter_fn(
        path=path,
        filter_column='domain',
        target = 'mathematica.site',
        retrieve_columns=['domain', 'url']
    )
)

#Create a list with the URLs
urls = filtered['url'].to_list()
urls[:5]

Filtering FineWeb:   0%|          | 0/148 [00:00<?, ?it/s]

['https://mathematica.site/keyword-term/egyptian-fraction/',
 'https://mathematica.site/web-mag/web-mag-egypt/3-5/',
 'https://mathematica.site/newsall/news/renewal-info/',
 'https://mathematica.site/keyword-term/mastaba/',
 'https://mathematica.site/keyword-term/neftis-2/']

With the initial list of URLs, you can begin to see the link structure.

To get a better overview, use the function, **fineweb_tools.collect_ids.count_path_branches**, which recursively splits paths into branches and counts them.

In [3]:
from fineweb_tools.collect_ids import count_path_branches

count_path_branches(urls).most_common(5)

[('/web-mag', 93),
 ('/keyword-term', 29),
 ('/web-mag/web-mag-egypt', 23),
 ('/web-mag/calendar', 17),
 ('/web-mag/column', 16)]

When viewd this way, you can see that the most common path is **web-mag**, which is where educational content is presented.

Let's use a list comprehension with a regular expression to filter the target links.

In [4]:
from fineweb_tools.collect_ids import save_list_to_txt
import re

#targets strings starting with 'web-mag/' followed by at least 1 of any character
regex = 'web-mag/.+'

#filters urls using a list comprehension with re.search
target = [url for url in urls if re.search(regex, url)]

#saves the extracted list to a text file for id extraction
save_list_to_txt(target, output_path='hub_dataset/jpn_Jpan/links.mathematica.txt')

target[:5]

['https://mathematica.site/web-mag/web-mag-egypt/3-5/',
 'https://mathematica.site/web-mag/web-mag-egypt/1-3/',
 'https://mathematica.site/web-mag/web-mag-egypt/1-2/',
 'https://mathematica.site/web-mag/euclid/euclid1-1/',
 'https://mathematica.site/web-mag/web-mag-babylonian/invention-of-numbers-8/']

The links were saved to a text file, and we will retrieve the corresponding IDs in the final section.

When website structure is not reflected in the URL, you need to use webscraping to isolate educational content. We will start by using BeautifulSoup4 (BS4).

## Scraping with BeautifulSoup4

BeautifulSoup4 is an easy-to-use scraping library that simplifies the process of parsing, navigating, and extracting data from HTML and XML documents.

To demonstrate, we will use the QA forum, Oshiete.

Oshiete is a QA forum similar to Quora, and it covers a lot of different topics. Most of it is not useful, but some articles are tagged as Expert, meaning that the author has some authority.

Let's start by visiting the index page of expert articles on [Oshiete](https://oshiete.goo.ne.jp/watch/pro/?pg=2).

The important thing to note is the structure of the URL. **Expert** pages are indexed seperately, denoted in the path by the term **pro**.

oshiete.goo.ne.jp/watch/**pro**/{page_number}.

Scroll to the bottom, and you will see that there are just 22 pages. So, we can make a list of index pages to crawl with a simple list comprehension.

In [39]:
urls = [f'https://oshiete.goo.ne.jp/watch/pro/?pg={i}' for i in range(1, 23)]
urls[:3]

['https://oshiete.goo.ne.jp/watch/pro/?pg=1',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=2',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=3']

Alternatively, build a list of pages using the function **fineweb_tools.collect_ids.pagerange**:

In [5]:
from fineweb_tools.collect_ids import pagerange

#Handy when combining lists from many different index pages
urls = pagerange('https://oshiete.goo.ne.jp/watch/pro/?pg={}', first = 1, last = 23)
urls[:3]

['https://oshiete.goo.ne.jp/watch/pro/?pg=1',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=2',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=3']

From here, you should click some of the links from the list comprehension, and confirm that they land on the target pages.

Let's try scraping one of these pages. The function **fineweb_tools.collect_ids.scrape_links** takes the arguments:
- **method**: Which library to use for scraping, either **bs4** or **selenium**.
- **url**: The URL from which to scrape links.
- **regex**: Optional. If provided, will filter the scraped links.

In [6]:
from fineweb_tools.collect_ids import scrape_links

links = scrape_links(
    method='bs4',
    url=urls[0]
)
links[:5]

['//oshiete.goo.ne.jp/',
 'javascript:void 0;',
 'javascript:void 0;',
 '//oshiete.goo.ne.jp/',
 '//oshiete.goo.ne.jp/articles/qa/']

Next, use **count_path_branches** to tune a filtering regex:

In [7]:
count_path_branches(links).most_common(5)

[('/watch', 167),
 ('/watch/entry', 95),
 ('/watch/category', 48),
 ('/articles', 21),
 ('/articles/qa', 21)]

In this case, target content is presented on urls with the branch, **watch/entry/{. . .}**

The scraper does it's job, but most of the links are not useful. Using the argument, **regex**, we can filter the links using regular expressions.

Let's try scraping links again using a regex:

In [8]:
scrape_links(
    method = 'bs4',
    url=urls[0],
    regex='watch/entry/.+'
)[:3]

['/watch/entry/9efe66583ef412bb0c90d4be18e67d99/?from=entry_side_rank',
 '/watch/entry/f79dcd164843bd95d8baff8d21e80120/',
 '/watch/entry/4786777415b37917a79acfa64327b629/']

If you combine any one of those paths with the base, oshiete.goo.ne.jp, it will take you to an article that is flagged as an expert.

URL encoding varies from site to site. It is important that you experiment with your regex to ensure that it is filtering properly.

Next, let's try crawling. The function **fineweb_tools.collect_ids.crawl** takes these arguments:
- **method**: Scraping library, **bs4** or **selenium**.
- **urls**: A list of URLs to scrape.
- **output_path**: The path to save your list of scraped links.
- **regex**: The regular expression used to filter scraped links.

In [53]:
from fineweb_tools.collect_ids import crawl

crawl(
    method='bs4',
    urls=urls,
    output_path='hub_dataset/jpn_Jpan/links/oshiete.txt',
    regex='watch/entry/.+'
)

Crawling and scraping links:   0%|          | 0/23 [00:00<?, ?it/s]

Crawl complete. 779 links scraped


Finally, let's try scraping using Selenium.

## Scraping with Selenium  

Selenium is a more flexible library because, unlike BeautifulSoup4, it can interact with web pages as a normal user. Therefore, it can handle things like lazy loading and JavaScript.

To demonstrate, we will scrape **Qiita**, which is a developer's forum where users write explanatory articles.

In the case of **Qiita**, the article retrieval system is scripted in JavaScript, so BeautifulSoup4 is ineffective.  

Although it is more flexible, Selenium is also much slower. Nevertheless, it is quite easy to use.  

Just like before, let's start by visiting the **Qiita** index [page](https://qiita.com/search).  

In the search bar, click on the icon on the right, which provides a list of search parameters. We are going to use the following ones:  .
- **created**: <=2024-04-24 (the date of the most recent article in Japanese FineWeb)  
- **sort**: by Likes (we are assuming articles with more Likes are of higher quality)  


Plug in the query parameters, and the URL looks like this.

`https://qiita.com/search?q=created%3C%3D2024-04-24&sort=like&stocked=&page=1`


In [12]:
urls = pagerange('https://qiita.com/search?q=created%3C%3D2024-04-24&sort=like&stocked=&page={}', 1, 5)
urls[:3]

['https://qiita.com/search?q=created%3C%3D2024-04-24&sort=like&stocked=&page=1',
 'https://qiita.com/search?q=created%3C%3D2024-04-24&sort=like&stocked=&page=2',
 'https://qiita.com/search?q=created%3C%3D2024-04-24&sort=like&stocked=&page=3']

Scrape links from a single page, and use that information to tune your regex. When using Selenium, it will open up a webbrowser as it scrapes.

In [13]:
from fineweb_tools.collect_ids import scrape_links, count_path_branches

links = scrape_links(
    method='selenium', #this time, use selenium instead of 'bs4',
    url=urls[0]
)

count_path_branches(links).most_common(10)

[('/tags', 86),
 ('/torifukukaiou', 12),
 ('/organizations', 11),
 ('/', 8),
 ('/shirok', 8),
 ('/torifukukaiou/items', 6),
 ('/tags/python', 5),
 ('/KEINOS', 4),
 ('/ddd_nnuco', 4),
 ('/aguilarklyno', 4)]

This pattern is not as obvious. Content is presented in URLs with the structure:

qiita.com/{user}/items/{title}

You can target this pattern with this regex:

com/.+/items/.+

Following **com/**, matches **any sequence of characters**, followed by **/items/**, followed by **any sequence of characters**.

Let's try it:

In [14]:
scrape_links(
    method='selenium', #this time, use selenium instead of 'bs4',
    url=links[0],
    regex = 'com/.+/items/.+'
)[:5]

['https://qiita.com/nucomiya/items/c30c8de57eba7ccdfbe3',
 'https://qiita.com/topi_log/items/b8cc6afaa6e12599ffbb',
 'https://qiita.com/nukipei/items/f096a1df6c8074b16150',
 'https://qiita.com/GOROman/items/769bf17589d5661f7a70',
 'https://qiita.com/free-honda/items/8c5c3ec4cdb6ad6a5107']

Click on one of the links to confirm that it directs you to an article page.

Finally, let's run a crawl. If you have a lot of pages, this will take a while.

In [15]:
from fineweb_tools.collect_ids import crawl

crawl(
    method='selenium',
    urls = urls,
    output_path = 'hub_dataset/jpn_Jpan/links/qiita.txt',
    regex = ".com/.+/items/.+$"
)

Crawling and scraping links:   0%|          | 0/5 [00:00<?, ?it/s]

Crawl complete. 2106 links scraped


Now that we have links from a few different pages, let's retrieve IDs from FineWeb.

## Retrieve IDs from FineWeb

This step is extremely straightforward. We will use the function, fineweb_tools.collect_ids.id_retrieval_pipeline. This function iterates through FineWeb files, filters rows with matching links, retrieves IDs, and writes them to a text file.

The **id_retrieval_pipeline** takes four arguments:
- **data_dir**: Path to FineWeb data. I reccommend that you used the **stripped data** from from preprocessing because it is more efficient. The pipeline works fine on raw data too.
- **domain**: The domain name from which you scraped links. Used for initial filtering.
- **links_path**: Path to the text file where links are saved.
- **output_path**: Path to where the list of IDs is to be saved.

Let's start with **Mathematica**:

In [62]:
from fineweb_tools.collect_ids import id_retrieval_pipeline

id_retrieval_pipeline(
    data_dir='FineWeb/intermediate/jpn_Jpan/stripped',
    domain='mathematica.site',
    links_path='hub_dataset/jpn_Jpan/links/mathematica.txt',
    output_path='hub_dataset/jpn_Jpan/ids/mathematica.txt'
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 93 found.


Now, let's do Oshiete and Qiita:

In [None]:
id_retrieval_pipeline(
    data_dir='FineWeb/intermediate/jpn_Jpan/stripped',
    domain='qiita',
    links_path='hub_dataset/jpn_Jpan/links/qiita.txt',
    output_path='hub_dataset/jpn_Jpan/ids/qiita.txt'
)

id_retrieval_pipeline(
    data_dir='FineWeb/intermediate/jpn_Jpan/stripped',
    domain='oshiete.goo',
    links_path='hub_dataset/jpn_Jpan/links/oshiete.txt',
    output_path='hub_dataset/jpn_Jpan/ids/oshiete.txt'
)

Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

IDs extracted. 586 found.


Finding IDs:   0%|          | 0/148 [00:00<?, ?it/s]

## Conclusion

With the list of IDs in hand, we can now prepare a sample for annotation that prioritizes better educational density. While this methodology requires significant effort, it is far less labor-intensive than data annotation itself.

Data annotation is a monotonous task, and the attention of our annotators is a valuable resource. By implementing strategies for pre-screening data, we can optimize this process and achieve greater efficiency in the long run.
