# Group FineWeb Data by Domain to Target URLs

Hello, everyone. I'm working on the Japanese portion of FineWeb. During annotation, I noticed that the density of educational content is quite low. So far, fewer than 1% of examples have scored 3 or higher. To scale up the educational content, we’ll need a significantly larger sample.

Educational content is sparse. Most web material is geared toward entertainment or profit. Furthermore, isolating educational content from high-quality websites is not trivial. For instance, if you were to randomly select a URL from a site like HuggingFace, it’s highly likely that you’d end up with a catalog page.

For languages with a low density of educational material, the leadership of FineWeb has requested a list of high-quality URLs and IDs to prepare an additional sample for annotation.

My primary challenge is limited compute resources. Currently, I’m living abroad with a constrained setup: a 500 GB internal hard drive, a 2 TB external hard drive, and 16 GB of RAM. The Japanese segment of the dataset is quite large, totaling about 500 GB and divided into 146 shards.

In this notebook, I’ll share how I extracted a sample of IDs that are potentially high quality.

## Structure

1. **Setup**: How to prepare your environment for the task.
2. **Preprocessing**: From the raw FineWeb dataset hosted on HuggingFace-Hub, strip away unnecessary data and group by domain for efficient browsing.
3. **Analysis**: Inspect the distribution of the corpus by domain and select domains with high-quality content.
4. **Scraping**: From the target domains, crawl a portion of the sitemap and compile a list of URLs.
5. **ID Extraction**: Using the list of URLs from scraping, compile the corresponding IDs from the FineWeb corpus.


## Setup

To run this code, you will need to set up an environment a few libraries. After installing, you can run the import statements.

In [None]:
# %pip install polars huggingface_hub tqdm tld --upgrade beautifulsoup4

In [5]:
import bs4
from math import log10
import os
from pathlib import Path
import polars as pl
import re
import requests
from tld import get_tld
from tqdm.auto import tqdm
from urllib.parse import urlparse

In [3]:
# increase amount of data polars shows
pl.Config.set_tbl_rows(100)

polars.config.Config

## Preprocessing

For my setup, I have the Japanese segment downloaded to my external hard drive. This works well for me, but you might find it easier to load the data directly from the HuggingFace Hub. If you need guidance on how to do this, please refer to the [blog post](https://danielvanstrien.xyz/posts/2024/12/23/fineweb-filter-polars.html).

We will be using the `polars` library, which offers powerful tools for working with large files without overloading your RAM. To begin, I want to inspect the columns in the dataset and keep only the ones I need. The function `pl.read_parquet_schema` checks a file and returns a dictionary of columns with datatypes:

In [3]:
data_dir = r"D:\FineWeb\jpn_Jpan\train"
pl.read_parquet_schema(Path(data_dir, os.listdir(data_dir)[0]))

{'text': String,
 'id': String,
 'dump': String,
 'url': String,
 'date': String,
 'file_path': String,
 'language': String,
 'language_score': Float64,
 'language_script': String,
 'minhash_cluster_size': Int64,
 'top_langs': String}

In this case, we will keep only the columns `'id'`, `'url'`, and `'date'`. Dropping the other columns, especially `'text'`, will significantly reduce the memory load. 

As we continue preprocessing, we will use the `pl.read_parquet` function. This function behaves similarly to its counterpart in pandas, offering a familiar interface while being optimized for performance.

### Step 1: Strip
In this next block of code, will will load the data, remove unneccesary columns, and add the `'domain'` column. Data files will be saved for the next step of preprocessing.

In [4]:
# Define the function for extracting the domain
def get_domain(url):
    parsed_url = urlparse(url)
    return f"{parsed_url.scheme}://{parsed_url.netloc}"

# Given a datafrome, this function will add the 'domain' column.
def add_domain_column(df):
    df = df.with_columns(
    pl.col("url").map_elements(lambda x: get_domain(x), return_dtype=pl.Utf8).alias("domain")
    )
    return df.drop_nans()

# The main function iterates through files, adds the domain column, and saves the dataframes to a new directory.
def strip_and_add_domain_pipeline(
        data_dir = r"D:\FineWeb\jpn_Jpan\train",
        output_dir = "preprocess/stripped",
        columns = ['id', 'date', 'url']
):
    files = os.listdir(data_dir)
    os.makedirs(output_dir, exist_ok=True)

    for file in tqdm(files, desc= "Preprocessing"):
        input_path, output_path = Path(data_dir, file), Path(output_dir, file)
        if os.path.exists(output_path):
            continue
        else:
                df = add_domain_column(pl.read_parquet(input_path, columns=columns))
                df.write_parquet(output_path)
        
strip_and_add_domain_pipeline()

Preprocessing: 100%|██████████| 148/148 [00:00<00:00, 4354.51it/s]


### Step 2: Group

Next, dataframes will be grouped by the domain. Grouped frames will be saved individually before being compiled in the final step.

In [5]:
def group_pipeline(
    input_dir = 'preprocess/stripped',
    output_dir = 'preprocess/grouped'
):
    files = os.listdir(input_dir)
    os.makedirs(output_dir, exist_ok=True)
    for file in tqdm(files, desc= 'Preprocessing'):
        input_path, output_path = Path(data_dir, file), Path(output_dir, file)
        if os.path.exists(output_path):
            continue
        else:
            df = pl.read_parquet(input_path)
            df = df.group_by('domain').count()
            df.write_parquet(output_path)

group_pipeline()       

Preprocessing:   0%|          | 0/148 [00:00<?, ?it/s]

Preprocessing: 100%|██████████| 148/148 [00:00<00:00, 7011.75it/s]


### Step 3: Compile

In this final preprocessing step, will combine all the grouped data. I ran this in batches to avoid running out of memory.

After combining all the data, I add the column with Top Level Domain (tld), sort it, and saved the preprocessed file for further analysis.

In [8]:
def compile(
        input_dir='preprocess/grouped',
        output_path='preprocessed.parquet',
        batch_size=30
):
    #Initialize the output and the filepaths
    output = None 
    files = [Path(input_dir, file) for file in os.listdir(input_dir)]

    #Compile the data in batches
    for i in tqdm(range(0, len(files), batch_size)):
        batch_files = files[i:i + batch_size]
        batch_data = pl.concat([pl.read_parquet(file) for file in batch_files])
        batch_grouped = batch_data.group_by('domain').agg(pl.col('count').sum())
        if output is None:
            output = batch_grouped
        else:
            output = pl.concat([output, batch_grouped]).group_by('domain').agg(pl.col('count').sum())
    
    #After batching, add a column with the Top Level Domain
    output = output.with_columns(
        pl.col("domain").map_elements(lambda x: get_tld(x, fail_silently=True), return_dtype=pl.Utf8).alias("tld")
        )
    
    #Sort and save
    output = output.sort('count', descending=True)
    output.write_parquet(output_path)

# Call the function
compile()


100%|██████████| 5/5 [00:40<00:00,  8.00s/it]


## Analysis

Now that our data has been grouped by domain, let's see where our data is coming from. The goal of this analysis is to explore the distribution of the data by domain and by top-level domain.

Then, we will select a sample to search for 'Good URLs' with high educational content.

Let's start by looking at the websites that contributed the most URLs:

In [85]:
path = 'preprocessed.parquet'

df = pl.read_parquet(path)
df[['domain', 'count']].head(10)

domain,count
str,u32
"""http://lineq.jp""",1346922
"""https://ameblo.jp""",908818
"""http://ameblo.jp""",773217
"""https://oshiete.goo.ne.jp""",770550
"""https://www.amazon.co.jp""",695980
"""http://mixi.jp""",471151
"""http://news.livedoor.com""",452297
"""https://detail.chiebukuro.yaho…",408322
"""http://q.hatena.ne.jp""",386786
"""https://qa.mamari.jp""",337377


Most of these high-frequency pages offer little to no value.

- **LineQ**, a payment platform, leads in contributions with over 1 million near-duplicate pages.
- **Ameblo** is a social media platform where fans discuss entertainment topics.
- **Amazon** needs no introduction.
- **Mixi** and **LiveDoor** are light news platforms heavily loaded with ads.

The remaining domains—Oshiete, Yahoo Chiebukuro, Hatena, and Mamari—are Q&A platforms. 

Oshiete is particularly interesting, because something of their content is flagged as 'Expert'. In general, the Expert content seems to be of at least 'Minimal' to 'Basic' educational value.

Here is an [example](https://oshiete.goo.ne.jp/watch/entry/aec1879357b6efac2e31a0a0a1d41307/).

This kind of content is somewhat useful. There is a variety of topics, decent structure, and a healthy amount of facts. However, it does not go into topics with depth. It's better to have content that more domain specefic.

In the past, I have used **Qiita**, a forum where developers share educational content with one another. The website is very well organized, with plenty of content that is targetted towards begginers. Here is an [example](https://qiita.com/ddd_nnuco/items/0873a5f286049ba46265).

So, I am going to target content from **Oshiete** and **Qiita**. I already know I have a lot of pages from **Oshiete**, but I need to check if there is sufficient data for **Qitta**.

The next code block shows you how to query the 

In [90]:
df.filter(df['domain'].str.contains('qiita')).head(5)

domain,count,tld,group
str,u32,str,i32
"""https://qiita.com""",68420,"""com""",4
"""http://qiita.com""",5009,"""com""",3
"""https://qiitadon.com""",927,"""com""",2
"""https://teams.qiita.com""",498,"""com""",2
"""https://jobs.qiita.com""",467,"""com""",2


Now that we know that FineWeb has plenty of data from both of the target websites. Let's start scraping URLs.

## Scraping

In this section, we are going to scrape links that are potentially high-quality educational content using the scraping library, **BeautifulSoup4**. We will follow this procedure:

1. Visit the target website. Find the pages where the target content is indexed.
2. Iterate through the index pages, and collect links that match the appropriate schema.

Once we have the links, we can match them with the FineWeb dataframe to find the appropriate IDs.

Let's start by visiting the index page of [Oshiete](https://oshiete.goo.ne.jp/watch/pro/?pg=2).

The important thing to note is the structure of the URL. **Expert** pages a denoted in the path by the term **pro**.

oshiete.goo.ne.jp/watch/**pro**/{page_number}.

Scroll to the bottom, and you will see that there are just 22 pages. So, we can make a list to crawl with a simple list comprehension:

In [94]:
urls = [f'https://oshiete.goo.ne.jp/watch/pro/?pg={i}' for i in range(1, 23)]
urls

['https://oshiete.goo.ne.jp/watch/pro/?pg=1',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=2',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=3',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=4',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=5',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=6',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=7',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=8',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=9',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=10',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=11',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=12',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=13',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=14',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=15',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=16',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=17',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=18',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=19',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=20',
 'https://oshiete.goo.ne.jp/watch/pro/?pg=21',
 'https://oshiete.goo.

You can click on those links to make sure they direct you to the expected page.

Next, lets scrape some links from one of the target pages:

In [None]:

# Function to scrape links from a specific page
def scrape_links(
    url, 
    pattern=""):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise HTTPError for bad responses
        
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        links = []

        # Find all <a> tags and extract their href attributes
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            if pattern:
                if re.search(pattern, href):
                    links.append(href)
            else:
                links.append(href)

        return links
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return []

scrape_links(urls[2])

You can see that most of those links are ones that we don't want, so we are going to need to filter them. For this, we can use regular expressions.

The target pages have the path structure, '/watch/entry/{foo}/'

We can use this pattern:
r'/watch/entry/.*/$'

In [98]:
scrape_links(
    url = urls[2],
    pattern = r'/watch/entry/.*/$')

['/watch/entry/f8ee77f336084031269bf4d0cd018087/',
 '/watch/entry/f8ee77f336084031269bf4d0cd018087/',
 '/watch/entry/b2c901bb206fac02987eee47859cc177/',
 '/watch/entry/b2c901bb206fac02987eee47859cc177/',
 '/watch/entry/1e9525ca7a1e6b6d3d5725cbcbc49982/',
 '/watch/entry/1e9525ca7a1e6b6d3d5725cbcbc49982/',
 '/watch/entry/c734879afb33755571dac088d739abd7/',
 '/watch/entry/c734879afb33755571dac088d739abd7/',
 '/watch/entry/d08390b6e77798226e2f1429729ca08b/',
 '/watch/entry/d08390b6e77798226e2f1429729ca08b/',
 '/watch/entry/9286b260cfd5d0a99403e01581874f7e/',
 '/watch/entry/9286b260cfd5d0a99403e01581874f7e/',
 '/watch/entry/c66962f63d3cc5973bae94ecae28a8b8/',
 '/watch/entry/c66962f63d3cc5973bae94ecae28a8b8/',
 '/watch/entry/76a751a9754ec731a27224cceacae461/',
 '/watch/entry/76a751a9754ec731a27224cceacae461/',
 '/watch/entry/27f3e84ed646bcc13213a8d09a57f453/',
 '/watch/entry/27f3e84ed646bcc13213a8d09a57f453/',
 '/watch/entry/93cf4f7153754065be6ac96b6d93b79a/',
 '/watch/entry/93cf4f7153754065

Much better, now lets script up a crawler.

In [99]:
def save_list_to_txt(
        _list,
        path_to_output
):
    if os.path.exists(path_to_output):
        with open(path_to_output, 'r', encoding ='utf-8') as f:
            old_list = f.read().split('\n')
            _list += old_list
    _list = list(set(_list))
    with open(path_to_output, 'w', encoding = 'utf-8') as f:
        f.write('\n'.join(_list))

def crawl(
        urls,
        pattern,
        path_to_output
):
    os.makedirs(os.path.dirname(path_to_output), exist_ok=True)
    
    for url in tqdm(urls, desc = 'Crawling and scraping links'):
        links = scrape_links(url, pattern)
        if links:
            save_list_to_txt(links, path_to_output)
    with open(path_to_output, 'r', encoding ='utf-8') as f:
        links = f.read().split('\n')
        print(f"Crawl complete. {len(links)} links scraped")

The crawler iterates across webpages, gathers the target links, and saves the list to a text file in between each page.

The function, crawl, takes the following arguments:

- **urls**, a list of urls to be scraped
- **pattern**, the regex used to filter the target links
- **path_to_output**, where you will save the list of links.

In [100]:
crawl(
    urls=urls,
    pattern= r'/watch/entry/.*/$',
    path_to_output='links/oshiete.txt'
)

Crawling and scraping links: 100%|██████████| 22/22 [00:17<00:00,  1.25it/s]

Crawl complete. 753 links scraped





Qiita is a bit more complicated, and I will need to scrape using Selenium. I'm going to work on a strategy and update this in the future.

## ID Extraction

In [121]:
def get_ids(
        df: pl.DataFrame,
        domain: str,
        links: list
) -> list:
    # Filter rows where the 'domain' column contains the given domain
    filtered = df.filter(df['domain'].str.contains(domain))
    
    # Create a pattern from the list of links
    pattern = '|'.join(map(re.escape, links))
    
    # Further filter rows where the 'url' column matches the pattern
    filtered = filtered.filter(filtered['url'].str.contains(pattern, literal=False))
    
    # Return the 'id' column as a list
    return filtered['id'].to_list()

def process(
        data_dir,
        domain,
        links,
        path_to_output
):
    os.makedirs(os.path.dirname(path_to_output), exist_ok=True)
    paths = [Path(data_dir, file) for file in os.listdir(data_dir)]
    for path in tqdm(paths, desc = "Finding IDs"):
        df = pl.read_parquet(path)
        ids = get_ids(df, domain, links)
        save_list_to_txt(ids, path_to_output)
    with open(path_to_output, 'r', encoding='utf-8') as f:
        ids = f.read().split('\n')
        print(f"IDs extracted. {len(ids)} found.")

In [122]:
data_dir = 'preprocess/stripped'
with open('links/oshiete.txt', 'r', encoding='utf-8') as f:
    links = f.read().split('\n')

process(
    data_dir=data_dir,
    domain='oshiete.txt',
    links=links,
    path_to_output='ids/oshiete'
)

Finding IDs: 100%|██████████| 148/148 [01:31<00:00,  1.61it/s]

IDs extracted. 341 found.





## Conclusion

Now that we have a list of IDs, we can prepare a sample for annotation with better educational density. This methodology is labor intensive, but so is data annotation. If we can find viable websites, I think it is worth the effort. To wrap things up, lets segment the corpus by its contributing domains, grouped by the number of pages those domains contributed.

In [7]:
# Function to assign group based on the 'count' column
def assign_group(count):
    return int(log10(count))

# Create the DataFrame and add the 'group' column
df = pl.read_parquet('preprocessed.parquet')
df = df.with_columns(
    pl.col("count").map_elements(lambda x: assign_group(x), return_dtype=pl.Int32).alias("group")
)

# First, we calculate the total sum of 'count' for the entire dataset
total_url_sum = df.select(pl.sum("count")).to_numpy()[0][0]

# Grouping and aggregating
result = (
    df.group_by("group")
      .agg([
          pl.min("count").alias("group_min"),   # Min of 'count' within the group
          pl.max("count").alias("group_max"),   # Max of 'count' within the group
          pl.count("domain").alias("domains"),  # Count of domain entries per group
          pl.sum("count").alias("pages")   # Sum of 'count' per group
      ])
      .with_columns(
          # Adding the 'corpus_perc' column as the percentage of the total sum
          ((pl.col("pages") / total_url_sum * 100).round(2)).alias("corpus_perc")
      )

).sort('group')

print ('Total Domains:', f"{result['domains'].sum():,}")
print ('Total URLs:', f"{result['pages'].sum():,}")
print(result)
result.plot.bar(
    x = 'group',
    y = 'pages'
)

Total Domains: 7,486,452
Total URLs: 376,134,745
shape: (7, 6)
┌───────┬───────────┬───────────┬─────────┬───────────┬─────────────┐
│ group ┆ group_min ┆ group_max ┆ domains ┆ pages     ┆ corpus_perc │
│ ---   ┆ ---       ┆ ---       ┆ ---     ┆ ---       ┆ ---         │
│ i32   ┆ u32       ┆ u32       ┆ u32     ┆ u32       ┆ f64         │
╞═══════╪═══════════╪═══════════╪═════════╪═══════════╪═════════════╡
│ 0     ┆ 1         ┆ 9         ┆ 4917634 ┆ 13986881  ┆ 3.72        │
│ 1     ┆ 10        ┆ 99        ┆ 1984799 ┆ 63838337  ┆ 16.97       │
│ 2     ┆ 100       ┆ 999       ┆ 541777  ┆ 145888934 ┆ 38.79       │
│ 3     ┆ 1000      ┆ 9996      ┆ 40324   ┆ 85025677  ┆ 22.61       │
│ 4     ┆ 10001     ┆ 99243     ┆ 1812    ┆ 44257590  ┆ 11.77       │
│ 5     ┆ 100078    ┆ 908818    ┆ 105     ┆ 21790404  ┆ 5.79        │
│ 6     ┆ 1346922   ┆ 1346922   ┆ 1       ┆ 1346922   ┆ 0.36        │
└───────┴───────────┴───────────┴─────────┴───────────┴─────────────┘


When we group the domains like this, we can see that a large portion over the corpus is contributed by very large websites. 21 million pages were contributed by just 100 websites. 45 million pages were contributed by the top ~2,000, comprising 16% of the corpus.

I plan to begin by search through these large websites. My expectation is that most of them will be of no value. However, if we know we can ignore these pages, should we do another random sample, it could improve the quality. If I discover a that a few of these domains are useful, it could be a quick and effective way to improve the representation of educational material.

Meanwhile, I plan to ask my Japanese friends to reccommend high quality educational websites. I don't have a specefic idea, I just plan to ask them for educational material that they would either use themselves or reccomend to others.

I am going to classify them based on the system that we are already using:

1. **No Educational Value**: Light news, e-commerce, sports, personal blogs, business pages, etc.
2. **Minimal Educational Value**: Mostly amateur content. With a lot of pruning, potentially a good resource for Minimal or Basic educational content. Ex. QA forums, SEO Content Disguised as Educational Blogs.
Examples: Quora, [Oshiete](https://oshiete.goo.ne.jp/watch/pro/), [RareJob](https://www.rarejob.com/englishlab/)
3. **Basic Educational Value**: Mostly amateur content, but overall, education is the priority. Potentially a good resource for Minimal, Basic, or Good educational content.
Examples: StackOverflow, [Qiita](https://qiita.com/)
4. **Good Educational Value**: Education is clearly the priority, but their may be some issues. Maybe it covers a lot of topics, but it doesn't go into any topic in too much depth, such as WikiPedia. Maybe there is a lot of high-quality content, but also a large amount of non-educational content, such as HuggingFace.
5. **Excellent Educational Value**: The entire website is dedicated to providing education on a certain topic, and it explores that topic in great depth. Randomly sample any page on this website, and you will likely draw one that is Good or Excellent educational value.
Examples: [NLTK Book](https://www.nltk.org/book/), [Imabi](https://imabi.org/)