# Group FineWeb Data by Domain  

Working with FineWeb data is challenging due to the size of the corpus. For example, the Japanese segment of FineWeb consists of 370 million examples, split into 148 files and totaling about 500 GB of data. I am using a laptop with 16 GB of RAM, and I imagine many of you are operating under even stricter constraints. To address this, I am developing a library specifically designed to process FineWeb data using strategies that prioritize memory efficiency.  

The leadership of FineWeb has requested that the community select a sample of data that is potentially high-quality educational material.  

To facilitate my search of the Japanese segment, I have grouped the raw data by domain and consolidated it into a single DataFrame. This approach makes it easy to identify the websites that comprise the FineWeb corpus and how many pages each contributed.  

## Setup  

Before we get started, please install **fineweb_tools**, which will automatically install all dependencies required to run this notebook.  

In [None]:
#pip install fineweb-tools

The Python library, **fineweb_tools**, is purpose-built for FineWeb data. It prioritizes minimal dependencies and memory efficiency.  

DataFrame operations are performed with **polars**, which is installed as a dependency of **fineweb_tools**.  

Let's start by looking at the final output of preprocessing:  

In [1]:
import polars as pl

df = pl.read_parquet('preprocessed/jpn_Jpan.parquet')
print('Total Domains:', df.height)
print('Total Pages', df['count'].sum())
print()
print(df.head(10))

Total Domains: 7486452
Total Pages 376134745

shape: (10, 3)
┌─────────────────────────────────┬─────────┬───────┐
│ domain                          ┆ count   ┆ tld   │
│ ---                             ┆ ---     ┆ ---   │
│ str                             ┆ u32     ┆ str   │
╞═════════════════════════════════╪═════════╪═══════╡
│ http://lineq.jp                 ┆ 1346922 ┆ jp    │
│ https://ameblo.jp               ┆ 908818  ┆ jp    │
│ http://ameblo.jp                ┆ 773217  ┆ jp    │
│ https://oshiete.goo.ne.jp       ┆ 770550  ┆ ne.jp │
│ https://www.amazon.co.jp        ┆ 695980  ┆ co.jp │
│ http://mixi.jp                  ┆ 471151  ┆ jp    │
│ http://news.livedoor.com        ┆ 452297  ┆ com   │
│ https://detail.chiebukuro.yaho… ┆ 408322  ┆ co.jp │
│ http://q.hatena.ne.jp           ┆ 386786  ┆ ne.jp │
│ https://qa.mamari.jp            ┆ 337377  ┆ jp    │
└─────────────────────────────────┴─────────┴───────┘


There are over 7 million unique domains, but a significant portion of the data comes from a small number of prolific contributors. I am reviewing the largest contributors individually to assess their educational value.  

Most of these domains lack educational merit, and excluding them from future samples could improve overall quality.  

However, some websites contain useful material. When I find a promising site, I focus on identifying high-quality pages, scraping their links, and matching them to the FineWeb raw data to extract the corresponding IDs.  

This notebook is a guide for using **fineweb_tools** for preprocessing. Web scraping is covered in another notebook.  

## Structure  

1. **Quickstart**: Set up your environment, run a sanity check, and preprocess your language data.  
2. **Pipeline**: Explains each step of the pipeline in more detail.  
3. **Conclusion**: Briefly analyzes the Japanese FineWeb corpus contents and discusses future directions.  

## Quickstart  

This section will guide you in preprocessing your data without getting too deep into the details.  

1. **Find Your Language**: Query the HF hub to determine the code assigned to your target language.  
2. **Run a Sanity Check**: Run the preprocessing pipeline on a sample to ensure everything is functioning correctly.  
3. **Preprocess**: Run the pipeline on your full language dataset.  

### Find Your Language  

The first step is to find your language code. There are over a thousand different languages in the repository. The function **get_fineweb_languages** queries the repository for the languages it contains.  

It's not always obvious how the language is named. Use a **list comprehension** to search for your language.  

Alternatively, you can visit the [FineWeb2 HF Hub](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and find your language code in the dropdown menu.  


In [2]:
import random
from modules.preprocess import get_fineweb_languages

languages = get_fineweb_languages()
print ('Total Languages:', len(languages))
print ('Sample:', random.sample(languages, 5))
print()
###Use a list comprehension to find your language.
print('Languages with "jp":', [lang for lang in languages if 'jp' in lang])

Total Languages: 1893
Sample: ['tso_Latn', 'kas_Arab', 'sby_Latn', 'avk_Latn', 'doi_Deva']

Languages with "jp": ['jpn_Jpan', 'bjp_Latn', 'cjp_Latn', 'ljp_Latn']


### Run a Sanity Check  

Preprocessing is a time-consuming operation, and it's best to ensure everything is in order before you preprocess the entire language dataset.  

Run the pipeline with the **sample** argument, and it will operate on the specified number of files. The preprocessing functions are configured resume progress from previously completed operations. So, there is no time lost in running a sanity check.

Upon completion, the raw data, intermediates, and outputs will be saved to the following directory structure:  


    root/
    ├── FineWeb/data/{language}/train
    │                            ├── 000_00000.parquet
    │                            ├── 000_00001.parquet
    │                            └── 000_00002.parquet
    ├── intermediate/
    │   ├── stripped/{language}
    │   │               ├── 000_00000.parquet
    │   │               ├── 000_00001.parquet
    │   │               └── 000_00002.parquet
    │   ├── grouped/{language}
    │   │               ├── 000_00000.parquet
    │   │               ├── 000_00001.parquet
    │   │               └── 000_00002.parquet
    ├── preprocessed/
            └── {language}.parquet

Let's run a sanity check:

In [2]:
from fineweb_tools.preprocess import download_and_preprocess_pipeline

#I run the code on 'bjp_Latn', which is a much smaller dataset than Japanese.
download_and_preprocess_pipeline(
    language = 'bjp_Latn',
    sample = 1
)

Checking FineWeb hub for the language code, "bjp_Latn"...
Running sanity check. 1 paths sampled.


Downloading FineWeb data:   0%|          | 0/1 [00:00<?, ?it/s]

000_00000.parquet:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading FineWeb data: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]


Download complete. Saved raw FineWeb data to FineWeb\data\bjp_Latn\train\data\bjp_Latn\train


Stripping dfs and adding domain column: 100%|██████████| 1/1 [00:00<00:00, 925.28it/s]


Preprocessing complete. 1 processed. Saved files to intermediate\stripped\bjp_Latn


Grouping by domain: 100%|██████████| 1/1 [00:00<00:00, 321.13it/s]


Preprocessing complete. 1 processed. Saved output to intermediate\grouped\bjp_Latn


Combining grouped data: 100%|██████████| 1/1 [00:00<00:00, 75.33it/s]


All DataFrames combined. Adding top-level domain column.
Saved combined dataframe to preprocessed\bjp_Latn.parquet
shape: (10, 3)
┌─────────────────────────────────┬───────┬──────────────┐
│ domain                          ┆ count ┆ tld          │
│ ---                             ┆ ---   ┆ ---          │
│ str                             ┆ u32   ┆ str          │
╞═════════════════════════════════╪═══════╪══════════════╡
│ https://png.bible               ┆ 83    ┆ bible        │
│ https://www.stepbible.org       ┆ 24    ┆ org          │
│ http://akustressgiler.blogspot… ┆ 19    ┆ blogspot.com │
│ https://ebible.org              ┆ 12    ┆ org          │
│ https://www.anagrammen.com      ┆ 2     ┆ com          │
│ http://fatlike.worddetector.co… ┆ 1     ┆ com          │
│ https://en.wikisource.org       ┆ 1     ┆ org          │
│ http://www.anagrammen.com       ┆ 1     ┆ com          │
│ http://zlobek.poznan.pl         ┆ 1     ┆ poznan.pl    │
│ http://glaikit.worddetector.co… ┆ 1     ┆ 

### Preprocess  

Now, we will run the code on the full Japanese dataset.  

The Japanese dataset is too large for my internal hard drive, so I save my language files to an external hard drive using the **local_dir** argument.  

The final step of the pipeline is the most memory-intensive. DataFrames are combined in batches, controlled by the **batch_size** argument. The default value is set low (10) to avoid memory issues. You can increase the batch size to speed up the process, depending on your system's capacity.

In [None]:
download_and_preprocess_pipeline(
    local_dir='D:/FineWeb',
    language = 'jpn_Jpan',
    batch_size = 30 ## Works well with 16GB of RAM
)

Checking FineWeb hub for the language code, "jpn_Jpan"...


Downloading FineWeb data: 100%|██████████| 148/148 [00:00<00:00, 12487.82it/s]


Download complete. Saved raw FineWeb data to D:\FineWeb\data\jpn_Jpan\train


Stripping dfs and adding domain column: 100%|██████████| 148/148 [00:00<00:00, 14513.50it/s]


Preprocessing complete. 148 processed. Saved files to intermediate\stripped\jpn_Jpan


Grouping by domain: 100%|██████████| 148/148 [00:00<00:00, 14633.25it/s]


Preprocessing complete. 148 processed. Saved output to intermediate\grouped\jpn_Jpan


Combining grouped data: 100%|██████████| 5/5 [00:44<00:00,  8.87s/it]


All DataFrames combined. Adding top-level domain column.
Saved combined dataframe to preprocessed\jpn_Jpan.parquet
shape: (10, 3)
┌─────────────────────────────────┬─────────┬───────┐
│ domain                          ┆ count   ┆ tld   │
│ ---                             ┆ ---     ┆ ---   │
│ str                             ┆ u32     ┆ str   │
╞═════════════════════════════════╪═════════╪═══════╡
│ http://lineq.jp                 ┆ 1346922 ┆ jp    │
│ https://ameblo.jp               ┆ 908818  ┆ jp    │
│ http://ameblo.jp                ┆ 773217  ┆ jp    │
│ https://oshiete.goo.ne.jp       ┆ 770550  ┆ ne.jp │
│ https://www.amazon.co.jp        ┆ 695980  ┆ co.jp │
│ http://mixi.jp                  ┆ 471151  ┆ jp    │
│ http://news.livedoor.com        ┆ 452297  ┆ com   │
│ https://detail.chiebukuro.yaho… ┆ 408322  ┆ co.jp │
│ http://q.hatena.ne.jp           ┆ 386786  ┆ ne.jp │
│ https://qa.mamari.jp            ┆ 337377  ┆ jp    │
└─────────────────────────────────┴─────────┴───────┘


Finished! Your language data is preproccessed.

The next section explains the pipeline in more detail.

## Pipepline

Preprocessing involves four steps:

1. **Download**: Collect language data from HF Hub.
2. **Strip**: Unnecessary columns are removed, and the domain column is added.
3. **Group**: DataFrames are grouped by domain.
4. **Combine**: Combines grouped data into a single DataFrame and adds the Top Level Domain in the **tld** column.

For this tutorial, I'm going to use a very small language group, **bjp_Latn**.

### Download Data from HF Hub

To download data from FineWeb, all you need is the code for your target language. Some of the language sets are very large, so I set the function up to raise an error unless the input language code exactly matches one from FineWeb.  

To learn how to find your language code, please check the Quickstart section at the beginning.  

This is a very straightforward function. It takes the following arguments:  
- **language (str)**: The code for your target language.  
- **local_dir (str)**: Where to save your FineWeb data. Defaults to 'FineWeb'.  
- **sample (int)**: Optional. If provided, will sample N paths. Useful for a sanity check.  

In [6]:
from modules.preprocess import download_fineweb_data

download_fineweb_data(
    language = 'bjp_Latn',
)

Checking FineWeb hub for the language code, "bjp_Latn"...


Downloading FineWeb data:   0%|          | 0/1 [00:00<?, ?it/s]

000_00000.parquet:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading FineWeb data: 100%|██████████| 1/1 [00:01<00:00,  1.29s/it]

Download complete. Saved raw FineWeb data to FineWeb\data\bjp_Latn\train





Using the default settings, your directory will look like this:

    root/
    ├── FineWeb/data/{language}/train
                                 └── 000_00000.parquet

### Strip

The first step in preprocessing is to remove unneccesary data in order to lighten the memory load.

Let's start by looking at the data:

In [26]:
import polars as pl

#Use the polars function, read_parquet_schema, to quickly load columns and datatypes
pl.read_parquet_schema('FineWeb/data/bjp_Latn/train/000_00000.parquet')

{'text': String,
 'id': String,
 'dump': String,
 'url': String,
 'date': String,
 'file_path': String,
 'language': String,
 'language_score': Float64,
 'language_script': String,
 'minhash_cluster_size': Int64,
 'top_langs': String}

There are a lot of columns, and most of them are not useful for our purposes.

- **text**: The main data. The text contents of the webpage.
- **id**: Identifies the row in the FineWeb corpus.
- **dump**: Group of files from Common Crawl that the text was extracted from.
- **url**: URL from which the text was extracted.
- **date**: The date the crawl was performed.
- **file_path**: Exact file from which the text was extracted.
- **language**: Most likely language.
- **language_score**: Confidence that the identified language is the true language.
- **language_script**: Character family used to write this language.
- **min_hash_cluster_size**: How many other text samples were marked as near-duplicates.
- **top_langs**: Rank-ordered dictionary of languages and their confidence scores.

All we need to group by domain is the **url** column.

Additionally, I set up the default settings to keep the **date** and **id**. This will lighten the load enough so I can store these files on my internal hard drive.  

Now, I have a lightweight option to query IDs and analyze the time distribution.

Next, let's strip the data and add the domain columns.

The default settings keep the columns `['url', 'id', 'date']`. You can select the columns you want. However, if you don't select the **url** column, or if you select a column that does not exist, it will raise an error.  

In [35]:
from modules.preprocess import strip_and_add_domain_pipeline

strip_and_add_domain_pipeline(
    local_dir="FineWeb",
    language = 'bjp_Latn'
)

print(pl.read_parquet('intermediate/stripped/bjp_Latn/000_00000.parquet').head(5))

Stripping dfs and adding domain column: 100%|██████████| 1/1 [00:00<00:00, 996.75it/s]

Preprocessing complete. 1 processed. Saved files to intermediate\stripped\bjp_Latn
shape: (5, 4)
┌─────────────────────────┬──────────────────────┬────────────────────────┬────────────────────────┐
│ id                      ┆ date                 ┆ url                    ┆ domain                 │
│ ---                     ┆ ---                  ┆ ---                    ┆ ---                    │
│ str                     ┆ str                  ┆ str                    ┆ str                    │
╞═════════════════════════╪══════════════════════╪════════════════════════╪════════════════════════╡
│ <urn:uuid:e7538e04-663b ┆ 2014-07-24T02:34:00Z ┆ http://akustressgiler. ┆ http://akustressgiler. │
│ -476b-8…                ┆                      ┆ blogspot…              ┆ blogspot…              │
│ <urn:uuid:6ffe8cb5-13a2 ┆ 2014-07-24T22:59:29Z ┆ http://akustressgiler. ┆ http://akustressgiler. │
│ -4497-a…                ┆                      ┆ blogspot…              ┆ blogspot…          




Now your directory should look like this:

    root/
    ├── FineWeb/data/{language}/train
    │                           └── 000_00000.parquet
    └── intermediate/
        └── stripped/{language}
                       └── 000_00000.parquet

Next, let's group the data.

### Group

Iterates through dataframes and groups them by domain:

In [38]:
from modules.preprocess import group_pipeline

group_pipeline(
    language = 'bjp_Latn'
)

print (pl.read_parquet('intermediate/grouped/bjp_Latn'))

Grouping by domain: 100%|██████████| 1/1 [00:00<?, ?it/s]

Preprocessing complete. 1 processed. Saved output to intermediate\grouped\bjp_Latn
shape: (20, 2)
┌─────────────────────────────────┬───────┐
│ domain                          ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ https://www.bible.com           ┆ 1     │
│ https://loginseekgo.com         ┆ 1     │
│ http://fatlike.worddetector.co… ┆ 1     │
│ http://zlobek.poznan.pl         ┆ 1     │
│ http://global-cheap-hotels.com  ┆ 1     │
│ …                               ┆ …     │
│ https://ebible.org              ┆ 12    │
│ http://akustressgiler.blogspot… ┆ 19    │
│ https://png.bible               ┆ 83    │
│ http://talkie.worddetector.com  ┆ 1     │
│ http://www.anagrammen.com       ┆ 1     │
└─────────────────────────────────┴───────┘





This function groups stripped files into dataframes with 2 columns:
- **domain**: The domain from which URLs were extracted.
- **count**: The number of pages which that URL contributed.

Now, your directory will look like this:

    root/
    ├── FineWeb/data/{language}/train
    │                           └── 000_00000.parquet
    └── intermediate/
        ├── stripped/{language}
        │               └── 000_00000.parquet
        └── grouped/{language}
                        └── 000_00000.parquet
### Combine

Finally, let's combine our DataFrames:

In [39]:
from modules.preprocess import combine_grouped_dfs

combine_grouped_dfs(
    language = 'bjp_Latn'
)

print(pl.read_parquet)

Combining grouped data: 100%|██████████| 1/1 [00:00<00:00, 998.41it/s]

All DataFrames combined. Adding top-level domain column.
Saved combined dataframe to preprocessed\bjp_Latn.parquet
shape: (10, 3)
┌─────────────────────────────────┬───────┬──────────────┐
│ domain                          ┆ count ┆ tld          │
│ ---                             ┆ ---   ┆ ---          │
│ str                             ┆ u32   ┆ str          │
╞═════════════════════════════════╪═══════╪══════════════╡
│ https://png.bible               ┆ 83    ┆ bible        │
│ https://www.stepbible.org       ┆ 24    ┆ org          │
│ http://akustressgiler.blogspot… ┆ 19    ┆ blogspot.com │
│ https://ebible.org              ┆ 12    ┆ org          │
│ https://www.anagrammen.com      ┆ 2     ┆ com          │
│ http://tealike.worddetector.co… ┆ 1     ┆ com          │
│ http://zlobek.poznan.pl         ┆ 1     ┆ poznan.pl    │
│ http://faiking.worddetector.co… ┆ 1     ┆ com          │
│ http://fatlike.worddetector.co… ┆ 1     ┆ com          │
│ https://www.bible.com           ┆ 1     ┆ 




In this test case, there is only one language file, so the output is similar to the previous one. This function also sorts and adds the **Top-Level Domain** column, **tld**.

TLD is particularly useful in cases where language confidence is low. Please refer to this [blog post](https://danielvanstrien.xyz/posts/2024/12/23/fineweb-filter-polars.html).

This is the final step in preprocessing, and your directory should look like this:

    root/
    ├── FineWeb/data/{language}/train
    │                           └── 000_00000.parquet
    └── intermediate/
        ├── stripped/{language}
        │               └── 000_00000.parquet
        ├── grouped/{language}
        │               └── 000_00000.parquet
        └── preprocessed
                └──{language}.parquet

## Conclusion

For this tutorial, we will conclude with an overview of the Japanese segment of FineWeb, grouped by domains and the number of pages which they contributed.

In [None]:
from modules.analyze import group_domains_by_count

df = pl.read_parquet('preprocessed/jpn_Jpan.parquet')
result = group_domains_by_count(df)
print(result)
result.plot.bar(
    x='group',
    y='pages'
)

Total Domains: 7,486,452
Total URLs: 376,134,745
shape: (7, 6)
┌───────┬───────────┬───────────┬─────────┬───────────┬─────────────┐
│ group ┆ group_min ┆ group_max ┆ domains ┆ pages     ┆ corpus_perc │
│ ---   ┆ ---       ┆ ---       ┆ ---     ┆ ---       ┆ ---         │
│ i32   ┆ u32       ┆ u32       ┆ u32     ┆ u32       ┆ f64         │
╞═══════╪═══════════╪═══════════╪═════════╪═══════════╪═════════════╡
│ 0     ┆ 1         ┆ 9         ┆ 4917634 ┆ 13986881  ┆ 3.72        │
│ 1     ┆ 10        ┆ 99        ┆ 1984799 ┆ 63838337  ┆ 16.97       │
│ 2     ┆ 100       ┆ 999       ┆ 541777  ┆ 145888934 ┆ 38.79       │
│ 3     ┆ 1000      ┆ 9996      ┆ 40324   ┆ 85025677  ┆ 22.61       │
│ 4     ┆ 10001     ┆ 99243     ┆ 1812    ┆ 44257590  ┆ 11.77       │
│ 5     ┆ 100078    ┆ 908818    ┆ 105     ┆ 21790404  ┆ 5.79        │
│ 6     ┆ 1346922   ┆ 1346922   ┆ 1       ┆ 1346922   ┆ 0.36        │
└───────┴───────────┴───────────┴─────────┴───────────┴─────────────┘


When we group the domains like this, we observe that a significant portion of the corpus is contributed by very large websites. For example, just 100 websites contributed 21 million pages, and approximately 2,000 websites contributed 45 million pages, comprising 16% of the corpus.

I plan to begin by examining these large websites. My expectation is that most of them will hold little to no value. However, if we can identify and exclude these low-value pages in future random samples, it could improve the overall quality of the corpus. Conversely, if a few of these domains prove to be useful, targeting them could be a quick and effective way to enhance the representation of educational material.

In addition, I plan to ask my Japanese friends to recommend high-quality educational websites. I don't have a specific approach in mind—I will simply ask them to share educational materials they personally use or would recommend to others.

I will classify these websites using the system we have already established:

1. **No Educational Value**: Light news, e-commerce, sports, personal blogs, business pages, etc.
2. **Minimal Educational Value**: Mostly amateur content. With a lot of pruning, potentially a good resource for Minimal or Basic educational content. Ex. QA forums, SEO Content Disguised as Educational Blogs.
Examples: Quora, [Oshiete](https://oshiete.goo.ne.jp/watch/pro/), [RareJob](https://www.rarejob.com/englishlab/)
3. **Basic Educational Value**: Mostly amateur content, but overall, education is the priority. Potentially a good resource for Minimal, Basic, or Good educational content.
Examples: StackOverflow, [Qiita](https://qiita.com/)
4. **Good Educational Value**: Education is clearly the priority, but their may be some issues. Maybe it covers a lot of topics, but it doesn't go into any topic in too much depth, such as WikiPedia. Maybe there is a lot of high-quality content, but also a large amount of non-educational content, such as HuggingFace.
5. **Excellent Educational Value**: The entire website is dedicated to providing education on a certain topic, and it explores that topic in great depth. Randomly sample any page on this website, and you will likely draw one that is Good or Excellent educational value.
Examples: [NLTK Book](https://www.nltk.org/book/), [Imabi](https://imabi.org/)

If this strategy of targeting specific websites and querying the existing FineWeb repository proves effective, I propose soliciting the community for additional high-quality website contributions in their native languages.

Thank you for reading. I will continue to update this notebook, and I hope it was useful for you.