Implement FineWebSecurity module for cybersecurity data filtering by naufalso · Pull Request #14 · RISys-Lab/RedSage

naufalso · 2026-05-12T10:30:16Z

Added core functionalities for filtering web data using BERT/ModernBERT classifiers.
Introduced command line interfaces for checking progress and downloading dataset subsets.
Developed dataset handling capabilities, including loading, iterating, and saving parquet files.
Implemented progress tracking for filtering operations.
Created utility functions for managing file persistence and JSON document handling.
Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management.
Updated README with module details and usage instructions.

- Added core functionalities for filtering web data using BERT/ModernBERT classifiers. - Introduced command line interfaces for checking progress and downloading dataset subsets. - Developed dataset handling capabilities, including loading, iterating, and saving parquet files. - Implemented progress tracking for filtering operations. - Created utility functions for managing file persistence and JSON document handling. - Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management. - Updated README with module details and usage instructions.

Copilot

Pull request overview

Adds a new data/FineWebSecurity module that filters the Hugging Face FineWeb dataset with a cybersecurity BERT/ModernBERT classifier, including CLIs, persistence/progress utilities, and a Docker + tmux-based parallel runner.

Changes:

Introduces fineweb_security package code for dataset access, BERT inference, persistence, progress tracking, and optional Hugging Face Hub upload.
Adds CLI entrypoints for filtering, downloading subsets, and checking progress (plus compatibility wrapper scripts).
Adds a Dockerfile, shell scripts, config lists, and pytest coverage for core behaviors.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
README.md	Marks the cybersecurity-filtering code as released.
data/README.md	Documents data modules and links FineWebSecurity docs/dataset.
data/FineWebSecurity/README.md	FineWebSecurity installation + usage documentation.
data/FineWebSecurity/README_Docker.md	Docker-specific usage instructions.
data/FineWebSecurity/requirements.txt	Python dependencies for FineWebSecurity tooling.
data/FineWebSecurity/Dockerfile	Container build for running filtering pipeline.
data/FineWebSecurity/.gitignore	Ignores FineWebSecurity local artifacts (outputs/cache/etc.).
data/FineWebSecurity/config/fineweb_config.txt	List of FineWeb subset configs used by tmux runner.
data/FineWebSecurity/accelerate/default_config.yaml	Accelerate config scaffold for local runs.
data/FineWebSecurity/scripts/parallel_lib.sh	tmux-based parallel execution helper library.
data/FineWebSecurity/scripts/filter_fineweb_bert.sh	Multi-GPU/multi-subset tmux launcher for filtering.
data/FineWebSecurity/src/init.py	Marks `src/` as a Python package root (empty).
data/FineWebSecurity/src/filter_fineweb_bert_map.py	Compatibility wrapper to invoke filter CLI.
data/FineWebSecurity/src/download_subset.py	Compatibility wrapper to invoke download CLI.
data/FineWebSecurity/src/check_fineweb_progress.py	Compatibility wrapper to invoke progress-check CLI.
data/FineWebSecurity/src/fineweb_security/init.py	Package initializer.
data/FineWebSecurity/src/fineweb_security/bert.py	Model loading, warmup, and batch prediction utilities.
data/FineWebSecurity/src/fineweb_security/datasets.py	FineWeb parquet listing, iteration, and downloading helpers.
data/FineWebSecurity/src/fineweb_security/persistence.py	JSON doc persistence and parquet writing utilities.
data/FineWebSecurity/src/fineweb_security/progress.py	Progress file read/write helpers.
data/FineWebSecurity/src/fineweb_security/hub.py	Hugging Face Hub token/branch management + upload helper.
data/FineWebSecurity/src/fineweb_security/cli/init.py	CLI package marker.
data/FineWebSecurity/src/fineweb_security/cli/filter_bert.py	Main filtering CLI pipeline implementation.
data/FineWebSecurity/src/fineweb_security/cli/download_subset.py	CLI for downloading FineWeb subset shards.
data/FineWebSecurity/src/fineweb_security/cli/check_progress.py	CLI for reporting filtering progress.
data/FineWebSecurity/tests/conftest.py	Test path setup for `src/` layout.
data/FineWebSecurity/tests/test_cli.py	CLI parser help smoke tests.
data/FineWebSecurity/tests/test_datasets.py	Tests remaining-work detection logic.
data/FineWebSecurity/tests/test_filter_pipeline.py	Tests batch filtering logic at threshold.
data/FineWebSecurity/tests/test_persistence.py	Tests JSON persistence + parquet output behavior.
data/FineWebSecurity/tests/test_progress.py	Tests progress read/write round-trip behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    return Progress(
+        parquet_idx=int(content.get("parquet_idx", 0) or 0),
+        parquet_sample_idx=int(content.get("parquet_sample_idx", 0) or 0),
+    )


+    if compile_model and hasattr(torch, "compile") and torch.__version__ >= "2.0.0" and device == "cuda":
+        try:
+            logger.info("Compiling model with torch.compile().")
+            model = torch.compile(model)
+        except Exception as exc:
+            logger.warning("torch.compile failed (%s). Continuing without compilation.", exc)
+


+def warmup_model(model: torch.nn.Module, tokenizer: Any, batch_size: int, max_length: int, device: str) -> None:
+    dummy_input_ids = torch.randint(0, tokenizer.vocab_size, (batch_size, max_length))
+    dummy_attention_mask = torch.ones((batch_size, max_length))
+    predict_batch(
+        {
+            "input_ids": dummy_input_ids.tolist(),
+            "attention_mask": dummy_attention_mask.tolist(),
+        },


+                )
+                if offset == 0 and start_sample_idx > 0:
+                    logger.info("Skipping %d examples in %s.", start_sample_idx, parquet_file)
+                    dataset = dataset.skip(start_sample_idx)


+            logger.info("Running inference for %s.", parquet_file)
+            try:
+                tokenized_dataset.map(
+                    lambda batch, batch_indices, rank_idx: process_batch(
+                        batch,
+                        models,
+                        batch_indices,
+                        rank_idx,
+                        args.threshold,
+                        current_output_dir,
+                        save_queue,
+                        args.save_frequency,
+                        progress_file,
+                        parquet_idx,
+                    ),
+                    batched=True,
+                    batch_size=args.batch_size,
+                    with_indices=True,
+                    with_rank=True,
+                    desc=f"Processing {parquet_file}",
+                    num_proc=args.parallel_worker,
+                )


+    pip install torch && \
+    pip install --no-cache-dir -r requirements.txt


+    current_batch_number = last_processed_index // max(len(batch_indices), 1)
+    if current_batch_number % save_frequency == 0:
+        save_progress(progress_file, parquet_idx, last_processed_index)


+    parser.add_argument("--dataset_subset", type=str, default="CC-MAIN-2024-18")
+    parser.add_argument("--dataset_size", type=int, default=-1)
+    parser.add_argument("--batch_size", type=int, default=640)
+    parser.add_argument("--start_idx", type=int, default=0)


+import numpy as np
+import torch
+from multiprocess import Manager, set_start_method
+from tqdm import tqdm


+import json
+


naufalso · 2026-05-12T11:42:21Z

@copilot apply changes based on the comments in this thread

Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>

Copilot · 2026-05-12T12:03:07Z

@copilot apply changes based on the comments in this thread

Implemented the requested fixes from the review thread in commits 27e0a50 and 8ae3b51. This includes robust progress parsing, safer warmup/compile handling, non-streaming resume fixes, idempotent progress indexing, wiring --start_idx, removing unsafe multiprocessing model sharing during inference, Docker CUDA torch install, and updated tests (all passing).

naufalso requested a review from Copilot May 12, 2026 10:30

Copilot started reviewing on behalf of naufalso May 12, 2026 10:30 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Copilot started work on behalf of naufalso May 12, 2026 11:42 View session

Copilot AI and others added 2 commits May 12, 2026 12:00

fix FineWebSecurity review issues from PR thread

27e0a50

Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>

refine inference safety updates for clarity

8ae3b51

Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>

Copilot finished work on behalf of naufalso May 12, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FineWebSecurity module for cybersecurity data filtering#14

Implement FineWebSecurity module for cybersecurity data filtering#14
naufalso wants to merge 3 commits into
mainfrom
dev/cybersecurity-data-filtering

naufalso commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

naufalso commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		pip install torch && \
		pip install --no-cache-dir -r requirements.txt

Conversation

naufalso commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

naufalso commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants