Skip to content

Implement FineWebSecurity module for cybersecurity data filtering#14

Open
naufalso wants to merge 3 commits into
mainfrom
dev/cybersecurity-data-filtering
Open

Implement FineWebSecurity module for cybersecurity data filtering#14
naufalso wants to merge 3 commits into
mainfrom
dev/cybersecurity-data-filtering

Conversation

@naufalso
Copy link
Copy Markdown
Collaborator

  • Added core functionalities for filtering web data using BERT/ModernBERT classifiers.
  • Introduced command line interfaces for checking progress and downloading dataset subsets.
  • Developed dataset handling capabilities, including loading, iterating, and saving parquet files.
  • Implemented progress tracking for filtering operations.
  • Created utility functions for managing file persistence and JSON document handling.
  • Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management.
  • Updated README with module details and usage instructions.

- Added core functionalities for filtering web data using BERT/ModernBERT classifiers.
- Introduced command line interfaces for checking progress and downloading dataset subsets.
- Developed dataset handling capabilities, including loading, iterating, and saving parquet files.
- Implemented progress tracking for filtering operations.
- Created utility functions for managing file persistence and JSON document handling.
- Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management.
- Updated README with module details and usage instructions.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new data/FineWebSecurity module that filters the Hugging Face FineWeb dataset with a cybersecurity BERT/ModernBERT classifier, including CLIs, persistence/progress utilities, and a Docker + tmux-based parallel runner.

Changes:

  • Introduces fineweb_security package code for dataset access, BERT inference, persistence, progress tracking, and optional Hugging Face Hub upload.
  • Adds CLI entrypoints for filtering, downloading subsets, and checking progress (plus compatibility wrapper scripts).
  • Adds a Dockerfile, shell scripts, config lists, and pytest coverage for core behaviors.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
README.md Marks the cybersecurity-filtering code as released.
data/README.md Documents data modules and links FineWebSecurity docs/dataset.
data/FineWebSecurity/README.md FineWebSecurity installation + usage documentation.
data/FineWebSecurity/README_Docker.md Docker-specific usage instructions.
data/FineWebSecurity/requirements.txt Python dependencies for FineWebSecurity tooling.
data/FineWebSecurity/Dockerfile Container build for running filtering pipeline.
data/FineWebSecurity/.gitignore Ignores FineWebSecurity local artifacts (outputs/cache/etc.).
data/FineWebSecurity/config/fineweb_config.txt List of FineWeb subset configs used by tmux runner.
data/FineWebSecurity/accelerate/default_config.yaml Accelerate config scaffold for local runs.
data/FineWebSecurity/scripts/parallel_lib.sh tmux-based parallel execution helper library.
data/FineWebSecurity/scripts/filter_fineweb_bert.sh Multi-GPU/multi-subset tmux launcher for filtering.
data/FineWebSecurity/src/init.py Marks src/ as a Python package root (empty).
data/FineWebSecurity/src/filter_fineweb_bert_map.py Compatibility wrapper to invoke filter CLI.
data/FineWebSecurity/src/download_subset.py Compatibility wrapper to invoke download CLI.
data/FineWebSecurity/src/check_fineweb_progress.py Compatibility wrapper to invoke progress-check CLI.
data/FineWebSecurity/src/fineweb_security/init.py Package initializer.
data/FineWebSecurity/src/fineweb_security/bert.py Model loading, warmup, and batch prediction utilities.
data/FineWebSecurity/src/fineweb_security/datasets.py FineWeb parquet listing, iteration, and downloading helpers.
data/FineWebSecurity/src/fineweb_security/persistence.py JSON doc persistence and parquet writing utilities.
data/FineWebSecurity/src/fineweb_security/progress.py Progress file read/write helpers.
data/FineWebSecurity/src/fineweb_security/hub.py Hugging Face Hub token/branch management + upload helper.
data/FineWebSecurity/src/fineweb_security/cli/init.py CLI package marker.
data/FineWebSecurity/src/fineweb_security/cli/filter_bert.py Main filtering CLI pipeline implementation.
data/FineWebSecurity/src/fineweb_security/cli/download_subset.py CLI for downloading FineWeb subset shards.
data/FineWebSecurity/src/fineweb_security/cli/check_progress.py CLI for reporting filtering progress.
data/FineWebSecurity/tests/conftest.py Test path setup for src/ layout.
data/FineWebSecurity/tests/test_cli.py CLI parser help smoke tests.
data/FineWebSecurity/tests/test_datasets.py Tests remaining-work detection logic.
data/FineWebSecurity/tests/test_filter_pipeline.py Tests batch filtering logic at threshold.
data/FineWebSecurity/tests/test_persistence.py Tests JSON persistence + parquet output behavior.
data/FineWebSecurity/tests/test_progress.py Tests progress read/write round-trip behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +38
return Progress(
parquet_idx=int(content.get("parquet_idx", 0) or 0),
parquet_sample_idx=int(content.get("parquet_sample_idx", 0) or 0),
)
Comment on lines +27 to +33
if compile_model and hasattr(torch, "compile") and torch.__version__ >= "2.0.0" and device == "cuda":
try:
logger.info("Compiling model with torch.compile().")
model = torch.compile(model)
except Exception as exc:
logger.warning("torch.compile failed (%s). Continuing without compilation.", exc)

Comment on lines +51 to +58
def warmup_model(model: torch.nn.Module, tokenizer: Any, batch_size: int, max_length: int, device: str) -> None:
dummy_input_ids = torch.randint(0, tokenizer.vocab_size, (batch_size, max_length))
dummy_attention_mask = torch.ones((batch_size, max_length))
predict_batch(
{
"input_ids": dummy_input_ids.tolist(),
"attention_mask": dummy_attention_mask.tolist(),
},
)
if offset == 0 and start_sample_idx > 0:
logger.info("Skipping %d examples in %s.", start_sample_idx, parquet_file)
dataset = dataset.skip(start_sample_idx)
Comment on lines +261 to +282
logger.info("Running inference for %s.", parquet_file)
try:
tokenized_dataset.map(
lambda batch, batch_indices, rank_idx: process_batch(
batch,
models,
batch_indices,
rank_idx,
args.threshold,
current_output_dir,
save_queue,
args.save_frequency,
progress_file,
parquet_idx,
),
batched=True,
batch_size=args.batch_size,
with_indices=True,
with_rank=True,
desc=f"Processing {parquet_file}",
num_proc=args.parallel_worker,
)
Comment thread data/FineWebSecurity/Dockerfile Outdated
Comment on lines +49 to +50
pip install torch && \
pip install --no-cache-dir -r requirements.txt
Comment on lines +114 to +116
current_batch_number = last_processed_index // max(len(batch_indices), 1)
if current_batch_number % save_frequency == 0:
save_progress(progress_file, parquet_idx, last_processed_index)
parser.add_argument("--dataset_subset", type=str, default="CC-MAIN-2024-18")
parser.add_argument("--dataset_size", type=int, default=-1)
parser.add_argument("--batch_size", type=int, default=640)
parser.add_argument("--start_idx", type=int, default=0)
import numpy as np
import torch
from multiprocess import Manager, set_start_method
from tqdm import tqdm
Comment on lines +1 to +2
import json

@naufalso
Copy link
Copy Markdown
Collaborator Author

@copilot apply changes based on the comments in this thread

Copilot AI and others added 2 commits May 12, 2026 12:00
Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c

Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented May 12, 2026

@copilot apply changes based on the comments in this thread

Implemented the requested fixes from the review thread in commits 27e0a50 and 8ae3b51. This includes robust progress parsing, safer warmup/compile handling, non-streaming resume fixes, idempotent progress indexing, wiring --start_idx, removing unsafe multiprocessing model sharing during inference, Docker CUDA torch install, and updated tests (all passing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants