MLFilter is a versatile, lightweight framework designed to facilitate the training of machine learning-based filters, particularly for identifying and curating high-quality datasets such as educational content.
Key Features:
-
Dataset Generation: A client provides seamless access to hosted large language models (LLMs) that evaluate the quality of documents using custom, user-defined prompts. By leveraging powerful LLMs, MLFilter enables the creation of training dataset for classifiers that filter documents based on their quality.
-
Training of Classifiers: MLFilter provides training functionalities allowing users to train and fine-tune classifiers based on the generated datasets. This feature enables the creation of specialized models tailored to specific needs and domains, enhancing the utility of the framework for a wide range of applications.
We use this repository to filter out low-quality documents from the Common Crawl dataset. The filtered dataset is then used to train the Eurolingua GPT model(s). The following diagram illustrates the workflow, that is closely related to Fineweb-EDU:
-
We start with a Common Crawl (CC) subset (e.g., 200,000 documents per language) that we want to score e.g., w.r.t. the amount of educational content. We use an LLM to score these documents based on the instructios defined in a prompt.
-
The scored documents are then used to train a classifier (probably Roberta) that can be used to filter out low-quality / non-educational documents.
-
The classifier is used to filter out low-quality documents from the entire CC dataset. The filtered dataset is then used to train the model(s).
- Pipelines: Embedding & Annotation – generate embeddings and run annotation heads at scale.
- Aggregation – how scores are combined (mean, max, min, majority, etc.).
- Data Format – expected JSONL schema & label structure.
- Evaluation – metrics and evaluation utilities.
Please see CONTRIBUTING.md
Once you have setup TGI container, you can proceed to score and the documents and trainer and classifier
python cli.py score_documents --config_file_path path/to/your/config.yaml
Generate HDF5 embedding files from raw JSONL (see documentation/pipelines.md for full schema):
python cli.py run_embedding_pipeline --config_file_path configs/embedding_job.yamlOutputs: one .h5 per input file (embeddings + optional labels) under the configured embedding directory.
If you already have scores (e.g. LLM annotations), you can train a classifier by running
python cli.py train_classifier --config_file_path path/to/your/training_config.yaml
Trained model (and tokenizer) are saved under the final subdirectory of the configured output dir.
Apply one or more trained regression / classification heads to previously generated embeddings:
python cli.py run_annotation_pipeline --config_file_path configs/annotation_job.yamlOutputs: ${source_filename}.jsonl with predicted scores in annotated_data/.
If you have a dataset with scores annotated by multiple annotators, you can compute metrics to measure the interrater reliability with the command interrater_reliability. If you want to compare the scores in a single file (e.g. the human annotated ground truth data), run:
python cli.py interrater_reliability data_annotated.jsonl --output_file_path output.json
If you want to compare the scores across different models and files (e.g. when comparing LLM annotated data to ground truth), the scores in each file first have to be aggregated. For that, use the parameter aggregation:
python cli.py interrater_reliability data_annotated_by_model_1.jsonl data_annotated_by_model_2.jsonl --aggregation majority --output_file_path output.json
You can create plots for the distribution of annotations and the differences between annotators with
python cli.py plot_scores data_annotated_by_model_1.jsonl data_annotated_by_model_2.jsonl --aggregation majority --output_dir outputs
This service relies on TGI containers (Text Generation Inference), which can be downloaded from Hugging Face. Follow the steps below to download and run the TGI container.
First, you'll need to export some environment variables for the model's download path, Hugging Face API key, and the model's full name.
-
Set the model cache directory:
Define the path where the model weights will be downloaded or where they already exist:
export HUGGINGFACECACHE=/raid/data/checkpoints/data -
Export your Hugging Face API token:
You need an API token from Hugging Face. Replace ... with your actual token:
export HF_TOKEN=your_huggingface_api_token_here -
Specify the model name:
Provide the full name of the model as it appears on Hugging Face (e.g., meta-llama/Llama-3.1-70B-Instruct):
export MODEL_NAME=meta-llama/Meta-Llama-3.1-70B-Instruct
Use the following command to download the TGI container and run it. If the model weights are already in the specified path, the download step will be skipped.
docker run -d --gpus all --shm-size 1g -p 8090:80 \
-v ${HUGGINGFACECACHE}:/data \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id $MODEL_NAME \
--num-shard 8 \
--max-input-length 65535 \
--max-total-tokens 65536 \
--max-batch-prefill-tokens 66536If you restrict the number of GPUs for your container by --gpus '"device=6"', number-shard should not be larger.
By default, the container uses all available GPUs (--gpus all). If you want to limit the number of GPUs, you can define specific devices. For example, to restrict the container to 4 GPUs (e.g., devices 0, 1, 2, 3), use the following:
docker run -d --gpus '"device=0,1,2,3"' --shm-size 1g -p 8090:80 \
-v ${HUGGINGFACECACHE}:/data \
-e HF_TOKEN=$API_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id $MODEL_NAME \
--num-shard 4 \
--max-input-length 65535 \
--max-total-tokens 65536 \
--max-batch-prefill-tokens 66536Make sure to update --num-shard to match the number of GPUs you're using.
Locate the your container, it will be named ghcr.io/huggingface/text-generation-inference:2.2.0
docker psYou can now look into the logs
docker logs --follow your_container_id please note that tgi takes a little bit of time to start
Once the container has been successfully setup and started you can test by running
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'docker run --runtime nvidia --gpus '"device=5,6"' --name vllm_container -v /raid/s3/opengptx/models/:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$API_TOKEN" -p 9900:8000 --ipc=host vllm/vllm-openai:v0.6.3 --model Qwen/Qwen2.5-72B-Instruct-AWQ --tensor-parallel-size 2Number of tensor-parallel-size and number of GPUs used should match (--gpus).
Alternativley, just use execute bash scripts/host_vllm_model.sh $CONTAINER_NAME $PORT $MODEL_NAME and make sure, all required environment variables are in your .env file under project root e.g.
bash scripts/host_vllm_model.sh my_vllm_container 9123 meta-llama/Llama-3.1-8B-InstructFor mistral models make sure to manually set the correct chat template file in the tokenizer_config.json. We tried hosting the model as described by MistralAI,
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tensor-parallel-size 4 --port 8003 --tokenizer_mode mistral --config_format mistral --load_format mistralbut still ran into:
ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation atcurl http://localhost:port_number/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'- Forward port 8000
- visit http://localhost:8000/metrics to see the tokens/s
Or watch the output e.g. with Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4 on two GPUs:
INFO:ml_filter.data_processing.document_processor:Results written final: 511 | Elapsed time: 215.89 seconds | Results per second: 2.22
Request failed with HTTPConnectionPool(host='localhost', port=9900): Read timed out. (read timeout=20), retrying...0
With larger models increase the llm_rest_client.timeout config parameter.
Also play around with:
llm_rest_client.max_pool_connections: 1000
llm_rest_client.max_pool_maxsize: 1000
[VLLM] is already dead, terminating server process.
Solustion as by vllm-project/vllm#10024
export VLLM_RPC_TIMEOUT= 20000
- Find the required version of vllm on docker hub
- Run
singularity build singulairty_container_name.sif docker://path/to/vllm/on/docker-hub
- clone the vllm repo
- change requirements files as necessary i.e transformer version
- cd vllm and run
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/your-build-name
- Export the Docker Container to a Tarball
Once the Docker container is built, save it to a tar file. This tarball will later be used by Singularity to build the Singularity Image File (SIF).
docker save -o vllm-openai-gemma.tar vllm/your-build-nameThis command produces a tar file (vllm-openai-gemma.tar) that contains your Docker image.
- Transfer the Tarball to Your Singularity Environment
Copy or transfer the generated tar file (vllm-openai-gemma.tar) to the (virtual) machine or environment where Singularity is installed (Singulairity and docker in the same setup lead to conflicts). This can be done via SCP, rsync, or any other file transfer method appropriate to your setup.
- Convert the Docker Tarball to a Singularity Image
On the target machine with Singularity installed, use the following command to build the Singularity Image File (SIF). Note the corrected and complete command below:
sudo singularity singulairty_container_name.sif docker-archive://path/to/docker/tar/file
Key points in this step:
singularity build: This is the primary command to create a new SIF file.singularity_container_name.sif: Replace this with your desired container name.docker-archive://path/to/vllm-openai-gemma.tar: This instructs Singularity to use the Docker tarball as the source. Ensure you provide the full path to your tar file.sudo: Some Singularity installations require root privileges for building containers. If your installation permits non-root builds, you may not needsudo.
TGI internally uses a buffer and performs dynamic batching. To make sure we get the maximum numbers documents processed per request, As a work around, we create batches, where each batch is close to the the capcity of the buffer size and than run .generate via multiple threading.
add_generation_prompt (bool): If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect. We expect it to work best, if set to true.
Example call for running the training pipeline:
ml_filter train_with_embeddings --config_file_path path/to/your/config/file.yaml
Example configs can be found in configs/train_classifier. The most important settings are:
| Setting | Description |
|---|---|
model.regressor_hidden_dim |
Hidden dimension of the MLP regression head that consumes embeddings. |
model.init_regression_weights |
Whether to apply the optional warm-start initialisation for faster convergence. |
data.train_file_path |
Directory containing HDF5 shards with a train group that stores embeddings and labels. |
data.val_file_path / data.test_file_path |
Optional directories with evaluation shards; each shard must expose the same datasets as training. |
data.embeddings_dataset / data.labels_dataset |
Dataset keys inside each HDF5 group that hold the embedding matrix and label matrix. |
data.task_names |
Task names used when logging per-task metrics (e.g. ["edu"]). |
data.num_targets_per_task |
Number of discrete levels per task; used to derive categorical thresholds for metrics. |
training.batch_size |
Per-device batch size used when loading the embedding tensors. |
training.metric_for_best_model |
Full metric name emitted by the Trainer (e.g. eval_validation_edu/spearman_corr). |
Each HDF5 shard should expose a group (default: train) containing two datasets: one with embedding vectors (data.embeddings_dataset) and one with labels (data.labels_dataset). The labels must be stored as arrays shaped (num_samples, num_tasks) (a trailing singleton dimension for single-task setups is fine).
The final head checkpoint will be saved in the final subdirectory of training.output_dir_path. It can be reloaded via EmbeddingRegressionModel.from_pretrained(".../final"). If you initialise the head from an external script, make sure the embedding dimension matches the vectors you used at training time.
