Skip to content

Modalities/ml_filter

Repository files navigation

MLFilter is a versatile, lightweight framework designed to facilitate the training of machine learning-based filters, particularly for identifying and curating high-quality datasets such as educational content.

Key Features:

  • Dataset Generation: A client provides seamless access to hosted large language models (LLMs) that evaluate the quality of documents using custom, user-defined prompts. By leveraging powerful LLMs, MLFilter enables the creation of training dataset for classifiers that filter documents based on their quality.

  • Training of Classifiers: MLFilter provides training functionalities allowing users to train and fine-tune classifiers based on the generated datasets. This feature enables the creation of specialized models tailored to specific needs and domains, enhancing the utility of the framework for a wide range of applications.

Usage

We use this repository to filter out low-quality documents from the Common Crawl dataset. The filtered dataset is then used to train the Eurolingua GPT model(s). The following diagram illustrates the workflow, that is closely related to Fineweb-EDU:

  1. We start with a Common Crawl (CC) subset (e.g., 200,000 documents per language) that we want to score e.g., w.r.t. the amount of educational content. We use an LLM to score these documents based on the instructios defined in a prompt.

  2. The scored documents are then used to train a classifier (probably Roberta) that can be used to filter out low-quality / non-educational documents.

  3. The classifier is used to filter out low-quality documents from the entire CC dataset. The filtered dataset is then used to train the model(s).

Documentation Map

Installation and Development

Please see CONTRIBUTING.md

Usage

Once you have setup TGI container, you can proceed to score and the documents and trainer and classifier

1. How to Score Documents with LLM

python cli.py score_documents --config_file_path path/to/your/config.yaml

2. Create Embeddings at Scale

Generate HDF5 embedding files from raw JSONL (see documentation/pipelines.md for full schema):

python cli.py run_embedding_pipeline --config_file_path configs/embedding_job.yaml

Outputs: one .h5 per input file (embeddings + optional labels) under the configured embedding directory.

3. How to Train a Classifier

If you already have scores (e.g. LLM annotations), you can train a classifier by running

python cli.py train_classifier --config_file_path path/to/your/training_config.yaml

Trained model (and tokenizer) are saved under the final subdirectory of the configured output dir.

4. Run Annotation Heads on Embeddings

Apply one or more trained regression / classification heads to previously generated embeddings:

python cli.py run_annotation_pipeline --config_file_path configs/annotation_job.yaml

Outputs: ${source_filename}.jsonl with predicted scores in annotated_data/.

5. Measure Interrater Reliability

If you have a dataset with scores annotated by multiple annotators, you can compute metrics to measure the interrater reliability with the command interrater_reliability. If you want to compare the scores in a single file (e.g. the human annotated ground truth data), run:

python cli.py interrater_reliability data_annotated.jsonl --output_file_path output.json

If you want to compare the scores across different models and files (e.g. when comparing LLM annotated data to ground truth), the scores in each file first have to be aggregated. For that, use the parameter aggregation:

python cli.py interrater_reliability data_annotated_by_model_1.jsonl data_annotated_by_model_2.jsonl --aggregation majority --output_file_path output.json

You can create plots for the distribution of annotations and the differences between annotators with

python cli.py plot_scores data_annotated_by_model_1.jsonl data_annotated_by_model_2.jsonl --aggregation majority --output_dir outputs

TGI

This service relies on TGI containers (Text Generation Inference), which can be downloaded from Hugging Face. Follow the steps below to download and run the TGI container.

1. Set Up Environment Variables

First, you'll need to export some environment variables for the model's download path, Hugging Face API key, and the model's full name.

  1. Set the model cache directory:

    Define the path where the model weights will be downloaded or where they already exist:

    export HUGGINGFACECACHE=/raid/data/checkpoints/data
  2. Export your Hugging Face API token:

    You need an API token from Hugging Face. Replace ... with your actual token:

    export HF_TOKEN=your_huggingface_api_token_here
  3. Specify the model name:

    Provide the full name of the model as it appears on Hugging Face (e.g., meta-llama/Llama-3.1-70B-Instruct):

    export MODEL_NAME=meta-llama/Meta-Llama-3.1-70B-Instruct

2. Download and Run the TGI Container

Use the following command to download the TGI container and run it. If the model weights are already in the specified path, the download step will be skipped.

docker run -d --gpus all --shm-size 1g -p 8090:80 \
-v ${HUGGINGFACECACHE}:/data \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id $MODEL_NAME \
--num-shard 8 \
--max-input-length 65535 \
--max-total-tokens 65536 \
--max-batch-prefill-tokens 66536

If you restrict the number of GPUs for your container by --gpus '"device=6"', number-shard should not be larger.

3. Optional: Restricting GPU Usage

By default, the container uses all available GPUs (--gpus all). If you want to limit the number of GPUs, you can define specific devices. For example, to restrict the container to 4 GPUs (e.g., devices 0, 1, 2, 3), use the following:

docker run -d --gpus '"device=0,1,2,3"' --shm-size 1g -p 8090:80 \
-v ${HUGGINGFACECACHE}:/data \
-e HF_TOKEN=$API_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id $MODEL_NAME \
--num-shard 4 \
--max-input-length 65535 \
--max-total-tokens 65536 \
--max-batch-prefill-tokens 66536

Make sure to update --num-shard to match the number of GPUs you're using.

4. Testing the docker setup

Locate the your container, it will be named ghcr.io/huggingface/text-generation-inference:2.2.0

docker ps

You can now look into the logs

docker logs --follow your_container_id 

please note that tgi takes a little bit of time to start

5. Testing TGI service

Once the container has been successfully setup and started you can test by running

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

VLLM

Host a Model with VLLM (faster)

docker run --runtime nvidia --gpus '"device=5,6"'  --name vllm_container -v /raid/s3/opengptx/models/:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$API_TOKEN" -p 9900:8000 --ipc=host vllm/vllm-openai:v0.6.3 --model Qwen/Qwen2.5-72B-Instruct-AWQ --tensor-parallel-size 2

Number of tensor-parallel-size and number of GPUs used should match (--gpus).

Alternativley, just use execute bash scripts/host_vllm_model.sh $CONTAINER_NAME $PORT $MODEL_NAME and make sure, all required environment variables are in your .env file under project root e.g.

bash scripts/host_vllm_model.sh my_vllm_container 9123 meta-llama/Llama-3.1-8B-Instruct
Mistral

For mistral models make sure to manually set the correct chat template file in the tokenizer_config.json. We tried hosting the model as described by MistralAI,

vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tensor-parallel-size 4 --port 8003 --tokenizer_mode mistral --config_format mistral --load_format mistral

but still ran into:

ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at

Test the hosted model

curl http://localhost:port_number/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

Look into metrics of the hosted model

  1. Forward port 8000
  2. visit http://localhost:8000/metrics to see the tokens/s

Or watch the output e.g. with Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4 on two GPUs:

INFO:ml_filter.data_processing.document_processor:Results written final: 511 | Elapsed time: 215.89 seconds | Results per second: 2.22

Troubleshooting

Request failed with HTTPConnectionPool(host='localhost', port=9900): Read timed out. (read timeout=20), retrying...0

With larger models increase the llm_rest_client.timeout config parameter. Also play around with:

llm_rest_client.max_pool_connections: 1000
llm_rest_client.max_pool_maxsize: 1000

[VLLM] is already dead, terminating server process.

Solustion as by vllm-project/vllm#10024

export VLLM_RPC_TIMEOUT= 20000

Converting docker containers to singularity containers

Build from docker hub

  1. Find the required version of vllm on docker hub
  2. Run
     singularity build singulairty_container_name.sif docker://path/to/vllm/on/docker-hub

Build from source

  1. clone the vllm repo
  2. change requirements files as necessary i.e transformer version
  3. cd vllm and run
    # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
    DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/your-build-name
  4. Export the Docker Container to a Tarball

Once the Docker container is built, save it to a tar file. This tarball will later be used by Singularity to build the Singularity Image File (SIF).

docker save -o vllm-openai-gemma.tar vllm/your-build-name

This command produces a tar file (vllm-openai-gemma.tar) that contains your Docker image.

  1. Transfer the Tarball to Your Singularity Environment

Copy or transfer the generated tar file (vllm-openai-gemma.tar) to the (virtual) machine or environment where Singularity is installed (Singulairity and docker in the same setup lead to conflicts). This can be done via SCP, rsync, or any other file transfer method appropriate to your setup.

  1. Convert the Docker Tarball to a Singularity Image On the target machine with Singularity installed, use the following command to build the Singularity Image File (SIF). Note the corrected and complete command below:
    sudo singularity singulairty_container_name.sif docker-archive://path/to/docker/tar/file

Key points in this step:

  • singularity build: This is the primary command to create a new SIF file.
  • singularity_container_name.sif: Replace this with your desired container name.
  • docker-archive://path/to/vllm-openai-gemma.tar: This instructs Singularity to use the Docker tarball as the source. Ensure you provide the full path to your tar file.
  • sudo: Some Singularity installations require root privileges for building containers. If your installation permits non-root builds, you may not need sudo.

Batching and TGI containers

image

TGI internally uses a buffer and performs dynamic batching. To make sure we get the maximum numbers documents processed per request, As a work around, we create batches, where each batch is close to the the capcity of the buffer size and than run .generate via multiple threading.

Config Advise

add_generation_prompt (bool): If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect. We expect it to work best, if set to true.

Training a regression head from pre-computed embeddings

Example call for running the training pipeline:

ml_filter train_with_embeddings --config_file_path path/to/your/config/file.yaml

Example configs can be found in configs/train_classifier. The most important settings are:

Setting Description
model.regressor_hidden_dim Hidden dimension of the MLP regression head that consumes embeddings.
model.init_regression_weights Whether to apply the optional warm-start initialisation for faster convergence.
data.train_file_path Directory containing HDF5 shards with a train group that stores embeddings and labels.
data.val_file_path / data.test_file_path Optional directories with evaluation shards; each shard must expose the same datasets as training.
data.embeddings_dataset / data.labels_dataset Dataset keys inside each HDF5 group that hold the embedding matrix and label matrix.
data.task_names Task names used when logging per-task metrics (e.g. ["edu"]).
data.num_targets_per_task Number of discrete levels per task; used to derive categorical thresholds for metrics.
training.batch_size Per-device batch size used when loading the embedding tensors.
training.metric_for_best_model Full metric name emitted by the Trainer (e.g. eval_validation_edu/spearman_corr).

Each HDF5 shard should expose a group (default: train) containing two datasets: one with embedding vectors (data.embeddings_dataset) and one with labels (data.labels_dataset). The labels must be stored as arrays shaped (num_samples, num_tasks) (a trailing singleton dimension for single-task setups is fine).

The final head checkpoint will be saved in the final subdirectory of training.output_dir_path. It can be reloaded via EmbeddingRegressionModel.from_pretrained(".../final"). If you initialise the head from an external script, make sure the embedding dimension matches the vectors you used at training time.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 11