Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
ef3536f
Dockerization
stepdi May 14, 2025
0212ddb
Added excels and JSONL to the output zip file
stepdi May 14, 2025
87f430c
Fix pipeline integration tests with proper mocking
sumukshashidhar May 15, 2025
1cfec04
remove unsupported
sumukshashidhar May 15, 2025
5f1915c
Fix CI workflow by adding virtual environment creation step
sumukshashidhar May 15, 2025
769fb9e
Fix CI workflow: add permissions and activate virtual environment
sumukshashidhar May 15, 2025
24a1a3d
CQ
sumukshashidhar May 15, 2025
d394a8d
feat: add cost tracking to inference engine
sumukshashidhar May 15, 2025
8652c4e
Update inference_engine.py
sumukshashidhar May 15, 2025
a820209
fix push to hub
sumukshashidhar May 15, 2025
6ec0b1e
Update pyproject.toml
sumukshashidhar May 16, 2025
cb7a819
Merge pull request #91 from huggingface/bugfix/dataset-push
sumukshashidhar May 16, 2025
c35cae3
Update inference_engine.py
sumukshashidhar May 16, 2025
b06025c
Merge pull request #90 from huggingface/fix/clean-cost-tracking
sumukshashidhar May 16, 2025
80f01df
Fix Dockerfile
m-peko May 16, 2025
feb3948
Stability fixes [WIP]
alozowski May 16, 2025
d26927f
add offline mode
sumukshashidhar May 15, 2025
40f06e5
add cq
sumukshashidhar May 15, 2025
ca7a3b0
Update dataset_engine.py
sumukshashidhar May 16, 2025
141a696
Update dataset_engine.py
sumukshashidhar May 16, 2025
2278ead
Merge pull request #93 from huggingface/cherry-pick-offline-mode
sumukshashidhar May 16, 2025
dd62da9
Merge pull request #92 from huggingface/release-v0.3.1
sumukshashidhar May 16, 2025
942de83
add new readme, remove legacy figure
sumukshashidhar May 20, 2025
e0f335e
move video and highlights to bottom
sumukshashidhar May 20, 2025
64ca98c
Merge pull request #99 from huggingface/update-readme
sumukshashidhar May 20, 2025
951ceb0
Merge branch 'main' into tests/integration-test
sumukshashidhar May 20, 2025
61f7387
Merge pull request #88 from huggingface/tests/integration-test
sumukshashidhar May 20, 2025
98d98e4
Update README.md
sumukshashidhar May 20, 2025
1572760
Update README.md
sumukshashidhar May 20, 2025
ae76f9d
Merge pull request #100 from huggingface/fix-readme-merge-conflict
sumukshashidhar May 21, 2025
28d27f2
Delete yourbench/utils/load_task_config.py
sumukshashidhar May 21, 2025
b0e473f
refactor loading engine
sumukshashidhar May 21, 2025
80f9f49
fix summarization and refactor
sumukshashidhar May 21, 2025
99924d3
Merge pull request #103 from huggingface/remove-empty-file
sumukshashidhar May 22, 2025
e67a11c
add sample question viewer to analyze
sumukshashidhar May 22, 2025
1c6239f
add docs
sumukshashidhar May 22, 2025
264bf64
Merge pull request #107 from huggingface/analyze-sample-questions
sumukshashidhar May 22, 2025
1acef89
Change output format of generated benchmark
m-peko May 23, 2025
47357ad
Improve lighteval.py for MCQ and long task
alozowski May 23, 2025
03a3036
Merge remote-tracking branch 'origin/main' into long-task-stability
alozowski May 23, 2025
78c59a5
Apply Ruff
alozowski May 23, 2025
364d215
Update quickstart with correct run command (#108)
patrickfleith May 23, 2025
f162207
Merge pull request #106 from huggingface/improve-summarization
sumukshashidhar May 23, 2025
fb58390
Merge pull request #104 from huggingface/refactor-loading-engine
sumukshashidhar May 24, 2025
a930329
remove main, unnecessary
sumukshashidhar May 24, 2025
93e5715
remove plotting code
sumukshashidhar May 24, 2025
fe76858
remove info density metrics
sumukshashidhar May 24, 2025
004fba7
refactor chunking and heavily reduce LoC
sumukshashidhar May 24, 2025
1971b20
fix cq
sumukshashidhar May 24, 2025
f58c00d
update testcase
sumukshashidhar May 24, 2025
2801943
add cq for tests
sumukshashidhar May 24, 2025
dc827a3
remove unnecessary dependencies based on semantic deduplications
sumukshashidhar May 24, 2025
99bbee7
Merge remote-tracking branch 'origin/main' into long-task-stability
alozowski May 26, 2025
3910a6f
Pull summarization.py from main
alozowski May 26, 2025
3aaae2d
Update citation_score_filtering.py
sumukshashidhar May 27, 2025
feb96ac
remove semantic chunking reference and add warning
sumukshashidhar May 27, 2025
0fbbab5
Update ingestion.py
sumukshashidhar May 27, 2025
65a3cd5
Update pyproject.toml
sumukshashidhar May 27, 2025
c20a224
Merge pull request #112 from huggingface/refactor-chunking
sumukshashidhar May 28, 2025
d2b9ab5
Merge branch 'main' of github.com:huggingface/yourbench into long-tas…
alozowski May 28, 2025
42cf7fe
Restore summarization.py from main
alozowski May 28, 2025
e008dde
Merge pull request #111 from huggingface/long-task-stability
alozowski May 28, 2025
4057d64
Introduce BENCHMARK_SYSTEM_PROMPT environment variable
m-peko May 28, 2025
951d257
use latest gemini flash model
Jun 1, 2025
c841506
use latest gemini flash model
Jun 1, 2025
68cdb5a
use correct model format for private evaluation
Jun 1, 2025
52efc59
use gemini flash for llm as a judge
Jun 1, 2025
2c4de98
Ensure summarization uses correct model by aligning step name
alozowski Jun 3, 2025
162ac00
Apply Ruff
alozowski Jun 3, 2025
33b830d
Merge pull request #117 from huggingface/fix-summarization-stepname-m…
alozowski Jun 3, 2025
da48707
remove import error and refactor block
sumukshashidhar Jun 4, 2025
07497eb
add helper
sumukshashidhar Jun 4, 2025
3801f63
fix cq
sumukshashidhar Jun 4, 2025
f698308
Merge pull request #116 from huggingface/improve-ingestion-markitdown
sumukshashidhar Jun 4, 2025
69be970
Merge pull request #115 from huggingface/hotfix-citation-score-filtering
sumukshashidhar Jun 4, 2025
5f56c05
Refactor CLI and pipeline init logic
alozowski Jun 4, 2025
1fed2fa
Refactor ingestion and QA stages
alozowski Jun 4, 2025
0e8d389
Split inference logic into modular files
alozowski Jun 4, 2025
c1f1446
Update parsing and QA model logic
alozowski Jun 4, 2025
fb87276
Refine question generation prompts
alozowski Jun 4, 2025
074c0bb
Add chunk sampling logic
alozowski Jun 4, 2025
c98cb05
Update config and integration test for QA pipeline
alozowski Jun 4, 2025
783a553
Refactor test pipeline to support unified question_generation
alozowski Jun 5, 2025
ed83d39
Merge pull request #121 from huggingface/qg-inference-clarity
alozowski Jun 5, 2025
b4883f8
Potential fix for code scanning alert no. 1: Workflow does not contai…
sumukshashidhar Jun 6, 2025
ee24b9f
Merge branch 'docker'
stepdi Jun 6, 2025
f5035f0
Added `include_docment_text` option to `lighteval` step to skip addin…
stepdi Jun 13, 2025
50f8c23
Turned off inclusion of doc contents in `lighteval` step
stepdi Jun 13, 2025
b0af320
Added missing HF_HUB_ONLINE=1 to .env.template
stepdi Jun 13, 2025
0cf0717
limit LLM query count to 50 for single-shot and 50 for multi-hop ques…
stepdi Jul 17, 2025
2b86b62
Update llm judge model for yourbench to 2.5-flash
Robert-H-Leonard Sep 2, 2025
c2282f1
Merge pull request #2 from LayerLens/update-llm-judge-model
Robert-H-Leonard Sep 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.venv
.git
__pycache__/
datasets/
15 changes: 13 additions & 2 deletions .env.template
Original file line number Diff line number Diff line change
@@ -1,2 +1,13 @@
HF_TOKEN=
HF_ORGANIZATION=
OPENROUTER_API_KEY=

BENCHMARK_NAME="test"
BENCHMARK_SYSTEM_PROMPT="test prompt"
INPUT_S3_BUCKET="layerlens-private-test-organization"
INPUT_S3_KEY="benchmarks/test-project/benchmark-name/data.zip"
OUTPUT_S3_BUCKET="layerlens-private-test-organization"
OUTPUT_S3_KEY="benchmarks/test-project/benchmark-name/"

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

HF_HUB_OFFLINE=1
47 changes: 47 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: YourBench CI

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

permissions:
contents: read

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.12]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install uv
run: pip install uv

- name: Create virtual environment
run: uv venv

- name: Install dependencies
run: |
. .venv/bin/activate
uv pip install -e .
uv pip install pytest pytest-cov

- name: Run tests
run: |
. .venv/bin/activate
python -m pytest tests/ --cov=yourbench --cov-report=xml

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
fail_ci_if_error: false
3 changes: 3 additions & 0 deletions .github/workflows/quality.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
name: Quality

permissions:
contents: read

on:
push:
branches:
Expand Down
43 changes: 43 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
FROM python:3.12-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Copy all yourbench files
COPY . .

# Install dependencies and yourbench in editable mode
RUN pip install --upgrade pip && \
pip install boto3 pyyaml awscli && \
pip install -e .

# Verify installation
RUN yourbench --version || echo "Yourbench installation verification failed but continuing build"

# Environment variables (will be overridden at runtime)
ENV BENCHMARK_NAME=""
ENV BENCHMARK_SYSTEM_PROMPT=""
ENV INPUT_S3_BUCKET=""
ENV INPUT_S3_KEY=""
ENV OUTPUT_S3_BUCKET=""
ENV OUTPUT_S3_KEY=""
ENV OPENROUTER_API_KEY=""
ENV AWS_ACCESS_KEY_ID=""
ENV AWS_SECRET_ACCESS_KEY=""
ENV AWS_DEFAULT_REGION="us-east-1"
ENV WORKDIR="/app"

# Create a startup script to run the processing workflow
RUN printf '#!/bin/bash\n\
echo "Running yourbench workflow..."\n\
exec python run_yourbench.py\n' > /app/entrypoint.sh && \
chmod +x /app/entrypoint.sh

# Use the startup script as entry point
ENTRYPOINT ["/app/entrypoint.sh"]
67 changes: 67 additions & 0 deletions README.docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# YourbenchProcessor Docker Container

This Docker container automates the process of:
1. Downloading data from AWS S3
2. Processing with yourbench
3. Uploading results back to AWS S3

## Required Environment Variables

The container requires the following environment variables:

- `INPUT_S3_BUCKET`: S3 bucket name for input data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we are missing some of the variables here and below in the example docker run command

- `INPUT_S3_KEY`: S3 object key for input data (ZIP file)
- `OUTPUT_S3_BUCKET`: S3 bucket name for output results
- `OUTPUT_S3_KEY`: S3 object key for output results
- `OPENROUTER_API_KEY`: API key for OpenRouter
- `AWS_ACCESS_KEY_ID`: AWS access key with S3 permissions
- `AWS_SECRET_ACCESS_KEY`: AWS secret key with S3 permissions
- `AWS_DEFAULT_REGION`: AWS region (default: us-east-1)

## Building the Docker Image

```bash
docker build -t yourbench-processor .
```

## Running the Container

```bash
docker run -e INPUT_S3_BUCKET=your-input-bucket \
-e INPUT_S3_KEY=input/data.zip \
-e OUTPUT_S3_BUCKET=your-output-bucket \
-e OUTPUT_S3_KEY=output/results.zip \
-e OPENROUTER_API_KEY=your-openrouter-key \
-e AWS_ACCESS_KEY_ID=your-aws-key-id \
-e AWS_SECRET_ACCESS_KEY=your-aws-secret \
-e AWS_DEFAULT_REGION=us-east-1 \
yourbench-processor
```

## Process Flow

1. Downloads the specified zip file from S3
2. Extracts contents to `task/data/raw` directory
3. Creates a `config.yaml` file in `task/dataset` directory
4. Runs yourbench with the created config
5. Zips the `task/dataset` directory
6. Uploads the zipped results back to S3

## Local Testing

For local testing without Docker:

```bash
# Set environment variables
export INPUT_S3_BUCKET=your-input-bucket
export INPUT_S3_KEY=input/data.zip
export OUTPUT_S3_BUCKET=your-output-bucket
export OUTPUT_S3_KEY=output/results.zip
export OPENROUTER_API_KEY=your-openrouter-key
export AWS_ACCESS_KEY_ID=your-aws-key-id
export AWS_SECRET_ACCESS_KEY=your-aws-secret
export AWS_DEFAULT_REGION=us-east-1

# Run the script
python run_yourbench.py
```
Loading
Loading