cs348k

Please go to MILESTONE_2.md for the write up and results for Milestone 2!

Farmshare

https://docs.farmshare.stanford.edu/

ssh jdarve@login.farmshare.stanford.edu

srun -c 8 --mem=16G --time=12:00:00 --pty bash

Quickstart

cd cs348k

source $(poetry env info --path)/bin/activate

conda activate cs348k

git pull

Repository structure

Root files

links.md: Collects reference links for VLM papers, project pages, Hugging Face code, and model cards.
MILESTONE_1.md Report for Milestone 1
MILESTONE_2.md Report for Milestone 2
pyproject.toml poetry.lock Poetry dependancy management

`benchmarks/`

Benchmarks existing implementations (Milestone 1)

Core files

bench_kernels.py: Benchmarks the Milestone 2 Qwen kernel implementations against HF legacy, HF fast, and HF bilinear baselines.
full_benchmark.py: Runs the full benchmark for n threads
full_memory_benchmark.py: Runs the full memory benchmark for configurable thread counts in a separate clean memory-measurement process.

Infrastructure

data.py: Contains load_images function. All image loading code should be here.
measurement.py: Timing and profiling harness using cProfile and RSS. All profiling code should be here.
models.py: Loads the relevant model pre-processing code from huggingface. All huggingface model/pre-processing code fetching should be done here.

Testing and development

phase0.py: Sanity-check benchmark for Qwen2.5-VL legacy vs fast preprocessing on W2.
phase1.py: Adds cProfile and memory profiling for the Qwen2.5-VL W2 benchmark.
phase2.py: Benchmarks InternVL2.5 HF and manual preprocessing to test whether the bottleneck generalizes.
phase3.py: Benchmarks Qwen2.5-VL and InternVL2.5 preprocessing on mixed-size W3 and large-image W4 workloads.
full_benchmark_single_thread.py: Runs the full benchmark results on a single thread
full_memory_benchmark_single_thread.py: Runs the full memory benchmark results (there is a bug in full_benchmark.py that makes its memory results invalid; this is the file that re-runs memory by itself, correctly).
test_llava.py: Quick LLaVA-NeXT W3 timing, memory, and output-shape test for legacy and fast processors.
test_models.py: Verifies that AutoImageProcessor and AutoProcessor.image_processor return identical pixel values for each benchmarked model, and that InternVL3.5's crop_to_patches=True flag is load-bearing.
test_measurement.py: Quick Qwen2.5-VL W3 test for the timing and RSS measurement helpers.

`kernels/`

Hand-fused Qwen2.5-VL preprocessor kernels (Milestone 2).

bilinear.py: Shared @njit bilinear sampling and resize primitives.
patch_coords.py: Shared @njit patch addressing utilities (flat index and column offset).
qwen_v1_naive.py: v1 naive correctness baseline — each stage (smart_resize_dims, bilinear_resize, rescale, normalize, patchify) is its own function with an intermediate buffer.
qwen_v2_fused.py: v2 pointwise fusion — rescale and normalize inlined into the bilinear_resize pixel loop (compute_at = inline), eliminating two intermediate buffers.
qwen_v3_storage.py: v3 full fusion with pre-allocated output — bilinear resize, rescale, normalize, and patchify fused into a single parallel loop that writes directly to a pre-allocated output tensor (store_at = root), eliminating all intermediate buffers.

`visualizations/`

Contains Python scripts for creating visualizations from the output data.

milestone_1_benchmark_charts.py: Creates the Milestone 1 runtime and peak RSS benchmark charts.
milestone_1_profiling_charts.py: Creates the Milestone 1 profiling breakdown and memory movement share charts.
milestone_2_ablation_speedups.py: Creates a normalized Qwen kernel ablation speedup chart for Milestone 2.
milestone_2_runtime_memory_pareto.py: Creates a runtime versus peak/output memory Pareto chart for the Milestone 2 Qwen kernels.

`results/`

Notes about intermediate results

bench_kernels_results.md: Summarizes the Milestone 2 bench_kernels.py kernel timing and memory results.
bench_kernels_results_multi_thread.md: Summarizes the Milestone 2 bench_kernels.py --num-threads 8 timing and memory results without torch compile rows.
full_benchmarks_multi_thread_results.md: Summarizes Milestone 1 timing and memory results for the multi-thread benchmark runs.
full_benchmarks_single_thread_results.md: Summarizes Milestone 1 timing and memory results for the single-thread benchmark runs.
full_profiling_output_multi_thread.md: Contains the detailed cProfile output from the multi-thread profiling runs.
full_profiling_output_single_thread.md: Contains the detailed cProfile output from the single-thread profiling runs.

`tests/`

test_libs.py tests that libaries load
test_correctness.py: Verifies v1/v2/v3 output shape, image_grid_thw, and pixel values against HF fast.

Project Proposal

Historically, image pre-processing for machine learning models has been simple, with models hardcoded to a set image size, which may only need resizing and normalization. However, modern vision language model (VLM) pipelines work in the native pixel resolution of the image, and may perform more sophisticated pre-processing, such as dynamic tiling, patching, and multiple types of padding or normalization. In most open source implementations for training VLMs such as Huggingface or PyTorch Transforms, these pipelines are not implemented efficiently, and convert back and forth between several image formats such as numpy, tensors, or Pillow/PIL for different transformations. I seek to develop a DSL inspired by Halide for high-performance VLM image preprocessing, separating algorithm from schedule and providing primitives such as tiling, fusion (compute_at), parallelism, and vectorization. The key metrics I will need to verify are that existing implementations are in fact inefficient, and that the image preprocessing is a significant bottleneck in the pipeline, both of which I suspect to be the case, especially for video or high quality image processing. Precisely, I want a bar chart showing runtimes and memory usage for existing implementations showing that they are slow, and that my method is significantly faster (2-5x). Furthermore, I want a separate chart comparing the entire image processing pipeline including the VLM and show that faster pre-processing increases the overall speed significantly.

Dependency notes

The current lockfile uses transformers==4.57.6 (pyproject.toml allows >=4.49.0,<5.0.0). This version supports selecting Hugging Face image processor implementations with use_fast=False for the legacy Python processor and use_fast=True for the fast processor, which is how the benchmark code compares the two preprocessing paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cs348k

Farmshare

Quickstart

Repository structure

Root files

`benchmarks/`

Core files

Infrastructure

Testing and development

`kernels/`

`visualizations/`

`results/`

`tests/`

Project Proposal

Dependency notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
benchmarks		benchmarks
kernels		kernels
results		results
tests		tests
visualizations		visualizations
.gitignore		.gitignore
MILESTONE_1.md		MILESTONE_1.md
MILESTONE_2.md		MILESTONE_2.md
README.md		README.md
links.md		links.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

cs348k

Farmshare

Quickstart

Repository structure

Root files

benchmarks/

Core files

Infrastructure

Testing and development

kernels/

visualizations/

results/

tests/

Project Proposal

Dependency notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`benchmarks/`

`kernels/`

`visualizations/`

`results/`

`tests/`

Packages