[Ported][CI/Build] Share HuggingFace downloads between test runs #4874

DarkLight1337 · 2024-05-17T03:00:26Z

The models tests keep getting interrupted (presumably due to running too long). This PR attempts to reduce the running time via:

~~Share the HuggingFace cache between Kubernetes containers during CI by storing it in a hostPath volume.~~
~~Disabling graph construction (considering that the vLLM model is only run once per test, not including the profile run).~~
Reuse the HuggingFace cache between Docker containers via a shared volume.

DarkLight1337 · 2024-05-17T04:05:27Z

Using eager mode doesn't seem to lead to significant improvement. It seems that the bottleneck is in downloading the models, so we should parallelize this process.

DarkLight1337 · 2024-05-17T12:30:06Z

Tbh it is probably better if we have a way to avoid re-downloading the models each time. Any thoughts?

… logs

DarkLight1337 · 2024-05-21T08:10:09Z

I'm not that experienced in Kubernetes but from my understanding, placing the HuggingFace cache inside a Volume should avoid the need to redownload the models when tests are run again in the same Pod.

@rkooo567 is it possible on your end to force the CI to run on the same Pod so we can test whether the cache actually works in this way?

rkooo567 · 2024-05-21T12:36:55Z

Hmm I am not super familiar with how CI works actually (idk if we even use k8s under the hood). cc @simon-mo for thoughts..

DarkLight1337 · 2024-06-05T00:26:12Z

@khluu since you're involved with CI, can you help out with this? Particularly the part concerning Kubernetes.

simon-mo · 2024-06-05T04:48:00Z

I believe we should download the model each time. @robertgshaw2-neuralmagic mentioned that putting them on NFS is a bit tricky because it might reaches rate limit.

simon-mo · 2024-06-05T04:48:16Z

hostPath a possible workaround.

rkooo567 · 2024-06-05T05:45:56Z

Disabling graph construction (considering that the vLLM model is only run once per test, not including the profile run).

Hmm I am against this. Imo we should test the default config for those tests (especially the test_model)

DarkLight1337 · 2024-06-19T06:59:56Z

I have updated this PR to work on AWS pipeline.

Looks like this shaved around 10 minutes off the duration of model tests. Going to rerun the test just to be sure.

DarkLight1337 · 2024-06-19T13:51:39Z

Looks like this shaved around 10 minutes off the duration of model tests. Going to rerun the test just to be sure.

This doesn't seem to be the case anymore. It's hard to determine the real effect since the test runs aren't necessarily performed on the same machine (from my understanding).

DarkLight1337 · 2024-06-25T06:45:09Z

Due to #5757, I have moved this PR to vllm-project/ci-infra#8.

DarkLight1337 added 2 commits May 17, 2024 02:55

Simplify code and fix type annotations

970dbc5

See if enforce_eager=True can reduce the running time

92b3bc0

DarkLight1337 marked this pull request as draft May 17, 2024 03:00

Apply formatter

7678e8b

DarkLight1337 added 5 commits May 17, 2024 04:15

Download models concurrently

d3e5200

Fix invalid fixture scope

09b8886

Also use parallel loading and eager mode for test_gptq_marlin

3f34fdf

Add progress bar for pre-loading models

6accb96

Fix models failing to be downloaded

427fc60

rkooo567 self-assigned this May 17, 2024

DarkLight1337 added 2 commits May 21, 2024 07:53

Remove parallel loading as it did not help much while obfuscating the…

31d3e99

… logs

Share the huggingface cache between k8s containers

78f64a4

Try using hostPath volume as suggested in agent-stack-k8s docs

7fb6e0d

Merge branch 'upstream' into optimize-models-tests

f416494

DarkLight1337 requested a review from khluu June 5, 2024 00:22

Merge branch 'upstream' into optimize-models-tests

32c01b5

DarkLight1337 added 6 commits June 15, 2024 05:46

Merge branch 'upstream' into optimize-models-tests

cf5db90

Merge branch 'upstream' into optimize-models-tests

40d3251

Merge branch 'upstream' into optimize-models-tests

2fd4eaa

Try to fix volume for hf cache

96cfc4f

Update path

183a4b8

Do not use quotes

66ac456

DarkLight1337 added 3 commits June 19, 2024 08:42

Remove enforce_eager change

93f3aa5

Remove tqdm change (it is in #5680 now)

6767ab4

Remove conftest change (it is in #5681 now)

7adec4c

DarkLight1337 changed the title ~~[Draft][CI/Build] Optimize models tests~~ [Draft][CI/Build] Avoid re-downloading models from HuggingFace Jun 19, 2024

This was referenced Jun 22, 2024

[ci] Remove aws template #5757

Merged

Share HuggingFace downloads between test runs vllm-project/ci-infra#8

Merged

DarkLight1337 closed this Jun 25, 2024

DarkLight1337 changed the title ~~[Draft][CI/Build] Avoid re-downloading models from HuggingFace~~ [Ported][CI/Build] Avoid re-downloading models from HuggingFace Jun 25, 2024

DarkLight1337 changed the title ~~[Ported][CI/Build] Avoid re-downloading models from HuggingFace~~ [Ported][CI/Build] Share HuggingFace downloads between test runs Jun 25, 2024

DarkLight1337 deleted the optimize-models-tests branch June 27, 2024 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Ported][CI/Build] Share HuggingFace downloads between test runs #4874

[Ported][CI/Build] Share HuggingFace downloads between test runs #4874

Uh oh!

DarkLight1337 commented May 17, 2024 •

edited

Loading

Uh oh!

DarkLight1337 commented May 17, 2024

Uh oh!

DarkLight1337 commented May 17, 2024 •

edited

Loading

Uh oh!

DarkLight1337 commented May 21, 2024 •

edited

Loading

Uh oh!

rkooo567 commented May 21, 2024

Uh oh!

DarkLight1337 commented Jun 5, 2024 •

edited

Loading

Uh oh!

simon-mo commented Jun 5, 2024

Uh oh!

simon-mo commented Jun 5, 2024

Uh oh!

rkooo567 commented Jun 5, 2024

Uh oh!

DarkLight1337 commented Jun 19, 2024 •

edited

Loading

Uh oh!

DarkLight1337 commented Jun 19, 2024 •

edited

Loading

Uh oh!

DarkLight1337 commented Jun 25, 2024

Uh oh!

Uh oh!

Uh oh!

[Ported][CI/Build] Share HuggingFace downloads between test runs #4874

[Ported][CI/Build] Share HuggingFace downloads between test runs #4874

Uh oh!

Conversation

DarkLight1337 commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 17, 2024

Uh oh!

DarkLight1337 commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented May 21, 2024

Uh oh!

DarkLight1337 commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-mo commented Jun 5, 2024

Uh oh!

simon-mo commented Jun 5, 2024

Uh oh!

rkooo567 commented Jun 5, 2024

Uh oh!

DarkLight1337 commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jun 25, 2024

Uh oh!

Uh oh!

DarkLight1337 commented May 17, 2024 •

edited

Loading

DarkLight1337 commented May 17, 2024 •

edited

Loading

DarkLight1337 commented May 21, 2024 •

edited

Loading

DarkLight1337 commented Jun 5, 2024 •

edited

Loading

DarkLight1337 commented Jun 19, 2024 •

edited

Loading

DarkLight1337 commented Jun 19, 2024 •

edited

Loading