Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple TimesSlicing sharing example to quickstart (WIP) #113

Closed
wants to merge 1 commit into from

Conversation

yuanchen8911
Copy link
Collaborator

@yuanchen8911 yuanchen8911 commented May 7, 2024

This PR adds an example for TimeSlicing sharing and updates the README description.

@yuanchen8911
Copy link
Collaborator Author

/cc @klueska

@yuanchen8911
Copy link
Collaborator Author

close it and may add it to a separate folder.

@yuanchen8911 yuanchen8911 reopened this May 7, 2024
@yuanchen8911
Copy link
Collaborator Author

yuanchen8911 commented May 7, 2024

Since there's a MPS example in the quickstart folder, an additional example showing a different sharing strategy TimeSlicing strategy would be helpful. WDYT, @klueska ? We can put them in a separate folder if it works better.

@yuanchen8911 yuanchen8911 changed the title Add a TimesSlicing sharing example to quickstart Add a simple TimesSlicing sharing example to quickstart May 7, 2024
@klueska
Copy link
Collaborator

klueska commented May 8, 2024

Yes, now that I realize this is not the top-level README, I agree this makes sense. WIll review in more detail tomorrow.

@yuanchen8911 yuanchen8911 requested a review from klueska May 13, 2024 22:30
@klueska
Copy link
Collaborator

klueska commented May 13, 2024

Would this now be superceded by #118 if we moved those to this level?

Signed-off-by: Yuan Chen <yuanc@nvidia.com>

Update namespace

Signed-off-by: Yuan Chen <yuanc@nvidia.com>

Update the timeslicing example

Signed-off-by: Yuan Chen <yuanc@nvidia.com>
@yuanchen8911 yuanchen8911 changed the title Add a simple TimesSlicing sharing example to quickstart Add a simple TimesSlicing sharing example to quickstart (WIP) May 15, 2024
@yuanchen8911
Copy link
Collaborator Author

yuanchen8911 commented May 15, 2024

Would this now be superceded by #118 if we moved those to this level?

Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?

$ k get pods -n mpsc-timeslicing-gpu-test
NAME        READY   STATUS    RESTARTS   AGE
gpu-pod-1   1/1     Running   0          5s
gpu-pod-2   0/1     Pending   0          5s

k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          92s
gpu-pod-2   1/1     Running     0          92s

$ k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          7m14s
gpu-pod-2   0/1     Completed   0          7m14s

@yuanchen8911
Copy link
Collaborator Author

Would this now be superceded by #118 if we moved those to this level?

Yes, we won't need this if that PR is merged. That folder contains two examples for SimeSlicing.

@yuanchen8911
Copy link
Collaborator Author

Would this now be superceded by #118 if we moved those to this level?

Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?

$ k get pods -n mpsc-timeslicing-gpu-test
NAME        READY   STATUS    RESTARTS   AGE
gpu-pod-1   1/1     Running   0          5s
gpu-pod-2   0/1     Pending   0          5s

k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          92s
gpu-pod-2   1/1     Running     0          92s

$ k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          7m14s
gpu-pod-2   0/1     Completed   0          7m14s

As @klueska suggested, we should use ResourceClaim (not ResoureceClaimTemplate). That resolved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants