Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: update minikube gpu support on windows wsl2 #18557

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
196 changes: 182 additions & 14 deletions site/content/en/docs/tutorials/nvidia.md
Expand Up @@ -7,7 +7,7 @@ date: 2018-01-02

## Prerequisites

- Linux
- Linux or Windows with WSL2 installed
- Latest NVIDIA GPU drivers
- minikube v1.32.0-beta.0 or later (docker driver only)

Expand All @@ -31,6 +31,8 @@ date: 2018-01-02

- Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) on your host machine



- Configure Docker:
```shell
sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
Expand All @@ -40,6 +42,41 @@ date: 2018-01-02
minikube start --driver docker --container-runtime docker --gpus all
```
{{% /tab %}}
{{% tab Windows-WSL %}}
## Using the Windows-WSL2 driver

- Endure you have already enabled WSL2. You also need to install the Docker Desktop For Windows.

- Ensure you have an NVIDIA driver installed(via Windows only), if one is not installed follow the [NVIDIA Driver Installation Guide](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html)

**Note: Make sure only install the driver on windows, and DO NOT install any linux nvidia driver**

- After instalation of windows driver, you many also need to execute `cp /usr/lib/wsl/lib/nvidia-smi /usr/bin/nvidia-smi` and `chmod ogu+x /usr/bin/nvidia-smi` in WSL2, because otherwise the nvidia-smi may not be found in PATH. You can check if one is installed by running `nvidia-smi`,

- Install the [Cuda Toolkit for WSL2](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local) inside WSL2. Note you need to select targetOS as linux and distribution as WSL-Ubuntu

- Check if `bpf_jit_harden` is set to `0` inside WSL2
```shell
sudo sysctl net.core.bpf_jit_harden
```
- If it's not `0` run:
```shell
echo "net.core.bpf_jit_harden=0" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```

- Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) inside WSL2

- Configure Docker inside WSL2:
```shell
sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
```
- Start minikube inside WSL2:
```shell
minikube start --driver docker --container-runtime docker --gpus all
```
{{% /tab %}}

{{% tab none %}}
## Using the 'none' driver

Expand Down Expand Up @@ -145,19 +182,150 @@ Also:
- nvidia-docker [doesn't support
macOS](https://github.com/NVIDIA/nvidia-docker/issues/101) either.

## Why does minikube not support NVIDIA GPUs on Windows?

minikube supports Windows host through Hyper-V or VirtualBox.

- VirtualBox doesn't support PCI passthrough for [Windows
host](https://www.virtualbox.org/manual/ch09.html#pcipassthrough).

- Hyper-V supports DDA (discrete device assignment) but [only for Windows Server
2016](https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment)
## Hand-on try: an example about training ML model in a Pod of minikube k8s cluster
Here is a simplest example program from [Pytorch website](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html), which trains a model on MNIST data set. Have a try on it to see that minikube gpu support actually works.

```python
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor(),
)

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
print(f"Shape of X [N, C, H, W]: {X.shape}")
print(f"Shape of y: {y.shape} {y.dtype}")
break
# Get cpu, gpu or mps device for training.
device = (
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)

def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits

model = NeuralNetwork().to(device)
print(model)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)

# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)

# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()

if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

epochs = 5
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train(train_dataloader, model, loss_fn, optimizer)
test(test_dataloader, model, loss_fn)
print("Done!")
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")
```

Start minikube with gpu support:
```shell
minikube start --driver docker --container-runtime docker --gpus all
```

Create a pod using `pytorch/pytorch` image, which have all necessary libraries installed, and get a shell from this pod.
```
kubectl run torch --image=pytorch/pytorch -it -- /bin/bash
```

Now copy the file into the pod, and run it with python3. You will see the model is trained with Nvidia GPU and Cuda acceleration.

```
... ...
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
Epoch 1
... ...
```

Since the only possibility of supporting GPUs on minikube on Windows is on a
server OS where users don't usually run minikube, we haven't invested time in
trying to support NVIDIA GPUs on minikube on Windows.

Also, nvidia-docker [doesn't support
Windows](https://github.com/NVIDIA/nvidia-docker/issues/197) either.