Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for IPv6 #178

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

johnsushant
Copy link

@johnsushant johnsushant commented Jul 14, 2024

Fixes #150

I've added IPv6 support to Hostengine and dcgmi CLI while not changing/breaking any existing functionality.

Hostengine now supports binding to an IPv6 address

Start hostnegine:

$ nv-hostengine -b [::1] --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2004760 root   47u  IPv6 3165553609      0t0  TCP localhost:personal-agent (LISTEN)

Connect using dcgmi without port:

$ dcgmi discovery -l --host [::1]
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi with port:

$ dcgmi discovery -l --host [::1]:5555
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Hostengine now supports both IPv4 and IPv6 connections

Start hostnegine:

$ nv-hostengine -b ALL --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2478925 root   47u  IPv6 3173352315      0t0  TCP *:personal-agent (LISTEN)

Connect using dcgmi on IPv4:

$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi on IPv6:

$ dcgmi discovery -l --host [::1]
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Hostengine default IPv4 functionality is not broken

Start hostnegine:

$ nv-hostengine --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2476116 root   47u  IPv4 3173251779      0t0  TCP localhost:personal-agent (LISTEN)

Connect using dcgmi on IPv4:

$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi on IPv6 (expected failure):

$ dcgmi discovery -l --host [::1]
Error: unable to establish a connection to the specified host: [::1]
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

Signed-off-by: John Sushant Sundharam <johnsushant@outlook.com>
@johnsushant
Copy link
Author

johnsushant commented Jul 16, 2024

@glowkey @nikkon-dev Can I please get a review on this?

@glowkey
Copy link
Collaborator

glowkey commented Jul 16, 2024

First pass of the code looks good. Can you run your changes through the testing framework and paste the final results (pass/fail counts)?
https://github.com/NVIDIA/DCGM?tab=readme-ov-file#running-the-test-framework

@johnsushant
Copy link
Author

johnsushant commented Jul 17, 2024

Thanks a lot for taking a look! Here's the final results I got:

========== TEST SUMMARY ==========
Passed: 458
Failed: 7
Waived: 68
Total:  465
Score:  99.24
==================================

These are the failed tests:

&&&& FAILED test_dcgm_action_stats_basics_pcie_standalone_with_service_account - 319
&&&& FAILED test_dcgm_action_stats_basics_targeted_power_standalone_with_service_account - 325
&&&& FAILED test_dcgm_action_stats_basics_targeted_stress_standalone_with_service_account - 331
&&&& FAILED test_dcgm_action_stats_file_present_standalone_with_service_account - 337
&&&& FAILED test_dcgm_action_string_stats_file_present_standalone_with_service_account - 343
&&&& FAILED test_diag_stats_bad_statspath_standalone_with_service_account - 349
&&&& FAILED test_dcgm_library_existence - 597

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DCGM cannot listen on ipv6 address
2 participants