You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @PalNilsson
Running jobs on kubernetes, we face the issue that processes running in pods see the pod name as the host name of the node. For example, in a pod named grid-job-16703192-mp47k (which is effectively the batch job ID):
bash-4.2$ hostname
grid-job-16703192-mp47k
This makes sense for k8s, and is appropriate because each pod does have its own unique IP address, but it doesn't fit well with Panda; the result is: https://bigpanda.cern.ch/wns/CA-VICTORIA-K8S-T2/?hours=12
Every "node" is a random unique ID so it is very difficult to correlate jobs to real nodes and identify problematic nodes.
We can easily expose the real node name as an env var:
So I would like to propose that the pilot look for some specific env var (maybe PANDA_NODE_NAME, PILOT_NODE_NAME ?), and if it exists, it uses that instead of the result of hostname when reporting details to the Panda server. Would that be reasonable?
The text was updated successfully, but these errors were encountered:
Hi @PalNilsson
Running jobs on kubernetes, we face the issue that processes running in pods see the pod name as the host name of the node. For example, in a pod named grid-job-16703192-mp47k (which is effectively the batch job ID):
This makes sense for k8s, and is appropriate because each pod does have its own unique IP address, but it doesn't fit well with Panda; the result is: https://bigpanda.cern.ch/wns/CA-VICTORIA-K8S-T2/?hours=12
Every "node" is a random unique ID so it is very difficult to correlate jobs to real nodes and identify problematic nodes.
We can easily expose the real node name as an env var:
So I would like to propose that the pilot look for some specific env var (maybe
PANDA_NODE_NAME
,PILOT_NODE_NAME
?), and if it exists, it uses that instead of the result ofhostname
when reporting details to the Panda server. Would that be reasonable?The text was updated successfully, but these errors were encountered: