Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

add refresh for {node,job}-exporter, prometheus and watchdog #1865

Merged
merged 1 commit into from Dec 12, 2018

Conversation

xudifsd
Copy link
Member

@xudifsd xudifsd commented Dec 11, 2018

fix #1748 , I also fixed a bug when calling paictl.py service refresh. Not sure if this is correct. @ydye please have a check.

@xudifsd xudifsd requested review from hao1939 and ydye December 11, 2018 05:35
@coveralls
Copy link

coveralls commented Dec 11, 2018

Coverage Status

Coverage increased (+18.4%) to 70.146% when pulling 1d98665 on dixu/add-refresh into 2c08713 on master.

@@ -20,9 +20,7 @@

pushd $(dirname "$0") > /dev/null


echo "refresh prometheus configuration"
kubectl apply -f prometheus-configmap.yaml || exit $?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wandering why it doesn't work. Do you have any idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works if someone changed some parameters that are allowed to change in description of configmap. It will not restart prometheus service, which might be expected behavior by refresh user.

@@ -67,7 +67,7 @@ def get_service_list(self):

def refresh_all_label(self):
self.logger.info("Begin to refresh all the nodes' labels")
machinelist = self.cluster_object_model['machinelist']
machinelist = self.cluster_object_model['machine']['machine-list']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to decouple the statically binding of machine list.
Could you query the machine list from k8s api?
Or via kube get pods?

Copy link
Member Author

@xudifsd xudifsd Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should ask @ydye about this. Anyway, this PR is focused on the refresh of these services not the way of how paictl.py implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hao1939, maybe machinelist should be kept for paiservice. it is just for paiservice, machinelist should be retrieved from k8s api, instead of clusterconfig. of course, before deploying paiservice, you will need to validate the k8s cluster state is healthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better remove statically machinelist, since we are working on a scalable system.
We should base on assume that system is on changing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My currently assume is that:

  1. k8s manage the nodes, and label the nodes.
  2. PAI service bind to label, for scalability, it should decouple from specific nodes when possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hao1939 , that makes sense. please present your design once ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, please approve this PR if appropriate. Changing to dynamic machine list can be addressed by another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@xudifsd xudifsd merged commit 8b922c0 into master Dec 12, 2018
@xudifsd xudifsd deleted the dixu/add-refresh branch December 12, 2018 02:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Webportal display the wrong GPU usage ratio: 179% allocated GPUs
4 participants