Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pserver Save state #2716

Merged
merged 12 commits into from
Jul 11, 2017
Merged

Pserver Save state #2716

merged 12 commits into from
Jul 11, 2017

Conversation

dzhwinter
Copy link
Contributor

fix #2566

@@ -79,6 +79,8 @@ func TestServiceFull(t *testing.T) {
if !reflect.DeepEqual(param1, p) {
t.FailNow()
}
var dummy int
s.Save("", &dummy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in nil is fine: s.Save("", nil). I used s.Save("", &dummy) before but later realized that it's fine to pass in nil :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -166,3 +168,7 @@ func TestBlockUntilInitialized(t *testing.T) {

wg.Wait()
}

func TestCheckpointSpeed(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed can be tested with benchmark. Here is an example: https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave a TODO here, will be tested after reaching an agreement with @Yancey1989 's recover logic.

@@ -38,6 +52,7 @@ type Parameter struct {
type ParameterWithConfig struct {
Param Parameter
Config []byte // parameter configuration in Proto Buffer format
State []byte // parameter training state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParameterWithConfig is the data sent from the trainer to the pserver. But State is saved by pserver, loaded by pserver, which is not related to trainer.
So State should not be part of this struct.

Maybe:

type checkpoint struct {
  Uuid      string
  Md5sum    string
  Timestamp string
  ParameterWithConfig // this is called embedded field
  State  []byte
}

embedded field: https://golang.org/ref/spec#Struct_types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split into info and data part. fix done.

@@ -142,8 +177,51 @@ func (s *Service) GetParam(name string, parameter *Parameter) error {

// Save tells the parameter server to save parameters.
func (s *Service) Save(path string, dummy *int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save was intended for saving model. But now we no longer use pservers to save model. Can you rename save to checkpoint? Also, I think at least for the first implementation, checkpoint should not be exposed as a RPC method to the trainer, instead, pservers periodically checkpoints, so can you make this a private function: func (s *Service) checkpoint(path string) error? (note that we don't need parameter dummy *int anymore if it's not used for RPC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix Done.

log.Infof("parameter checkpoint %s", ckbytes)

if _, err = os.Stat(ck.Uuid); os.IsNotExist(err) {
log.Info("checkpoint not exists.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint not exists. -> checkpoint does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

log.Info("checkpoint not exists.")
} else {
err = os.Remove(ck.Uuid)
log.Infof("remove %s", ck.Uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove %s -> checkpoint %s already exists, removing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

log.Infof("remove %s", ck.Uuid)
}
f, err := os.Create(ck.Uuid)
defer f.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer f.Close() will close when this function returns, not when the for loop goes to the next loop. And the for loop may be very long. So perhaps call f.Close() at the end of for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

@@ -14,6 +24,10 @@ const (
Uninitialized = "pserver not fully initialized"
)

const (
checkpoint_path = "./checkpoints/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go's naming convention is camelCase, not snake_case.

checkpointPath need to be an argument (flag.String) passed to go/cmd/pserver program. Since the k8s will configure the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

log.Errorln(err)
}
// TODO: according design doc, need to save Uuid to etcd in json format
// {\"Uuid\": [UUID], \"md5\", \"MD5 sum\", \"Timestamp\": xxxx}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design doc mentioned using etcd to save checkpoint information as well. Maybe add a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add etcd saving logic. fix done.

}

//serialize ParameterWithConfig to byte stream
func GetBytes(content ...interface{}) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making content ...interface{} adds more complexity to the code: since it's interface type that we need to encode, we have to call gob.Register. It's harder to understand the code (people need to search for what does gob.Register do. And it's harder to maintain the code (whenever adds a new type for GetBytes to use, maintainer need to add gob.Register as well, it's hard to track.

Since here we only need to call GetBytes twice, and this function does not have much code. Maybe just put it inline? (and remove gob.Register)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix Done.

@@ -52,14 +67,34 @@ type Service struct {
optMap map[string]*optimizer
}

type checkpoint struct {
Uuid string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding json: "uuid" at the end of the line, so we can use Json.marshal to a format JSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with another PR. fix Done.

err = os.Remove(ck.Uuid)
log.Infof("remove %s", ck.Uuid)
}
f, err := os.Create(ck.Uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create so many files for each paramter. Following the design doc, we will only have one checkpoint file named UUID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++ except small comments.

@@ -20,6 +20,8 @@ func main() {
"comma separated endpoint string for pserver to connect to etcd")
etcdTimeout := flag.Int("etcd-timeout", 5, "timeout for etcd calls")
numPservers := flag.Int("num-pservers", 1, "total pserver count in a training job")
checkpointPath := flag.String("checkpoint-path", "/checkpoints/", "save checkpoint path")
checkpointInterval := flag.Int("checkpoint-interval", 10, "save checkpoint per interval seconds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default 10 seconds maybe too quick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fix interval is not proper for every training job. Time consumed always determined by training data amount. Round count may be better here.
Change it to 10 min(600seconds)

}

// Checkpoint is the pserver shard persist in file
type Checkpoint []parameterCheckPoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported type is an array of unexported type, maybe inconvenience to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dzhwinter dzhwinter merged commit 15f021a into PaddlePaddle:develop Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pserver save state.
4 participants