Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PServer recovery from checkpoint #2741

Merged
merged 15 commits into from
Jul 13, 2017

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented Jul 4, 2017

Fixed #2671

Related: Pserver save state #2716

// GetCheckpoints gets the checkpoint information by the specified pserver id
func (e *EtcdClient) GetCheckpointInfo(string idx) (string, error) {
key := PsCheckpoint + idx
ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(timeout))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is timeout defined, does this code compiles? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry :), fixed it.

@@ -186,3 +188,20 @@ func (e *EtcdClient) registerPserverEtcd(ctx context.Context) (int, error) {

return idx, nil
}

// GetCheckpoints gets the checkpoint information by the specified pserver id
func (e *EtcdClient) GetCheckpointInfo(string idx) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this function to GetKey, caller can compute and pass in the key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

UUID string `json:"uuid"`
MD5 string `json:"md5"`
Timestamp string `json:"timestamp"`
State []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why State and ParameterWithConfig does not have json tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint will be initialized from the JSON data in etcd, which only the the key uuid, md5 and timestamp, but according with your other comment, separate Checkpoint and CheckpointInfo is better way, thanks for @helinwang .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MD5 string `json:"md5"`
Timestamp string `json:"timestamp"`
State []byte
ParameterWithConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your comment on the other PR, Checkpoint need to contain checkpoints for all parameters on the pserver.
I was confused before. I think Checkpoint should be renamed to CheckpointInfo, it's actually a reference to the checkpoint file on distributed filesystem. The type of Timestamp should be changed to int64.
Perhaps something like:

type CheckpointInfo struct {
  UUID string
  Path string
  MD5 string
  Timestamp int64
}

Maybe we can have another type like:

type ParameterCheckpoint struct {
  ParameterWithConfig
  State []byte
}

type Checkpoint []ParameterCheckpoint

How do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I think separate Checkpoint and CheckpointInfo is a good idea.I will do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename ParameterCheckpoint to CheckpointData should be more clear. Then these two struct should be:

type CheckpointMeta struct {}
type CheckpointData struct {}
type Checkpoint struct {
  meta CheckpointMeta
  data CheckpointData
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, CheckpointMeta looks like a appropriate name.
Maybe we don't need the struct Checkpoint including meta and data? We can

  • Json.Umarshal from etcd to initialize CheckpointMeta.
  • gob.Decode(...) from the checkpoint file to initialize CheckpointData

I looks like more simple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@typhoonzero How do you think about this way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with @Yancey1989 . Do we need to separate meta and data in checkpoint ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

Do we need to separate meta and data in checkpoint ?

Maybe not, It's simple enouph to load meta and data into the struct separately.

MD5 string `json:"md5"`
Timestamp string `json:"timestamp"`
State []byte
ParameterWithConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename ParameterCheckpoint to CheckpointData should be more clear. Then these two struct should be:

type CheckpointMeta struct {}
type CheckpointData struct {}
type Checkpoint struct {
  meta CheckpointMeta
  data CheckpointData
}

}

// NewCheckpoint creates a new checkpoint.
func NewCheckpoint(idx int, cpPath string, e *EtcdClient) (*Checkpoint, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to NewCheckpointFromFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the offline discuss, I will add Load and Save method for the struct Checkponit:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
return nil, err
}
// TODO: create checkpoint from file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish this TODO and then review.

@helinwang helinwang dismissed their stale review July 5, 2017 03:39

I may not have internet connection in the following few days. Don't want to block this PR from being merged.

@Yancey1989 Yancey1989 changed the title [WIP] PServer recovery from checkpoint PServer recovery from checkpoint Jul 7, 2017
Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part of the design is quite different.
Please review this PR when you in spare time.
https://github.com/PaddlePaddle/Paddle/pull/2716/files

@@ -49,13 +54,95 @@ type Service struct {
optMap map[string]*optimizer
}

// CheckpointMeta saves the checkpoint information
type CheckpointMeta struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpointMeta is an inner type which only used by pserver, it should not be a exported type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

s, err = pserver.NewService(idx)
candy.Must(err)
} else {
s, err = pserver.NewServiceFromCheckpoint(idx, cp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unify these two interface may be better, we can pass an empty checkpoint in default. then we will have only one create entry and leave out the duplicate code.

s := &Service{
	idx: idx,
}
s.optMap = make(map[string]*optimizer)
s.initialized = make(chan struct{})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thx!

}

// ParameterCheckpoint saves parameter checkpoint
type ParameterCheckpoint struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as Meta type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// ParameterCheckpoint saves parameter checkpoint
type ParameterCheckpoint struct {
ParameterWithConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there in only type here without variable name.

Copy link
Contributor Author

@Yancey1989 Yancey1989 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According with: https://golang.org/ref/spec#Struct_types, a field without a explicit field name is an anonymous field, so I can initialize the Param filed of ParameterWithConfig as ParameterCheckpoint.Param = xxx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it. a brilliant feature of Go.


dec := gob.NewDecoder(f)

if err = dec.Decode(&cp.data); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check md5 value here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// Checkpoint saves all parameters' checkpoint
type Checkpoint struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe these things should not be the member of Checkpoint. They do not persist in permanent storage. I move them into Server's argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Timestamp string `json:"timestamp"`
}

// Checkpoint is the pserver shard persist in file
type Checkpoint []ParameterCheckpoint
type Checkpoint []parameterCheckpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following @typhoonzero 's comment here for possible usage.
#2716 (comment)
But it is true, single parameterCheckpoint is useless to the user. It should be use immediately out of pserver's package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this if something error for the later.

}
md5 := hex.EncodeToString(h.Sum(nil))
if md5 != cpMeta.MD5 {
return nil, errors.New("md5 does match, load checkpoint failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to put the const error type together? see the line of L28.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -189,6 +189,22 @@ func (e *EtcdClient) registerPserverEtcd(ctx context.Context) (int, error) {
return idx, nil
}

// GetKey gets the value by the specified key
func (e *EtcdClient) GetKey(key string, timeout int) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more clear if the type of timeout is time.Duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -26,6 +27,8 @@ const (
AlreadyInitialized = "pserver already initialized"
// Uninitialized is true if pserver not fully initialized
Uninitialized = "pserver not fully initialized"
// CheckpointUnmatched is true if the md5 of checkpoint file does matched the older
CheckpointUnmatched = "checkpoint file does not matched the older"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe CheckpointMD5Failed = "checkpoint file MD5 validation failed" would better express the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -26,6 +27,8 @@ const (
AlreadyInitialized = "pserver already initialized"
// Uninitialized is true if pserver not fully initialized
Uninitialized = "pserver not fully initialized"
// CheckpointUnmatched is true if the md5 of checkpoint file does matched the older
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "is true" is not very appropriate in English grammar, it's string type, "is true" is for bool type. Since the content of these constants already express the meaning, maybe we don't need the comment?
Perhaps we can put a comment on the top of const keyword:

// RPC error messages.
const (
  AlreadyInitialized = "pserver already initialized"
  Uninitialized = "pserver not fully initialized"
  CheckpointUnmatched = "checkpoint file does not matched the older"
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

type checkpointMeta struct {
UUID string `json:"uuid"`
Md5sum string `json:"md5sum"`
MD5 string `json:"md5"`
Timestamp string `json:"timestamp"`
Copy link
Contributor

@helinwang helinwang Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is not related to this PR.
Timestamp should have type int64, it's value should be from func (t Time) UnixNano() int64.
Having string here makes it not precise enough (精度不够). time.Now().String() is only accurate to second: 2009-11-10 23:00:00 +0000 UTC.
Can you add a TODO: change type of timestamp to int64 here? (or would be great if you could fix it 👍 ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return nil, errors.New(CheckpointUnmatched)
}

dec := gob.NewDecoder(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since io.Copy already read through f once, gob.NewDecoder(f) requires to read from the beginning of the file, but f have not been seek to the beginning of the file. Does this actually work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's a bug...
Done.

// endpoints specified.
func NewService(idx int, seconds int, path string, client *EtcdClient, cp Checkpoint) (*Service, error) {
// endpoints specified. It will recovery from checkpoint file if a exists a specified checkpoint.
func NewService(idx int, interval int, path string, client *EtcdClient, cp *Checkpoint) (*Service, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's not related to this PR. I know etcd uses int to express the type of timeout, we should really use type time.Duration for interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -186,13 +231,13 @@ func (s *Service) doCheckpoint() error {
s.mu.Lock()
defer s.mu.Unlock()

cp := make([]ParameterCheckpoint, 0, len(s.optMap))
cp := make([]parameterCheckpoint, 0, len(s.optMap))
Copy link
Contributor

@helinwang helinwang Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is not related to this PR.
This does not work. It creates a slice with length set to 0 and capacity set to len(s.optMap). access cp[0] will crash the program since it's length is 0. Should use cp := make([]parameterCheckpoint, len(s.optMap)) here.

For example:

package main

func main() {
	a := make([]int, 0, 10)
	for i := 0; i < 10; i++ {
		a[i] = i
	}
}

Output:

panic: runtime error: index out of range

goroutine 1 [running]:
main.main()
	/tmp/sandbox064034874/main.go:6 +0x20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @helinwang , done.

if err != nil {
panic(err)
log.Infof("Fetch checkpoint failed, %s\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log.Errorf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
panic(err)
}
s, err := pserver.NewService(idx, time.Duration(*checkpointInterval)*time.Second, *checkpointPath, e, cp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can change flag.Int("checkpoint-interval", 600, "save checkpoint per interval seconds") to flag.Duration("checkpoint-interval", 600*time.Second, "save checkpoint interval"), and not do type cast here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


cpMetajson, _ := json.Marshal(cpMeta)
err = s.client.PutKey(filepath.Join(PsCheckpoint, strconv.Itoa(s.idx)), cpMetajson, 3)
err = s.client.PutKey(filepath.Join(PsCheckpoint, strconv.Itoa(s.idx)), cpMetajson, time.Duration(3)*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to 3*time.Second. Constant in Go is well designed: https://blog.golang.org/constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
return s, nil
}

// InitParam initializes a parameter.
func (s *Service) InitParam(paramWithConfigs ParameterWithConfig, dummy *int) error {
func (s *Service) InitParam(paramWithConfigs *ParameterWithConfig, dummy *int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is passing pointer * ParameterWithConfig necessary? Please see: https://github.com/golang/go/wiki/CodeReviewComments#pass-values

Copy link
Contributor Author

@Yancey1989 Yancey1989 Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -35,7 +35,7 @@ func cArrayToSlice(p unsafe.Pointer, len int) []byte {
return (*[1 << 30]byte)(p)[:len:len]
}

func newOptimizer(paramWithConfigs ParameterWithConfig, State []byte) *optimizer {
func newOptimizer(paramWithConfigs *ParameterWithConfig, State []byte) *optimizer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is passing pointer * ParameterWithConfig necessary? Please see: https://github.com/golang/go/wiki/CodeReviewComments#pass-values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++ Thanks!!!

@Yancey1989 Yancey1989 merged commit 19bfb8a into PaddlePaddle:develop Jul 13, 2017
@Yancey1989 Yancey1989 deleted the pserver_from_checkpoint branch July 13, 2017 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants