A new design of our Python save / load interfaces #8521

JiayiFeng · 2018-02-23T12:27:43Z

Related issue: #7163

Issues

Currently, there are a few obvious issues in our program saving and loading interfaces:

The save_params() and load_params() is useless and misleading. Some variables required by making checkpoints are not Parameter, which makes a model is unable to continue its training or make inference after save_params() and load_params(). The correct way of making a checkpoint is using save_persistables() and load_persistables().
save_var and load_var takes an executor, builds a temporary program and then executes the program immediately. This makes variables saving and loading are triggered by Python code. We can't make checkpoints or save parameters in an environment without Python.

Proposed Solution

Base functions

To fix existing issues, we redesign our Python io module. The proposed new io module mainly consists of following base functions:

# serializes the given program and save it in dir
def save_program(program, dir):
    ...

# loads and deserializes program from the given dir
def load_program(dir):
    ...
    return res_prog

# appends save_ops to the given program
# appended save_ops save variables given in var_list to dir
def save(var_list, dir, program):
    ...

# appends load_ops to the given program
def load(var_name_list, dir, program):
    ...

save() and load() can be considered as the layers of save_op and load_op. They don't execute immediatly like current load_vars() and save_vars(). They just append the save_op or load_op to the given program and leave the execution to the runtime.

By using these base functions, we can save our program at any stage of model configuration, or save and load any specific variable values at any phrases of program execution.

Checkpoints

To make it more user-friendly, we can add some high-level wrappers for checkpoint related functions:

"""
checkpoint loader is a special startup program. 
A regular startup holds only initializer ops, 
while a checkpoint loader may hold some load_op.
In other words, a load_program initialize persistable variables from existing files.
"""
def build_checkpoint_loader(load_var_list, startup_program):
    loader = Program()
    load_var_name_set = set([var.name for var in load_var_list])
    load(var_name_list=load_var_name_set,
         program = loader)
    if startup_program:
        for op in startup_program:
            if not (op.out in load_var_name_set):
                loader.append(deep_copy(op))
    return loader

def make_checkpoint(dir, predicate, startup_program, main_program):
    persistable_var_list = filter(main_program.all_vars(), is_persistable())
    if predicate:
        persistable_var_list = filter(persistable_var_list, predicate)
    save(persistable_var_list, dir, main_program)
    loader = build_checkpoint_loader(persistable_var_list, startup_program)
    save_program(loader, "./loader")

def get_checkpoint_loader(...):
    loader = load_program(loader_dir)
    return loader

A checkpoint consists of two parts: variables and a loader. A loader is a program. It acts like a startup program and the only difference between a loader and a regular startup program is that in a loader some variables may be initialized by existing file instead of initializer ops.

We can use the checkpoint as follows:

"""
Save checkpoints:
"""
x = layers.data(...)
var1 = layers.fc(...)
# some other model configurations

make_checkpoint(dir="./", 
                predicate=None, 
                startup_program=default_startup_program(),
                main_program=default_main_program())
exe = Executor()
exe.run(default_startup_program())
while(...):
    exe.run(default_main_program())


"""
Load a checkpoint and continue training:
"""
x = layers.data(...)
var1 = layers.fc(...)
# the same model configurations as above

make_checkpoint(dir="./", 
                predicate=None, 
                startup_program=default_startup_program(),
                main_program=default_main_program())
loader = get_checkpoint_loader("./")
exe = Executor()
exe.run(loader)
while(...):
    exe.run(default_main_program())

Inference Model Saving and Loading

Currently, we use Program.prune to cut main program to get inference model. However, prune algorithm is complex and easy to be buggy. In recent discussions, we tend to leave the building of inference model to users:

"""
Saves the inference model:
"""
x = layers.data(...)
var1 = layers.fc(...)
# some other model configurations
cost = layers.mean(...)
save_program(default_main_program(), "./main_prog")
make_checkpoint(dir="./", 
                predicate=None, 
                startup_program=default_startup_program(),
                main_program=default_main_program())
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_cost)
exe = Executor()
exe.run(default_startup_program())
while(...):
    exe.run(default_main_program())

"""
Loads and uses the inference model:
"""
main_program = load_program("./main_prog")
loader = get_checkpoint_loader("./")
exe.run(loader)
while(...):
    exe.run(main_program)

The key to getting inference model is saving the main program and making checkpoints precisely before optimizers.

The text was updated successfully, but these errors were encountered:

QiJune · 2018-02-24T03:12:02Z

For save API, we provide two basic interface, save_checkpoint and save_params
For load API, we provide one interface, load_all
Load operator will be inserted to another startup program

reyoung · 2018-02-24T03:12:50Z

Related issue #7931

Xreki · 2018-02-26T03:39:54Z

Just supply some requirements for inference:

We need to insert feed_op and fetch_op to the inference_program, so that users need not to specify the feed variable names and fetch variable names manually in C++ inference interface.
We need to support saving all parameters into a single file, because there are requirements to initialize the inference program and parameters from a pre-loaded buffer.

For training, there may be some other requirement, like initializing part of parameters from a pre-trained model and randomizing the other part.

JiayiFeng · 2018-03-01T04:39:47Z

After a discussion with @QiJune and @reyoung, we all agree that this design can be approximated via a few simple modifications of the current code:

Some Op may hold persistable variable to represent its own status. We need to change all these variables to Parameter.
Merge load_persistable() and load_persistable_if_exsist() to one API. And remove load_parameter().
Add layer wrappers for save_op and load_op, so users can insert them into their own programs.

It will hardly change the code structure.

JiayiFeng assigned reyoung, qingqing01, Xreki, QiJune and JiayiFeng Feb 23, 2018

Xreki mentioned this issue Feb 27, 2018

Need to change the attribute is_test of batch_norm op to true in test_program and inference_program #8372

Closed

JiayiFeng mentioned this issue Mar 2, 2018

Add layers for save/load op #8711

Merged

panyx0718 self-assigned this Mar 5, 2018

JiayiFeng mentioned this issue Apr 21, 2018

Design Doc: Complete Fluid #10103

Closed

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new design of our Python save / load interfaces #8521

A new design of our Python save / load interfaces #8521

JiayiFeng commented Feb 23, 2018 •

edited

Loading

QiJune commented Feb 24, 2018

reyoung commented Feb 24, 2018

Xreki commented Feb 26, 2018

JiayiFeng commented Mar 1, 2018 •

edited

Loading

A new design of our Python save / load interfaces #8521

A new design of our Python save / load interfaces #8521

Comments

JiayiFeng commented Feb 23, 2018 • edited Loading

Issues

Proposed Solution

Base functions

Checkpoints

Inference Model Saving and Loading

QiJune commented Feb 24, 2018

reyoung commented Feb 24, 2018

Xreki commented Feb 26, 2018

JiayiFeng commented Mar 1, 2018 • edited Loading

JiayiFeng commented Feb 23, 2018 •

edited

Loading

JiayiFeng commented Mar 1, 2018 •

edited

Loading