Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

启动内存优化后,多个loss训练,获取loss值有异常 #11320

Closed
dyning opened this issue Jun 8, 2018 · 15 comments · Fixed by #11372 or #11462
Closed

启动内存优化后,多个loss训练,获取loss值有异常 #11320

dyning opened this issue Jun 8, 2018 · 15 comments · Fixed by #11372 or #11462
Assignees

Comments

@dyning
Copy link
Contributor

dyning commented Jun 8, 2018

我的多个loss的程序调用fluid.memory_optimize(fluid.default_main_program())内存优化后,获取loss显示异常。并且相同环境,关闭fluid.memory_optimize优化,两个loss值有diff。
image
我调用的一些关键程序:
train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_loss.name)
fetch_list_var = []
results = train_exe.run(fetch_list_var, feed=feed_dict)

建议可以拿googlenet的训练试试,看看是否可以复现。

@panyx0718

@panyx0718
Copy link
Contributor

@dyning Can you give a screenshot of your code? The part that uses memory_optimizer and ParallelExecutor.run?

@dzhwinter @reyoung @chengduoZH
I quickly looked at @dyning 's code. He fetched several intermediate losses. When mem_opt is enabled, the fetched value is wrong, when mem_opt is disabled, the fetched value is correct.

I suspect the memory_optimizer somehow renamed the variables, therefore ParallelExecutor fetched
the wrong thing.

@dyning
Copy link
Contributor Author

dyning commented Jun 8, 2018

screenshot of code as following:

       optimizer = fluid.optimizer.Momentum(
            learning_rate=fluid.layers.piecewise_decay(
                boundaries=self.propsalparam['bd'], values=self.propsalparam['lr']), momentum=0.9, 
            regularization=fluid.regularizer.L2Decay(1e-4))
        opts = optimizer.minimize(avg_loss)
        #fluid.memory_optimize(fluid.default_main_program())

        #place = fluid.CPUPlace()
        place = fluid.CUDAPlace(0)
        exe = fluid.Executor(place)
        exe.run(fluid.default_startup_program())
        pretrain_models_path = self.propsalparam['pretrain_models_path']
        save_interval = self.propsalparam['save_interval']
        if pretrain_models_path is not None:
            def if_exist(var):
                """if_exist"""
                return os.path.exists(os.path.join(pretrain_models_path, 
                    var.name))
            fluid.io.load_vars(exe, pretrain_models_path, predicate=if_exist)
        train_reader = paddle.batch(logodet_reader.reader_creator_logodet(
            self.configfile, self.sectionname), 
            batch_size=self.propsalparam['batch_size'])

        train_exe = fluid.ParallelExecutor(use_cuda=True, 
            loss_name=avg_loss.name)
        fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
            avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
            rpn_acc_top1.name, acc_top1.name, "learning_rate"]
        fetch_list_name = ["rpn_loss_cls", "rpn_loss_loc", "loss_cls",\
            "loss_loc", "loss", "rpn_acc_top1", "acc_top1", "lr"]
        with open("./output/train_log.txt", "wb") as fout_log:
            for pass_id in range(self.propsalparam['epoch_num']):
                begtime = time.time()
                for batch_id, blobs in enumerate(train_reader()):
                    feed_dict = self.convert_blobs_to_feed_dict(blobs, place)
                    results = train_exe.run(fetch_list_var, 
                        feed=feed_dict)

@QiJune
Copy link
Member

QiJune commented Jun 11, 2018

@dyning 你可以试试 #11372 这个patch,帮忙确认一下是否有效

先把自己想要fetch的值给列出来,然后传给memory_optimize这个接口。
这些值在做内存优化的时候就会被跳过。

fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
        avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
        rpn_acc_top1.name, acc_top1.name, "learning_rate"]
fluid.memory_optimize(fluid.default_main_program(), fetch_list_var)

@panyx0718
Copy link
Contributor

@dyning 这个问题解决了吗?

@dyning
Copy link
Contributor Author

dyning commented Jun 14, 2018

@QiJune ,#11372 验证了,问题依然存在。

@dzhwinter
Copy link
Contributor

@dzhwinter
Copy link
Contributor

dzhwinter commented Jun 14, 2018

这个issue @QiJune 和我的PR都是可以fix的。打开和关闭memory_optimize,可以完全对齐。验证结果如下,
使用了models中的googlenet,fetch部分的snippet如下

    train_reader = paddle.batch(reader.fake_reader(), batch_size=train_batch_size)
    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name, avg_cost0.name, avg_cost1.name, avg_cost2.name]

    if with_memory_optimization:
        fluid.memory_optimize(fluid.default_main_program(), skip_opt_set=set(fetch_list))

    for pass_id in range(params["num_epochs"]):
        train_info = [[], [], []]
        test_info = [[], [], []]
        for batch_id, data in enumerate(train_reader()):
            t1 = time.time()
            loss, acc1, acc5, cost0, cost1, cost2 = train_exe.run(fetch_list, feed=feeder.feed(data))
            t2 = time.time()
            period = t2 - t1
            loss = np.mean(np.array(loss))
            acc1 = np.mean(np.array(acc1))
            acc5 = np.mean(np.array(acc5))
            train_info[0].append(loss)
            train_info[1].append(acc1)
            train_info[2].append(acc5)
            if batch_id % 1 == 0:
                print("Pass {0}, trainbatch {1}, loss {2}, \
                       acc1 {3}, acc5 {4}, {5}, {6}, {7}, time {8}"
                                                   .format(pass_id, \
                                                           batch_id, loss, acc1, acc5, np.array(cost0), np.array(cost1), np.array(cost2), \
                       "%2.2f sec" % period))
                sys.stdout.flush()

Reader部分下

shape = [3, 224, 224]
label = range(102)
np.random.seed(100)
def fake_reader():
    def reader():
        while True:
            yield np.random.uniform(size=shape), np.random.choice(label)
    return reader

关闭memory_optimize

Pass 0, trainbatch 0, loss 7.40484714508,                        acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919,                        acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979,                        acc1 0.0, acc5 0.0, [4.647031], [4.606136], [4.641044], time 0.11 sec
Pass 0, trainbatch 3, loss 7.39293956757,                        acc1 0.03125, acc5 0.0625, [4.620225], [4.600328], [4.6420546], time 0.10 sec
Pass 0, trainbatch 4, loss 7.45125865936,                        acc1 0.0, acc5 0.0, [4.6322255], [4.7699347], [4.6268435], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39793109894,                        acc1 0.0, acc5 0.0625, [4.6226783], [4.621888], [4.628953], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39033412933,                        acc1 0.0, acc5 0.03125, [4.6119056], [4.6370993], [4.6243286], time 0.10 sec
Pass 0, trainbatch 7, loss 7.37090873718,                        acc1 0.03125, acc5 0.0625, [4.6033278], [4.618459], [4.606809], time 0.09 sec
Pass 0, trainbatch 8, loss 7.45732116699,                        acc1 0.0, acc5 0.03125, [4.650306], [4.720114], [4.636603], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50588703156,                        acc1 0.03125, acc5 0.09375, [4.711724], [4.682214], [4.6316633], time 0.09 sec
Pass 0, trainbatch 10, loss 7.42192602158,                        acc1 0.0, acc5 0.0, [4.638735], [4.6542664], [4.623038], time 0.10 sec

打开memory_optimize

Pass 0, trainbatch 0, loss 7.40484714508,                        acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919,                        acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979,                        acc1 0.0, acc5 0.0, [4.647031], [4.6061325], [4.641046], time 0.09 sec
Pass 0, trainbatch 3, loss 7.39287471771,                        acc1 0.03125, acc5 0.0625, [4.6201744], [4.600312], [4.642022], time 0.08 sec
Pass 0, trainbatch 4, loss 7.45143222809,                        acc1 0.0, acc5 0.0, [4.6320643], [4.7702603], [4.627632], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39721155167,                        acc1 0.0, acc5 0.0625, [4.622399], [4.6212826], [4.6280923], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39040565491,                        acc1 0.0, acc5 0.03125, [4.611952], [4.636217], [4.625294], time 0.09 sec
Pass 0, trainbatch 7, loss 7.37325382233,                        acc1 0.03125, acc5 0.0625, [4.604336], [4.6227126], [4.6070127], time 0.09 sec
Pass 0, trainbatch 8, loss 7.4660615921,                        acc1 0.0, acc5 0.03125, [4.657354], [4.7219033], [4.6404552], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50238180161,                        acc1 0.03125, acc5 0.09375, [4.711801], [4.669479], [4.6324573], time 0.10 sec
Pass 0, trainbatch 10, loss 7.42162179947,                        acc1 0.0, acc5 0.0, [4.640833], [4.6434155], [4.625881], time 0.10 sec

@dzhwinter
Copy link
Contributor

这里需要特别注意顺序,必须在memory_optimize之前设置fetch的变量为persistable,或者加入到skip_set里。
原因是memory_optimize会默认把非persistable的变量(非参数变量,临时变量) 内存复用掉,节约内存,但是用户会fetch的变量也可能在其中,所以必须在做内存优化之前就把这个信息传过去。

@dyning
Copy link
Contributor Author

dyning commented Jun 14, 2018

赞,修复后可以正常显示loss,thanks。

@dyning dyning closed this as completed Jun 14, 2018
@dyning dyning reopened this Jun 18, 2018
@dyning
Copy link
Contributor Author

dyning commented Jun 18, 2018

实际运行时,发现训练过程中,内存使用一直在涨,不知道什么情况?

@panyx0718
Copy link
Contributor

panyx0718 commented Jun 18, 2018

这个高优跟进一下吧。看看能不能先一起复现一下

@panyx0718 panyx0718 added the label Jun 18, 2018
@dzhwinter
Copy link
Contributor

是内存在涨还是显存在涨? 什么版本的paddle?

@dyning
Copy link
Contributor Author

dyning commented Jun 19, 2018

是内存在涨,最新版的paddle

@dzhwinter
Copy link
Contributor

update:
训练过程中峰值在上涨,登陆进去看实际占用内存并没有波动。本地启动任务复现不了这个问题,正在和cloud跟进。

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment