Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset progress percentage to last checkpoint when task is initialized #4911

Merged
merged 3 commits into from
Sep 15, 2022

Conversation

Vulpine05
Copy link
Contributor

Fixes #3430

Description of the Change
When a task is initialized (or resuming computation), the progress percentage is updated to the progress percentage of the last checkpoint. This happens to be 0 if the task has not reached a checkpoint.

Alternate Designs
None.

Release Notes
Progress percentage is reset to 0 for tasks restarting with pseudo-progress.

@Vulpine05
Copy link
Contributor Author

@RichardHaselgrove, could you please test? I think this fixes your bug, but I have not had a task large enough to reproduce. An extra set of hands would help. Once one of us has verified that this works, I will mark as ready to review.

@RichardHaselgrove
Copy link
Contributor

I'd forgotten about that one! Unfortunately, my screenshot doesn't identify the Einstein task in question - it may not be active at the moment. But I don't have many Windows 10 machines. I'll search through them in the morning, but it may take a little time.

@Vulpine05
Copy link
Contributor Author

No rush, I understand this is a rare case. This can just sit here until someone can confirm it solves the problem.

@RichardHaselgrove
Copy link
Contributor

Finally got a chance to reproduce the original problem. Starting a brand-new task, with a high memory footprint and a known lengthy delay between checkpoints (WCG's ARP project - checkpoints 8 times per run):
WCG memory 1
Allowing it to use more memory, under v7.20.2, showed this in the first minute:
WCG memory 2
Repeating the whole exercise with the artifact from this PR gave:
WCG memory 3
So, it seems to be re-starting from 0% progress, but not following the normal rules for pseudo-progress. I'm not quite sure where that leaves us.

@Vulpine05
Copy link
Contributor Author

Thanks for the report. The pseudo progress is strange. Let me dig into that more and see what I come up with. To be continued...

@Vulpine05
Copy link
Contributor Author

@RichardHaselgrove, can you clarify two things:

  1. For the task that causes this, is there a boinc_task_state file in the slot?
  2. Does the original problem occur with BOINC running the whole time? Or, what I mean to ask, do you quit the client/manager, start it up again, and have the issue with pseudo-progress being incorrectly reported?

@RichardHaselgrove
Copy link
Contributor

Sorry, had to wait for new work to become available after the weekend - WCG is still having problems.

Question 1 - no. I started a task at 17:30 local
image

Directory of d:\boincdata\slots\0

12/09/2022 17:39 65,185 stdout.txt
12/09/2022 17:31 97,115 namelist.output
12/09/2022 17:31 93 stderr.txt
12/09/2022 17:31 87,327,860 wrfrst_d03
12/09/2022 17:31

..
12/09/2022 17:31 .
12/09/2022 17:30 87,327,860 wrfrst_d02
12/09/2022 17:30 92,575,412 wrfrst_d01
12/09/2022 17:30 0 boinc_lockfile
12/09/2022 17:30 9,091 init_data.xml

There is an active_task record in client_state.xml:

<active_task>
    <project_master_url>http://www.worldcommunitygrid.org/</project_master_url>
    <result_name>ARP1_0024635_128_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>732</app_version_num>
    <slot>0</slot>
    <checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>0.000000</checkpoint_elapsed_time>
    <checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
    <checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
    <current_cpu_time>519.968800</current_cpu_time>
    <once_ran_edf>0</once_ran_edf>
    <swap_size>779431936.000000</swap_size>
    <working_set_size>822038528.000000</working_set_size>
    <working_set_size_smoothed>822038530.995117</working_set_size_smoothed>
    <page_fault_rate>0.000000</page_fault_rate>
    <bytes_sent>0.000000</bytes_sent>
    <bytes_received>0.000000</bytes_received>
</active_task>

Normally, my machines run continually: I'll do a reboot and come back to answer question 2.

@RichardHaselgrove
Copy link
Contributor

After rebooting with the task suspended, the active task record contained
<current_cpu_time>0.000000</current_cpu_time>
and Manager showed
image

I didn't reboot last time, which probably explains the pseudo-progress.

@Vulpine05
Copy link
Contributor Author

I think this helps, thank you. I'll start digesting this.

I discovered last night that the tasks are NOT suspended - they are set to quit instead. I think that is the red herring that we assumed. I think the description for these tasks would be more accurate to say "Waiting to run - not enough memory".

@Vulpine05
Copy link
Contributor Author

@RichardHaselgrove, is there an <elapsed_time> in active task?

@RichardHaselgrove
Copy link
Contributor

@RichardHaselgrove, is there an <elapsed_time> in active task?

I gave you the whole thing, so no.

I'm using manual suspend/resume for these tests, because it's quicker and easier than finessing the memory allocation. The status messages, including in the Event Log, are appropriate for my actions, including whether or not the app was removed from memory.

@Vulpine05
Copy link
Contributor Author

@RichardHaselgrove, is there an <elapsed_time> in active task?

I gave you the whole thing, so no.

I'm using manual suspend/resume for these tests, because it's quicker and easier than finessing the memory allocation. The status messages, including in the Event Log, are appropriate for my actions, including whether or not the app was removed from memory.

If you feel I was questioning if you were taking appropriate actions, that was not my intent. I'm just trying to get a feel for what is happening. All I was trying to say earlier is when I walk the code, a task is told to quit when there is not enough memory to run it, which is different (I think), from the amount of memory BOINC can use when the PC is idle/active.

This does help, let me see what I can dig up now. Thanks!

@Vulpine05
Copy link
Contributor Author

@RichardHaselgrove, try it now. I tested this and it appears to work, but I don't think I have the same conditions as you do. I changed my memory preferences to a low value to force my tasks to wait for memory, then I would update the preferences again to my original memory settings. The percentage completed appeared to reset back to zero, but progress did start increasing around the 40 second mark. I think this is because the task is not as large as the ones you are testing with, though. Let me know what you find and thank you for helping.

@RichardHaselgrove
Copy link
Contributor

OK, I've got the new artifact, but no tasks at the moment. As you say, I'm trying it out on bigger tasks, and one machine is still plodding its way through one from two days ago:

12-Sep-2022 23:19:56 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
13-Sep-2022 06:24:30 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
13-Sep-2022 12:58:03 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
13-Sep-2022 17:30:41 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
13-Sep-2022 22:55:18 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
14-Sep-2022 05:59:24 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed
14-Sep-2022 12:14:45 [World Community Grid] [checkpoint] result ARP1_0024635_128_0 checkpointed

Seven out of eight completed - should be ready for a new-start test in about five hours!

@RichardHaselgrove
Copy link
Contributor

OK, finally got another of these long tasks.
Initial run:
image
Threw it out of memory by changing memory use preference (to 10%):
15/09/2022 12:48:54 | World Community Grid | [cpu_sched] Preempting ARP1_0011950_130_2 (removed from memory)
and allowed it back in:
image

I think that's good enough for this one. Thanks for your perseverance - I approve.

Copy link
Contributor

@RichardHaselgrove RichardHaselgrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids presenting inaccurate information to users.

@Vulpine05 Vulpine05 marked this pull request as ready for review September 15, 2022 12:37
@Vulpine05 Vulpine05 changed the title [WIP] Reset progress percentage to last checkpoint when task is initialized Reset progress percentage to last checkpoint when task is initialized Sep 15, 2022
@AenBleidd
Copy link
Member

Looks good to me. Thanks you for the fix

@AenBleidd AenBleidd merged commit 10b6bcd into BOINC:master Sep 15, 2022
@Vulpine05 Vulpine05 deleted the Vulpine05-3430 branch September 15, 2022 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client GUI: misleading display of pseudo-progress
3 participants