Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo Map Tasks when Workers are unable to get Intermediate Data #404

Merged
merged 1 commit into from Apr 1, 2018

Conversation

RyanConnell
Copy link
Member

@RyanConnell RyanConnell commented Mar 14, 2018

@RyanConnell RyanConnell changed the title [WIP] Redo Map Tasks when Workers are unable to get Intermediate Data Redo Map Tasks when Workers are unable to get Intermediate Data Mar 30, 2018
@@ -50,6 +52,20 @@ impl State {
Ok(())
}

pub fn get_tasks_for_job(&self, job_id: &str) -> Result<Vec<String>> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't need to be public. It should also specify that they are in progress tasks.

@@ -287,6 +308,19 @@ impl State {
Ok(())
}

pub fn set_worker_operation_cancelled(&mut self, worker_id: &str) -> Result<()> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic behind this? It seems like changing the workers status only on the master will lead to trouble when it tries to assign a busy worker a task and then kicks them from the cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is also changed in the worker (when we call set_cancelled_state) but this function works similarly to the 'set_worker_operation_failed' and 'set_worker_operation_completed' functions in the same file, which are both called after we process worker map/reduce results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay sounds good.

}

pub fn handle_worker_report(&mut self, request: &pb::ReportWorkerRequest) -> Result<()> {
let worker_id = request.worker_id.clone();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the worker id is necessary for this, I think instead a worker should send its current task id and we should simply check if that task still needs a response. The reason for this is the same reason that workers send their current task id in other places, the data the master has might be behind if it crashed and then reloaded an old version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the worker_id as it's used in other places (such as setting the status to cancelled) but I have also added the task_id.

self.tasks.contains_key(&worker.current_task_id)
};

let mut reschedule_tasks: Vec<String> = Vec::new();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually ever be more than a single map task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

for partition_key in map_task.map_output_files.keys() {
let partition = map_task.map_output_files.get(&partition_key).unwrap();
if partition.ends_with(&path) {
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just iterate by value:

for partition in map_task.map_output_files.values() {
    if partition.ends_with(&path) {
ect.

// If the task was cancelled, we shouldn't return an error here.
if status == pb::OperationStatus::CANCELLED {
return Ok(());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a weird special case caused by setting cancelled unnecessarily in the worker interface. Seems better to just leave the error chaining like normal.

format!("Unable to get task {} from task queue", queued_task_id)
})?;

if self.reduce_from_map_task(&map_task, &queued_task) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check the task type before calling this, since the function only applies to reduce tasks.

pub fn remove_tasks_from_queue(&mut self, task_ids: Vec<String>) -> Result<()> {
let mut new_priority_queue: BinaryHeap<PriorityTask> = BinaryHeap::new();

for task_id in task_ids.clone() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this clone is unnecessary.
Can just do:

for task_id in &task_ids {
self.tasks.remove(task_id);
}

if task.id == task_id {
task.status = TaskStatus::Cancelled;
can_push = false;
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever possible? If a task is in task_ids, it will not be in self.tasks anymore.

let mut new_new_priority_task_queue: BinaryHeap<PriorityTask> = BinaryHeap::new();
for t in new_priority_queue.drain() {
new_new_priority_task_queue.push(t);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Can't the first new_priority_queue be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, it's part of some leftover debug code. Removed.

fn get_in_progress_tasks_for_job(&self, job_id: &str) -> Result<Vec<String>> {
let mut tasks: Vec<String> = Vec::new();

for task_id in self.tasks.keys() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just iterate over values here.

new_map_task.job_priority * FAILED_TASK_PRIORITY,
));

println!("Rescheduled map task with ID {}", new_map_task.id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think info! would be better here.

if self.tasks.contains_key(&request.task_id) {
let task_ids = self.completed_tasks.keys().clone();

for task_id in task_ids {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just iterate over values of completed_tasks here.

@RyanConnell RyanConnell merged commit 5016adf into develop Apr 1, 2018
@RyanConnell RyanConnell deleted the feature/redo-map-tasks branch April 1, 2018 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants