Redo Map Tasks when Workers are unable to get Intermediate Data #404

RyanConnell · 2018-03-14T00:38:33Z

ConorGriffin37 · 2018-03-31T12:04:27Z

master/src/worker_management/state.rs

@@ -50,6 +52,20 @@ impl State {
        Ok(())
    }

+    pub fn get_tasks_for_job(&self, job_id: &str) -> Result<Vec<String>> {


This function doesn't need to be public. It should also specify that they are in progress tasks.

ConorGriffin37 · 2018-03-31T12:07:20Z

master/src/worker_management/state.rs

@@ -287,6 +308,19 @@ impl State {
        Ok(())
    }

+    pub fn set_worker_operation_cancelled(&mut self, worker_id: &str) -> Result<()> {


What is the logic behind this? It seems like changing the workers status only on the master will lead to trouble when it tries to assign a busy worker a task and then kicks them from the cluster.

The state is also changed in the worker (when we call set_cancelled_state) but this function works similarly to the 'set_worker_operation_failed' and 'set_worker_operation_completed' functions in the same file, which are both called after we process worker map/reduce results.

Ah, okay sounds good.

ConorGriffin37 · 2018-03-31T12:14:00Z

master/src/worker_management/state.rs

+    }
+
+    pub fn handle_worker_report(&mut self, request: &pb::ReportWorkerRequest) -> Result<()> {
+        let worker_id = request.worker_id.clone();


I don't think the worker id is necessary for this, I think instead a worker should send its current task id and we should simply check if that task still needs a response. The reason for this is the same reason that workers send their current task id in other places, the data the master has might be behind if it crashed and then reloaded an old version.

I kept the worker_id as it's used in other places (such as setting the status to cancelled) but I have also added the task_id.

ConorGriffin37 · 2018-03-31T12:15:50Z

master/src/worker_management/state.rs

+            self.tasks.contains_key(&worker.current_task_id)
+        };
+
+        let mut reschedule_tasks: Vec<String> = Vec::new();


Can this actually ever be more than a single map task?

ConorGriffin37 · 2018-03-31T12:17:56Z

master/src/worker_management/state.rs

+        for partition_key in map_task.map_output_files.keys() {
+            let partition = map_task.map_output_files.get(&partition_key).unwrap();
+            if partition.ends_with(&path) {
+                return true;


Can just iterate by value:

for partition in map_task.map_output_files.values() { if partition.ends_with(&path) { ect.

ConorGriffin37 · 2018-03-31T12:34:05Z

worker/src/operations/reduce.rs

+            // If the task was cancelled, we shouldn't return an error here.
+            if status == pb::OperationStatus::CANCELLED {
+                return Ok(());
+            }


This seems like a weird special case caused by setting cancelled unnecessarily in the worker interface. Seems better to just leave the error chaining like normal.

ConorGriffin37 · 2018-03-31T12:37:45Z

master/src/worker_management/state.rs

+                format!("Unable to get task {} from task queue", queued_task_id)
+            })?;
+
+            if self.reduce_from_map_task(&map_task, &queued_task) {


Should check the task type before calling this, since the function only applies to reduce tasks.

ConorGriffin37 · 2018-03-31T12:48:11Z

master/src/worker_management/state.rs

+    pub fn remove_tasks_from_queue(&mut self, task_ids: Vec<String>) -> Result<()> {
+        let mut new_priority_queue: BinaryHeap<PriorityTask> = BinaryHeap::new();
+
+        for task_id in task_ids.clone() {


Seems like this clone is unnecessary.
Can just do:

for task_id in &task_ids { self.tasks.remove(task_id); }

ConorGriffin37 · 2018-03-31T12:49:40Z

master/src/worker_management/state.rs

+                    if task.id == task_id {
+                        task.status = TaskStatus::Cancelled;
+                        can_push = false;
+                        break;


Is this ever possible? If a task is in task_ids, it will not be in self.tasks anymore.

ConorGriffin37 · 2018-03-31T12:50:38Z

master/src/worker_management/state.rs

+        let mut new_new_priority_task_queue: BinaryHeap<PriorityTask> = BinaryHeap::new();
+        for t in new_priority_queue.drain() {
+            new_new_priority_task_queue.push(t);
+        }


Why is this necessary? Can't the first new_priority_queue be used?

It's not, it's part of some leftover debug code. Removed.

ConorGriffin37 · 2018-03-31T19:21:57Z

master/src/worker_management/state.rs

+    fn get_in_progress_tasks_for_job(&self, job_id: &str) -> Result<Vec<String>> {
+        let mut tasks: Vec<String> = Vec::new();
+
+        for task_id in self.tasks.keys() {


Can just iterate over values here.

ConorGriffin37 · 2018-03-31T19:23:00Z

master/src/worker_management/state.rs

+            new_map_task.job_priority * FAILED_TASK_PRIORITY,
+        ));
+
+        println!("Rescheduled map task with ID {}", new_map_task.id);


I think info! would be better here.

ConorGriffin37 · 2018-03-31T19:23:49Z

master/src/worker_management/state.rs

+        if self.tasks.contains_key(&request.task_id) {
+            let task_ids = self.completed_tasks.keys().clone();
+
+            for task_id in task_ids {


Can just iterate over values of completed_tasks here.

RyanConnell force-pushed the feature/redo-map-tasks branch from 8490ffd to 9e4f3e8 Compare March 30, 2018 21:06

RyanConnell changed the title ~~[WIP] Redo Map Tasks when Workers are unable to get Intermediate Data~~ Redo Map Tasks when Workers are unable to get Intermediate Data Mar 30, 2018

ConorGriffin37 requested changes Mar 31, 2018

View reviewed changes

RyanConnell force-pushed the feature/redo-map-tasks branch from ac15d14 to dd5db14 Compare March 31, 2018 18:36

ConorGriffin37 requested changes Mar 31, 2018

View reviewed changes

ConorGriffin37 approved these changes Apr 1, 2018

View reviewed changes

Redo map tasks when workers fail to get intermediate data

b52f9ee

RyanConnell force-pushed the feature/redo-map-tasks branch from 03eb406 to b52f9ee Compare April 1, 2018 23:01

RyanConnell merged commit 5016adf into develop Apr 1, 2018

RyanConnell deleted the feature/redo-map-tasks branch April 1, 2018 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redo Map Tasks when Workers are unable to get Intermediate Data #404

Redo Map Tasks when Workers are unable to get Intermediate Data #404

RyanConnell commented Mar 14, 2018 •

edited

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

RyanConnell Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

RyanConnell Mar 31, 2018

ConorGriffin37 Mar 31, 2018

RyanConnell Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

RyanConnell Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

ConorGriffin37 Mar 31, 2018

Redo Map Tasks when Workers are unable to get Intermediate Data #404

Redo Map Tasks when Workers are unable to get Intermediate Data #404

Conversation

RyanConnell commented Mar 14, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanConnell commented Mar 14, 2018 •

edited