### Task: Find Missing Subtasks using Pandas
In many real-world applications, tasks are often broken down into multiple subtasks that need to be completed. However, keeping track of which subtasks have been executed and identifying the missing ones can be a challenge. Imagine you are managing a workflow where each task is divided into smaller subtasks, and you need to generate a report of the never executed subtasks. How would you efficiently retrieve this data using Pandas?

In this blog post, we'll explore a practical solution to find missing subtasks for each task. We'll break down the problem statement, analyze the given tables (Tasks and Executed), and implement an effective query to generate the desired output. This blog will help you understand how to use joins, exploding values, and filters to derive meaningful insights from structured data.

Let's dive in and solve this problem step by step!

#### Table: `Tasks`
| Column Name    | Type    |
|---------------|---------|
| task_id       | int     |
| subtasks_count | int    |

- `task_id` is the column with unique values for this table.
- Each row in this table indicates that `task_id` was divided into `subtasks_count` subtasks labeled from 1 to `subtasks_count`.
- It is guaranteed that `2 <= subtasks_count <= 20`.

#### Table: `Executed`
| Column Name  | Type |
|-------------|------|
| task_id     | int  |
| subtask_id  | int  |

- (`task_id`, `subtask_id`) is the combination of columns with unique values for this table.
- Each row in this table indicates that for the task `task_id`, the subtask with ID `subtask_id` was executed successfully.
- It is guaranteed that `subtask_id <= subtasks_count` for each `task_id`.

---

### Problem Statement:
Write a solution to report the IDs of the missing subtasks for each `task_id`.

The output should be returned in any order.

---

### Example:

#### **Input:**

**Tasks Table:**
| task_id | subtasks_count |
|---------|---------------|
| 1       | 3             |
| 2       | 2             |
| 3       | 4             |

**Executed Table:**
| task_id | subtask_id |
|---------|------------|
| 1       | 2          |
| 3       | 1          |
| 3       | 2          |
| 3       | 3          |
| 3       | 4          |

---

#### **Output:**
| task_id | subtask_id |
|---------|------------|
| 1       | 1          |
| 1       | 3          |
| 2       | 1          |
| 2       | 2          |

---

#### **Explanation:**
- **Task 1** was divided into 3 subtasks **(1, 2, 3)**. Only **subtask 2** was executed successfully, so we include **(1, 1)** and **(1, 3)** in the output.
- **Task 2** was divided into 2 subtasks **(1, 2)**. No subtask was executed successfully, so we include **(2, 1)** and **(2, 2)** in the output.
- **Task 3** was divided into 4 subtasks **(1, 2, 3, 4)**. All subtasks were executed successfully, so **no missing subtasks** for task 3.

---


In [7]:
import pandas as pd

data = [[1, 3], [2, 2], [3, 4]]
tasks = pd.DataFrame(data, columns=['task_id', 'subtasks_count']).astype({'task_id':'Int64', 'subtasks_count':'Int64'})
display(tasks)
data = [[1, 2], [3, 1], [3, 2], [3, 3], [3, 4]]
executed = pd.DataFrame(data, columns=['task_id', 'subtask_id']).astype({'task_id':'Int64', 'subtask_id':'Int64'})
display(executed)

Unnamed: 0,task_id,subtasks_count
0,1,3
1,2,2
2,3,4


Unnamed: 0,task_id,subtask_id
0,1,2
1,3,1
2,3,2
3,3,3
4,3,4


Step 1: Generate All Possible Subtask IDs for Each Task
- Creates a new column subtask_id in the tasks DataFrame.
- Uses a list comprehension to generate a list of subtask IDs for each task

In [8]:
tasks["subtask_id"] = [[j+1 for j in range(i)] for i in tasks["subtasks_count"]]
display(tasks)

Unnamed: 0,task_id,subtasks_count,subtask_id
0,1,3,"[1, 2, 3]"
1,2,2,"[1, 2]"
2,3,4,"[1, 2, 3, 4]"


Step 2: Explode the List of Subtask IDs into Individual Rows
- The explode() function expands the lists in the subtask_id column into individual rows.

In [9]:
tasks = tasks.explode("subtask_id")
display(tasks)

Unnamed: 0,task_id,subtasks_count,subtask_id
0,1,3,1
0,1,3,2
0,1,3,3
1,2,2,1
1,2,2,2
2,3,4,1
2,3,4,2
2,3,4,3
2,3,4,4


Step 3: Remove the subtasks_count Column
- Drops the subtasks_count column since it's no longer needed.

In [10]:
tasks = tasks.drop(columns=["subtasks_count"])
display(tasks)

Unnamed: 0,task_id,subtask_id
0,1,1
0,1,2
0,1,3
1,2,1
1,2,2
2,3,1
2,3,2
2,3,3
2,3,4


Step 4: Perform a Right Join to Identify Missing Subtasks
- Joins the executed DataFrame with the tasks DataFrame using a right join on task_id and subtask_id.
- The indicator=True option adds a new column _merge that indicates:
- "both" → The subtask exists in both executed and tasks (executed successfully).
- "right_only" → The subtask exists in tasks but not in executed (missing execution).
- "left_only" → Not applicable here since we used a right join.

In [11]:
df = executed.merge(tasks, on=["task_id", "subtask_id"], how="right", indicator=True)
display(df)

Unnamed: 0,task_id,subtask_id,_merge
0,1,1,right_only
1,1,2,both
2,1,3,right_only
3,2,1,right_only
4,2,2,right_only
5,3,1,both
6,3,2,both
7,3,3,both
8,3,4,both


Step 5: Filter for Missing Subtasks
- Filters rows where _merge != "both", keeping only "right_only" (i.e., missing subtasks).
- Drops the _merge column since it is no longer needed.

In [12]:
df = df[df["_merge"] != "both"].drop('_merge', axis=1)
display(df)

Unnamed: 0,task_id,subtask_id
0,1,1
2,1,3
3,2,1
4,2,2


Reference: [1] https://leetcode.com/problems/find-the-subtasks-that-did-not-execute/description/