# What is Job in Spark, stage and Task?

In Apache Spark, a "job" refers to the entire computation triggered by an action in a Spark application. It represents a set of transformations on data that are executed in a distributed and parallel manner across the nodes of a Spark cluster. A Spark job consists of one or more stages, where each stage represents a set of transformations that can be executed in parallel.

Here are key concepts related to Spark jobs:

1. **Action Trigger:**
   - Spark jobs are triggered by actions. Actions are operations on RDDs or DataFrames that trigger the execution of the entire computation plan built by transformations.

   ```python
   # Example of an action triggering a job
   result = rdd.reduce(lambda x, y: x + y)
   ```

   In this example, the `reduce` action triggers the execution of the transformations defined on the `rdd`.

2. **Stages:**
   - A job is divided into one or more stages. Each stage consists of a set of transformations that can be executed in parallel. Stages are determined based on the presence of wide transformations that require shuffling of data between partitions.

3. **Tasks:**
   - A stage is further divided into tasks, where each task represents the smallest unit of parallel execution. Tasks are assigned to individual partitions of data, and they execute on the worker nodes of the Spark cluster.

4. **Shuffle:**
   - A Spark job may involve shuffling of data, which is the process of redistributing data across the partitions. This occurs when wide transformations like `groupByKey` or `reduceByKey` are used, requiring data to be exchanged between partitions.

5. **Job Execution Plan:**
   - Spark builds a directed acyclic graph (DAG) representing the execution plan for the entire job. The DAG includes transformations and dependencies between them. The execution plan is optimized to minimize data movement and improve performance.

6. **Fault Tolerance:**
   - Spark provides fault tolerance for jobs through lineage information. Each RDD or DataFrame keeps track of its lineage, which is the sequence of transformations that led to its creation. If a partition is lost due to a node failure, Spark can recompute the lost partition using the lineage information.

7. **Job Monitoring:**
   - Spark provides monitoring tools, such as the Spark UI and Spark History Server, to monitor the progress and performance of jobs. These tools display information about completed and ongoing jobs, stages, and tasks.

![Job](/home/blackheart/Documents/Data/Apache-Spark/Images/Job.jpg)

Let's break down the concepts of jobs, stages, and tasks in Apache Spark with an example:

### Example:

Consider a simple Spark application that performs a few transformations and an action on an RDD.

```python
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "SparkExample")

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformation 1: Map to double each element
mapped_rdd = rdd.map(lambda x: x * 2)

# Transformation 2: Filter to keep only even numbers
filtered_rdd = mapped_rdd.filter(lambda x: x % 2 == 0)

# Action: Compute the sum
result = filtered_rdd.reduce(lambda x, y: x + y)

# Stop the SparkContext
sc.stop()
```

### Concepts:

1. **Job:**
   - In this example, the entire computation triggered by the `reduce` action is a Spark job.
   - The job includes all the transformations (in this case, `map` and `filter`) and the final action (`reduce`).

2. **Stages:**
   - The job is divided into stages based on the presence of wide transformations that require shuffling of data between partitions.
   - In this example, there are two stages:
     - **Stage 1:** The `map` transformation, which is narrow and can be executed in parallel.
     - **Stage 2:** The `filter` transformation and the final `reduce` action, which may involve shuffling and require a separate stage.

3. **Tasks:**
   - Each stage is further divided into tasks, representing the smallest unit of parallel execution.
   - Tasks are assigned to partitions of data and executed on worker nodes.
   - In Stage 1, there is one task for each partition (assuming the default parallelism).
   - In Stage 2, the number of tasks depends on the number of partitions after the `filter` transformation.



### Explanation:

1. **Job:**
   - The entire execution triggered by the `reduce` action is a single Spark job.

2. **Stages:**
   - Stage 1: `map` transformation is a narrow transformation, so it forms a single stage.
   - Stage 2: `filter` transformation and the final `reduce` action may involve shuffling, so they form a separate stage.

3. **Tasks:**
   - Tasks within Stage 1 execute `map` in parallel for each partition.
   - Tasks within Stage 2 execute `filter` and the final `reduce` in parallel for each partition.

This example illustrates how a Spark job is divided into stages, and each stage consists of tasks that can be executed in parallel. The concept of stages allows Spark to optimize the execution plan and efficiently distribute computation across the cluster.

# How Job Count?
Calculating the number of jobs created by a Spark program involves understanding the program's structure, specifically the actions and transformations used. Each action typically triggers the execution of a job, and transformations within the program may cause the creation of multiple stages within a job.

Here are some guidelines to help you identify the number of jobs in a Spark program:

1. **Actions Trigger Jobs:**
   - Look for actions in your Spark program. Actions, such as `count`, `collect`, `saveAsTextFile`, or any operation that requires materializing the data, generally trigger the execution of a job.

2. **Stages Within a Job:**
   - Examine the transformations in your program. If there are wide transformations (e.g., `groupByKey`, `reduceByKey`) or actions that involve shuffling, these may introduce multiple stages within a job.

3. **Spark UI:**
   - Use the Spark UI to monitor job execution. When you run your Spark application, the Spark UI provides detailed information about completed jobs, stages, and tasks. You can access the Spark UI at `http://<driver-node>:4040` by default.

4. **Program Structure:**
   - Programs with multiple actions or multiple distinct computations may create more than one job.

Here's a simple example:

```python
from pyspark import SparkContext

sc = SparkContext("local", "JobCountExample")

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformation 1: Map to double each element
mapped_rdd = rdd.map(lambda x: x * 2)

# Action 1: Count the number of elements
count1 = mapped_rdd.count()

# Transformation 2: Filter to keep only even numbers
filtered_rdd = mapped_rdd.filter(lambda x: x % 2 == 0)

# Action 2: Collect the results
result = filtered_rdd.collect()

# Stop the SparkContext
sc.stop()
```

In this example:
- There are two actions (`count1` and `result`), so at least two jobs will be triggered.
- The first action may involve multiple stages, as it depends on the transformations preceding it.
- The second action might be part of the same job or a separate one, depending on the structure of the execution plan.

By examining the structure of your Spark program and understanding the impact of actions and transformations, you can estimate the number of jobs created during its execution. Monitoring the Spark UI during the execution of your program provides detailed insights into job and stage information.

## List of Action

Actions in Spark are operations that trigger the execution of the computation plan built by transformations. Each action typically results in the creation of one or more jobs. Here is a list of common actions in Spark:

1. **`collect()`**
   - Returns all the elements of the RDD or DataFrame as an array to the driver program.

2. **`count()`**
   - Returns the number of elements in the RDD or DataFrame.

3. **`first()`**
   - Returns the first element of the RDD or DataFrame.

4. **`take(n)`**
   - Returns the first `n` elements of the RDD or DataFrame.

5. **`reduce(func)`**
   - Aggregates the elements of the RDD or DataFrame using a specified reduce function.

6. **`foreach(func)`**
   - Applies a function to each element of the RDD or DataFrame. Commonly used for side-effect operations.

7. **`saveAsTextFile(path)`**
   - Writes the elements of the RDD or DataFrame as text files in the specified path.

8. **`saveAsSequenceFile(path)`**
   - Writes the elements of the RDD as Hadoop SequenceFiles in the specified path.

9. **`saveAsObjectFile(path)`**
   - Writes the elements of the RDD as serialized Java objects in the specified path.

10. **`countByKey()`**
    - Only applicable to RDDs of key-value pairs. Returns a map of each unique key and its count.

11. **`collectAsMap()`**
    - Only applicable to RDDs of key-value pairs. Returns the elements of the RDD as a map.

12. **`lookup(key)`**
    - Only applicable to RDDs of key-value pairs. Returns all values associated with the specified key.

13. **`takeOrdered(n, key=None)`**
    - Returns the first `n` elements of the RDD or DataFrame based on their natural order or a custom key.

14. **`top(n, key=None)`**
    - Returns the top `n` elements of the RDD or DataFrame based on their natural order or a custom key.

15. **`countByValue()`**
    - Returns the count of each unique element in the RDD as a map.

16. **`max()`**
    - Returns the maximum element of the RDD or DataFrame.

17. **`min()`**
    - Returns the minimum element of the RDD or DataFrame.

18. **`sum()`**
    - Returns the sum of the elements in the RDD or DataFrame.

19. **`mean()`**
    - Returns the mean (average) of the elements in the RDD or DataFrame.

20. **`stats()`**
    - Returns a `StatCounter` object that provides statistics (mean, variance, etc.) about the elements in the RDD or DataFrame.

These actions are used to trigger the execution of Spark computations and obtain results. Keep in mind that some actions, especially those involving large datasets, can be resource-intensive, so it's essential to use them judiciously.

# Calculating `Stage` and `Task`?

Calculating the number of stages and tasks in Spark involves understanding the transformations in your Spark program and how they impact the execution plan. Here are guidelines to help you estimate the number of stages and tasks:

### Stages:

1. **Wide Transformations:**
   - Look for wide transformations like `groupByKey`, `reduceByKey`, or any operation that requires shuffling of data. These transformations typically introduce a new stage in the execution plan.

2. **Actions:**
   - Actions generally delineate stages. Each action triggers the execution of the entire lineage leading up to that action, and each transformation that requires shuffling may introduce a new stage.

3. **Spark UI:**
   - Utilize the Spark UI (`http://<driver-node>:4040` by default) during the execution of your Spark program. The UI provides a detailed breakdown of stages, their status, and the number of tasks.

### Tasks:

1. **Partitions:**
   - The number of tasks within a stage is often equal to the number of partitions of the RDD or DataFrame involved in that stage.

2. **Parallelism:**
   - Parallelism is influenced by the number of cores available on your cluster. Spark tries to parallelize the computation by assigning one task per partition to each available core.

3. **Configuration:**
   - The `spark.default.parallelism` configuration parameter in Spark influences the default number of partitions created for RDDs in operations that transform them (e.g., `map`, `filter`). You can set this parameter to control the number of tasks created during certain transformations.

### Example:

Consider the following Spark program:

```python
from pyspark import SparkContext

sc = SparkContext("local", "StagesTasksExample")

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

mapped_rdd = rdd.map(lambda x: x * 2)
filtered_rdd = mapped_rdd.filter(lambda x: x % 2 == 0)
result = filtered_rdd.collect()

sc.stop()
```

In this example:
- There is one action (`collect`), which triggers the execution of the entire computation plan.
- There are two transformations (`map` and `filter`).
- Depending on the default parallelism or configuration settings, there may be two stages (one for each transformation).
- The number of tasks in each stage is determined by the number of partitions in the RDD involved in that stage.

To get a detailed breakdown, use the Spark UI during the execution of your program. It provides information on stages, tasks, and their progress.