# SPARK UI TUTORIAL

What are we going to talk about?

This notebook will help you understand Spark's Web UI - basically your dashboard for seeing what's happening in your Spark applications.
All information on this notebook is on the basis of Apache Spark 3.5 UI Guide, available at:
https://spark.apache.org/docs/latest/web-ui.html





## Table of contents:

1. What is Spark - a reminder
2. Jobs Tab
3. DAG Visualizations & Event Timeline
4. Stages Tab
5. Storage Tab 
6. SQL Tab 
7. Environment Tab
8. Executor Tab
   


## 1. What's Apache Spark? - reminder
Think of Spark as a super-powered data processing engine. You know how your laptop struggles when you try to work with huge Excel files? 
Spark is designed to handle data that's WAY bigger than that by spreading the work across multiple computers.

Cool things about Spark:
- It's super fast (uses computer memory instead of just hard drive)
- Works with Python and other languages
- Can handle both regular data processing and other like machine learning
- Won't lose your data if something crashes

## The UI part 

The Spark UI is like the dashboard of your car - it shows you what's happening under the hood.
- See if your code is actually running or stuck
- Find out why your program is taking forever
- Figure out if you're about to crash your computer's memory 😅
  
It's a webpage with the following tabs:
Jobs Tab,
Stages Tab,
Storage Tab, 
SQL Tab,
Environment Tab,
and Executor Tab.

In addition to that, there are DAG visualizations (more on that later) and Event timelines, which summarize metrics for tasks.



Let's create our first Spark program and check out the UI:


In [6]:
!pip install pyspark



In [7]:
from pyspark.sql import SparkSession

# Start up Spark (like turning on your car's engine)
spark = SparkSession.builder \
    .appName("MyFirstSparkApp") \
    .master("local[*]") \
    .getOrCreate()

print("Hey! Your Spark UI should be available at:", spark.sparkContext.uiWebUrl)



Hey! Your Spark UI should be available at: http://host.docker.internal:4040



# 2. The Spark Jobs Tab:
The Jobs tab in Apache Spark's UI is one of the most important interfaces for understanding what's happening in your Spark application.

## What is a Spark Job?
A "job" in Spark is created whenever you call an **action** on your data. Actions are operations that trigger actual computation and produce results, unlike transformations which just build up a plan. Common actions include:
- `collect()` - Brings data back to the driver program
- `count()` - Counts the number of elements
- `save()` - Writes data to storage
- `show()` - Displays data
- `take(n)` - Returns the first n elements

## What the Jobs Tab Shows You

### Job Listing and Status
- **Job ID**: A unique identifier for each job (starting at 0)
- **Status indicators**: 
  - RUNNING - Currently executing
  - SUCCEEDED - Completed successfully
  - FAILED - Terminated with errors
  - PENDING - Waiting to be scheduled

### Detailed Metrics
- **Submission Time**: When the job was sent to the Spark engine
- **Duration**: How long the job took/is taking to complete
- **Stages**: The job broken down into its component processing steps
- **Tasks**: Individual units of work within each stage
- **Input/Output**: Data processed and generated

### Visual Elements
- **Progress bars**: Show percentage completion of each job
- **Stage visualization**: See which parts of your job are complete or in progress
- **Timeline view**: A chronological representation of when jobs executed
- **Dependency charts* reads clearly as part of your Spark tutorial. applications run more efficiently.

In [8]:

# Create sample data
students = [(1, "Alice", 85), (2, "Bob", 92), (3, "Charlie", 78)]


df = spark.createDataFrame(students, ["id", "name", "grade"])

# Calculate average grade by name
avg_grade = df.groupBy("name").avg("grade").collect()



Py4JJavaError: An error occurred while calling o52.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5) (host.docker.internal executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 34 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 34 more


### 🎯 Your First Exercise: Making Spark Work
 Let's create something bigger to actually see what's happening


In [None]:
import random

# Create pretend student data
student_data = []
for i in range(10000):  # 10,000 students
    student_data.append((
        i,  # student ID
        random.randint(60, 100),  # random grade
        random.choice(['Math', 'Science', 'History', 'English'])  # random subject
    ))

# Turn it into a Spark DataFrame
students_df = spark.createDataFrame(student_data, ["id", "grade", "subject"])

# Find average grades by subject
result = students_df.groupBy("subject") \
    .agg({"grade": "avg"}) \
    .collect()


## While this runs, quickly:
 1. Open the Spark UI (http://localhost:4040)
 2. Click on the Jobs tab
 3. Look at what's happening

## Things to spot in the UI:
 - How many stages did your job have?
 - How long did it take?
 - Are there any cool graphs showing up?

# 3. Understanding Spark Event Timeline and DAG Visualization

# Event Timeline in Spark

The Event Timeline in Apache Spark provides a chronological view of your application's execution. It's an essential diagnostic tool that helps beginners understand the flow and timing of operations.

### What the Event Timeline Shows

1. **Time-based visualization**: A horizontal chart showing when each component of your Spark application was active
   
2. **Component types displayed**:
   - **Jobs**: The highest-level units of work (triggered by actions)
   - **Stages**: Sub-divisions of jobs that represent distinct processing steps
   - **Tasks**: The smallest units of work, distributed across executors
   - **Executors**: The processes actually running your code

3. **Color coding**:
   - Different colors typically represent different states (running, completed, failed)
   - The length of each bar represents duration
   - Gaps may indicate idle time or scheduling delays

### How to Use the Event Timeline

- **Identify bottlenecks**: Look for unusually long stages or tasks
- **Spot parallelism issues**: See if your tasks are running concurrently or serially
- **Detect skew**: Find tasks that take much longer than others in the same stage
- **Understand dependencies**: See which stages must complete before others can begin
- **Track resource usage**: Monitor how many executors are active at any time



# DAG Visualization in Spark

The DAG (Directed Acyclic Graph) Visualization shows the logical and physical plan of your Spark operations.

### Key Elements of the DAG Visualization

1. **Nodes**: 
   - Each box represents an operation (like `map`, `filter`, `join`, etc.)
   - Labels show the specific transformation being applied
   - May include additional details about partitioning, size, or format

2. **Edges (arrows)**:
   - Show the flow of data between operations
   - Indicate dependencies between steps
   - Direction always flows from input to output (hence "directed")

3. **Stages**:
   - Groups of operations that can be executed together without shuffling data
   - Usually separated by shuffle boundaries (operations that require data redistribution)

### What Makes it "Acyclic"

The "acyclic" part means there are no loops or cycles in the graph - data always flows forward, never back to an earlier operation. This is fundamental to how Spark processes data in a deterministic way.

### How Beginners Can Use the DAG Visualization

1. **Understanding transformations**: See how operations like `map`, `filter`, and `join` connect together
   
2. **Identifying shuffles**: Spot where data is being redistributed across the cluster (often a performance bottleneck)
   
3. **Recognizing optimizations**: See how Spark has optimized your code (some operations might be combined or reordered)
   
4. **Debugging**: Trace the flow of data to understand where problems might occur
   
5. **Learning Spark's execution model**: Visualize how Spark translates high-level code into concrete execution steps




## How Timeline and DAG Visualizations Work Together

These two visualizations complement each other:
- **Timeline**: Shows *when* things happened and how long they took
- **DAG**: Shows *what* happened and how operations connected

Together, they give beginners a complete picture of both the logical structure and temporal execution of their Spark applications, which is invaluable for learning, optimization, and debugging.


# 4. The Stages Tab 

## What Are Stages in Spark?

In Apache Spark, a "stage" is a set of tasks that can be executed in parallel without requiring data to be shuffled across the cluster. Stages are created when:

1. A **shuffle operation** is required (like `groupByKey`, `reduceByKey`, `join`, etc.)
2. An **action** is called that triggers computation (like `collect`, `count`, `save`)

Understanding stages is crucial because they represent the actual chunks of work that Spark performs.

## The Stages Tab Interface

The Stages tab in Spark's UI provides detailed information about each stage in your application:

### Key Components of the Stages Tab

1. **Stages List**:
   - **Stage ID**: A unique identifier for each stage
   - **Description**: What operations this stage performs
   - **Submitted Time**: When this stage was submitted to the scheduler
   - **Duration**: How long the stage took to execute
   - **Tasks Progress**: Visual representation of task completion
   - **Input/Output**: Data sizes processed and produced

2. **Stage Details** (when clicking on a specific stage):
   - **Task metrics**: Statistics about task execution
   - **Executor information**: Which executors ran tasks
   - **Shuffle read/write**: Data movement between stages
   - **Task distribution**: How evenly work was distributed

3. **Aggregated Metrics**:
   - **Summary statistics**: Averages, minimums, maximums
   - **Percentile information**: Distribution of task durations
   - **Skew detection**: Identifying imbalanced work distribution

## How to Interpret the Stages Tab

### Stage Status Indicators

- **Active**: Currently running stages
- **Pending**: Stages waiting to be scheduled
- **Completed**: Successfully finished stages
- **Failed**: Stages that encountered errors
- **Skipped**: Stages thatages
4. **Compare similar jobs**: See how changes to your code affec

In [None]:
# Let's do something that creates multiple stages
# First, create two sets of student data

class_a = spark.createDataFrame([
    (i, f"Student_{i}", random.randint(60, 100))
    for i in range(1000)
], ["id", "name", "grade"])

class_b = spark.createDataFrame([
    (i, random.choice(['A', 'B', 'C', 'D']))
    for i in range(1000)
], ["id", "letter_grade"])

# Now let's combine them!
combined_classes = class_a.join(class_b, "id") \
    .where("grade >= 70") \
    .groupBy("letter_grade") \
    .count() \
    .collect()





### Check the Stages tab - you should see multiple stages because
 Spark had to:
 1. Process class_a data
 2. Process class_b data
 3. Join them together
 4. Group and count

### 🎯 Exercise
 Your mission: Figure out which operation is slowest!


In [None]:

def investigate_performance():
    # Create some data
    data1 = spark.createDataFrame([
        (i, random.randint(1, 50), random.choice(['X', 'Y', 'Z']))
        for i in range(10000)
    ], ["id", "score", "group"])
    
    data2 = spark.createDataFrame([
        (i, random.choice(['Red', 'Blue', 'Green']))
        for i in range(10000)
    ], ["id", "team"])
    
    # Do a bunch of operations
    result = data1.join(data2, "id") \
        .groupBy("team", "group") \
        .avg("score") \
        .orderBy("team") \
        .collect()
    
    return result

    




### Run it and check the UI!
 Things to look for in the Stages tab:
 - Which stage took the longest?
 - How much data was shuffled around?
 - Did all tasks take about the same time?


# 5. The Storage Tab

### What is the Storage Tab in Spark?

The Storage tab in Apache Spark's UI provides visibility into your application's data caching behavior. Caching (or "persisting") data is one of the most powerful optimization techniques in Spark, allowing frequently used datasets to be stored in memory or disk for faster access.

## Key Components of the Storage Tab

### RDD/DataFrame/Dataset Listing

1. **Cached Data Entries**:
   - **Name**: Identifier for each cached dataset
   - **Storage Level**: How and where data is stored
   - **Cache Size**: Memory/disk space utilized
   - **Fraction Cached**: Percentage of dataset actually stored
   - **Number of Partitions**: How the data is divided

2. **Storage Level Details**:
   - **MEMORY_ONLY**: Data stored in memory as deserialized Java objects
   - **MEMORY_AND_DISK**: Data stored in memory first, overflow goes to disk
   - **MEMORY_ONLY_SER**: Data stored in memory in serialized format (more compact)
   - **DISK_ONLY**: Data stored only on disk
   - **OFF_HEAP**: Data stored in off-heap memory (outside JVM)

3. **Additional Information**:
   - **Creation Time**: When the data was cached
   - **Data Distribution**: How data is spread across executors
   - **Last Access Time**: When cached data was last used

## How Caching Works in Spark

When you call `.cache()` or `.persist()` on an RDD, DataFrame, or Dataset:

1. The first time an action is called on the dataset, Spark computes it and stores the result
2. Subsequent operations using this data will read from cache instead of recomputing
3. Cached data remains until explicitly unpersistet or u1.
## Why the Storage Tab Mat
ers

### Verifying Caching

- Check if datasets are actually cached and how much was cached (vs. spilled to disk).
- Monitor memory usage to 2.P exceeding limits or wasting space.

### Performance Optimization

- Identify which cached datasets are used frequently.
- Check partition distribution for skew or imbalance.
- Evaluate storage level (e.g., memory vsve your application's performance and resource utilization.

In [30]:
# Let's make Spark remember something
big_df = spark.createDataFrame([
    (i, f"Data_{i}")
    for i in range(10000)
], ["id", "data"])

# Tell Spark to keep this in memory
big_df.cache()

# Force it to actually load into memory
big_df.count()


Py4JJavaError: An error occurred while calling o73.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 1.0 failed 1 times, most recent failure: Lost task 6.0 in stage 1.0 (TID 14) (host.docker.internal executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:381)
	at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1597)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1524)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1588)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 43 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:381)
	at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1597)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1524)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1588)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 43 more


### Now check the Storage tab!
How much memory is it using?

Is everything in memory or did some spill to disk?

# 6. The SQL Tab:

## What it Shows
The SQL Tab displays all Spark SQL queries and DataFrame operations executed in your application, helping you understand query performance and execution det. ailThe SQL Tab is your window into how Spark transforms your high-level code into execution steps, critical for optimizing query performance.s.

## Key Components

### Query List View
- **Query ID**: Unique identifier for each query
- **Description**: The SQL statement or DataFrame operation
- **Duration**: Execution time (most important metric!)
- **Status**: Running, completed, or failed

### Query Details View (when clicking a query)
- **Logical Plan**: The abstract representation of what you want to compute
- **Physical Plan**: The actual execution strategy Spark uses
  - Shows operations like joins, filters, aggregations
  - Includes important execution details (join types, estimated sizes)
- **Visualization**: Interactive graph showing query execution flow

##
- erformancthe e Analysis
- Identify slowest queries by sorting by duration
- Look for expensive operations (large shuffles, cartesian products)
- Check if filters are being pushed down early in execution
- Spot inries (shuffles)

### Debugging
- Find failed qrace complex query execution paths
- Connecfective Spark development and optimization.

In [33]:

# First, let's make our DataFrame available for SQL
big_df.createOrReplaceTempView("my_table")

# Now run a SQL query
spark.sql("""
    SELECT 
        SUBSTRING(data, 1, 5) as short_data,
        COUNT(*) as count
    FROM my_table
    GROUP BY SUBSTRING(data, 1, 5)
""").show()




Py4JJavaError: An error occurred while calling o80.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 (TID 22) (host.docker.internal executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:381)
	at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1597)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1524)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1588)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 43 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:381)
	at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1597)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1524)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1588)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.base/java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.base/java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.PlainSocketImpl.accept(Unknown Source)
	at java.base/java.net.ServerSocket.implAccept(Unknown Source)
	at java.base/java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 43 more


### Check out the SQL tab to see:
 What steps Spark took to run your query?
 
 How long each part took?
 
 If there are any parts that could be faster?
 


# 7.The Environment Tab

## What it Shows
The Environment Tab provides a comprehensive view of the configuration settings and runtime environment of your Spark applicon.ati
The Environment Tab is your reference for understanding exactly how your Spark application is configured, critical for debugging configuration issues and ensuring optimal performance settin
nation issues and ensuring optimal performance settings.

In [None]:
# We can change some settings
spark.conf.set("spark.sql.shuffle.partitions", "10")  # Default is 200!



### Check the Environment tab to see:
 What settings are active
 
 How much memory Spark can use
 
 What version of Spark you're running


# 8. The Executors Tab

## What it Shows
The Executors Tab provides information about the worker processes (executors) running your Spark application tasks across the clur.steThe Executors Tab gives you visibility into the physical resources processing your data, essential for diagnosing performance issues and optimizing resource allocatior.

## Key Compoents

### Summary Statistics
- **Active Executors**: Total number of running executors
- **Total Cores**: Available CPU cores across all executors
- **Total Memory**: Aggregate memory allocation

### Per-Executor Information
- **Executor ID**: Unique identifier for each executor
- **Storage Memory**: Memory used for caching data
- **Task Metrics**:
  - **Completed/Failed Tasks**: Task execution status
  - **Task Time**: Processing duration
  - **GC Time**: Time spent on garbage collection
  - **Shuffle Read/Write**: Data transferred between executors

#
### Resource Monitoring
- Verify executor count matches your configuration
- Monitor memor distribution is balanced
- Track GC time relative to task time (high ratio sure)

### Troubleshootings
- Look for skewed workloads
- Detect memory-related issues
- Findetrics

### Performance Analysis
- - pck for data skew in shuffle metricsoerformance issues and optimizing resource allocation.

In [None]:
# Let's create some work for our executors
# (This will be more interesting on a real cluster, but we can still see it in local mode)

# Create a bigger dataset to process
big_data = spark.createDataFrame([
    (i, random.randint(1, 100), random.choice(['A', 'B', 'C', 'D', 'E']))
    for i in range(50000)  # 50,000 rows should be enough to see some action
], ["id", "value", "category"])

# Do some processing that will use our executors
processed = big_data.repartition(8) \
    .groupBy("category") \
    .agg({"value": "sum", "value": "avg", "value": "max"}) \
    .cache() \
    .collect()

# Processing another dataset to give executors more work
another_data = spark.createDataFrame([
    (i, f"item_{i}", random.randint(1, 1000))
    for i in range(20000)
], ["id", "name", "price"])

another_result = another_data.join(big_data, "id", "left") \
    .where("price > value") \
    .groupBy("category") \
    .count() \
    .collect()


#### 🎯 Exercise: Executor Detective Work!


In [None]:

# Let's make our executors work harder and see what happens
def investigate_executors():
    # Create 3 datasets to process
    data1 = spark.createDataFrame([
        (i, random.randint(1, 100), f"product_{i}")
        for i in range(20000)
    ], ["id", "quantity", "product"])
    
    data2 = spark.createDataFrame([
        (i, random.choice(['Region1', 'Region2', 'Region3', 'Region4']))
        for i in range(20000)
    ], ["id", "region"])
    
    data3 = spark.createDataFrame([
        (i, random.randint(10, 1000), random.choice([True, False]))
        for i in range(20000)
    ], ["id", "price", "in_stock"])
    
    # Cache the first dataset
    data1.cache()
    data1.count()  # Materialize the cache
    
    # Run a complex job
    result = data1.join(data2, "id") \
        .join(data3, "id") \
        .where("in_stock = true") \
        .groupBy("region") \
        .agg({"price": "sum", "quantity": "sum"}) \
        .orderBy("region") \
        .collect()
    
    return result

# Run the investigation
result = investigate_executors()




### Now check the Executors tab and answer these questions:
 1. How many executors do you see?
 2. How much memory is each executor using?
 3. How many tasks has each executor completed?
 4. Is there a big difference in task runtime between executors?
 5. Do you see any "spill" (data written to disk instead of kept in memory)?


### 🎯 Final Exercise: Put It All Together


In [None]:
# Let's create a mini school database and analyze it
def analyze_school_data():
    # First, let's set some configuration to make things interesting
    # This will show up in the Environment tab
    spark.conf.set("spark.sql.shuffle.partitions", "8")  # Default is 200
    spark.conf.set("spark.executor.memory", "1g")        # Just for demo purposes
    
    # Create students table
    students = spark.createDataFrame([
        (i, f"Student_{i}", random.randint(14, 18))
        for i in range(1000)
    ], ["id", "name", "age"])
    
    # Create grades table
    grades = spark.createDataFrame([
        (random.randint(1, 1000), 
         random.choice(['Math', 'Science', 'History']),
         random.randint(60, 100))
        for _ in range(5000)
    ], ["student_id", "subject", "grade"])
    
    # Create a third table for more complex joins
    attendance = spark.createDataFrame([
        (random.randint(1, 1000),
         random.choice(['Math', 'Science', 'History']),
         random.randint(70, 100))
        for _ in range(7000)
    ], ["student_id", "subject", "attendance_pct"])
    
    # Cache one of our tables to see memory usage in the Storage and Executors tabs
    grades.cache()
    grades.count()  # Materialize the cache
    
    # Make them available for SQL
    students.createOrReplaceTempView("students")
    grades.createOrReplaceTempView("grades")
    attendance.createOrReplaceTempView("attendance")
    
    # Run a more complex query with multiple joins
    result = spark.sql("""
        SELECT 
            s.name,
            s.age,
            AVG(g.grade) as avg_grade,
            AVG(a.attendance_pct) as avg_attendance,
            COUNT(DISTINCT g.subject) as subjects_taken
        FROM students s
        JOIN grades g ON s.id = g.student_id
        JOIN attendance a ON s.id = a.student_id AND g.subject = a.subject
        GROUP BY s.name, s.age
        HAVING AVG(g.grade) > 80 AND AVG(a.attendance_pct) > 85
        ORDER BY avg_grade DESC, avg_attendance DESC
    """)
    
    # Create a second query to generate more executor work
    spark.sql("""
        SELECT 
            g.subject,
            COUNT(*) as student_count,
            AVG(g.grade) as avg_subject_grade,
            AVG(a.attendance_pct) as avg_subject_attendance
        FROM grades g
        JOIN attendance a ON g.student_id = a.student_id AND g.subject = a.subject
        GROUP BY g.subject
    """).show()
    
    # Show the main result
    result.show()
    
    return result

# Run it!
school_analysis = analyze_school_data()



# Your mission - investigate ALL the tabs:
1. Check the Jobs tab - how many jobs were created?
2. Look at the Stages tab - which stage took longest?
3. Check the SQL tab - can you understand the query plan?
4. Look at the Executors tab:
    - How many tasks did each executor process?
    - Is memory usage balanced across executors?
    - Do you see any data being spilled to disk?
5. Check the Environment tab:
    - Find the custom configurations we set (hint: shuffle partitions)
    - What's the value of spark.executor.memory?
    - How many cores is Spark using?

##  Cleaning Up 🧹


In [None]:
# Stop Spark (like turning off your car)
spark.stop()