# SparkContext: Core Interface to Spark Execution

## 1. Introduction to SparkContext

`SparkContext` is the original entry point into Apache Spark's execution engine, present since the earliest versions of Spark. It acts as the bridge between the Spark application and the cluster manager (like YARN, Mesos, or Kubernetes). With the introduction of `SparkSession` in Spark 2.0, `SparkContext` became a part of the unified interface but remains vital for low-level control.

In [29]:
from pyspark import SparkContext
sc = SparkContext(appName="BasicContext", master="local[*]")

## 2. SparkContext Architecture Overview

```
SparkContext
├── DAG Scheduler
│   └── Logical execution plan
├── Task Scheduler
│   └── Schedules tasks on executors
├── Cluster Manager
│   └── Launches executors, manages resources
└── Executors
    └── Execute tasks and return results
```

### 2.1 Key Responsibilities

* Establishes connection to a cluster manager
* Allocates resources
* Sends application code to executors
* Coordinates distributed task execution

### 2.2 Core Components

* **SchedulerBackend** – communicates with cluster manager
* **DAGScheduler** – builds stages and handles task dependencies
* **TaskScheduler** – schedules tasks onto worker nodes
* **HeartbeatReceiver** – tracks executor health

---


## 3. Creating SparkContext

### 3.1 Basic Initialization

In [30]:
sc.stop()

In [31]:
sc = SparkContext(appName="SimpleApp", master="local[*]")

### 3.2 Configuration Object


In [32]:
sc.stop()

In [33]:
from pyspark import SparkConf
conf = SparkConf()
conf.set("spark.executor.memory", "2g")
conf.set("spark.executor.cores", "2")
sc = SparkContext(conf=conf)

### 3.3 Reusing in SparkSession


In [34]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WithSC").getOrCreate()
sc = spark.sparkContext

## 4. SparkContext Operations

### 4.1 Parallelize Collections

In [35]:
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.collect())

[1, 2, 3, 4, 5]


### 4.2 Reading Files

In [36]:
text_rdd = sc.textFile("sample.txt")
print(text_rdd.take(5))

Py4JJavaError: An error occurred while calling o197.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/user/PySpark/notebooks/architectures/sample.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:210)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.api.java.JavaRDDLike.partitions(JavaRDDLike.scala:61)
	at org.apache.spark.api.java.JavaRDDLike.partitions$(JavaRDDLike.scala:61)
	at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Input path does not exist: file:/home/user/PySpark/notebooks/architectures/sample.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
	... 25 more


### 4.3 Broadcast Variables

In [11]:
broadcastVar = sc.broadcast([1, 2, 3])
print(broadcastVar.value)

[1, 2, 3]


### 4.4 Accumulators

In [12]:
acc = sc.accumulator(0)
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: acc.add(x))
print(acc.value)

[Stage 1:>                                                          (0 + 2) / 2]

10


                                                                                

## 5. Configuration & Resource Management 

### 5.1 View Cluster Info


In [15]:
print("Master:", sc.master)
print("App Name:", sc.appName)
print("Spark Version:", sc.version)

Master: local[*]
App Name: pyspark-shell
Spark Version: 3.5.5


### 5.2 Executor Info

In [21]:
status = sc._jsc.sc().getExecutorMemoryStatus()
hosts_iter = status.keySet().iterator()

while hosts_iter.hasNext():
    host = hosts_iter.next()
    mem = status.get(host)
    print(f"Executor: {host}, Memory: {mem}")

Executor: idx-pyspark-1746386305122:44591, Memory: Some((455501414,455495773))


### 5.3 CPU Parallelism

In [22]:
print("Default Parallelism:", sc.defaultParallelism)

Default Parallelism: 2


## 6. Advanced Tuning

### 6.1 Memory
* `spark.driver.memory` – Driver JVM heap
* `spark.executor.memory` – Executor JVM heap

### 6.2 Serialization

In [23]:
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

<pyspark.conf.SparkConf at 0x78fdf14d3b50>

### 6.3 Hadoop Configuration

In [24]:
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", "AKIA...")
hadoop_conf.set("fs.s3a.secret.key", "...")

## 7. Monitoring and Debugging

### 7.1 UI and Logs
* Spark Web UI (default on [http://localhost:4040](http://localhost:4040))
* DAG visualizations

### 7.2 Application ID


In [26]:
print("Application ID:", sc.applicationId)

Application ID: local-1746651389420


### 7.3 Stopping Context

In [27]:
sc.stop()

## 8. Summary

* `SparkContext` is the foundation of Spark execution.
* It handles job scheduling, resource management, and communication with the cluster.
* Most PySpark users now interact via `SparkSession`, but `SparkContext` remains critical for low-level operations and advanced tuning.

> Best practice: always access `SparkContext` via `spark.sparkContext` to avoid multiple active contexts.