## Building the Spark Session

- Spark Sessions help us to create a spark application which helps us to communicate with the Driver (JVM)
- Python Code -> Spark Session -> Driver
- This Spark Session starts the driver process, setsup the SparkContect, and establishes the bridge to the JVM


In [1]:
# Importing the Spark Session module
from pyspark.sql import SparkSession

# Creating an application using the SparkSession

spark = SparkSession.builder \
        .appName("Spark Basics") \
        .getOrCreate()

# Check if the application is created successfully by checking the spark appliaction version 

print('Spark Version : ',spark.version)


Spark Version :  3.5.5


___
## Locally Hosted Spark UI
- When we create a spark session it locally hosts a web ui page at http://localhost:4040/ locaiton
- We can check all the Job statuses, DAGs, Plans here

## General Deployment methods of Spark Applications

||Use Case|Your Computer|Data Center Cloud|
|---|---|---|---|
|Local|Testing|Driver and Executor||
|Client|Development|Driver|Executor|
|Cluster|Prodcution||Driver and Executor|


___
## Understanding Flow of Python Code to Spark Execution

```mermaid

flowchart LR

t1[Python Code]
t2["Py4J (Bridge between Python and JVM)"]
t3["JVM
(-Runs actual computations
-Manages Clusters
-Does parallel processing)"]

t1-->t2-->t3
```

### Drivers and Executors
- Drivers and Executors are all part of the JVM
- When running on cluster, Driver can be on a different JVM and Executors can be on a different JVM

### Driver
- Driver is a Project Manager
- As mentioned above we create "Spark Session" to setup this "Driver" and the JVM communication channel.
- There can be only one "Driver" for each Spark Session



___
## Spark Context

- Condier Spark Context as the brain of your Spark Application
- It gives you information about every thing (all Spark setting that can affect your job)
    1. The execution environment
    2. Where are you running the Application (Locally/Cluster)
    3. How many cores and memory is available 
    4. Driver and Executor info

In [2]:
# We have named our session as "spark" above 

# Get SparkContext
sc = spark.sparkContext


In [17]:
# Let's try to answer a few questions

# 1. Where are we running Spark? and how many cores are available to us?
print("Master Information : ", sc.master)

'''
If sc.master returns 
1. local --> then its only using 1 core
2. local[n] --> then it is using n threads/cores
3. local[*] --> then it is using all the avaliable logical cores (read physical vs logic core page for more info - link)
4. YARN/Mesos/Kubernetes --> when you use an external cluster manager
'''

# 2. If you want to know the App Name and its Id you gave while starting the setting 
print(f'App Name : {sc.appName} \nApplication Id : {sc.applicationId}')

# 3. If you want to konw how many cores are available for Spark to executre its tasks (Note that these will be logical cores)
print("Default Parallelism (Cores Used):", sc.defaultParallelism)

# 4. Applciation start time deatils
print(f'Application Start Time (ms) : {sc.startTime}' ) 

'''
startTime will return us a value in milliseconds from the Unix epoch (January 1, 1970)
we will need to conver this to normal date
'''
import datetime as dt
print("Application Start Time : ", dt.datetime.fromtimestamp(sc.startTime/1000))



Master Information :  local[*]
App Name : Spark Basics 
Application Id : local-1745675883616
Default Parallelism (Cores Used): 16
Application Start Time (ms) : 1745675882092
Application Start Time :  2025-04-26 19:28:02.092000
