<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>


# SPARK SESSION

`SparkSession` is the entry point to programming with Spark using the DataFrame and Dataset API. It allows you to read data, execute SQL queries, manage configurations, and interact with Spark clusters.

## AUTOMATICALLY
In data platforms like **Databricks**, a SparkSession is automatically created and available via the `spark` variable, you don’t need to initialize it manually.

## MANUAL

In other tools or environments like **Cloudera**, you need to manually create the SparkSession before using it. Here's a basic example:

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()
```

## SPARK METHODS

- `spark.version` : returns the current Spark version  
- `spark.sparkContext` : access to the underlying SparkContext  
- `spark.udf` : register and use user-defined functions (UDFs)  
- `spark.conf` : get or set Spark configurations  
- `spark.catalog` : interact with Spark catalog (databases, tables, functions)  
- `spark.read` : read data from CSV, JSON, Parquet, etc.  
- `spark.readStream` : read streaming data sources  
- `spark.range()` : create a DataFrame with a range of numbers  
- `spark.sql()` : run SQL queries using Spark SQL  
- `spark.createDataFrame()` : create a DataFrame from a local object or RDD  
- `spark.stop()` : stop the active Spark session

Note: Note: We will use almost all

### SPARK INFO

In [0]:
spark

### SPARK VERSION

In [0]:
spark.version

### SPARK CONFIG
You can **get** or **set** Spark configuration parameters using `spark.conf`.

**SET**
```python
spark.conf.set("spark.sql.shuffle.partitions", "100")
```

**GET**
```python
spark.conf.get("spark.sql.shuffle.partitions")
```

[SPARK CONFIGURATION](https://spark.apache.org/docs/latest/configuration.html)

In [0]:
print(spark.conf.get("spark.app.name"))                          # name of the Spark application
print(spark.conf.get("spark.master"))                            # cluster manager to connect to (local, yarn, etc.)
print(spark.conf.get("spark.executor.memory"))                   # memory per executor (e.g., 2g)
#print(spark.conf.get("spark.executor.cores"))                    # number of cores per executor
#print(spark.conf.get("spark.driver.memory"))                     # memory available to the driver
print(spark.conf.get("spark.sql.shuffle.partitions"))            # number of partitions used during shuffle operations
#print(spark.conf.get("spark.default.parallelism"))               # default number of tasks for parallel operations
print(spark.conf.get("spark.sql.warehouse.dir"))                 # directory for Spark SQL managed tables
#print(spark.conf.get("spark.serializer"))                        # serialization class used
print(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))    # threshold for automatic broadcast join (in bytes)
print(spark.conf.get("spark.hadoop.fs.s3a.block.size"))          # block size used when reading from S3 (Hadoop connector)
size = int(spark.conf.get("spark.hadoop.fs.s3a.block.size"))
print(f"block size: {size / (1024 * 1024):.2f} MB")