# PySpark 01 — Spark Basics

This notebook introduces the basics of working with Apache Spark using PySpark. Topics covered:

1. Creating a SparkSession
2. Constructing a DataFrame from Python data
3. Exploring DataFrame structure
4. Viewing data

# 1. Create a Spark Session
The entry point to Spark is the SparkSession. It allows access to all Spark features from PySpark.

In PySpark, the **`SparkSession`** is part of the **Structured API** (which includes DataFrames and SQL-like operations). That API lives in the `pyspark.sql` module — even though you're not necessarily writing SQL.

**Modules**
```
pyspark/
├── sql/
│   ├── SparkSession.py     <-- defines SparkSession class
│   ├── DataFrame.py
│   └── functions.py        <-- for withColumn, col, etc.
├── rdd/
│   └── RDD.py              <-- lower-level API
```

* `pyspark.sql`: includes all higher-level DataFrame and SQL APIs
* `pyspark.rdd`: older, low-level RDD API


### 1.1 What’s Inside a SparkSession?
Internally, SparkSession wraps:
* A SparkContext (the original core interface to the cluster)
* A SQLContext (for DataFrames and SQL)
* A HiveContext (if working with Hive)



In [16]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark Basics") \
    .getOrCreate()

In [17]:
spark.sparkContext

### 1.2 Check the Cluster Info (Number of Nodes/Executors/Cores)

In [18]:
# Get Java Set of executor memory keys
java_set = spark.sparkContext._jsc.sc().getExecutorMemoryStatus().keySet()

# Use Java iterator to safely loop through elements
hosts_iter = java_set.iterator()
hosts = []

while hosts_iter.hasNext():
    hosts.append(str(hosts_iter.next()))

cores = spark.sparkContext.defaultParallelism

# Display results
print("Executors:", len(hosts))
print("Executor Hosts:", hosts)
print("Default Parallelism (cores):", cores)

Executors: 1
Executor Hosts: ['idx-pyspark-1746386305122:38439']
Default Parallelism (cores): 2


In [19]:
spark.stop()

**Increasing CPU Parallelism in Spark**

Pros
- **Faster execution** — more tasks run simultaneously
- **Better hardware use** — maximizes CPU utilization
- **Handles larger data** — improved performance on big workloads
- **Closer to real clusters** — mimics distributed behavior in dev

Cons
- **Higher memory use** — more threads = more memory pressure
- **Wasteful for small jobs** — extra overhead with little gain
- **Diminishing returns** — beyond a point, no added benefit
- **Can affect other apps** — may slow down your system

Use `local[*]` to automatically match the number of available cores.


In [20]:
# Use all available cores
spark = SparkSession.builder \
    .appName("ParallelTest") \
    .master("local[16]") \
    .getOrCreate()

In [21]:
# Get Java Set of executor memory keys
java_set = spark.sparkContext._jsc.sc().getExecutorMemoryStatus().keySet()

# Use Java iterator to safely loop through elements
hosts_iter = java_set.iterator()
hosts = []

while hosts_iter.hasNext():
    hosts.append(str(hosts_iter.next()))

cores = spark.sparkContext.defaultParallelism

# Display results
print("Executors:", len(hosts))
print("Executor Hosts:", hosts)
print("Default Parallelism (cores):", cores)

Executors: 1
Executor Hosts: ['idx-pyspark-1746386305122:37453']
Default Parallelism (cores): 16


Because Firebase Studio is single-node (like your laptop), it won't spin up a real Spark cluster — it just simulates one for development.

# 2. Constructing a DataFrame from Python data
You can create a DataFrame directly from a list of tuples and specify column names.

In [22]:
# Sample Python data (list of tuples)
data = [
    ("Alice", 25),
    ("Bob", 32),
    ("Cathy", 19)
]

# Define column names
columns = ["name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the result
df.show()



+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 32|
|Cathy| 19|
+-----+---+



                                                                                

Even for small tasks, Spark is starting a whole execution engine under the hood, which includes:
* Starting the JVM (Java Virtual Machine)
* Initializing SparkContext
* Creating a thread pool
* Allocating memory for executors
* Logging setup and environment scanning
* This startup cost can easily take 10–20 seconds

In [23]:
df2 = spark.createDataFrame([("Dan", 40)], ["name", "age"])
df2.show()

[Stage 5:>                                                        (0 + 11) / 11]

+----+---+
|name|age|
+----+---+
| Dan| 40|
+----+---+



                                                                                

# 3. Exploring DataFrame structure
After creating a DataFrame, you can inspect its schema, column names, data types, and sample rows using these methods:

In [24]:
# Show the schema (column names and data types)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [25]:
# Get list of column names
df.columns

['name', 'age']

In [26]:
# Get (column name, data type) pairs
df.dtypes

[('name', 'string'), ('age', 'bigint')]

# 4. Viewing data

In [27]:
# Display first few rows in a tabular format
df.show()

                                                                                

[Stage 8:>                                                        (0 + 11) / 11]

+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 32|
|Cathy| 19|
+-----+---+



                                                                                

In [28]:
# View first 3 rows as Row objects
df.head(3)

                                                                                

                                                                                

[Row(name='Alice', age=25), Row(name='Bob', age=32), Row(name='Cathy', age=19)]

In [29]:
# Stop Spark
spark.stop()