# Module 01: PySpark Setup and SparkSession

**Difficulty**: ‚≠ê

**Estimated Time**: 45-60 minutes

**Prerequisites**: 
- Module 00: Introduction to Big Data and Spark Ecosystem
- Python 3.7+ installed
- Basic command line knowledge

## Learning Objectives

By the end of this notebook, you will be able to:
1. Install and configure PySpark in a local environment
2. Create and configure a SparkSession for local development
3. Understand the relationship between SparkSession and SparkContext
4. Navigate and interpret the Spark UI for monitoring applications
5. Configure Spark settings for optimal local performance

## 1. Installing PySpark

### Installation Methods

There are several ways to install PySpark:

#### Method 1: pip (Recommended for this course)
```bash
pip install pyspark
```

#### Method 2: conda
```bash
conda install -c conda-forge pyspark
```

#### Method 3: Install full Apache Spark distribution
- Download from https://spark.apache.org/downloads.html
- Set SPARK_HOME environment variable
- More complex but gives you access to all Spark tools

### Requirements

PySpark requires:
- **Python**: 3.7 or higher
- **Java**: JDK 8 or 11 (Spark runs on JVM)
- **Memory**: At least 4GB RAM (8GB+ recommended)

### Verify Installation

Run the cell below to verify PySpark is installed correctly:

In [None]:
# Verify PySpark installation
import pyspark

print(f"PySpark version: {pyspark.__version__}")
print(f"Installation path: {pyspark.__file__}")

# Check if we can import key modules
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

print("\n‚úì PySpark is installed correctly!")

## 2. Creating Your First SparkSession

### What is SparkSession?

**SparkSession** is the entry point for Spark functionality:
- Unified entry point (since Spark 2.0)
- Replaces older SQLContext and HiveContext
- Provides access to DataFrames, SQL, and Spark configuration

### SparkSession vs SparkContext

```
SparkSession (High-level API)
    ‚Üì contains
SparkContext (Low-level API)
    ‚Üì manages
Cluster Connection and RDDs
```

- **SparkSession**: Modern, high-level API (use this in new code)
- **SparkContext**: Low-level API (needed for RDD operations)

### Creating a SparkSession

In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
# Builder pattern allows us to configure before creating
spark = SparkSession.builder \
    .appName("Module 01: PySpark Setup") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

print(f"SparkSession created: {spark}")
print(f"App Name: {spark.sparkContext.appName}")
print(f"Master: {spark.sparkContext.master}")
print(f"Spark Version: {spark.version}")

### Understanding the Configuration

Let's break down the SparkSession creation:

```python
.appName("Module 01: PySpark Setup")
```
- Sets a human-readable name for your application
- Visible in Spark UI for monitoring
- Helps identify your job in logs

```python
.master("local[*]")
```
- **"local"**: Run Spark locally with 1 thread
- **"local[4]"**: Run locally with 4 threads
- **"local[*]"**: Use all available CPU cores (recommended for local dev)
- **"spark://host:port"**: Connect to Spark cluster (production)

```python
.config("spark.driver.memory", "2g")
```
- Allocates 2GB of memory to the driver program
- Increase if working with large datasets locally

```python
.getOrCreate()
```
- Gets existing SparkSession or creates new one
- Prevents creating multiple sessions (which causes errors)

## 3. Accessing SparkContext

SparkContext is available through SparkSession for low-level operations:

In [None]:
# Access SparkContext from SparkSession
sc = spark.sparkContext

print("SparkContext Information:")
print(f"  Application ID: {sc.applicationId}")
print(f"  Application Name: {sc.appName}")
print(f"  Master URL: {sc.master}")
print(f"  Default Parallelism: {sc.defaultParallelism}")
print(f"  Python Version: {sc.pythonVer}")

# Default parallelism = number of CPU cores being used
print(f"\nSpark is using {sc.defaultParallelism} CPU cores for parallel processing")

## 4. Exploring the Spark UI

### Accessing the UI

The Spark UI is a web interface for monitoring your Spark applications:

1. **Default URL**: http://localhost:4040
2. If port 4040 is busy, Spark will try 4041, 4042, etc.
3. The UI is only available while your SparkSession is active

Run the cell below to get the UI URL:

In [None]:
# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI is available at: {ui_url}")
print("\nOpen this URL in your browser to monitor your Spark application")
print("\nIMPORTANT: The UI will only work while this notebook is running!")

### Spark UI Tabs

The Spark UI has several important tabs:

1. **Jobs**: Shows all Spark jobs (actions) executed
   - View completed and running jobs
   - See execution time and stages

2. **Stages**: Breakdown of each job into stages
   - Understand shuffle operations
   - Identify bottlenecks

3. **Storage**: Shows cached/persisted RDDs and DataFrames
   - Memory usage
   - Partition information

4. **Environment**: Configuration and system properties
   - Spark settings
   - JVM properties
   - Classpath

5. **Executors**: Information about executors (workers)
   - Memory usage
   - CPU cores
   - Task metrics

6. **SQL**: SQL query execution plans
   - Physical and logical plans
   - Query duration

### Let's Generate Some Activity for the UI

In [None]:
# Create a simple DataFrame to generate activity in Spark UI
data = [("Alice", 34), ("Bob", 45), ("Charlie", 28), ("Diana", 31)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Trigger an action (this will create a job in the UI)
print("Sample DataFrame:")
df.show()

# Count operation (another action)
count = df.count()
print(f"\nTotal rows: {count}")

print("\nüëâ Check the Spark UI now! You should see 2 jobs (show and count)")

## 5. Spark Configuration

### Viewing Current Configuration

In [None]:
# View all Spark configurations
conf = spark.sparkContext.getConf()

print("Current Spark Configuration:")
print("=" * 50)
for item in conf.getAll():
    print(f"{item[0]:40s} = {item[1]}")

### Important Configuration Options

#### Memory Configuration

```python
# Driver memory (machine running your notebook)
.config("spark.driver.memory", "4g")

# Executor memory (workers in cluster)
.config("spark.executor.memory", "4g")
```

#### CPU Configuration

```python
# Cores per executor
.config("spark.executor.cores", "4")

# Number of executors
.config("spark.executor.instances", "2")
```

#### Shuffle Configuration

```python
# Shuffle partitions (default: 200)
.config("spark.sql.shuffle.partitions", "50")
```

#### UI Configuration

```python
# Change UI port
.config("spark.ui.port", "4050")

# Disable UI (for production)
.config("spark.ui.enabled", "false")
```

### Creating a Configured SparkSession

In [None]:
# Stop existing session first (can't have multiple sessions)
spark.stop()

# Create new session with custom configuration
spark = SparkSession.builder \
    .appName("Configured Spark Session") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

print("‚úì Reconfigured SparkSession created")
print(f"  App Name: {spark.sparkContext.appName}")
print(f"  Shuffle Partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print(f"  Adaptive Query Execution: {spark.conf.get('spark.sql.adaptive.enabled')}")

## 6. Best Practices for Local Development

### Recommended Configuration for Local Laptop

```python
spark = SparkSession.builder \
    .appName("My Local App") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()
```

**Explanation**:
- `local[*]`: Use all CPU cores
- `driver.memory`: 2GB for small datasets (increase if needed)
- `shuffle.partitions`: 4 instead of default 200 (faster for small data)
- `adaptive.enabled`: Let Spark optimize query execution

### Common Issues and Solutions

#### Issue 1: "Java not found"
```bash
# Install Java 8 or 11
# Ubuntu/Debian:
sudo apt-get install openjdk-11-jdk

# macOS (with Homebrew):
brew install openjdk@11
```

#### Issue 2: "Port 4040 already in use"
- Spark will automatically try 4041, 4042, etc.
- Or specify a port: `.config("spark.ui.port", "4050")`

#### Issue 3: "Out of memory"
- Increase driver memory: `.config("spark.driver.memory", "4g")`
- Reduce shuffle partitions: `.config("spark.sql.shuffle.partitions", "2")`

### Helper Function for Easy Setup

In [None]:
def create_spark_session(app_name="PySpark App", memory="2g", shuffle_partitions=4):
    """
    Create a SparkSession with sensible defaults for local development.
    
    Parameters:
    -----------
    app_name : str
        Name of your Spark application
    memory : str
        Driver memory (e.g., "2g", "4g")
    shuffle_partitions : int
        Number of partitions for shuffle operations
    
    Returns:
    --------
    SparkSession
    """
    spark = SparkSession.builder \
        .appName(app_name) \
        .master("local[*]") \
        .config("spark.driver.memory", memory) \
        .config("spark.sql.shuffle.partitions", str(shuffle_partitions)) \
        .config("spark.sql.adaptive.enabled", "true") \
        .getOrCreate()
    
    print(f"‚úì SparkSession '{app_name}' created")
    print(f"  Spark UI: {spark.sparkContext.uiWebUrl}")
    print(f"  Using {spark.sparkContext.defaultParallelism} CPU cores")
    
    return spark

# Test the helper function
spark.stop()  # Stop current session
spark = create_spark_session("Helper Function Test")

## Exercises

### Exercise 1: Create a Custom SparkSession

Create a SparkSession with the following requirements:
- App name: "My First Spark App"
- Driver memory: 3GB
- Shuffle partitions: 8
- Enable adaptive query execution

In [None]:
# Exercise 1: Your code here
spark.stop()  # Stop current session first

# Create your custom SparkSession
my_spark = None  # Replace with your code

# Verify your configuration
# Uncomment these lines after creating your SparkSession:
# print(f"App Name: {my_spark.sparkContext.appName}")
# print(f"Driver Memory: {my_spark.conf.get('spark.driver.memory')}")
# print(f"Shuffle Partitions: {my_spark.conf.get('spark.sql.shuffle.partitions')}")

### Exercise 2: Explore SparkContext Properties

Using the SparkContext, find and print:
1. The application ID
2. The master URL
3. The default parallelism (number of cores)
4. The Python version being used

In [None]:
# Exercise 2: Your code here
# Access SparkContext from your SparkSession
# sc = my_spark.sparkContext

# Print the required properties
# Your code here

### Exercise 3: Monitoring in Spark UI

1. Create a DataFrame with at least 100 rows of sample data
2. Perform 3 different actions (e.g., count, show, collect)
3. Open the Spark UI and identify the 3 jobs
4. Answer the questions below

In [None]:
# Exercise 3: Your code here

# 1. Create DataFrame with 100+ rows
# Hint: Use range() or list comprehension

# 2. Perform 3 actions

# 3. Check Spark UI at the URL printed below
print(f"Spark UI: {my_spark.sparkContext.uiWebUrl}")

**Questions** (Answer after checking Spark UI):

1. How many jobs were created? ___
2. Which action took the longest time? ___
3. How many stages did each job have? ___
4. What was the total number of tasks executed? ___

### Exercise 4: Configuration Optimization

You're working on a laptop with:
- 8 CPU cores
- 16GB RAM
- A dataset that's 5GB in size

Create an optimized SparkSession configuration for this scenario. Explain your choices.

In [None]:
# Exercise 4: Your optimized configuration
my_spark.stop()

optimized_spark = None  # Create your optimized SparkSession here

# Explain your configuration choices:
"""
Driver memory: ___ GB because:

Shuffle partitions: ___ because:

Other configurations:
"""

## Solutions

### Exercise 1 Solution

In [None]:
# Solution 1
spark.stop()

my_spark = SparkSession.builder \
    .appName("My First Spark App") \
    .master("local[*]") \
    .config("spark.driver.memory", "3g") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

print(f"App Name: {my_spark.sparkContext.appName}")
print(f"Driver Memory: {my_spark.conf.get('spark.driver.memory')}")
print(f"Shuffle Partitions: {my_spark.conf.get('spark.sql.shuffle.partitions')}")
print(f"Adaptive Query: {my_spark.conf.get('spark.sql.adaptive.enabled')}")

### Exercise 2 Solution

In [None]:
# Solution 2
sc = my_spark.sparkContext

print(f"1. Application ID: {sc.applicationId}")
print(f"2. Master URL: {sc.master}")
print(f"3. Default Parallelism: {sc.defaultParallelism}")
print(f"4. Python Version: {sc.pythonVer}")

### Exercise 3 Solution

In [None]:
# Solution 3

# Create DataFrame with 100 rows
data = [(i, f"Name_{i}", i * 10) for i in range(100)]
df = my_spark.createDataFrame(data, ["id", "name", "value"])

# Action 1: count
print(f"Count: {df.count()}")

# Action 2: show
df.show(5)

# Action 3: collect (be careful with large datasets!)
first_10 = df.limit(10).collect()
print(f"Collected {len(first_10)} rows")

print(f"\nCheck Spark UI: {my_spark.sparkContext.uiWebUrl}")

### Exercise 4 Solution

In [None]:
# Solution 4
my_spark.stop()

optimized_spark = SparkSession.builder \
    .appName("Optimized Local Spark") \
    .master("local[*]") \
    .config("spark.driver.memory", "8g") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print("Optimized Configuration:")
print("  Driver Memory: 8g")
print("  Reason: Allocate ~50% of RAM (16GB / 2) for Spark, leave room for OS")
print("")
print("  Shuffle Partitions: 8")
print("  Reason: Match number of CPU cores for optimal parallelism")
print("")
print("  Adaptive Query Execution: true")
print("  Reason: Let Spark optimize query plans and coalesce partitions automatically")

## Summary

### Key Concepts Covered

‚úÖ **Installing PySpark**: Using pip for easy installation

‚úÖ **SparkSession**: Entry point for Spark functionality (modern API)

‚úÖ **SparkContext**: Low-level API accessible through SparkSession

‚úÖ **Spark UI**: Web interface for monitoring jobs, stages, and executors

‚úÖ **Configuration**: Memory, CPU, and shuffle partition settings

### Key Takeaways

1. **Always use SparkSession** in new code (not SparkContext directly)
2. **local[*]** is perfect for development (uses all CPU cores)
3. **Spark UI** is essential for debugging and optimization
4. **getOrCreate()** prevents multiple SparkSession creation errors
5. **Configure wisely**: Balance memory allocation with system resources

### Essential Code Pattern

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

# Your Spark code here

spark.stop()  # Clean shutdown
```

### What's Next?

In **Module 02: RDD Basics**, you will:
- Understand RDDs (Resilient Distributed Datasets)
- Learn about transformations vs actions
- Explore lazy evaluation and DAG visualization
- Work with key-value pair RDDs

### Additional Resources

- [SparkSession Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html)
- [Spark Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
- [Spark UI Guide](https://spark.apache.org/docs/latest/web-ui.html)

In [None]:
# Clean up - stop SparkSession when done
spark.stop()
print("SparkSession stopped. ‚úì")