# Comprehensive PySpark Guide: Installation and Cluster Configuration

## Part 1: Installing PySpark in Ubuntu

### Prerequisites Installation

First, let's install the necessary prerequisites. Note, here python3 refers to your python command, if you've set manually you may use 'python':

1. **Update your system packages**:
   ```bash
   sudo apt update
   sudo apt upgrade
   ```

2. **Install Java (PySpark requires Java 8 or higher)**:
   ```bash
   sudo apt install default-jdk
   ```

3. **Verify Java installation**:
   ```bash
   java -version
   ```

4. **Install Python and pip** (if not already installed):
   ```bash
   sudo apt install python3 python3-pip
   ```

### Installing PySpark

There are two main approaches to install PySpark:

#### Method 1: Using pip (Recommended for beginners)

1. **Install PySpark package**:
   ```bash
   pip3 install pyspark
   ```

2. **Verify installation**:
   ```bash
   python3 -c "import pyspark; print(pyspark.__version__)"
   ```

#### Method 2: Manual installation (More control)

1. **Download Apache Spark**:
   ```bash
   wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
   ```

2. **Extract the archive**:
   ```bash
   tar -xzf spark-3.5.0-bin-hadoop3.tgz
   ```

3. **Move to /opt directory** (recommended location):
   ```bash
   sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
   ```

4. **Install PySpark with pip** (to ensure Python bindings):
   ```bash
   pip3 install pyspark
   ```

### Setting Up Environment Variables

Add these to your `~/.bashrc` or `~/.profile` file:

```bash
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
```

After adding, apply the changes:
```bash
source ~/.bashrc  # or source ~/.profile
```


## Part 2: Configuring a Spark Cluster

### Types of Spark Cluster Deployment

You can configure Spark in several modes:

1. **Standalone Mode** - Spark's built-in cluster manager
2. **YARN Mode** - Using Hadoop's resource manager
3. **Mesos Mode** - Using Apache Mesos
4. **Kubernetes Mode** - Deploying on Kubernetes

We'll focus primarily on Standalone mode as it's the simplest to set up.

### Prerequisites for Cluster Setup

Before configuring your cluster, ensure you have:

- Java 8 or higher installed on all machines
- SSH access between all nodes
- Same version of Spark installed in the same directory path on all machines
- Consistent user accounts across machines
- Open ports for Spark communication (default: 7077 for master, 8080 for web UI)

### Step 1: Install Spark on All Machines

Follow these steps on each machine in your cluster:

```bash
# Download and install Spark (same version on all machines)
wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark

# Set environment variables on all machines
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
source ~/.bashrc
```

### Step 2: Configure SSH Key-based Authentication

For the master to communicate with workers without password prompts:

```bash
# On master node
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# Copy the public key to each worker node
ssh-copy-id username@worker1
ssh-copy-id username@worker2
# Repeat for all worker nodes
```

### Step 3: Configure Spark Environment

#### On Master Node

1. Create the configuration files by copying the templates:

```bash
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
```

2. Edit the `spark-env.sh` file:

```bash
# Add these lines to spark-env.sh
export SPARK_MASTER_HOST=master-hostname  # Use the actual hostname or IP
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=4  # Set to number of cores you want to allocate
export SPARK_WORKER_MEMORY=8g  # Set to amount of memory you want to allocate
export SPARK_WORKER_INSTANCES=1
export SPARK_PUBLIC_DNS=master-hostname  # Use your actual public DNS if available
```

3. Create a workers file:

```bash
cp workers.template workers
```

4. Edit the `workers` file to include all worker nodes:

```
worker1
worker2
worker3
# Add more worker hostnames or IPs as needed
```

#### On Worker Nodes

1. Create and edit the `spark-env.sh` file with the same settings as the master, but make sure to use the worker's own hostname where appropriate:

```bash
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
```

2. Edit the `spark-env.sh` file with worker-specific settings:

```bash
# Add these lines to spark-env.sh
export SPARK_MASTER_HOST=master-hostname  # Use the actual master hostname or IP
export SPARK_WORKER_CORES=4  # Adjust based on server capacity
export SPARK_WORKER_MEMORY=8g  # Adjust based on server capacity
export SPARK_WORKER_INSTANCES=1
export SPARK_PUBLIC_DNS=worker-hostname  # Use this worker's hostname
```

### Step 4: Configure Additional Spark Settings

Edit `spark-defaults.conf` on all machines for cluster-wide settings:

```
spark.master                     spark://master-hostname:7077
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              4g
spark.executor.memory            4g
spark.executor.cores             2
spark.default.parallelism        8
spark.sql.shuffle.partitions     200
spark.local.dir                  /tmp/spark-temp
```

### Step 5: Start the Cluster

#### Start the Master Node

On the master machine:

```bash
$SPARK_HOME/sbin/start-master.sh
```

Check the master web UI at `http://master-hostname:8080` to confirm it's running.

#### Start the Worker Nodes

You can start workers in two ways:

1. **From the master node** (if you've configured SSH properly):

```bash
$SPARK_HOME/sbin/start-workers.sh
```

2. **On each worker individually**:

```bash
$SPARK_HOME/sbin/start-worker.sh spark://master-hostname:7077
```

### Step 6: Submitting Applications to the Cluster

You can now submit applications to your cluster using `spark-submit`:

```bash
$SPARK_HOME/bin/spark-submit \
  --master spark://master-hostname:7077 \
  --deploy-mode cluster \
  --driver-memory 4g \
  --executor-memory 2g \
  --executor-cores 1 \
  --num-executors 3 \
  your_application.py
```

### Step 7: Monitoring and Managing Your Cluster

1. **Web UI**: Access the master web UI at `http://master-hostname:8080`
2. **Spark History Server**: To track completed applications:

```bash
# First, configure the event log directory
echo 'spark.eventLog.enabled true' >> $SPARK_HOME/conf/spark-defaults.conf
echo 'spark.eventLog.dir file:/tmp/spark-events' >> $SPARK_HOME/conf/spark-defaults.conf
echo 'spark.history.fs.logDirectory file:/tmp/spark-events' >> $SPARK_HOME/conf/spark-defaults.conf

# Create the event log directory
mkdir -p /tmp/spark-events

# Start the history server
$SPARK_HOME/sbin/start-history-server.sh
```

Access the history server at `http://master-hostname:18080`

### Step 8: Advanced Configuration Options

#### High Availability Setup

For production environments, configure Spark with ZooKeeper for high availability:

1. Install ZooKeeper on a separate set of machines
2. Add these to `spark-defaults.conf`:

```
spark.deploy.recoveryMode ZOOKEEPER
spark.deploy.zookeeper.url zk1:2181,zk2:2181,zk3:2181
spark.deploy.zookeeper.dir /spark
```

#### Resource Management

Tune these parameters based on your workload:

```
# Memory management
spark.memory.fraction 0.6
spark.memory.storageFraction 0.5

# Execution
spark.speculation true
spark.speculation.multiplier 3
```

#### Network Configuration

For improved network performance:

```
spark.network.timeout 120s
spark.rpc.message.maxSize 256
```

### Step 9: Stopping Your Cluster

When you're done, stop the cluster:

```bash
# From the master node (if SSH is configured)
$SPARK_HOME/sbin/stop-all.sh

# Or stop individually
$SPARK_HOME/sbin/stop-master.sh
$SPARK_HOME/sbin/stop-workers.sh
```


## Part 3: Basic Usage of PySpark

### Starting the PySpark Shell & Creating a simple script

1. **Launch the interactive PySpark shell**:
   ```bash
   pyspark
   ```

2. **You should see a Spark context (sc) and Spark session (spark) automatically created**

#### Creating a Simple PySpark Script

Create a file named `simple_pyspark.py`:

```python
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
   .appName("Simple PySpark Example") \
   .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print("DataFrame contents:")
df.show()

# Perform some operations
print("Average age:")
df.select("Age").groupBy().avg().show()

# Stop the Spark session
spark.stop()
```

Run the script:
```bash
python3 simple_pyspark.py
```

### Running PySpark in Jupyter Notebook

1. **Install Jupyter**:
   ```bash
   pip3 install jupyter
   ```

2. **Configure PySpark to work with Jupyter**:
   Add to your `~/.bashrc`:
   ```bash
   export PYSPARK_DRIVER_PYTHON=jupyter
   export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
   ```

3. **Start Jupyter with PySpark**:
   
   Install notebook if not already present:
   ```bash
   pip install notebook
   ```

   Start jupyter using pyspark command(needs aboe variables present in .bashrc or exported for the session)
   ```bash
   pyspark
   ```
   
   Or to use regular Python and import PySpark:
   ```bash
   jupyter notebook
   ```

### Using PySpark with Larger Datasets

For a more realistic example, create another script that processes CSV data:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("CSV Processing Example") \
    .getOrCreate()

# Read CSV file (replace with your actual file path)
df = spark.read.option("header", "true").option("inferSchema", "true").csv("your_data.csv")

# Show schema and sample data
print("DataFrame Schema:")
df.printSchema()
print("Sample Data:")
df.show(5)

# Perform aggregations (example assuming columns 'age' and 'salary' exist)
print("Summary Statistics:")
df.select(
    avg("salary").alias("avg_salary"),
    count("*").alias("count")
).show()

# Filter data
filtered_df = df.filter(col("age") > 30)
print("Filtered data (age > 30):")
filtered_df.show(5)

# Save results
filtered_df.write.mode("overwrite").parquet("filtered_results")

# Stop Spark session
spark.stop()
```


## Troubleshooting Common Issues

### For PySpark Installation

1. **Java issues**: If you encounter Java-related errors, ensure you have the correct Java version:
   ```bash
   sudo update-alternatives --config java
   ```

2. **Memory issues**: Edit Spark configuration for more memory:
   ```bash
   export SPARK_OPTS="--driver-memory 4g --executor-memory 4g"
   ```

3. **Python version conflicts**: Set the Python version explicitly:
   ```bash
   export PYSPARK_PYTHON=python3.8  # or your specific version
   ```

4. **Connection refused errors**: Check if ports are being blocked by a firewall:
   ```bash
   sudo ufw status
   ```

### For Cluster Configuration

1. **Nodes not connecting**: Check firewall settings and ensure ports 7077, 8080 are open
2. **Memory issues**: Adjust SPARK_WORKER_MEMORY and make sure physical RAM is sufficient
3. **SSH problems**: Verify SSH key configuration with `ssh worker1 date` to test
4. **Application failures**: Check logs in `$SPARK_HOME/logs` directory


## Next Steps

Once you're comfortable with basic PySpark usage and cluster configuration, you might want to explore:

1. Spark SQL for structured data processing
2. Spark MLlib for machine learning
3. Spark Streaming for real-time data processing
4. GraphX for graph processing
5. Integration with data sources like HDFS, Hive, or cloud storage