# Setting up PySpark in Locan Ubuntu System?

### ✅ Stage 2: Setting Up PySpark in a Local Ubuntu System (Step-by-Step)

Here’s a **clean and professional guide** to set up PySpark on Ubuntu (22.04 or similar) for local development.



### 🔧 Prerequisites

* ✅ Python (3.7+)
* ✅ Java (Java 8 or 11 recommended)
* ✅ pip (Python package manager)
* ✅ Ubuntu terminal access



### 🚀 Step-by-Step Installation Guide

#### ✅ Step 1: Install Java (if not already installed)

```bash
sudo apt update
sudo apt install openjdk-11-jdk -y
```

✅ **Check version**:

```bash
java -version
```



#### ✅ Step 2: Install Python & pip (if not already)

```bash
sudo apt install python3 python3-pip -y
```

✅ **Check version**:

```bash
python3 --version
pip3 --version
```



#### ✅ Step 3: Install Apache Spark via pip

Use `findspark` and `pyspark`:

```bash
pip3 install pyspark findspark
```

* **pyspark** → PySpark bindings
* **findspark** → Helps Jupyter or Python scripts locate Spark



#### ✅ Step 4: Set Environment Variables (optional but recommended)

Edit your `.bashrc` file:

```bash
nano ~/.bashrc
```

Add these lines at the end:

```bash
export SPARK_HOME=$(pip3 show pyspark | grep Location | cut -d' ' -f2)/pyspark
export PATH=$SPARK_HOME/bin:$PATH
```

Then run:

```bash
source ~/.bashrc
```

---

#### ✅ Step 5: Test PySpark Shell

```bash
pyspark
```

If everything works, you'll see:

```
>>> Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version X.X.X
      /_/
```

Type `exit()` or press `Ctrl + D` to exit.



#### ✅ Step 6: Test PySpark in Python

Create a file `test_spark.py`:

```python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TestApp").getOrCreate()

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```

Run it:

```bash
python3 test_spark.py
```

You should see the DataFrame printed in your terminal.

---

### ✅ (Optional) Step 7: Jupyter Notebook + PySpark

Install Jupyter:

```bash
pip3 install notebook
```

Then create a kernel:

```bash
pip3 install ipykernel
python3 -m ipykernel install --user --name=pyspark_env
```

Launch Jupyter:

```bash
jupyter notebook
```

Use this code to initialize Spark in your notebook:

```python
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NotebookApp").getOrCreate()
```



# Setting up PySpark in Google Colab?

### ✅ Setting Up PySpark in Google Colab (2025 Guide)

Google Colab is an excellent platform for running **PySpark** without local installation. Follow these simple steps to configure and run Spark on Colab.



### 🔹 Step-by-Step Setup Guide

#### ✅ Step 1: Install PySpark

Run the following in a Colab cell:

```python
!apt-get install openjdk-11-jdk -y
!pip install pyspark
```



#### ✅ Step 2: Set Environment Variables

```python
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.10/dist-packages/pyspark"
```

✅ This tells Spark where Java is located (required to run on JVM).



#### ✅ Step 3: Start Spark Session

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Colab PySpark") \
    .getOrCreate()
```



#### ✅ Step 4: Test Spark

```python
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```



### ✅ Optional: Check Spark Version

```python
print(spark.version)
```



### 📌 Notes

* No need for `findspark` in Colab, unless running multiple sessions.
* Spark runs **locally** on Colab’s virtual machine (not distributed).
* You can upload files to Colab or use `gdown`, `wget`, or mount Google Drive for data input.



# Setting up PySpark in Databrics?

### ✅ Setting Up PySpark in **Databricks**

Databricks is a powerful cloud-based platform built on Apache Spark. It offers an easy-to-use environment for running **PySpark** code without manual installation or setup.



### 🔹 Step-by-Step Guide to Set Up PySpark in Databricks

#### ✅ Step 1: Create a Databricks Account

1. Go to: [https://community.cloud.databricks.com](https://community.cloud.databricks.com)
2. Sign up for a **free Community Edition** account (sufficient for learning).



#### ✅ Step 2: Create a Workspace

Once logged in:

1. Go to the **Workspace** tab.
2. Click on `Create > Notebook`.



#### ✅ Step 3: Create a New Notebook

1. **Name** your notebook (e.g., `My PySpark Demo`).
2. **Default language**: Select `Python`.
3. **Cluster**: You’ll be prompted to attach a cluster.



#### ✅ Step 4: Create and Start a Cluster

1. Go to `Compute > Create Cluster`.
2. Set:

   * Cluster name: `my-cluster`
   * Runtime: Choose default (`10.x` or higher is fine)
   * Cluster mode: `Single Node`
3. Click **Create Cluster**.

📌 Wait 2–3 minutes for the cluster to initialize.



#### ✅ Step 5: Run PySpark Code in Notebook

You can now run PySpark like this:

```python
# Create a SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DatabricksApp").getOrCreate()

# Sample data
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```



### 🔄 Bonus Features in Databricks

| Feature                    | Use                                   |
| -------------------------- | ------------------------------------- |
| 📊 Built-in visualizations | Click on result ➝ Visualize           |
| 📁 File system (DBFS)      | `/dbfs/` path for storing files       |
| 📚 Markdown Support        | `%md` cells for documentation         |
| 📦 Libraries               | Install via `Libraries > Install New` |



### ✅ Databricks Magic Commands

| Magic Command | Description                          |
| ------------- | ------------------------------------ |
| `%fs`         | Access Databricks File System (DBFS) |
| `%run`        | Run another notebook                 |
| `%sql`        | Run SQL queries                      |
| `%python`     | Run Python code (default)            |
| `%sh`         | Run shell commands                   |

