
**Why Vertex AI Workbench Doesn't Find PySpark by Default**

  * **JupyterLab Environment:** A Vertex AI Workbench instance provides a JupyterLab environment running on a Compute Engine VM. While it comes with many data science libraries (TensorFlow, PyTorch, scikit-learn), it **doesn't inherently include a Spark distribution or PySpark configured to run locally or connect to a Spark cluster.**
  * **Spark is a Distributed System:** PySpark is the Python API for Apache Spark, which is a *distributed computing framework*. It's designed to run across a cluster of machines. A single notebook instance, by itself, isn't a Spark cluster.
  * **Missing Dependencies:** PySpark requires a Java Development Kit (JDK) and specific environment variables (`SPARK_HOME`, `JAVA_HOME`) to be set up correctly to find and interact with the underlying Spark binaries.

**How to Enable PySpark in Vertex AI Workbench**

There are two primary ways to get PySpark working in a Vertex AI Workbench notebook:

**Method 1: Integrate with a Dataproc Cluster (Recommended for Big Data)**

This is the most robust and scalable solution for real Big Data workloads. You connect your Workbench notebook to a managed Spark cluster (Dataproc).

1.  **Enable Dataproc API:** Ensure the Dataproc API is enabled in your Google Cloud project.
    ```bash
    gcloud services enable dataproc.googleapis.com
    ```
2.  **Create a Dataproc Cluster:** Create a Dataproc cluster in the same region as your Workbench instance. You can do this via the GCP Console or `gcloud` commands.
    ```bash

    gcloud dataproc clusters create example-cluster\
      --enable-component-gateway\
      --bucket=example-dataproc-workshop\
      --region=europe-west1\
      --no-address\
      --master-machine-type=n4-standard-2\
      --master-boot-disk-type=hyperdisk-balanced\
      --master-boot-disk-size=100\
      --num-workers=2\
      --worker-machine-type=n4-standard-2\
      --worker-boot-disk-size=200\
      --image-version=2.2-debian12\
      --optional-components JUPYTER\
      --max-age=3600s\
      --labels=mode=workshop,user=zelda\
      --project=$PROJECT_ID
    ```
3.  **Enable Dataproc Integration in Workbench:**
      * When creating a new Vertex AI Workbench instance, ensure "Enable Dataproc Serverless Interactive Sessions" (or "Enable Dataproc" for older versions) is selected under "Advanced options" -\> "Environment" or "Integrations."
      * If your instance already exists, you might need to stop it, edit it, enable the Dataproc integration, and then restart it.
4.  **Open JupyterLab and Select Dataproc Kernel:**
      * Once your Workbench instance is running and has Dataproc integration enabled, open JupyterLab.
      * When creating a new notebook, you should see kernels like "PySpark" or "Spark (with Python 3)" that allow you to connect to your Dataproc cluster.
5.  **Connect to Spark Session in Notebook:** Your notebook code will then automatically connect to the Dataproc cluster when you create a SparkSession.
    ```python
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("MyPySparkApp").getOrCreate()
    print("SparkSession created successfully!")
    ```

**Method 2: Install PySpark Locally on the Workbench Instance (for Smaller Scale/Testing)**

This method installs PySpark directly on your Workbench VM. It's suitable for smaller datasets that fit within the VM's memory and for local development/testing, but it won't leverage distributed computing.

1.  **Open a Terminal in JupyterLab:** From your JupyterLab interface on the Workbench instance, go to `File` -\> `New` -\> `Terminal`.
2.  **Install Java (if not present):** PySpark requires Java. Many default images have it, but if not:
    ```bash
    sudo apt-get update
    sudo apt-get install default-jdk
    ```
3.  **Install PySpark via pip:**
    ```bash
    pip install pyspark
    ```
4.  **Set Environment Variables (Optional, but good practice):**
    ```bash
    # Add these to your ~/.bashrc or directly in the notebook cell
    export JAVA_HOME="/usr/lib/jvm/default-java" # Adjust if your JDK path is different
    export SPARK_HOME="/opt/conda/lib/python3.11/site-packages/pyspark" # Adjust based on your pip install location
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    ```
      * You might need to restart the kernel or JupyterLab after setting these.
5.  **Create SparkSession in Notebook:**
    ```python
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("LocalPySpark").getOrCreate()
    print("Local SparkSession created successfully!")
    ```

**Troubleshooting Tips:**

  * **Check Java Installation:** Run `java -version` in the terminal to confirm Java is installed.
  * **Environment Variables:** Ensure `JAVA_HOME` and `SPARK_HOME` are correctly set.
  * **Kernel Selection:** Always choose the correct PySpark kernel in your JupyterLab notebook.
  * **Restart Kernel:** After installing packages or changing environment variables, always restart the notebook kernel.


# Demo

Start by creating the cluster, you can launch this command here (use Python 3 kernel) or in a normal shell.

In [None]:
!gcloud dataproc clusters create ale-cluster\
      --enable-component-gateway\
      --bucket=example-dataproc-workshop\
      --region=europe-west1\
      --no-address\
      --master-machine-type=n4-standard-2\
      --master-boot-disk-type=hyperdisk-balanced\
      --master-boot-disk-size=100\
      --num-workers=2\
      --worker-machine-type=n4-standard-2\
      --worker-boot-disk-size=200\
      --image-version=2.2-debian12\
      --optional-components JUPYTER\
      --max-age=3600s\
      --labels=mode=workshop,user=zelda\
      --project=poc-example-ds

Waiting on operation [projects/poc-example-ds/regions/europe-west1/operations/ac6f4b90-b421-3662-9386-5cf4e96453c6].
Waiting for cluster creation operation...                                      
Waiting for cluster creation operation...⠶                                     

Now select the kernel again (upper right corner) and set it to pyspark "CLUSTER_NAME" in our case example-cluster. 

In [1]:
BQ_INPUT_TABLE_ID = f"poc-example-ds.example_composer_workshop.sales_data" # Replace with your input table
BQ_OUTPUT_TABLE_ID = f"poc-example-ds.example_composer_workshop.filtered_sales" # Replace with your output table
INPUT_DATE='2025-06-12'

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Get or create a SparkSession.
spark = SparkSession.builder \
    .appName("BasicPySparkOperations") \
    .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.38.0") \
    .getOrCreate()

print("SparkSession created successfully!")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")
print(f"Spark Version: {spark.version}")

SparkSession created successfully!
Spark UI: http://zelda-cluster-m.europe-west1-b.c.poc-example-ds.internal:36369
Spark Version: 3.5.3


25/07/31 10:12:27 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [3]:
spark = SparkSession.builder \
        .appName("EBClientZeldaProcessor") \
        .config("temporaryGcsBucket", "example-dataproc-workshop") \
        .getOrCreate()


25/07/31 10:12:27 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [4]:
 df = spark.read.format("bigquery") \
            .option("table", BQ_INPUT_TABLE_ID) \
            .load()


In [5]:
df.show()

                                                                                

+--------------+----------+------+----------------+
|transaction_id|product_id|amount|transaction_date|
+--------------+----------+------+----------------+
|        TID003|    PROD_A|  75.2|      2025-06-11|
|        TID001|    PROD_A|150.75|      2025-06-12|
|        TID006|    PROD_A| 99.99|      2025-06-12|
|        TID002|    PROD_B| 200.0|      2025-06-12|
|        TID005|    PROD_B|120.99|      2025-06-13|
|        TID004|    PROD_C| 300.5|      2025-06-12|
|        TID008|    PROD_C| 450.0|      2025-06-13|
|        TID007|    PROD_D|  50.0|      2025-06-11|
+--------------+----------+------+----------------+



In [12]:
processing_date="2025-06-12"

# Example processing: Filter by date and calculate total sales
processed_df = df.filter(F.col("transaction_date") == F.lit(processing_date)) \
                 .groupBy("product_id") \
                 .agg(F.sum("amount").alias("total_sales")) \
                 .withColumn("processing_date", F.lit(processing_date))




In [13]:
# Example of using a utility function
print(f"Writing processed data to BigQuery table: {BQ_OUTPUT_TABLE_ID}")

# Write to BigQuery
processed_df.write \
    .format("bigquery") \
    .option("table", BQ_OUTPUT_TABLE_ID) \
    .mode("overwrite").save()

Writing processed data to BigQuery table: poc-example-ds.example_composer_workshop.filtered_sales


                                                                                