<a href="https://colab.research.google.com/github/Swelihlelwazi/us-ie-big-data-technologies/blob/master/postblock3/q4..ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose

Explore PySpark and the JDBC connection functionality to read from operational databases.

In this notebook we will setup a PostgreSQL instance and populate it with the Pagila dataset. We will then connect to the database via a JDBC connector.

# Setup

## PostgreSQL

Firstly, let's install postgres in the this Colab instance.

In [None]:
!sudo apt install postgresql postgresql-contrib

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libcommon-sense-perl libjson-perl libjson-xs-perl libtypes-serialiser-perl
  logrotate netbase postgresql-14 postgresql-client-14
  postgresql-client-common postgresql-common ssl-cert sysstat
Suggested packages:
  bsd-mailx | mailx postgresql-doc postgresql-doc-14 isag
The following NEW packages will be installed:
  libcommon-sense-perl libjson-perl libjson-xs-perl libtypes-serialiser-perl
  logrotate netbase postgresql postgresql-14 postgresql-client-14
  postgresql-client-common postgresql-common postgresql-contrib ssl-cert
  sysstat
0 upgraded, 14 newly installed, 0 to remove and 49 not upgraded.
Need to get 18.4 MB of archives.
After this operation, 51.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 logrotate amd64 3.19.0-1ubuntu1.1 [54.3 kB]
Get:2 http://archive.ubuntu.com

In [None]:
!service postgresql start

 * Starting PostgreSQL 14 database server
   ...done.


Create a user in Postgres ([stackoverflow](https://stackoverflow.com/questions/12720967/how-to-change-postgresql-user-password/12721020#12721020))


In [None]:
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'test';"

ALTER ROLE


Store you database password in an environmental variable so that we need no type it in all the time (not advisable generally).

We'll use the notebook magic `%end`

In [None]:
%env PGPASSWORD=test

env: PGPASSWORD=test


## Pagila

Now, let's populate the PostgreSQL database with the Pagila data from the tutorial.

In [None]:
!git clone https://github.com/spatialedge-ai/pagila.git

Cloning into 'pagila'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects: 100% (94/94), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 94 (delta 47), reused 85 (delta 42), pack-reused 0 (from 0)[K
Receiving objects: 100% (94/94), 2.91 MiB | 15.58 MiB/s, done.
Resolving deltas: 100% (47/47), done.


In [None]:
!psql -h localhost -U postgres -c "create database pagila"

CREATE DATABASE


In [None]:
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-schema.sql"

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
CREATE TYPE
ALTER TYPE
CREATE DOMAIN
ALTER DOMAIN
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
CREATE SEQUENCE
ALTER TABLE
SET
SET
CREATE TABLE
ALTER TABLE
CREATE FUNCTION
ALTER FUNCTION
CREATE AGGREGATE
ALTER AGGREGATE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE VIEW
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE VIEW
ALTER TABLE
CREATE VIEW
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQU

In [None]:
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-data.sql"

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
COPY 200
COPY 109
COPY 600
COPY 603
COPY 16
COPY 2
COPY 599
COPY 6
COPY 1000
COPY 5462
COPY 1000
COPY 4581
COPY 2
COPY 16044
COPY 1157
COPY 2312
COPY 5644
COPY 6754
COPY 182
COPY 0
 setval 
--------
    200
(1 row)

 setval 
--------
    605
(1 row)

 setval 
--------
     16
(1 row)

 setval 
--------
    600
(1 row)

 setval 
--------
    109
(1 row)

 setval 
--------
    599
(1 row)

 setval 
--------
   1000
(1 row)

 setval 
--------
   4581
(1 row)

 setval 
--------
      6
(1 row)

 setval 
--------
  32098
(1 row)

 setval 
--------
  16049
(1 row)

 setval 
--------
      2
(1 row)

 setval 
--------
      2
(1 row)



## PySpark Setup

Now, let's download what is necessary for initiating jdbc connections, as well as what is required to run PySpark itself.

In [None]:
# https://stackoverflow.com/questions/34948296/using-pyspark-to-connect-to-postgresql
!wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar

--2024-11-03 09:52:20--  https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1046274 (1022K) [application/java-archive]
Saving to: ‘postgresql-42.5.0.jar’


2024-11-03 09:52:20 (5.69 MB/s) - ‘postgresql-42.5.0.jar’ saved [1046274/1046274]



In [None]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np

%config Completer.use_jedi = False

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}"

# print(os.environ['SPARK_HOME'])


In [None]:
!sudo apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://archive.apache.org/dist/spark/spark-{SPARKVERSION}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}.tgz
!tar xf spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}.tgz

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 3.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
--2024-11-03 09:52:44--  https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
Resolving archive.apache.org (archive.apache.org)... 65.108.204.189, 2a01:4f9:1a:a084::2
Connecting to archive.apache.org (archive.apache.org)|65.108.204.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300971569 (287M) [application/x-gzip]
Saving to: ‘spark-3.2.1-bin-hadoop3.2.tgz’


2024-11-03 09:52:56 (26.4 MB/s) - ‘spark-3.2.1-bin-hadoop3.2.tgz’ saved [300971569/300971569]



In [None]:
!cp postgresql-42.5.0.jar spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars

In [None]:
!pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [None]:
import findspark
findspark.init()
findspark.find()

# get a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.jars",
                                                       "postgresql-42.2.5.jar").config(
                                                          "spark.driver.extraClassPath",
                                                          f"spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars"
                                                       ).getOrCreate()
print(spark.conf.get('spark.jars'))

%env PYARROW_IGNORE_TIMEZONE=1

postgresql-42.2.5.jar
env: PYARROW_IGNORE_TIMEZONE=1


# Questions

### Question 1

Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection.

In [None]:
# 1. Install PostgreSQL
!sudo apt install postgresql postgresql-contrib
!service postgresql start

# 2. Set up PostgreSQL credentials
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'test';"
%env PGPASSWORD=test

# 3. Clone and setup Pagila database
!git clone https://github.com/spatialedge-ai/pagila.git
!psql -h localhost -U postgres -c "create database pagila"
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-schema.sql"
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-data.sql"

# 4. Set up PySpark with JDBC
!wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
!sudo apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!cp postgresql-42.5.0.jar spark-3.2.1-bin-hadoop3.2/jars
!pip install findspark

# 5. Initialize Spark
import os
import findspark

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}"

findspark.init()

# 6. Create Spark Session and read customer table
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # 7. Read customer table and print schema
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Print the schema
customer_df.printSchema()

# pyspark code

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
postgresql is already the newest version (14+238).
postgresql-contrib is already the newest version (14+238).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
 * Starting PostgreSQL 14 database server
   ...done.
ALTER ROLE
env: PGPASSWORD=test
fatal: destination path 'pagila' already exists and is not an empty directory.
ERROR:  database "pagila" already exists
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
psql:pagila/pagila-schema.sql:29: ERROR:  type "mpaa_rating" already exists
ALTER TYPE
psql:pagila/pagila-schema.sql:39: ERROR:  type "year" already exists
ALTER DOMAIN
psql:pagila/pagila-schema.sql:56: ERROR:  function "_group_concat" already exists with same argument types
ALTER FUNCTION
psql:pagila/pagila-schema.sql:73: ERROR:  function "film_in_stock" already exists with same argument types
ALTER FUNCTION
psql:pagila/pagila-schema.sql:90: ERR

### Question 2

Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook.

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Count unique email addresses
unique_email_count = customer_df.select("email").distinct().count()

print(f"Number of unique email addresses: {unique_email_count}")# pyspark code

Number of unique email addresses: 599


### Question 3

Repeat this calculation using only the Dataframe API and print the result.

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Using DataFrame API to count unique emails
unique_email_count = customer_df.dropDuplicates(["email"]).count()

print(f"Number of unique email addresses using DataFrame API: {unique_email_count}")# pyspark code

Number of unique email addresses using DataFrame API: 599


### Question 4

How many partitions are present in the dataframe resulting from Question 3 (additionally provide the code necessary to determine that)

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Get unique emails DataFrame using DataFrame API
unique_emails_df = customer_df.dropDuplicates(["email"])

                                        # Get number of partitions
num_partitions = unique_emails_df.rdd.getNumPartitions()

print(f"Number of partitions in the unique emails DataFrame: {num_partitions}")

                                        # Alternatively, we can see the partitions in action
print("\nPartition distribution:")
print(unique_emails_df.rdd.glom().map(len).collect())

Number of partitions in the unique emails DataFrame: 1

Partition distribution:
[599]


### Question 5

Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API).

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
from pyspark.sql.functions import min, max
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Calculate min and max dates using DataFrame API
date_stats = customer_df.select(
                                            min("create_date").alias("earliest_date"),
                                                max("create_date").alias("latest_date")
                                                ).collect()[0]

print(f"Earliest customer create date: {date_stats['earliest_date']}")
print(f"Latest customer create date: {date_stats['latest_date']}")

Earliest customer create date: 2020-02-14
Latest customer create date: 2020-02-14


### Question 6.1

Determine which first names occur more than once:

1. using the Spark SQL API (printing the result)

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Register the DataFrame as a temporary view
customer_df.createOrReplaceTempView("customer")

                                        # SQL query to find duplicate first names
sql_result = spark.sql("""
                                            SELECT first_name, COUNT(*) as name_count
                                                FROM customer
                                                    GROUP BY first_name
                                                        HAVING COUNT(*) > 1
                                                            ORDER BY name_count DESC, first_name
                                                            """)

                                                            # Show the results
sql_result.show()

+----------+----------+
|first_name|name_count|
+----------+----------+
|     JAMIE|         2|
|    JESSIE|         2|
|     KELLY|         2|
|    LESLIE|         2|
|    MARION|         2|
|     TERRY|         2|
|     TRACY|         2|
|    WILLIE|         2|
+----------+----------+



### Question 6.2

  2. using the Spark Dataframe API (printing the result once more).

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, col
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read customer table from PostgreSQL
customer_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "customer") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # DataFrame operations to find duplicate first names
df_result = customer_df.groupBy("first_name") \
                                            .agg(count("*").alias("name_count")) \
                                                .filter(col("name_count") > 1) \
                                                    .orderBy(col("name_count").desc(), col("first_name"))

                                                    # Show the results
df_result.show()

+----------+----------+
|first_name|name_count|
+----------+----------+
|     JAMIE|         2|
|    JESSIE|         2|
|     KELLY|         2|
|    LESLIE|         2|
|    MARION|         2|
|     TERRY|         2|
|     TRACY|         2|
|    WILLIE|         2|
+----------+----------+



### Question 7

Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL):

```
SELECT
   staff.first_name
   ,staff.last_name
   ,SUM(payment.amount)
 FROM payment
   INNER JOIN staff ON payment.staff_id = staff.staff_id
 WHERE payment.payment_date BETWEEN '2007-01-01' AND '2007-02-01'
 GROUP BY
   staff.last_name
   ,staff.first_name
 ORDER BY SUM(payment.amount)
 ;
```

In [None]:
# Create Spark Session with PostgreSQL JDBC configuration
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as spark_sum, col
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session with PostgreSQL JDBC driver
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Read payment table
payment_df = spark.read \
                .format("jdbc") \
                    .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                        .option("dbtable", "payment") \
                            .option("user", "postgres") \
                                .option("password", "test") \
                                    .option("driver", "org.postgresql.Driver") \
                                        .load()

                                        # Read staff table
staff_df = spark.read \
                                            .format("jdbc") \
                                                .option("url", "jdbc:postgresql://localhost:5432/pagila") \
                                                    .option("dbtable", "staff") \
                                                        .option("user", "postgres") \
                                                            .option("password", "test") \
                                                                .option("driver", "org.postgresql.Driver") \
                                                                    .load()

                                                                    # Convert the SQL query to DataFrame operations with correct column references
result_df = payment_df \
                                                                        .join(staff_df, payment_df.staff_id == staff_df.staff_id, "inner") \
                                                                            .filter((col("payment_date") >= "2007-01-01") & (col("payment_date") <= "2007-02-01")) \
                                                                                .groupBy(col("first_name"), col("last_name")) \
                                                                                    .agg(spark_sum("amount").alias("total_amount")) \
                                                                                        .orderBy(col("total_amount").desc())

                                                                                        # Show results
print("Staff Payment Summary (Jan 1 - Feb 1, 2007):")
print("===========================================")
result_df.show()

Staff Payment Summary (Jan 1 - Feb 1, 2007):
+----------+---------+------------+
|first_name|last_name|total_amount|
+----------+---------+------------+
+----------+---------+------------+



### Question 8

Are you currently executing commands on a driver node, or a worker? Provide the code you ran to determine that.

In [None]:
from pyspark.sql import SparkSession
import os

SPARKVERSION='3.2.1'
HADOOPVERSION='3.2'
pwd=os.getcwd()

# Create Spark session
spark = SparkSession.builder \
    .config("spark.jars", "postgresql-42.5.0.jar") \
        .config("spark.driver.extraClassPath", f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}/jars") \
            .getOrCreate()

            # Get Spark context
sc = spark.sparkContext

            # Check if we're on driver by examining local properties
is_driver = sc._jsc.sc().isLocal()

            # Get deployment mode
deploy_mode = spark.conf.get("spark.submit.deployMode", "client")

            # Get master URL
master_url = sc.master

print(f"Is running locally (driver): {is_driver}")
print(f"Deployment mode: {deploy_mode}")
print(f"Master URL: {master_url}")

            # Additional configuration details
print("\nSpark Configuration Details:")
print("============================")
print(f"Driver Host: {spark.conf.get('spark.driver.host', 'Not Set')}")
print(f"Driver Port: {spark.conf.get('spark.driver.port', 'Not Set')}")
print(f"App Name: {sc.appName}")

Is running locally (driver): True
Deployment mode: client
Master URL: local[*]

Spark Configuration Details:
Driver Host: 5f82e67c07fa
Driver Port: 33647
App Name: pyspark-shell
