<h1><center>Connecting PySpark with SQL SERVER</center></h1>
<hr><hr><hr>Required connector and driver used: Apache Spark connector: SQL Server & Azure SQL <br>MS Doc Link: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16 <br>Reference Video: https://www.youtube.com/watch?v=YPe_jmV3pzk

In [1]:
import findspark
findspark.init()

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
import ipynbname
notebook_name = ipynbname.name()

print(notebook_name)

004-sql_server_pyspark_connection


In [4]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName( notebook_name )
    .getOrCreate()
)

spark

- For connection to the SQL Server, any port needs to be exposed for the SQL Server, from the 'SQL Server (2022) Configuration Manager' and Firewall. By default, for most of the SQL Server installations in local machine, the port `1433` is exposed.
- We need to get the `SERVER_NAME` (Server IP Address/Hostname) of the SQL Server we want to connect.
- The SQL Server path is then obtained by : `"jdbc:sqlserver://{SQL_SERVER_NAME}:{SQL_SERVER_EXPOSED_PORT}"`
- Using this sql server path, and the database name(which we want to connect to), the connection URL is formed as `{SQL_SERVER_PATH};databaseName={DATABASE_NAME};`. This URL needs to be passed as `url` option while reading or writing data to sql server.
- The format to be used while reading/writing data to SQL Server: `jdbc`
- Also, `.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")` must be passed as option, so as to specify the driver that is to be used for fetching/ingesting data to SQL server.
- The other options to specify are `dbtable`, `user` and `password`.

In [5]:
# SERVER_NAME = "jdbc:sqlserver://10.0.2.5:1433"
SQL_SERVER_NAME = "DEBANJAN"
SQL_SERVER_EXPOSED_PORT = "1433"

SQL_SERVER_PATH = f"jdbc:sqlserver://{SQL_SERVER_NAME}:{SQL_SERVER_EXPOSED_PORT}"

DATABASE_NAME = "prac"
CONNECTION_URL = f"{SQL_SERVER_PATH};databaseName={DATABASE_NAME};"

TABLE_NAME = "crj_orders"

# Create an ".env" file, and it, set environment variables named "SQL_SERVER_USERNAME" and "SQL_SERVER_PASSWORD", with the coorrect username and password, of the SQL to which we wannt to get connected.

USERNAME = os.getenv("SQL_SERVER_USERNAME")
PASSWORD = os.getenv("SQL_SERVER_PASSWORD")

### Reading from SQL Server:
----------------------------------

In [6]:
sql_server_df = (
    spark.read
    .format("jdbc")
    .option("url", CONNECTION_URL)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", TABLE_NAME)
    .option("user", USERNAME)
    .option("password", PASSWORD)
    .load()
)

In [7]:
sql_server_df.show()

+---+-------+--------------+-------+
| id|cust_id|       product|  price|
+---+-------+--------------+-------+
|  1|      2|        Laptop|  35000|
|  2|      3|        Scooty|  80000|
|  3|      1|         Phone|  15000|
|  4|      3|        Laptop|  45000|
|  5|      4|           Car|1000000|
|  6|      3|Dressing Table|  15000|
|  7|      1|        IPhone|  69000|
+---+-------+--------------+-------+



In [22]:
customers_df = (
    spark.read
    .format("jdbc")
    .option("url", CONNECTION_URL)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", "crj_customers")
    .option("user", USERNAME)
    .option("password", PASSWORD)
    .load()
)

customers_df = customers_df.withColumnRenamed("id", "cust_id")

customers_df.show()

+-------+--------+--------+
|cust_id|   fname|   lname|
+-------+--------+--------+
|      1|  Projna| Kabiraj|
|      2|Debanjan|  Sarkar|
|      3|  Nitika|  Sarkar|
|      4|    Atul|   Kumar|
|      5|   Tuhin|  Sarkar|
|      7|  Sagnik|Bairagya|
+-------+--------+--------+



In [16]:
orders_df = (
    spark.read
    .format("jdbc")
    .option("url", CONNECTION_URL)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", "crj_orders")
    .option("user", USERNAME)
    .option("password", PASSWORD)
    .load()
)

orders_df.show()

+---+-------+--------------+-------+
| id|cust_id|       product|  price|
+---+-------+--------------+-------+
|  1|      2|        Laptop|  35000|
|  2|      3|        Scooty|  80000|
|  3|      1|         Phone|  15000|
|  4|      3|        Laptop|  45000|
|  5|      4|           Car|1000000|
|  6|      3|Dressing Table|  15000|
|  7|      1|        IPhone|  69000|
+---+-------+--------------+-------+



In [24]:
join_df = (
    orders_df
    .join(
        customers_df,
        orders_df.cust_id == customers_df.cust_id,
        "right"
    )
    .drop("cust_id")
)

join_df.show()

+----+--------------+-------+--------+--------+
|  id|       product|  price|   fname|   lname|
+----+--------------+-------+--------+--------+
|   7|        IPhone|  69000|  Projna| Kabiraj|
|   3|         Phone|  15000|  Projna| Kabiraj|
|   6|Dressing Table|  15000|  Nitika|  Sarkar|
|   4|        Laptop|  45000|  Nitika|  Sarkar|
|   2|        Scooty|  80000|  Nitika|  Sarkar|
|null|          null|   null|   Tuhin|  Sarkar|
|   5|           Car|1000000|    Atul|   Kumar|
|null|          null|   null|  Sagnik|Bairagya|
|   1|        Laptop|  35000|Debanjan|  Sarkar|
+----+--------------+-------+--------+--------+



### Writing to SQL Server:
---------------------------------

In [25]:
(
    join_df.write
    .format("jdbc")
    .option("url", CONNECTION_URL)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", "pyspark_tbl")
    .option("user", USERNAME)
    .option("password", PASSWORD)
    .save()
)