# Create DataFrame from Data source (RDBMS)

## Introduction
We  can create PySpark SQL DataFrame from different database sources such as SQL, Oracle, and PostgreSQL. In this lab, we will explore how to read data from SQL(MariaDB) using a JDBC connection. We will use the “classicmodels” database in this example. 

## Setup Session

In [1]:
from pyspark.sql import SparkSession

In [2]:
session = SparkSession.builder.master('local[*]').appName('Test SQL app').getOrCreate()
# NOTE
# .master() required for initializing connection to either local server or URL
# local[*] ==> 'local any'


## Reading from SQLDB
### read.format().options() with the JDBC API

In [8]:
from super_secret_password import PASSWORD
sql_db_df = session.read.format("jdbc").options(
    driver="com.mysql.cj.jdbc.Driver",
    user="root",
    password=PASSWORD,  # tfw you forgot your password 👀
    url="jdbc:mysql://localhost:3306/classicmodels",
    dbtable="classicmodels.orders").load()  
    # Note that we only loaded the orders table to the df ⭐

# NOTE - JDBC
# jdbc stands for 'java database connectivity' API
# what's happening here is format is recognizing options' **kwargs as
# the socket info needed to interface with the DB
# .load() is returning everything as a dataframe





In [9]:
sql_db_df.show(2)

+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10100|2003-01-06|  2003-01-13| 2003-01-10|Shipped|                null|           363|
|      10101|2003-01-09|  2003-01-18| 2003-01-11|Shipped|Check on availabi...|           128|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
only showing top 2 rows



### Queries as methods: An example with count()

In [10]:
sql_db_df.count()  # number of rows

326

### Queries Built-in to the dataframe

In [12]:
query = """
    (SELECT *
    FROM orders
    WHERE customerNumber = 144) as Customer
"""

sql_db_df_custom = session.read.format("jdbc").options(
    driver="com.mysql.cj.jdbc.Driver",
    user="root",
    password=PASSWORD,  
    url="jdbc:mysql://localhost:3306/classicmodels",
    dbtable=query).load()
# be mindful that the query needs to be encapsulated with () and given an alias for some reason

In [13]:
sql_db_df_custom.show()
# this does not interfere with previous db connection and both can be displayed when called

+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10112|2003-03-24|  2003-04-03| 2003-03-29|Shipped|Customer requeste...|           144|
|      10320|2004-11-03|  2004-11-13| 2004-11-07|Shipped|                null|           144|
|      10326|2004-11-09|  2004-11-16| 2004-11-10|Shipped|                null|           144|
|      10334|2004-11-19|  2004-11-28|       null|On Hold|The outstaniding ...|           144|
+-----------+----------+------------+-----------+-------+--------------------+--------------+



In [20]:
query_2 = """
    (SELECT *
    FROM orders
    WHERE customerNumber = 144 OR customerNumber = 128) as Customer
"""

sql_db_df_custom_2 = session.read.format("jdbc").options(
    driver="com.mysql.cj.jdbc.Driver",
    user="root",
    password=PASSWORD,  
    url="jdbc:mysql://localhost:3306/classicmodels",
    dbtable=query_2).load()  

In [21]:
sql_db_df_custom_2.show()

+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10101|2003-01-09|  2003-01-18| 2003-01-11|Shipped|Check on availabi...|           128|
|      10230|2004-03-15|  2004-03-24| 2004-03-20|Shipped|Customer very con...|           128|
|      10300|2003-10-04|  2003-10-13| 2003-10-09|Shipped|                null|           128|
|      10323|2004-11-05|  2004-11-12| 2004-11-09|Shipped|                null|           128|
|      10112|2003-03-24|  2003-04-03| 2003-03-29|Shipped|Customer requeste...|           144|
|      10320|2004-11-03|  2004-11-13| 2004-11-07|Shipped|                null|           144|
|      10326|2004-11-09|  2004-11-16| 2004-11-10|Shipped|                null|           144|
|      10334|2004-11-19|  2004-11-28|       null|On Hold|The