**More About Dataframe reader API:**

In [1]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.catalogImplementation", "hive"). \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    enableHiveSupport(). \
    master("local"). \
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/25 13:36:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [22]:
orders_df = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [23]:
orders_df.show(5)

+--------+-------------------+-----------+------------+
|order_id|         order_date|customer_id|order_status|
+--------+-------------------+-----------+------------+
|       1|2013-07-27 00:00:00|      30265|      CLOSED|
|       2|2013-11-25 00:00:00|      20386|      CLOSED|
|       3|2014-01-21 00:00:00|      15768|    COMPLETE|
|       4|2014-07-04 00:00:00|      27181|  PROCESSING|
|       5|2014-03-08 00:00:00|      12448|    COMPLETE|
+--------+-------------------+-----------+------------+
only showing top 5 rows



In PySpark, there are several shortcut methods that allow you to read data from different formats like CSV, JSON, ORC, Parquet, JDBC, and Tables. These methods make it easy to load data into Spark DataFrames for further processing.

Let's go through each of these formats and the methods to read them:

**1. CSV (Comma Separated Values)**

The csv() method is used to read CSV files. You can provide options such as specifying delimiters, header row, schema, and more.

**Syntax:**

```
df = spark.read.csv("path_to_file.csv", header=True, inferSchema=True)
```

**Key Options:**
-   **header=True:** Tells Spark to use the first row as column names.
-   **inferSchema=True:** Automatically infers the data types of columns.
-   **sep=<delimiter>:** Specifies a custom delimiter (default is comma).
-   **quote="<quote character>":** Defines how quoted strings are handled.

**Example:**

```
df = spark.read.csv("/path/to/file.csv", header=True, inferSchema=True)
df.show()
```

**2. JSON (JavaScript Object Notation)**

The json() method is used to read JSON files, which are commonly used for data exchange.

**Syntax:**

```
df = spark.read.json("path_to_file.json")
```

**Key Options:**

-   **multiline=True:** Allows reading JSON files that span multiple lines (useful for pretty-printed JSON).
-   **primitivesAsString=True:** Treats primitive types like strings when reading JSON objects.

**Example:**

```
df = spark.read.json("/path/to/file.json")
df.show()
```

**3. ORC (Optimized Row Columnar)**

ORC is a columnar storage format that provides high compression and high performance.

**Syntax:**

```
df = spark.read.orc("path_to_file.orc")
```

**Key Options:**
-   ORC files are efficient for both reading and writing, especially in a Hive or SparkSQL environment.

**Example:**

```
df = spark.read.orc("/path/to/file.orc")
df.show()
```

**4. Parquet**

Parquet is a popular columnar storage format that is optimized for large-scale data processing with Spark. It's very efficient for storage and query performance.

**Syntax:**

```
df = spark.read.parquet("path_to_file.parquet")
```

**Key Options:**
-   **mergeSchema=True:** Merges the schema of multiple Parquet files.
-   **dateMetadata=True:** Retains metadata for date and timestamp columns.

**Example:**

```
df = spark.read.parquet("/path/to/file.parquet")
df.show()
```

**5. Tables (Spark SQL / Hive tables)**

You can read data directly from Spark SQL tables (either from the default Spark catalog or Hive).

**Syntax:**

```
df = spark.read.table("table_name")
```

-   This reads a table as a DataFrame. If you're working with Hive, Spark will connect to the Hive Metastore to retrieve the table schema and data.

**Example:**

```
df = spark.read.table("my_table")
df.show()
```

**6. JDBC (Java Database Connectivity)**

To read data from relational databases such as MySQL, PostgreSQL, SQL Server, etc., you can use the jdbc() method.

**Syntax:**

```
df = spark.read.jdbc(url="jdbc:<database_url>", table="<table_name>", properties={"user": "<username>", "password": "<password>"})
```

**Key Options:**

-   **url:** JDBC URL to connect to the database.
-   **table:** The name of the table you want to read from.
-   **properties:** Properties like username and password for connecting to the database.

**Example (Reading from MySQL):**

```
jdbc_url = "jdbc:mysql://localhost:3306/mydatabase"
properties = {"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"}

df = spark.read.jdbc(url=jdbc_url, table="my_table", properties=properties)
df.show()
```

**Summary Table:**

| **Format**   | **Method**               | **Use Case**                                         | **Key Options**                        |
|--------------|--------------------------|-----------------------------------------------------|----------------------------------------|
| CSV          | `spark.read.csv()`        | For reading CSV files                              | `header`, `inferSchema`, `sep`, `quote`|
| JSON         | `spark.read.json()`       | For reading JSON files                             | `multiline`, `primitivesAsString`      |
| ORC          | `spark.read.orc()`        | Efficient columnar format, often used with Hive    | -                                      |
| Parquet      | `spark.read.parquet()`    | Optimized columnar storage format, fast for queries| `mergeSchema`, `dateMetadata`          |
| Tables       | `spark.read.table()`      | Read from Spark SQL or Hive tables                 | -                                      |
| JDBC         | `spark.read.jdbc()`       | For connecting to relational databases (e.g., MySQL, PostgreSQL) | `url`, `properties`, `table`           |


In [None]:
orders_df = spark.read.csv("/Users/sugumarsrinivasan/Documents/data/orders.csv", header="true", inferSchema="true")

In [17]:
orders_df.show(5)

+-----+-----------------------------+---+---------------+
|40322|2014-04-08T00:00:00.000+05:30|  1|PENDING_PAYMENT|
+-----+-----------------------------+---+---------------+
|35390|          2014-05-20 00:00:00|  2|       COMPLETE|
|29108|          2013-10-14 00:00:00|  3|PENDING_PAYMENT|
|34419|          2014-01-04 00:00:00|  4|         CLOSED|
| 9936|          2014-02-16 00:00:00|  5|PENDING_PAYMENT|
|41598|          2014-05-02 00:00:00|  6|       COMPLETE|
+-----+-----------------------------+---+---------------+
only showing top 5 rows



In [None]:
orders_df = spark.read.json("/Users/sugumarsrinivasan/Documents/data/orders.json")

In [9]:
orders_df.show(5)

+-----------+--------------------+--------+---------------+
|customer_id|          order_date|order_id|   order_status|
+-----------+--------------------+--------+---------------+
|      40322|2014-04-08T00:00:...|       1|PENDING_PAYMENT|
|      35390|2014-05-20T00:00:...|       2|       COMPLETE|
|      29108|2013-10-14T00:00:...|       3|PENDING_PAYMENT|
|      34419|2014-01-04T00:00:...|       4|         CLOSED|
|       9936|2014-02-16T00:00:...|       5|PENDING_PAYMENT|
+-----------+--------------------+--------+---------------+
only showing top 5 rows



In [10]:
orders_df = spark.read.orc("/Users/sugumarsrinivasan/Documents/data/orders.orc")

In [11]:
orders_df.show(5)

+-----------+--------------------+--------+---------------+
|customer_id|          order_date|order_id|   order_status|
+-----------+--------------------+--------+---------------+
|      40322|2014-04-08T00:00:...|       1|PENDING_PAYMENT|
|      35390|2014-05-20T00:00:...|       2|       COMPLETE|
|      29108|2013-10-14T00:00:...|       3|PENDING_PAYMENT|
|      34419|2014-01-04T00:00:...|       4|         CLOSED|
|       9936|2014-02-16T00:00:...|       5|PENDING_PAYMENT|
+-----------+--------------------+--------+---------------+
only showing top 5 rows



In [12]:
orders_df = spark.read.parquet("/Users/sugumarsrinivasan/Documents/data/orders.parquet")

In [13]:
orders_df.show(5)

+-----------+--------------------+--------+---------------+
|customer_id|          order_date|order_id|   order_status|
+-----------+--------------------+--------+---------------+
|      40322|2014-04-08T00:00:...|       1|PENDING_PAYMENT|
|      35390|2014-05-20T00:00:...|       2|       COMPLETE|
|      29108|2013-10-14T00:00:...|       3|PENDING_PAYMENT|
|      34419|2014-01-04T00:00:...|       4|         CLOSED|
|       9936|2014-02-16T00:00:...|       5|PENDING_PAYMENT|
+-----------+--------------------+--------+---------------+
only showing top 5 rows



In [3]:
filtered_df = orders_df.where("customer_id = 35390")

In [7]:
filtered_df.show(5, truncate = False)

+--------+-------------------+-----------+---------------+
|order_id|order_date         |customer_id|order_status   |
+--------+-------------------+-----------+---------------+
|2       |2014-05-20 00:00:00|35390      |COMPLETE       |
|20043   |2013-12-30 00:00:00|35390      |CLOSED         |
|70178   |2013-10-22 00:00:00|35390      |CLOSED         |
|107696  |2013-09-07 00:00:00|35390      |COMPLETE       |
|282162  |2014-07-07 00:00:00|35390      |PENDING_PAYMENT|
+--------+-------------------+-----------+---------------+
only showing top 5 rows



In [8]:
filtered_df = orders_df.filter("customer_id = 40322")

In [10]:
filtered_df.show(5,truncate = False)

+--------+-------------------+-----------+---------------+
|order_id|order_date         |customer_id|order_status   |
+--------+-------------------+-----------+---------------+
|1       |2014-04-08 00:00:00|40322      |PENDING_PAYMENT|
|48390   |2013-09-17 00:00:00|40322      |PENDING_PAYMENT|
|87380   |2014-01-08 00:00:00|40322      |PENDING_PAYMENT|
|90486   |2013-09-12 00:00:00|40322      |PROCESSING     |
|92377   |2013-09-26 00:00:00|40322      |CLOSED         |
+--------+-------------------+-----------+---------------+
only showing top 5 rows



In PySpark, the `createOrReplaceTempView()` method is used to register a DataFrame as a temporary SQL table (view) so that you can run SQL queries against it. This temporary view will exist only during the lifetime of the Spark session and will be dropped once the session ends or the view is explicitly replaced.

**Key Points:**

-   **Temporary View:** The view is temporary in the sense that it is only available during the current Spark session.
-   **SQL Queries:** Once the DataFrame is registered as a temporary view, you can use spark.sql() to run SQL queries on it, just like any regular SQL table.
-   **Replacement:** If a view with the same name already exists, calling createOrReplaceTempView() will replace the old view with the new DataFrame.
-   **Global vs. Local:** This creates a local temporary view that is visible only within the current Spark session. If you need a view to be accessible across multiple sessions, you should use createOrReplaceGlobalTempView().

**When to Use createOrReplaceTempView:**

-   **SQL Queries:** If you prefer to use SQL syntax to query your DataFrame instead of using PySpark DataFrame operations, createOrReplaceTempView() allows you to do that.

-   **Intermediate Views:** It's useful for registering intermediate views in a pipeline where multiple transformations happen, and you need to query specific stages using SQL.

-   **Replacing Views:** If you want to refresh or modify the structure of the view dynamically, you can replace the previous view with a new one using createOrReplaceTempView().

**Important Considerations:**

-   **Scope:** The view is only accessible within the current session. If you need a view to persist across sessions, you can use createOrReplaceGlobalTempView().

-   **Performance:** Since temporary views are in-memory and don't persist to disk, there is no overhead related to reading from storage. However, complex SQL queries over large DataFrames can still result in performance challenges.

**Conclusion:**
createOrReplaceTempView() is a powerful method that helps bridge the gap between PySpark DataFrame operations and SQL operations. It makes it easy to query DataFrames using SQL, leveraging the power of Spark SQL.

In [24]:
orders_df.createOrReplaceTempView("orders")

In [13]:
filtered_df = spark.sql("select * from orders where order_status = 'CLOSED'")

In [14]:
filtered_df.show()

+--------+-------------------+-----------+------------+
|order_id|         order_date|customer_id|order_status|
+--------+-------------------+-----------+------------+
|       4|2014-01-04 00:00:00|      34419|      CLOSED|
|       7|2014-04-23 00:00:00|       4914|      CLOSED|
|       8|2013-10-02 00:00:00|      36928|      CLOSED|
|      18|2013-10-06 00:00:00|       1921|      CLOSED|
|      25|2014-05-08 00:00:00|      34902|      CLOSED|
|      35|2014-03-18 00:00:00|      15715|      CLOSED|
|      36|2013-08-16 00:00:00|      10644|      CLOSED|
|      45|2013-09-06 00:00:00|      28985|      CLOSED|
|      46|2013-10-03 00:00:00|      45110|      CLOSED|
|      50|2014-06-05 00:00:00|      35405|      CLOSED|
|      55|2014-01-01 00:00:00|      21949|      CLOSED|
|      65|2013-07-29 00:00:00|      49936|      CLOSED|
|      66|2014-04-22 00:00:00|      19425|      CLOSED|
|      73|2013-10-01 00:00:00|      41501|      CLOSED|
|      74|2014-05-19 00:00:00|      21753|      

In [15]:
orders_df = spark.read.table("orders")

In [16]:
orders_df.show(5)

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2014-04-08 00:00:00|      40322|PENDING_PAYMENT|
|       2|2014-05-20 00:00:00|      35390|       COMPLETE|
|       3|2013-10-14 00:00:00|      29108|PENDING_PAYMENT|
|       4|2014-01-04 00:00:00|      34419|         CLOSED|
|       5|2014-02-16 00:00:00|       9936|PENDING_PAYMENT|
+--------+-------------------+-----------+---------------+
only showing top 5 rows



In [7]:
spark.sql("create database if not exists retail")

24/12/25 13:37:56 WARN ObjectStore: Failed to get database retail, returning NoSuchObjectException
24/12/25 13:37:56 WARN ObjectStore: Failed to get database retail, returning NoSuchObjectException
24/12/25 13:37:56 WARN ObjectStore: Failed to get database retail, returning NoSuchObjectException


DataFrame[]

In [8]:
spark.sql("show databases").show()

+---------+
|namespace|
+---------+
|  default|
|   retail|
+---------+



In [9]:
spark.sql("show databases").filter("namespace = 'retail'").show()

+---------+
|namespace|
+---------+
|   retail|
+---------+



In [10]:
spark.sql("show databases").filter("namespace like 'retail%'").show()

+---------+
|namespace|
+---------+
|   retail|
+---------+



In [18]:
spark.sql("show tables").show()

+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
|   retail|orders_tbl|      false|
+---------+----------+-----------+



In [12]:
spark.sql("use retail")

DataFrame[]

In [25]:
spark.sql("show tables").show()

+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
|   retail|orders_tbl|      false|
|         |    orders|       true|
+---------+----------+-----------+



In [16]:
spark.sql("create table if not exists orders_tbl (order_id integer, order_date string, customer_id integer, order_status string)")

24/12/25 13:39:28 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


DataFrame[]

In [27]:
spark.sql("insert into orders_tbl select * from orders")

DataFrame[]

In [28]:
spark.sql("select * from orders_tbl limit 10").show()

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2013-07-27 00:00:00|      30265|         CLOSED|
|       2|2013-11-25 00:00:00|      20386|         CLOSED|
|       3|2014-01-21 00:00:00|      15768|       COMPLETE|
|       4|2014-07-04 00:00:00|      27181|     PROCESSING|
|       5|2014-03-08 00:00:00|      12448|       COMPLETE|
|       6|2014-07-20 00:00:00|      49340|         CLOSED|
|       7|2013-12-14 00:00:00|      13801|     PROCESSING|
|       8|2014-04-23 00:00:00|      28523|PENDING_PAYMENT|
|       9|2014-01-07 00:00:00|      26329|         CLOSED|
|      10|2013-07-29 00:00:00|      38797|       COMPLETE|
+--------+-------------------+-----------+---------------+



In [29]:
spark.sql("describe table orders_tbl").show()

+------------+---------+-------+
|    col_name|data_type|comment|
+------------+---------+-------+
|    order_id|      int|   NULL|
|  order_date|   string|   NULL|
| customer_id|      int|   NULL|
|order_status|   string|   NULL|
+------------+---------+-------+



In [31]:
spark.sql("describe extended orders_tbl").show(truncate=False)

+----------------------------+---------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                  |comment|
+----------------------------+---------------------------------------------------------------------------+-------+
|order_id                    |int                                                                        |NULL   |
|order_date                  |string                                                                     |NULL   |
|customer_id                 |int                                                                        |NULL   |
|order_status                |string                                                                     |NULL   |
|                            |                                                                           |       |
|# Detailed Table Information|                                                  

In [32]:
spark.sql("describe formatted orders_tbl").show(truncate=False)

+----------------------------+---------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                  |comment|
+----------------------------+---------------------------------------------------------------------------+-------+
|order_id                    |int                                                                        |NULL   |
|order_date                  |string                                                                     |NULL   |
|customer_id                 |int                                                                        |NULL   |
|order_status                |string                                                                     |NULL   |
|                            |                                                                           |       |
|# Detailed Table Information|                                                  

In [34]:
spark.sql("drop table orders_tbl")

DataFrame[]

In [None]:
spark.sql("describe extended orders_tbl").show()