**DataFrame in PySpark:**

A DataFrame in PySpark is a distributed collection of data organized into named columns, and it's similar to a table in a relational database or a data frame in Pandas. The core idea behind a DataFrame is to provide a higher-level abstraction over RDDs (Resilient Distributed Datasets), which allows for more expressive operations (like SQL-like queries) while maintaining scalability and performance.

Here are some key features of a PySpark DataFrame:

-   **Schema-based:** DataFrames have a schema, which means the data is structured with columns that have types, similar to a relational database table.

-   **Distributed:** DataFrames are distributed across a cluster, and Spark will automatically manage parallelization and fault tolerance.

-   **Optimized Execution:** PySpark DataFrames use Spark’s Catalyst query optimizer to optimize query execution plans.

-   **Interoperability:** DataFrames can be created from various sources like CSV files, Parquet files, databases, and other distributed file systems (HDFS, S3, etc.).

-   **High-level API:** The API is more user-friendly and higher-level than working directly with RDDs.

**Example of creating a DataFrame:**

```
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()

# Create a DataFrame from a list of tuples
data = [("Alice", 29), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

df.show()
```

**Example Output:**
```
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 29|
|    Bob| 30|
|Charlie| 35|
+-------+---+
```

**Spark SQL:**

Spark SQL is a Spark module for structured data processing. It allows you to run SQL queries on data stored in a variety of formats, including DataFrames, tables, and external sources like Hive or relational databases.

Spark SQL integrates relational data processing with Spark’s functional programming API, allowing users to execute SQL queries along with transformations and actions on DataFrames and Datasets.

Key features of Spark SQL:

-   **SQL Queries:** You can run standard SQL queries on data in DataFrames or external sources (like Hive).
Unified Data Access: You can use the same DataFrame API or Spark SQL to query different data sources (e.g., HDFS, S3, Parquet, JDBC).

-   **Catalyst Optimizer:** Spark SQL uses the Catalyst optimizer to optimize query execution for better performance.

**Example of running SQL queries with Spark SQL:**

**1.Register a DataFrame as a temporary view:**

```
# Create a DataFrame
data = [("Alice", 29), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Register the DataFrame as a temporary SQL view
df.createOrReplaceTempView("people")
```

**2.Run a SQL query:**

```
# Execute SQL query
sql_df = spark.sql("SELECT * FROM people WHERE Age > 30")

# Show results
sql_df.show()
```

**Example Output:**

```
+-------+---+
|   Name|Age|
+-------+---+
|Charlie| 35|
+-------+---+
```

**Key Differences:**

-   **DataFrame API:** Provides a programmatic way to work with data using functions like filter(), select(), groupBy(), etc.

-   **Spark SQL:** Allows you to express queries in SQL syntax and can be used in combination with the DataFrame API.



| Feature                  | Spark SQL (SQL API)                                    | PySpark (Python API)                                    |
|--------------------------|--------------------------------------------------------|---------------------------------------------------------|
| **Definition**            | A DataFrame in Spark SQL is a distributed collection of data organized into named columns, similar to a table in a relational database. | A DataFrame in PySpark is a distributed collection of data organized into named columns, similar to a table in a relational database, but accessed via Python code. |
| **Language**              | SQL-based interface for querying data.                | Python-based interface for querying data using PySpark functions. |
| **Creation**              | Typically created using `CREATE TEMPORARY VIEW` or directly reading from external data sources with `spark.read` in SQL queries. | Created using PySpark's `spark.read` or loading external data, and can be manipulated using PySpark’s Python functions. |
| **Querying**              | Queries are executed using SQL syntax directly on DataFrames. | Queries are executed using PySpark’s DataFrame API (e.g., `filter()`, `select()`, `groupBy()`) or SQL queries via `spark.sql()`. |
| **Operations**            | Operations are written as SQL queries, like `SELECT`, `JOIN`, `GROUP BY`, `ORDER BY`, etc. | Operations are written using PySpark DataFrame API methods, like `.select()`, `.filter()`, `.join()`, `.agg()`, etc. |
| **Interactivity**         | You interact with DataFrames in Spark SQL by running SQL queries through the `spark.sql()` interface. | You interact with DataFrames in PySpark using Python code and PySpark API functions. |
| **Performance**           | Spark SQL queries are optimized via Catalyst optimizer. | PySpark queries also use Catalyst optimizer, but operations are written in Python, which may involve extra overhead. |
| **Syntax**                | Uses SQL syntax, e.g., `SELECT * FROM df WHERE age > 30`. | Uses Python syntax with PySpark functions, e.g., `df.filter(df.age > 30)`. |
| **Integration with Other Libraries** | SQL queries can be easily integrated with other Spark components and tools like Spark MLlib, GraphX, etc. | PySpark DataFrame API integrates easily with other Python libraries like Pandas, NumPy, and SciPy for further analysis. |
| **Compatibility**         | Best suited for users familiar with SQL or databases.   | Ideal for users comfortable with Python and the PySpark API. |
| **Execution Environment** | Runs inside a Spark session, typically via `spark.sql()`. | Runs inside a PySpark session using Python's interactive shell or scripts. |


**When to use which?**

-   **Use DataFrame API** when you want to stay within the functional programming paradigm and benefit from optimization and transformations on distributed data.

-   **Use Spark SQL** if you're more comfortable with SQL or if you're working in a scenario where SQL queries are a more natural fit, especially when querying data sources like Hive or JDBC.

Both DataFrame API and Spark SQL are optimized by the Catalyst Optimizer, so performance-wise, there is little difference between the two. It largely depends on the preference of the user (SQL vs. programmatic interface).

In [1]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    master("local"). \
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/25 09:07:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/25 09:07:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


**read():** Used for read the data from files and load it to dataframe

In [None]:
orders_df = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/Users/sugumarsrinivasan/Documents/data/sample_orders_1GB.csv")

**show():**

-   It is used for printing the contents from dataframe.
-   show(5) --> It prints the first 5 rows from dataframe.
-   show() --> It prints the first 20 rows if you don't pass any numbers.

In [3]:
orders_df.show()

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2014-04-08 00:00:00|      40322|PENDING_PAYMENT|
|       2|2014-05-20 00:00:00|      35390|       COMPLETE|
|       3|2013-10-14 00:00:00|      29108|PENDING_PAYMENT|
|       4|2014-01-04 00:00:00|      34419|         CLOSED|
|       5|2014-02-16 00:00:00|       9936|PENDING_PAYMENT|
|       6|2014-05-02 00:00:00|      41598|       COMPLETE|
|       7|2014-04-23 00:00:00|       4914|         CLOSED|
|       8|2013-10-02 00:00:00|      36928|         CLOSED|
|       9|2013-11-29 00:00:00|      17318|     PROCESSING|
|      10|2013-10-09 00:00:00|      46757|     PROCESSING|
|      11|2013-11-16 00:00:00|      34795|PENDING_PAYMENT|
|      12|2013-12-18 00:00:00|      34931|PENDING_PAYMENT|
|      13|2014-02-01 00:00:00|      32229|       COMPLETE|
|      14|2014-05-10 00:00:00|       1438|     PROCESSIN

In [4]:
orders_df.show(5)

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2014-04-08 00:00:00|      40322|PENDING_PAYMENT|
|       2|2014-05-20 00:00:00|      35390|       COMPLETE|
|       3|2013-10-14 00:00:00|      29108|PENDING_PAYMENT|
|       4|2014-01-04 00:00:00|      34419|         CLOSED|
|       5|2014-02-16 00:00:00|       9936|PENDING_PAYMENT|
+--------+-------------------+-----------+---------------+
only showing top 5 rows



**printSchema():** Used to print the schema of a dataframe

In [5]:
orders_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



**withColumnRenamed():** Used to rename the existing columns names in the dataframe

In [6]:
transformed_df1 = orders_df.withColumnRenamed("order_status", "status")

In [7]:
transformed_df1.show()

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|         status|
+--------+-------------------+-----------+---------------+
|       1|2014-04-08 00:00:00|      40322|PENDING_PAYMENT|
|       2|2014-05-20 00:00:00|      35390|       COMPLETE|
|       3|2013-10-14 00:00:00|      29108|PENDING_PAYMENT|
|       4|2014-01-04 00:00:00|      34419|         CLOSED|
|       5|2014-02-16 00:00:00|       9936|PENDING_PAYMENT|
|       6|2014-05-02 00:00:00|      41598|       COMPLETE|
|       7|2014-04-23 00:00:00|       4914|         CLOSED|
|       8|2013-10-02 00:00:00|      36928|         CLOSED|
|       9|2013-11-29 00:00:00|      17318|     PROCESSING|
|      10|2013-10-09 00:00:00|      46757|     PROCESSING|
|      11|2013-11-16 00:00:00|      34795|PENDING_PAYMENT|
|      12|2013-12-18 00:00:00|      34931|PENDING_PAYMENT|
|      13|2014-02-01 00:00:00|      32229|       COMPLETE|
|      14|2014-05-10 00:00:00|       1438|     PROCESSIN

-   **withColumn():** Used to modify the existing columns or Add a new column in the dataframe.
-   **to_timestamp():** Used to converting the data type from string to timestamp

In [8]:
from pyspark.sql.functions import to_timestamp
transformed_df2 = transformed_df1.withColumn("orders_date_new", to_timestamp("order_date"))

In [9]:
transformed_df2.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- orders_date_new: timestamp (nullable = true)

