# Running SQL Queries - Practice Notebook

This notebook covers **Running SQL Queries Programmatically** and **Global Temporary Views** from the [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Register DataFrames as temporary views
- Execute SQL queries using spark.sql()
- Understand the difference between temporary and global temporary views
- Compare DataFrame API vs SQL syntax
- Practice complex SQL queries

## Sections
1. **Setup and Data Preparation**
2. **Creating Temporary Views**
3. **Running SQL Queries**
4. **Global Temporary Views**
5. **DataFrame API vs SQL Comparison**
6. **Complex SQL Queries**
7. **Practice Exercises**

---


In [1]:
# Setup
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create SparkSession
spark = SparkSession.builder.appName("SQL Queries Practice").getOrCreate()

# Create sample datasets
employees_data = [
    (1, "Alice", "Engineering", 75000, "2020-01-15"),
    (2, "Bob", "Sales", 65000, "2019-03-20"),
    (3, "Charlie", "Engineering", 80000, "2018-06-10"),
    (4, "Diana", "Marketing", 70000, "2021-02-28"),
    (5, "Eve", "Sales", 68000, "2017-11-05"),
    (6, "Frank", "Engineering", 82000, "2020-09-12"),
]

departments_data = [
    ("Engineering", "Tech", "Alice Johnson"),
    ("Sales", "Business", "Bob Smith"),
    ("Marketing", "Business", "Charlie Brown"),
    ("HR", "Support", "Diana Prince"),
]

employees_df = spark.createDataFrame(
    employees_data, ["id", "name", "department", "salary", "hire_date"]
)
departments_df = spark.createDataFrame(
    departments_data, ["dept_name", "division", "manager"]
)

print("Employees DataFrame:")
employees_df.show()

print("Departments DataFrame:")
departments_df.show()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/10 06:06:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Employees DataFrame:


                                                                                

+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+

Departments DataFrame:
+-----------+--------+-------------+
|  dept_name|division|      manager|
+-----------+--------+-------------+
|Engineering|    Tech|Alice Johnson|
|      Sales|Business|    Bob Smith|
|  Marketing|Business|Charlie Brown|
|         HR| Support| Diana Prince|
+-----------+--------+-------------+



## 1. Creating Temporary Views

To run SQL queries on DataFrames, we first need to register them as temporary views.


In [5]:
# Register DataFrames as temporary views
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

print("Temporary views created successfully!")

# List all temporary views
print("\nCurrent temporary views:")
spark.catalog.listTables()

Temporary views created successfully!

Current temporary views:


[Table(name='departments', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='employees', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

In [6]:
departments_df.show()

+-----------+--------+-------------+
|  dept_name|division|      manager|
+-----------+--------+-------------+
|Engineering|    Tech|Alice Johnson|
|      Sales|Business|    Bob Smith|
|  Marketing|Business|Charlie Brown|
|         HR| Support| Diana Prince|
+-----------+--------+-------------+



In [7]:
employees_df.show()

+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



## 2. Running Basic SQL Queries

Now we can run SQL queries using the `spark.sql()` method.


In [11]:
# Basic SELECT queries
print("1. Select all employees:")
result1 = spark.sql("SELECT * FROM employees")
result1.show()

print("\n2. Select specific columns:")
result2 = spark.sql("SELECT name, department, salary FROM employees")
result2.show()

print("\n3. Filter with WHERE clause:")
result3 = spark.sql("SELECT name, salary FROM employees WHERE salary > 70000")
result3.show()

print("\n4. Order by salary:")
result4 = spark.sql("SELECT name, salary FROM employees ORDER BY salary DESC")
result4.show()

print("\n5. Count employees by department:")
result5 = spark.sql(
    """
    SELECT department, COUNT(*) as employee_count
    FROM employees
    GROUP BY department
"""
)
result5.show()

1. Select all employees:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+


2. Select specific columns:
+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 75000|
|    Bob|      Sales| 65000|
|Charlie|Engineering| 80000|
|  Diana|  Marketing| 70000|
|    Eve|      Sales| 68000|
|  Frank|Engineering| 82000|
+-------+-----------+------+


3. Filter with WHERE clause:
+-------+------+
|   name|salary|
+-------+------+
|  Alice| 75000|
|Charlie| 80000|
|  Frank| 82000|
+-------+------+


4. Order by salary:
+-------+------+
|   name|salary|
+-------+

## 3. Global Temporary Views

Global temporary views are shared across multiple SparkSessions and are kept alive until the Spark application terminates.


In [15]:
# Create or replace global temporary view
employees_df.createOrReplaceGlobalTempView("global_employees")

print("Global temporary view created!")

# Access global temporary view (note the global_temp prefix)
print("\nAccess global temporary view:")
result_global = spark.sql(
    "SELECT * FROM global_temp.global_employees WHERE department = 'Engineering'"
)
result_global.show()

# You can also access it from a different SparkSession
# new_spark = SparkSession.builder.appName("NewSession").getOrCreate()
# result_from_new_session = new_spark.sql("SELECT COUNT(*) FROM global_temp.global_employees")
# result_from_new_session.show()

print("\nDifference between temporary and global temporary views:")
print("- Temporary views: Session-scoped, accessed directly by name")
print(
    "- Global temporary views: Application-scoped, accessed via global_temp.view_name"
)

Global temporary view created!

Access global temporary view:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+


Difference between temporary and global temporary views:
- Temporary views: Session-scoped, accessed directly by name
- Global temporary views: Application-scoped, accessed via global_temp.view_name


## 4. DataFrame API vs SQL Comparison

Let's compare the same operations using DataFrame API and SQL syntax.


In [None]:
# Example 1: Filter and select
print("=== EXAMPLE 1: Filter and Select ===")


=== EXAMPLE 1: Filter and Select ===
DataFrame API:
+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 75000|
|Charlie|Engineering| 80000|
|  Frank|Engineering| 82000|
+-------+-----------+------+

SQL:
+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 75000|
|Charlie|Engineering| 80000|
|  Frank|Engineering| 82000|
+-------+-----------+------+


=== EXAMPLE 2: Group By and Aggregate ===
DataFrame API:
+-----------+-----+----------+----------+
| department|count|avg_salary|max_salary|
+-----------+-----+----------+----------+
|Engineering|    3|   79000.0|     82000|
|      Sales|    2|   66500.0|     68000|
|  Marketing|    1|   70000.0|     70000|
+-----------+-----+----------+----------+

SQL:
+-----------+-----+----------+----------+
| department|count|avg_salary|max_salary|
+-----------+-----+----------+----------+
|Engineering|    3|   79000.0|     82000|
|      Sa

In [None]:


# Example 2: Group by and aggregate
print("\n=== EXAMPLE 2: Group By and Aggregate ===")


## 5. Complex SQL Queries

Practice more advanced SQL operations including joins, subqueries, and window functions.


## 6. Practice Exercises

Complete these SQL exercises to test your understanding.
