# Running SQL Queries - Practice Notebook

This notebook covers **Running SQL Queries Programmatically** and **Global Temporary Views** from the [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Register DataFrames as temporary views
- Execute SQL queries using spark.sql()
- Understand the difference between temporary and global temporary views
- Compare DataFrame API vs SQL syntax
- Practice complex SQL queries

## Sections
1. **Setup and Data Preparation**
2. **Creating Temporary Views**
3. **Running SQL Queries**
4. **Global Temporary Views**
5. **DataFrame API vs SQL Comparison**
6. **Complex SQL Queries**
7. **Practice Exercises**

---


## SETUP

In [40]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("SQL Queries Practise").getOrCreate()

In [41]:
# Create sample datasets
employees_data = [
    (1, "Alice", "Engineering", 75000, "2020-01-15"),
    (2, "Bob", "Sales", 65000, "2019-03-20"),
    (3, "Charlie", "Engineering", 80000, "2018-06-10"),
    (4, "Diana", "Marketing", 70000, "2021-02-28"),
    (5, "Eve", "Sales", 68000, "2017-11-05"),
    (6, "Frank", "Engineering", 82000, "2020-09-12")
]

departments_data = [
    ("Engineering", "Tech", "Alice Johnson"),
    ("Sales", "Business", "Bob Smith"),
    ("Marketing", "Business", "Charlie Brown"),
    ("HR", "Support", "Diana Prince")
]

In [42]:
employees_df = spark.createDataFrame(employees_data, ["id", "name", "department", "salary", "hire_date"])
departments_df = spark.createDataFrame(departments_data, ["dept_name", "division", "manager"])

print("Employees DataFrame:")
employees_df.show()

print("Departments DataFrame:")
departments_df.show()

Employees DataFrame:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+

Departments DataFrame:
+-----------+--------+-------------+
|  dept_name|division|      manager|
+-----------+--------+-------------+
|Engineering|    Tech|Alice Johnson|
|      Sales|Business|    Bob Smith|
|  Marketing|Business|Charlie Brown|
|         HR| Support| Diana Prince|
+-----------+--------+-------------+



## 1. Creating Temporary Views

To run SQL queries on DataFrames, we first need to register them as temporary views.


In [43]:
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

print("Temp view created succesfully!")
print("\nCurrent temporary views:")
spark.catalog.listTables()

Temp view created succesfully!

Current temporary views:


[Table(name='departments', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='employees', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

## 2. Running Basic SQL Queries

Now we can run SQL queries using the `spark.sql()` method.


In [44]:
employees_df.show()

+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



In [45]:
spark.catalog.listTables()

[Table(name='departments', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='employees', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

In [46]:
print("1. Select all employees:")
spark.sql("SELECT * FROM employees").show()

1. Select all employees:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  2|    Bob|      Sales| 65000|2019-03-20|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



In [47]:
print("\n2. Select specific columns:")
spark.sql("SELECT dept_name, manager FROM departments").show()


2. Select specific columns:
+-----------+-------------+
|  dept_name|      manager|
+-----------+-------------+
|Engineering|Alice Johnson|
|      Sales|    Bob Smith|
|  Marketing|Charlie Brown|
|         HR| Diana Prince|
+-----------+-------------+



In [48]:
departments_df.show()

+-----------+--------+-------------+
|  dept_name|division|      manager|
+-----------+--------+-------------+
|Engineering|    Tech|Alice Johnson|
|      Sales|Business|    Bob Smith|
|  Marketing|Business|Charlie Brown|
|         HR| Support| Diana Prince|
+-----------+--------+-------------+



In [49]:
print("\n3. Filter with WHERE clause:")
spark.sql("SELECT * FROM employees WHERE salary > 70000").show()


3. Filter with WHERE clause:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



In [50]:
employees_df.filter(employees_df["salary"]>70000).show()

+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  1|  Alice|Engineering| 75000|2020-01-15|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



In [51]:
print("\n4. Order by salary:")
spark.sql("SELECT * FROM employees ORDER BY salary").show()


4. Order by salary:
+---+-------+-----------+------+----------+
| id|   name| department|salary| hire_date|
+---+-------+-----------+------+----------+
|  2|    Bob|      Sales| 65000|2019-03-20|
|  5|    Eve|      Sales| 68000|2017-11-05|
|  4|  Diana|  Marketing| 70000|2021-02-28|
|  1|  Alice|Engineering| 75000|2020-01-15|
|  3|Charlie|Engineering| 80000|2018-06-10|
|  6|  Frank|Engineering| 82000|2020-09-12|
+---+-------+-----------+------+----------+



In [52]:
print("\n5. Count employees by department:")
spark.sql("""
SELECT department, count(*) AS employee_count
FROM employees
GROUP BY department
""").show()


5. Count employees by department:
+-----------+--------------+
| department|employee_count|
+-----------+--------------+
|Engineering|             3|
|      Sales|             2|
|  Marketing|             1|
+-----------+--------------+



In [53]:
employees_df.groupBy("department").agg(
    F.count("*").alias("empoyee_count")
).show()

+-----------+-------------+
| department|empoyee_count|
+-----------+-------------+
|Engineering|            3|
|      Sales|            2|
|  Marketing|            1|
+-----------+-------------+



## 3. Global Temporary Views

Global temporary views are shared across multiple SparkSessions and are kept alive until the Spark application terminates.


In [None]:
employees_df.createGlobalTempView("global_employees")

spark.sql("""
SELECT * FROM global_temp.global_employees WHERE department = 'Engineering'
""").show()

## 4. DataFrame API vs SQL Comparison

Let's compare the same operations using DataFrame API and SQL syntax.


In [57]:
print("=== EXAMPLE 1: Filter and Select ===")

# DataFrame API
print("DataFrame API:")

=== EXAMPLE 1: Filter and Select ===
DataFrame API:


In [58]:
print("SQL:")
sql_result = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE salary > 70000
""")
sql_result.show()

SQL:
+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 75000|
|Charlie|Engineering| 80000|
|  Frank|Engineering| 82000|
+-------+-----------+------+



In [61]:
employees_df.groupBy('department').agg(
    F.count("*").alias("count")
).show()

+-----------+-----+
| department|count|
+-----------+-----+
|Engineering|    3|
|      Sales|    2|
|  Marketing|    1|
+-----------+-----+



## 5. Complex SQL Queries

Practice more advanced SQL operations including joins, subqueries, and window functions.


In [62]:
print("=== JOIN OPERATIONS ===")

print("1. Inner join employees with departments:")

=== JOIN OPERATIONS ===
1. Inner join employees with departments:


In [68]:
spark.sql("""
SELECT e.name, e.salary, e.department, d.division
FROM employees e
INNER JOIN departments d ON e.department = d.dept_name
""").show()

+-------+------+-----------+--------+
|   name|salary| department|division|
+-------+------+-----------+--------+
|  Alice| 75000|Engineering|    Tech|
|Charlie| 80000|Engineering|    Tech|
|  Frank| 82000|Engineering|    Tech|
|  Diana| 70000|  Marketing|Business|
|    Bob| 65000|      Sales|Business|
|    Eve| 68000|      Sales|Business|
+-------+------+-----------+--------+



In [70]:
spark.sql("""
    SELECT e.name, e.department, e.salary, d.division
    FROM employees e
    LEFT JOIN departments d ON e.department = d.dept_name
""").show()

+-------+-----------+------+--------+
|   name| department|salary|division|
+-------+-----------+------+--------+
|  Alice|Engineering| 75000|    Tech|
|    Bob|      Sales| 65000|Business|
|Charlie|Engineering| 80000|    Tech|
|  Diana|  Marketing| 70000|Business|
|    Eve|      Sales| 68000|Business|
|  Frank|Engineering| 82000|    Tech|
+-------+-----------+------+--------+



In [72]:
spark.sql("""
SELECT AVG(salary) FROM employees
""").show()

+-----------------+
|      avg(salary)|
+-----------------+
|73333.33333333333|
+-----------------+



In [74]:
spark.sql("""
    SELECT name, salary
    FROM employees
    WHERE salary > (SELECT AVG(salary) FROM employees)
    ORDER BY salary DESC
""").show()

+-------+------+
|   name|salary|
+-------+------+
|  Frank| 82000|
|Charlie| 80000|
|  Alice| 75000|
+-------+------+



In [76]:
spark.sql("""SELECT department, AVG(salary) as avg_salary
        FROM employees
        GROUP BY department""").show()

+-----------+----------+
| department|avg_salary|
+-----------+----------+
|Engineering|   79000.0|
|      Sales|   66500.0|
|  Marketing|   70000.0|
+-----------+----------+



In [78]:
spark.sql("""
    SELECT department, avg_salary
    FROM (
        SELECT department, AVG(salary) as avg_salary
        FROM employees
        GROUP BY department
    ) dept_avg
    ORDER BY avg_salary DESC
    LIMIT 1
""").show()

+-----------+----------+
| department|avg_salary|
+-----------+----------+
|Engineering|   79000.0|
+-----------+----------+



In [83]:
spark.sql("""
    SELECT name, department, salary
    FROM employees
""").show()

+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 75000|
|    Bob|      Sales| 65000|
|Charlie|Engineering| 80000|
|  Diana|  Marketing| 70000|
|    Eve|      Sales| 68000|
|  Frank|Engineering| 82000|
+-------+-----------+------+



In [84]:
spark.sql("""
    SELECT name, department, salary,
           SUM(salary) OVER (ORDER BY salary ROWS UNBOUNDED PRECEDING) as running_total
    FROM employees
    ORDER BY salary
""").show()

25/07/13 18:29:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 18:29:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 18:29:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-------+-----------+------+-------------+
|   name| department|salary|running_total|
+-------+-----------+------+-------------+
|    Bob|      Sales| 65000|        65000|
|    Eve|      Sales| 68000|       133000|
|  Diana|  Marketing| 70000|       203000|
|  Alice|Engineering| 75000|       278000|
|Charlie|Engineering| 80000|       358000|
|  Frank|Engineering| 82000|       440000|
+-------+-----------+------+-------------+



25/07/13 18:29:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 18:29:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


In [86]:
spark.sql("""
        SELECT department, AVG(salary) as avg_salary
        FROM employees
        GROUP BY department
""").show()

+-----------+----------+
| department|avg_salary|
+-----------+----------+
|Engineering|   79000.0|
|      Sales|   66500.0|
|  Marketing|   70000.0|
+-----------+----------+



In [89]:
spark.sql("""
    WITH dept_stats AS (
        SELECT department, AVG(salary) as avg_salary
        FROM employees
        GROUP BY department
    ),
    top_performers AS (
        SELECT e.name, e.department, e.salary
        FROM employees e
        JOIN dept_stats d ON e.department = d.department
        WHERE e.salary > d.avg_salary
    )
    SELECT * FROM top_performers
    ORDER BY salary DESC
""").show()

+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Frank|Engineering| 82000|
|Charlie|Engineering| 80000|
|    Eve|      Sales| 68000|
+-------+-----------+------+



## 6. Practice Exercises

Complete these SQL exercises to test your understanding.
