# ‚ö° Spark SQL: SELECT, JOIN, and GROUP BY Queries

Using the **temporary view** created from a structured dataset (like `employees`), this guide demonstrates how to run:

- ‚úÖ **SELECT**
- ‚úÖ **JOIN**
- ‚úÖ **GROUP BY**
- ‚úÖ **Save results**

---

## ‚úÖ Dataset Views Used

### `employees` (already created)
| id | name | dept | salary |
|----|------|------|--------|
| 1  | Alice| HR   | 45000  |
| 2  | Bob  | IT   | 60000  |
| 3  | Charlie | IT | 70000 |
| 4  | David | Finance | 80000 |

---

# 1Ô∏è‚É£ SELECT Query (Retrieve Specific Columns)

```scala
val selectDF = spark.sql("SELECT name, dept, salary FROM employees")
selectDF.show()
````

---

# 2Ô∏è‚É£ JOIN Operation (Using Another View)

### Create another view `dept_info`

```scala
val deptData = Seq(
  (1, "HR"),
  (2, "IT"),
  (3, "Finance")
).toDF("dept_id", "dept_name")

deptData.createOrReplaceTempView("dept_info")
```

### Perform JOIN between `employees` and `dept_info`

```scala
val joinDF = spark.sql("""
  SELECT e.name, e.salary, d.dept_name
  FROM employees e
  JOIN dept_info d
  ON e.dept = d.dept_name
""")

joinDF.show()
```

---

# 3Ô∏è‚É£ GROUP BY Query (Aggregation)

### Example: Average Salary by Department

```scala
val groupDF = spark.sql("""
  SELECT dept, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY dept
""")

groupDF.show()
```

---

# 4Ô∏è‚É£ Save Query Results

### Save as CSV

```scala
groupDF.write.mode("overwrite").csv("output/group_by_salary")
```

---

## ‚úÖ Sample Output (Group By)

```text
+--------+----------+
| dept   | avg_salary |
+--------+----------+
| HR     | 45000     |
| IT     | 65000     |
| Finance| 80000     |
+--------+----------+
```


## üìå PySpark Program

In [1]:
# Import required libraries
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .appName("Spark SQL - Select, Join, GroupBy Example") \
    .getOrCreate()

# -----------------------------
# Create Sample Structured Data
# -----------------------------

# Employee Data
employee_data = [
    (1, "Alice", "HR", 50000),
    (2, "Bob", "IT", 60000),
    (3, "Charlie", "IT", 70000),
    (4, "David", "Finance", 65000),
    (5, "Eve", "HR", 55000)
]

employee_columns = ["emp_id", "emp_name", "department", "salary"]
employees_df = spark.createDataFrame(employee_data, employee_columns)

# Department Data
department_data = [
    ("HR", "Human Resources"),
    ("IT", "Information Technology"),
    ("Finance", "Finance Department")
]

department_columns = ["department", "dept_full_name"]
departments_df = spark.createDataFrame(department_data, department_columns)

# Create Temporary Views
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

# 1. SELECT Query
print("=== SELECT Query Result ===")
select_query = spark.sql("""
    SELECT emp_id, emp_name, salary
    FROM employees
""")
select_query.show()

# 2. JOIN Query
print("=== JOIN Query Result ===")
join_query = spark.sql("""
    SELECT e.emp_id, e.emp_name, d.dept_full_name, e.salary
    FROM employees e
    INNER JOIN departments d
    ON e.department = d.department
""")
join_query.show()

# 3. GROUP BY Query
print("=== GROUP BY Query Result ===")
group_query = spark.sql("""
    SELECT department,
           SUM(salary) AS total_salary,
           AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
""")
group_query.show()

# Stop Spark Session
spark.stop()

=== SELECT Query Result ===
+------+--------+------+
|emp_id|emp_name|salary|
+------+--------+------+
|     1|   Alice| 50000|
|     2|     Bob| 60000|
|     3| Charlie| 70000|
|     4|   David| 65000|
|     5|     Eve| 55000|
+------+--------+------+

=== JOIN Query Result ===
+------+--------+--------------------+------+
|emp_id|emp_name|      dept_full_name|salary|
+------+--------+--------------------+------+
|     4|   David|  Finance Department| 65000|
|     1|   Alice|     Human Resources| 50000|
|     5|     Eve|     Human Resources| 55000|
|     2|     Bob|Information Techn...| 60000|
|     3| Charlie|Information Techn...| 70000|
+------+--------+--------------------+------+

=== GROUP BY Query Result ===
+----------+------------+----------+
|department|total_salary|avg_salary|
+----------+------------+----------+
|        HR|      105000|   52500.0|
|        IT|      130000|   65000.0|
|   Finance|       65000|   65000.0|
+----------+------------+----------+



In [None]:

# Import required libraries
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .appName("Spark SQL - Select, Join, GroupBy Example") \
    .getOrCreate()

# -----------------------------
# Create Sample Structured Data
# -----------------------------

# Employee Data
employee_data = [
    (1, "Alice", "HR", 50000),
    (2, "Bob", "IT", 60000),
    (3, "Charlie", "IT", 70000),
    (4, "David", "Finance", 65000),
    (5, "Eve", "HR", 55000)
]

employee_columns = ["emp_id", "emp_name", "department", "salary"]
employees_df = spark.createDataFrame(employee_data, employee_columns)

# Department Data
department_data = [
    ("HR", "Human Resources"),
    ("IT", "Information Technology"),
    ("Finance", "Finance Department")
]

department_columns = ["department", "dept_full_name"]
departments_df = spark.createDataFrame(department_data, department_columns)

# Create Temporary Views
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

# 1. SELECT Query
print("=== SELECT Query Result ===")
select_query = spark.sql("""
    SELECT emp_id, emp_name, salary
    FROM employees
""")
select_query.show()

# 2. JOIN Query
print("=== JOIN Query Result ===")
join_query = spark.sql("""
    SELECT e.emp_id, e.emp_name, d.dept_full_name, e.salary
    FROM employees e
    INNER JOIN departments d
    ON e.department = d.department
""")
join_query.show()

# 3. GROUP BY Query
print("=== GROUP BY Query Result ===")
group_query = spark.sql("""
    SELECT department,
           SUM(salary) AS total_salary,
           AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
""")
group_query.show()

# Stop Spark Session
spark.stop()

=== SELECT Query Result ===
+------+--------+------+
|emp_id|emp_name|salary|
+------+--------+------+
|     1|   Alice| 50000|
|     2|     Bob| 60000|
|     3| Charlie| 70000|
|     4|   David| 65000|
|     5|     Eve| 55000|
+------+--------+------+

=== JOIN Query Result ===
+------+--------+--------------------+------+
|emp_id|emp_name|      dept_full_name|salary|
+------+--------+--------------------+------+
|     4|   David|  Finance Department| 65000|
|     1|   Alice|     Human Resources| 50000|
|     5|     Eve|     Human Resources| 55000|
|     2|     Bob|Information Techn...| 60000|
|     3| Charlie|Information Techn...| 70000|
+------+--------+--------------------+------+

=== GROUP BY Query Result ===
+----------+------------+----------+
|department|total_salary|avg_salary|
+----------+------------+----------+
|        HR|      105000|   52500.0|
|        IT|      130000|   65000.0|
|   Finance|       65000|   65000.0|
+----------+------------+----------+

