# Run this file in DataBricks

Q2.Consider Student schema where many students learn many subjects . A teacher teaches many subjects. Prepare a database and SQL tables. 
Perform SQL queries using basic features, aggregation, nested  queries.


Step 1: Import Libraries & Create Spark Session

Create a Spark session (Databricks already provides spark object, but we do it explicitly for clarity).

In [0]:
# Import PySpark libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg

# Create Spark session (if not already available)
spark = SparkSession.builder.appName("StudentSubjectTeacherDB").getOrCreate()


Step 2: Create DataFrames for Students, Subjects, Teachers

We will define sample data as Python lists and create DataFrames.

In [0]:
# Sample Data

# Students
students_data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
students_columns = ["student_id", "student_name"]
students_df = spark.createDataFrame(students_data, students_columns)

# Subjects
subjects_data = [(101, "Math"), (102, "Physics"), (103, "Chemistry")]
subjects_columns = ["subject_id", "subject_name"]
subjects_df = spark.createDataFrame(subjects_data, subjects_columns)

# Teachers
teachers_data = [(201, "Mr. Smith"), (202, "Mrs. Johnson")]
teachers_columns = ["teacher_id", "teacher_name"]
teachers_df = spark.createDataFrame(teachers_data, teachers_columns)

# Show DataFrames
students_df.show()
subjects_df.show()
teachers_df.show()


+----------+------------+
|student_id|student_name|
+----------+------------+
|         1|       Alice|
|         2|         Bob|
|         3|     Charlie|
+----------+------------+

+----------+------------+
|subject_id|subject_name|
+----------+------------+
|       101|        Math|
|       102|     Physics|
|       103|   Chemistry|
+----------+------------+

+----------+------------+
|teacher_id|teacher_name|
+----------+------------+
|       201|   Mr. Smith|
|       202|Mrs. Johnson|
+----------+------------+



Step 3: Create Many-to-Many Relationships

Student_Subject (which student takes which subject)

Teacher_Subject (which teacher teaches which subject)

In [0]:
# Many-to-many relationship

# Student-Subject mapping
student_subject_data = [(1, 101), (1, 102), (2, 101), (2, 103), (3, 102)]
student_subject_columns = ["student_id", "subject_id"]
student_subject_df = spark.createDataFrame(student_subject_data, student_subject_columns)

# Teacher-Subject mapping
teacher_subject_data = [(201, 101), (201, 102), (202, 103)]
teacher_subject_columns = ["teacher_id", "subject_id"]
teacher_subject_df = spark.createDataFrame(teacher_subject_data, teacher_subject_columns)

# Show DataFrames
student_subject_df.show()
teacher_subject_df.show()


+----------+----------+
|student_id|subject_id|
+----------+----------+
|         1|       101|
|         1|       102|
|         2|       101|
|         2|       103|
|         3|       102|
+----------+----------+

+----------+----------+
|teacher_id|subject_id|
+----------+----------+
|       201|       101|
|       201|       102|
|       202|       103|
+----------+----------+



Step 4: Create Temporary Views for SQL Queries

Databricks allows you to use SQL via spark.sql using temporary views.

In [0]:
# Create SQL temporary views
students_df.createOrReplaceTempView("students")
subjects_df.createOrReplaceTempView("subjects")
teachers_df.createOrReplaceTempView("teachers")
student_subject_df.createOrReplaceTempView("student_subject")
teacher_subject_df.createOrReplaceTempView("teacher_subject")


Step 5: Basic SQL Queries

In [0]:
# Basic Query - List all students with their subjects
query1 = """
SELECT s.student_name, sub.subject_name
FROM students s
JOIN student_subject ss ON s.student_id = ss.student_id
JOIN subjects sub ON ss.subject_id = sub.subject_id
"""
spark.sql(query1).show()

# List all teachers with the subjects they teach
query2 = """
SELECT t.teacher_name, sub.subject_name
FROM teachers t
JOIN teacher_subject ts ON t.teacher_id = ts.teacher_id
JOIN subjects sub ON ts.subject_id = sub.subject_id
"""
spark.sql(query2).show()


+------------+------------+
|student_name|subject_name|
+------------+------------+
|       Alice|        Math|
|       Alice|     Physics|
|         Bob|        Math|
|         Bob|   Chemistry|
|     Charlie|     Physics|
+------------+------------+

+------------+------------+
|teacher_name|subject_name|
+------------+------------+
|   Mr. Smith|        Math|
|   Mr. Smith|     Physics|
|Mrs. Johnson|   Chemistry|
+------------+------------+



Step 6: Aggregation Queries

Example: Count of students per subject, count of subjects per teacher

In [0]:
# Count of students in each subject
query3 = """
SELECT sub.subject_name, COUNT(ss.student_id) AS num_students
FROM subjects sub
JOIN student_subject ss ON sub.subject_id = ss.subject_id
GROUP BY sub.subject_name
"""
spark.sql(query3).show()

# Count of subjects each teacher teaches
query4 = """
SELECT t.teacher_name, COUNT(ts.subject_id) AS num_subjects
FROM teachers t
JOIN teacher_subject ts ON t.teacher_id = ts.teacher_id
GROUP BY t.teacher_name
"""
spark.sql(query4).show()


+------------+------------+
|subject_name|num_students|
+------------+------------+
|        Math|           2|
|     Physics|           2|
|   Chemistry|           1|
+------------+------------+

+------------+------------+
|teacher_name|num_subjects|
+------------+------------+
|   Mr. Smith|           2|
|Mrs. Johnson|           1|
+------------+------------+



Step 7: Nested Queries

Example: Find students who are learning subjects taught by "Mr. Smith"

In [0]:
# Cell 10: Nested Query
query5 = """
SELECT student_name
FROM students
WHERE student_id IN (
    SELECT student_id
    FROM student_subject
    WHERE subject_id IN (
        SELECT subject_id
        FROM teacher_subject
        WHERE teacher_id = (SELECT teacher_id FROM teachers WHERE teacher_name='Mr. Smith')
    )
)
"""
spark.sql(query5).show()


+------------+
|student_name|
+------------+
|       Alice|
|         Bob|
|     Charlie|
+------------+



Step 8: Save DataFrames as Tables

You can save these as permanent tables in Databricks.

In [0]:
# Save DataFrames as tables
students_df.write.mode("overwrite").saveAsTable("students_table")
subjects_df.write.mode("overwrite").saveAsTable("subjects_table")
teachers_df.write.mode("overwrite").saveAsTable("teachers_table")
student_subject_df.write.mode("overwrite").saveAsTable("student_subject_table")
teacher_subject_df.write.mode("overwrite").saveAsTable("teacher_subject_table")
