<pre>
Table: Students

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| student_id    | int     |
| student_name  | varchar |
+---------------+---------+
student_id is the primary key (column with unique values) for this table.
Each row of this table contains the ID and the name of one student in the school.
 

Table: Subjects

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| subject_name | varchar |
+--------------+---------+
subject_name is the primary key (column with unique values) for this table.
Each row of this table contains the name of one subject in the school.
 

Table: Examinations

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| student_id   | int     |
| subject_name | varchar |
+--------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each student from the Students table takes every course from the Subjects table.
Each row of this table indicates that a student with ID student_id attended the exam of subject_name.
 

Write a solution to find the number of times each student attended each exam.

Return the result table ordered by student_id and subject_name.

The result format is in the following example.

 

Example 1:

Input: 
Students table:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1          | Alice        |
| 2          | Bob          |
| 13         | John         |
| 6          | Alex         |
+------------+--------------+
Subjects table:
+--------------+
| subject_name |
+--------------+
| Math         |
| Physics      |
| Programming  |
+--------------+
Examinations table:
+------------+--------------+
| student_id | subject_name |
+------------+--------------+
| 1          | Math         |
| 1          | Physics      |
| 1          | Programming  |
| 2          | Programming  |
| 1          | Physics      |
| 1          | Math         |
| 13         | Math         |
| 13         | Programming  |
| 13         | Physics      |
| 2          | Math         |
| 1          | Math         |
+------------+--------------+
Output: 
+------------+--------------+--------------+----------------+
| student_id | student_name | subject_name | attended_exams |
+------------+--------------+--------------+----------------+
| 1          | Alice        | Math         | 3              |
| 1          | Alice        | Physics      | 2              |
| 1          | Alice        | Programming  | 1              |
| 2          | Bob          | Math         | 1              |
| 2          | Bob          | Physics      | 0              |
| 2          | Bob          | Programming  | 1              |
| 6          | Alex         | Math         | 0              |
| 6          | Alex         | Physics      | 0              |
| 6          | Alex         | Programming  | 0              |
| 13         | John         | Math         | 1              |
| 13         | John         | Physics      | 1              |
| 13         | John         | Programming  | 1              |
+------------+--------------+--------------+----------------+
Explanation: 
The result table should contain all students and all subjects.
Alice attended the Math exam 3 times, the Physics exam 2 times, and the Programming exam 1 time.
Bob attended the Math exam 1 time, the Programming exam 1 time, and did not attend the Physics exam.
Alex did not attend any exams.
John attended the Math exam 1 time, the Physics exam 1 time, and the Programming exam 1 time.
</pre>

In [0]:
spark

In [0]:
# importing pyspark sql functions
from pyspark.sql.functions import *

# importing sql types from pyspark
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType, DateType, FloatType

# importing SparkSession
from pyspark.sql import SparkSession


In [0]:
# creating spark session and providing app name
spark = SparkSession.builder.appName("leetcode-top-50-sql-solution-with-pyspark").getOrCreate()

In [0]:
# creating Schema
# Define the schema for the Students table
students_schema = StructType([
    StructField("student_id", IntegerType(), False),
    StructField("student_name", StringType(), True)
])


# Define the schema for the Subjects table
subjects_schema = StructType([
    StructField("subject_name", StringType(), False)
])

# Define the schema for the Examinations table
examinations_schema = StructType([
    StructField("student_id", IntegerType(), False),
    StructField("subject_name", StringType(), False)
])





In [0]:

student_df = spark.createDataFrame([
    (1, "Alice"),
    (2, "Bob"),
    (13, "John"),
    (6, "Alex")
], schema=students_schema)


subject_df = spark.createDataFrame([
    ("Math",),
    ("Physics",),
    ("Programming",)
], schema=subjects_schema)


examination_df = spark.createDataFrame([
    (1, "Math"),
    (1, "Physics"),
    (1, "Programming"),
    (2, "Programming"),
    (1, "Physics"),
    (1, "Math"),
    (13, "Math"),
    (13, "Programming"),
    (13, "Physics"),
    (2, "Math"),
    (1, "Math")
], schema=examinations_schema)





In [0]:
student_df.display()

student_id,student_name
1,Alice
2,Bob
13,John
6,Alex


In [0]:
subject_df.display()

subject_name
Math
Physics
Programming


In [0]:
examination_df.display()

student_id,subject_name
1,Math
1,Physics
1,Programming
2,Programming
1,Physics
1,Math
13,Math
13,Programming
13,Physics
2,Math


In [0]:
# Leetcode Solution in Spark SQL
# Creating Temporary view for the product dataframe for sql queries
student_df.createOrReplaceTempView('students')
subject_df.createOrReplaceTempView('subjects')
examination_df.createOrReplaceTempView('examinations')



sql_result = spark.sql(
    '''
   WITH cartesianProductTable AS 
(
    SELECT
    * 
    FROM students
    JOIN subjects
)

SELECT 
main_tab.student_id,
main_tab.student_name,
main_tab.subject_name,
SUM(
    CASE WHEN exam_tab.student_id IS NULL THEN 0 ELSE 1 END
) AS attended_exams
FROM cartesianProductTable AS main_tab
LEFT JOIN examinations AS exam_tab
ON  main_tab.student_id = exam_tab.student_id AND  main_tab.subject_name  = exam_tab.subject_name 
GROUP BY main_tab.student_id, main_tab.student_name, main_tab.subject_name
ORDER BY main_tab.student_id, main_tab.student_name, main_tab.subject_name;
    
    '''
)

# Displaying Result
sql_result.display()

student_id,student_name,subject_name,attended_exams
1,Alice,Math,3
1,Alice,Physics,2
1,Alice,Programming,1
2,Bob,Math,1
2,Bob,Physics,0
2,Bob,Programming,1
6,Alex,Math,0
6,Alex,Physics,0
6,Alex,Programming,0
13,John,Math,1


In [0]:
# Leetcode Solution in PySpark
# As Done in Above SQL solution

# Create Cartesian Product of Students and Subjects
cartesian_product_df = student_df.crossJoin(subject_df)

#Joining Cartesian Product with Examinations
joined_df = cartesian_product_df.join(
    examination_df,
    on=["student_id", "subject_name"],
    how="left"
)

# Grouping data by student_id, student_name, and subject_name to calculate attended_exams
result_df = joined_df.groupBy(
    "student_id", "student_name", "subject_name"
).agg(
    sum(
        when(col("student_id").isNotNull(), 1).otherwise(0)
    ).alias("attended_exams")
)

# Order By student_id, student_name, and subject_name
final_df = result_df.orderBy("student_id", "student_name", "subject_name")

final_df.display()

student_id,student_name,subject_name,attended_exams
1,Alice,Math,3
1,Alice,Physics,2
1,Alice,Programming,1
2,Bob,Math,1
2,Bob,Physics,1
2,Bob,Programming,1
6,Alex,Math,1
6,Alex,Physics,1
6,Alex,Programming,1
13,John,Math,1
