<pre>
Table: Signups

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| user_id        | int      |
| time_stamp     | datetime |
+----------------+----------+
user_id is the column of unique values for this table.
Each row contains information about the signup time for the user with ID user_id.
 

Table: Confirmations

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| user_id        | int      |
| time_stamp     | datetime |
| action         | ENUM     |
+----------------+----------+
(user_id, time_stamp) is the primary key (combination of columns with unique values) for this table.
user_id is a foreign key (reference column) to the Signups table.
action is an ENUM (category) of the type ('confirmed', 'timeout')
Each row of this table indicates that the user with ID user_id requested a confirmation message at time_stamp and that confirmation message was either confirmed ('confirmed') or expired without confirming ('timeout').
 

The confirmation rate of a user is the number of 'confirmed' messages divided by the total number of requested confirmation messages. The confirmation rate of a user that did not request any confirmation messages is 0. Round the confirmation rate to two decimal places.

Write a solution to find the confirmation rate of each user.

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Signups table:
+---------+---------------------+
| user_id | time_stamp          |
+---------+---------------------+
| 3       | 2020-03-21 10:16:13 |
| 7       | 2020-01-04 13:57:59 |
| 2       | 2020-07-29 23:09:44 |
| 6       | 2020-12-09 10:39:37 |
+---------+---------------------+
Confirmations table:
+---------+---------------------+-----------+
| user_id | time_stamp          | action    |
+---------+---------------------+-----------+
| 3       | 2021-01-06 03:30:46 | timeout   |
| 3       | 2021-07-14 14:00:00 | timeout   |
| 7       | 2021-06-12 11:57:29 | confirmed |
| 7       | 2021-06-13 12:58:28 | confirmed |
| 7       | 2021-06-14 13:59:27 | confirmed |
| 2       | 2021-01-22 00:00:00 | confirmed |
| 2       | 2021-02-28 23:59:59 | timeout   |
+---------+---------------------+-----------+
Output: 
+---------+-------------------+
| user_id | confirmation_rate |
+---------+-------------------+
| 6       | 0.00              |
| 3       | 0.00              |
| 7       | 1.00              |
| 2       | 0.50              |
+---------+-------------------+
Explanation: 
User 6 did not request any confirmation messages. The confirmation rate is 0.
User 3 made 2 requests and both timed out. The confirmation rate is 0.
User 7 made 3 requests and all were confirmed. The confirmation rate is 1.
User 2 made 2 requests where one was confirmed and the other timed out. The confirmation rate is 1 / 2 = 0.5.
</pre>

In [0]:
spark

In [0]:
# importing pyspark sql functions
from pyspark.sql.functions import *

# importing sql types from pyspark
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType, DateType, FloatType

# importing SparkSession
from pyspark.sql import SparkSession


In [0]:
# creating spark session and providing app name
spark = SparkSession.builder.appName("leetcode-top-50-sql-solution-with-pyspark").getOrCreate()

In [0]:
# creating Schema
# Define the schema for the Signups table
signups_schema = StructType([
    StructField("user_id", IntegerType(), False),
    StructField("time_stamp", TimestampType(), True)
])

# Define the schema for the Confirmations table
confirmations_schema = StructType([
    StructField("user_id", IntegerType(), False),
    StructField("time_stamp", TimestampType(), True),
    StructField("action", StringType(), True)
])


In [0]:
# Create the Signups DataFrame
signups_data = [
    (3, "2020-03-21 10:16:13"),
    (7, "2020-01-04 13:57:59"),
    (2, "2020-07-29 23:09:44"),
    (6, "2020-12-09 10:39:37")
]

signups_df = spark.createDataFrame(signups_data, schema=["user_id", "time_stamp"])

# Convert the `time_stamp` column to TimestampType
signups_df = signups_df.withColumn("time_stamp", to_timestamp(col("time_stamp")))

# Create the Confirmations DataFrame
confirmations_data = [
    (3, "2021-01-06 03:30:46", "timeout"),
    (3, "2021-07-14 14:00:00", "timeout"),
    (7, "2021-06-12 11:57:29", "confirmed"),
    (7, "2021-06-13 12:58:28", "confirmed"),
    (7, "2021-06-14 13:59:27", "confirmed"),
    (2, "2021-01-22 00:00:00", "confirmed"),
    (2, "2021-02-28 23:59:59", "timeout")
]

confirmations_df = spark.createDataFrame(confirmations_data, schema=["user_id", "time_stamp", "action"])

# Convert the `time_stamp` column to TimestampType
confirmations_df = confirmations_df.withColumn("time_stamp", to_timestamp(col("time_stamp")))


In [0]:
signups_df.display()

user_id,time_stamp
3,2020-03-21T10:16:13.000+0000
7,2020-01-04T13:57:59.000+0000
2,2020-07-29T23:09:44.000+0000
6,2020-12-09T10:39:37.000+0000


In [0]:
confirmations_df.display()


user_id,time_stamp,action
3,2021-01-06T03:30:46.000+0000,timeout
3,2021-07-14T14:00:00.000+0000,timeout
7,2021-06-12T11:57:29.000+0000,confirmed
7,2021-06-13T12:58:28.000+0000,confirmed
7,2021-06-14T13:59:27.000+0000,confirmed
2,2021-01-22T00:00:00.000+0000,confirmed
2,2021-02-28T23:59:59.000+0000,timeout


In [0]:
# Leetcode Solution in Spark SQL
# Creating Temporary view for the product dataframe for sql queries
signups_df.createOrReplaceTempView('signups')
confirmations_df.createOrReplaceTempView('confirmations')



sql_result = spark.sql(
    '''
    SELECT 
    su.user_id AS user_id,
    ROUND(AVG(
        CASE
            WHEN conf.action = 'confirmed' THEN 1 ELSE 0 END
    ),2) AS confirmation_rate
    FROM Signups AS su
    LEFT JOIN confirmations AS conf
    ON su.user_id = conf.user_id
    GROUP BY su.user_id;
    
    '''
)

# Displaying Result
sql_result.display()

user_id,confirmation_rate
3,0.0
7,1.0
2,0.5
6,0.0


In [0]:
# Leetcode Solution in PySpark

#Joining two tables
joined_df = signups_df.join(confirmations_df, on=['user_id'], how='left').select('user_id', 'action')


agg_df = joined_df.groupBy('user_id').agg(round(avg(when(joined_df.action == "confirmed",1).otherwise(0)),2).alias("confirmation_rate"))

agg_df.display()


user_id,confirmation_rate
3,0.0
7,1.0
2,0.5
6,0.0
