<pre>Problem 1: PySpark – Calculate Session Gaps for Users
Problem Statement

You have a PySpark DataFrame with user session information. Each row represents a session start date for a user. Write a PySpark program to calculate the gap in days between consecutive sessions for each user.
Sample Input (user_sessions)
user_id 	session_date
U1 	2025-01-01
U1 	2025-01-05
U1 	2025-01-10
U2 	2025-01-02
U2 	2025-01-04
Expected Output
user_id 	session_date 	days_since_last_session
U1 	2025-01-01 	NULL
U1 	2025-01-05 	4
U1 	2025-01-10 	5
U2 	2025-01-02 	NULL
U2 	2025-01-04 	2
</pre>

In [16]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

In [17]:
data = [("U1", 	"2025-01-01"),
("U1" ,	"2025-01-05"),
("U1" ,	"2025-01-10"),
("U2" ,	"2025-01-02"),
("U2" 	,"2025-01-04")]
schema = ["user_id","session_date"]
spark = SparkSession.builder.appName("Daily-Day7").getOrCreate()
user_sessions = spark.createDataFrame(data, schema)

In [18]:
windowSpec = Window.partitionBy("user_id").orderBy("session_date")
user_sessions = user_sessions.withColumn("prev_session_date", F.lag("session_date").over(windowSpec))
user_sessions = user_sessions.withColumn("gap_days", 
    F.when(F.col("prev_session_date").isNull(), None)
     .otherwise(F.datediff(F.col("session_date"), F.col("prev_session_date"))))
result = user_sessions.select("user_id", "session_date", "gap_days")
result.show()

+-------+------------+--------+
|user_id|session_date|gap_days|
+-------+------------+--------+
|     U1|  2025-01-01|    NULL|
|     U1|  2025-01-05|       4|
|     U1|  2025-01-10|       5|
|     U2|  2025-01-02|    NULL|
|     U2|  2025-01-04|       2|
+-------+------------+--------+



<pre>
Problem 2: SQL – Find Customers with Only One Purchase
Problem Statement

You have a SQL table purchases(customer_id, purchase_date, amount) representing customer purchases. Write a SQL query to find all customers who made exactly one purchase.
Sample Input (purchases)
customer_id 	purchase_date 	amount
C1 	2025-01-01 	200
C1 	2025-01-10 	300
C2 	2025-01-05 	150
C3 	2025-01-02 	100
C3 	2025-01-04 	200
Expected Output
customer_id
C2
<pre>

In [19]:
data = [
    ("C1", 	"2025-01-01", 	200),
    ("C1", 	"2025-01-10" ,	300),
    ("C2", 	"2025-01-05", 	150),
    ("C3", 	"2025-01-02" ,	100),
    ("C3", 	"2025-01-04", 	200)
]
schema = ["customer_id", "purchase_date","amount"]
purchases = spark.createDataFrame(data, schema)
purchases = purchases.withColumn("purchase_date", F.to_date(F.col("purchase_date"), "yyyy-MM-dd"))
purchases.createOrReplaceTempView("purchases")


In [20]:
result = spark.sql("""select customer_id
from purchases 
group by customer_id
having count(*) = 1
""")
result.show()

+-----------+
|customer_id|
+-----------+
|         C2|
+-----------+

