# 📝 Problem 1: PySpark – Calculate Session Gaps for Users

### **Problem Statement**

You have a PySpark DataFrame with user session information. Each row represents a session start date for a user. Write a PySpark program to **calculate the gap in days between consecutive sessions** for each user.

### **Sample Input** (`user_sessions`)

| user\_id | session\_date |
| -------- | ------------- |
| U1       | 2025-01-01    |
| U1       | 2025-01-05    |
| U1       | 2025-01-10    |
| U2       | 2025-01-02    |
| U2       | 2025-01-04    |

### **Expected Output**

| user\_id | session\_date | days\_since\_last\_session |
| -------- | ------------- | -------------------------- |
| U1       | 2025-01-01    | NULL                       |
| U1       | 2025-01-05    | 4                          |
| U1       | 2025-01-10    | 5                          |
| U2       | 2025-01-02    | NULL                       |
| U2       | 2025-01-04    | 2                          |

---

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window as W
from pyspark.sql.types import StructType, StructField, StringType, DateType

In [2]:
spark = SparkSession.builder.appName("DailyCodingProblem-25-08-2025").getOrCreate()

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 53338)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/spark/python/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/spark/python/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/spark/python/pyspark/accumulators.py", line 271, in accum_updates
    num_updates = read_int(self.rfile)
  File "/spark/python/pyspark/serializers.

In [4]:
data = [
    ("U1", "2025-01-01"),
    ("U1", "2025-01-05"),
    ("U1", "2025-01-10"),
    ("U2", "2025-01-02"),
    ("U2", "2025-01-04")
]

schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("session_date", StringType(), True)
])

df = spark.createDataFrame(data, schema)

In [6]:
df.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- session_date: string (nullable = true)



In [21]:
w = W.partitionBy("user_id").orderBy("session_date")

In [29]:
df = df.withColumn(
    "session_date",
    F.to_date(F.col("session_date"))
).withColumn(
    "days_since_last_session",
    F.date_diff(
        F.col("session_date"), F.lag(F.col("session_date"), 1).over(w)
    )
)

In [30]:
df.show()

+-------+------------+-----------------------+
|user_id|session_date|days_since_last_session|
+-------+------------+-----------------------+
|     U1|  2025-01-01|                   NULL|
|     U1|  2025-01-05|                      4|
|     U1|  2025-01-10|                      5|
|     U2|  2025-01-02|                   NULL|
|     U2|  2025-01-04|                      2|
+-------+------------+-----------------------+

