## **Problem: Advanced User Session Analytics**

A user login website tracks detailed user activities across sessions. For each user per day, you need to calculate:

1. **Number of Logins** – Count of login events per day.
2. **Session Times** – Duration of each session (between login and logout) per day.
3. **Total Usage Time** – Sum of all session durations per day.
4. **Average Usage Time** – Average session duration per day.
5. **Most Used Device** – Device used most frequently per day.
6. **Top Location** – Location from where the user accessed most events per day.
7. **Unique Browsers** – Count of distinct browsers used per day.
8. **Event Counts** – Count of each event type per day (stored separately).

### **Additional Notes**

* Events belong to different sessions identified by `session_id`.
* A session starts at `login` and ends at `logout`. If no `logout`, assume session ends at **23:59:59** for that day.
* Event types include: `login`, `logout`, `click`, `view`, `purchase`.

---

## **Sample Input from CSV file (Simplified)**

| user\_id | event\_time         | event\_type | device  | location | browser | session\_id |
| -------- | ------------------- | ----------- | ------- | -------- | ------- | ----------- |
| U1       | 2025-08-27 09:00:00 | login       | mobile  | India    | Chrome  | S101        |
| U1       | 2025-08-27 09:15:00 | click       | mobile  | India    | Chrome  | S101        |
| U1       | 2025-08-27 09:45:00 | logout      | mobile  | India    | Chrome  | S101        |
| U2       | 2025-08-27 10:00:00 | login       | desktop | USA      | Firefox | S202        |
| U2       | 2025-08-27 10:30:00 | view        | desktop | USA      | Firefox | S202        |
| U2       | 2025-08-27 11:00:00 | logout      | desktop | USA      | Firefox | S202        |

---

## **Expected Output 1: Main Metrics**

| user\_id | date       | num\_logins | session\_times(mins) | total\_usage(mins) | avg\_usage(mins) | most\_device | top\_location | unique\_browsers |
| -------- | ---------- | ----------- | -------------------- | ------------------ | ---------------- | ------------ | ------------- | ---------------- |
| U1       | 2025-08-27 | 1           | \[45]                | 45                 | 45.0             | mobile       | India         | 1                |
| U2       | 2025-08-27 | 1           | \[60]                | 60                 | 60.0             | desktop      | USA           | 1                |

---

## **Expected Output 2: Event Counts (Separate Table)**

| user\_id | date       | login\_count | logout\_count | click\_count | view\_count | purchase\_count |
| -------- | ---------- | ------------ | ------------- | ------------ | ----------- | --------------- |
| U1       | 2025-08-27 | 1            | 1             | 1            | 0           | 0               |
| U2       | 2025-08-27 | 1            | 1             | 0            | 1           | 0               |

---

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

In [4]:
spark = SparkSession.builder.appName("Daily-Day9").getOrCreate()

25/09/02 11:06:06 WARN Utils: Your hostname, neosoft-Latitude-5420 resolves to a loopback address: 127.0.1.1; using 10.0.61.174 instead (on interface wlp0s20f3)
25/09/02 11:06:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/02 11:06:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
df = spark.read.option("header","true").csv("user_data.csv")

In [6]:
df = df.withColumn("event_time", F.col("event_time").cast("timestamp"))

In [None]:
df = df.withColumn("date", F.to_date("event_time"))
login_times = df.filter(F.col("event_type") == "login") \
    .groupBy("user_id", "session_id", "date") \
    .agg(F.min("event_time").alias("login_time"))

logout_times = df.filter(F.col("event_type") == "logout") \
    .groupBy("user_id", "session_id", "date") \
    .agg(F.max("event_time").alias("logout_time"))

sessions = login_times.join(logout_times, ["user_id", "session_id", "date"], "left")

# If no logout → set to 23:59:59
sessions = sessions.withColumn(
    "logout_time",
    F.when(F.col("logout_time").isNull(), 
           F.concat_ws(" ", F.col("date"), F.lit("23:59:59")).cast("timestamp"))
     .otherwise(F.col("logout_time"))
)

sessions = sessions.withColumn(
    "session_duration",
    (F.unix_timestamp("logout_time") - F.unix_timestamp("login_time")) / 60
)

main_metrics = df.groupBy("user_id", "date").agg(
    F.count(F.when(F.col("event_type")=="login", 1)).alias("num_logins"),
    F.expr("mode(device)").alias("most_device"),
    F.expr("mode(location)").alias("top_location"),
    F.countDistinct("browser").alias("unique_browsers")
)

session_aggs = sessions.groupBy("user_id", "date").agg(
    F.collect_list("session_duration").alias("session_times"),
    F.sum("session_duration").alias("total_usage"),
    F.avg("session_duration").alias("avg_usage")
)

main_metrics = main_metrics.join(session_aggs, ["user_id", "date"], "left")

event_counts = df.groupBy("user_id", "date").pivot("event_type", 
        ["login", "logout", "click", "view", "purchase"]).count().fillna(0)

event_counts = event_counts.withColumnRenamed("login", "login_count") \
                           .withColumnRenamed("logout", "logout_count") \
                           .withColumnRenamed("click", "click_count") \
                           .withColumnRenamed("view", "view_count") \
                           .withColumnRenamed("purchase", "purchase_count")


In [14]:
main_metrics.show()

+-------+----------+----------+-----------+------------+---------------+--------------------+------------------+------------------+
|user_id|      date|num_logins|most_device|top_location|unique_browsers|       session_times|       total_usage|         avg_usage|
+-------+----------+----------+-----------+------------+---------------+--------------------+------------------+------------------+
|     U1|2025-08-01|         0|     tablet|          UK|              2|                NULL|              NULL|              NULL|
|     U1|2025-08-02|         1|     mobile|   Australia|              2|             [304.4]|             304.4|             304.4|
|     U1|2025-08-04|         0|     tablet|   Australia|              1|                NULL|              NULL|              NULL|
|     U1|2025-08-05|         0|     tablet|         USA|              1|                NULL|              NULL|              NULL|
|     U1|2025-08-07|         0|     tablet|         USA|              1|    

In [15]:
event_counts.show()

+-------+----------+-----------+------------+-----------+----------+--------------+
|user_id|      date|login_count|logout_count|click_count|view_count|purchase_count|
+-------+----------+-----------+------------+-----------+----------+--------------+
|    U43|2025-08-06|          2|           3|          0|         0|             0|
|    U31|2025-08-16|          1|           0|          0|         1|             1|
|    U17|2025-08-14|          0|           0|          0|         0|             1|
|     U8|2025-08-25|          0|           1|          0|         0|             1|
|     U9|2025-08-22|          1|           0|          0|         0|             0|
|    U39|2025-08-28|          1|           0|          0|         0|             0|
|    U29|2025-08-04|          0|           0|          0|         0|             2|
|    U38|2025-08-03|          0|           1|          0|         0|             0|
|    U49|2025-08-03|          1|           0|          0|         0|        