## **Problem: Advanced User Session Analytics**

A user login website tracks detailed user activities across sessions. For each user per day, you need to calculate:

1. **Number of Logins** – Count of login events per day.
2. **Session Times** – Duration of each session (between login and logout) per day.
3. **Total Usage Time** – Sum of all session durations per day.
4. **Average Usage Time** – Average session duration per day.
5. **Most Used Device** – Device used most frequently per day.
6. **Top Location** – Location from where the user accessed most events per day.
7. **Unique Browsers** – Count of distinct browsers used per day.
8. **Event Counts** – Count of each event type per day (stored separately).

### **Additional Notes**

* Events belong to different sessions identified by `session_id`.
* A session starts at `login` and ends at `logout`. If no `logout`, assume session ends at **23:59:59** for that day.
* Event types include: `login`, `logout`, `click`, `view`, `purchase`.

---

## **Sample Input from CSV file (Simplified)**

| user\_id | event\_time         | event\_type | device  | location | browser | session\_id |
| -------- | ------------------- | ----------- | ------- | -------- | ------- | ----------- |
| U1       | 2025-08-27 09:00:00 | login       | mobile  | India    | Chrome  | S101        |
| U1       | 2025-08-27 09:15:00 | click       | mobile  | India    | Chrome  | S101        |
| U1       | 2025-08-27 09:45:00 | logout      | mobile  | India    | Chrome  | S101        |
| U2       | 2025-08-27 10:00:00 | login       | desktop | USA      | Firefox | S202        |
| U2       | 2025-08-27 10:30:00 | view        | desktop | USA      | Firefox | S202        |
| U2       | 2025-08-27 11:00:00 | logout      | desktop | USA      | Firefox | S202        |

---

## **Expected Output 1: Main Metrics**

| user\_id | date       | num\_logins | session\_times(mins) | total\_usage(mins) | avg\_usage(mins) | most\_device | top\_location | unique\_browsers |
| -------- | ---------- | ----------- | -------------------- | ------------------ | ---------------- | ------------ | ------------- | ---------------- |
| U1       | 2025-08-27 | 1           | \[45]                | 45                 | 45.0             | mobile       | India         | 1                |
| U2       | 2025-08-27 | 1           | \[60]                | 60                 | 60.0             | desktop      | USA           | 1                |

---

## **Expected Output 2: Event Counts (Separate Table)**

| user\_id | date       | login\_count | logout\_count | click\_count | view\_count | purchase\_count |
| -------- | ---------- | ------------ | ------------- | ------------ | ----------- | --------------- |
| U1       | 2025-08-27 | 1            | 1             | 1            | 0           | 0               |
| U2       | 2025-08-27 | 1            | 1             | 0            | 1           | 0               |

---

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

In [2]:
spark = SparkSession.builder.appName("Daily-Day9").getOrCreate()

25/09/01 18:53:42 WARN Utils: Your hostname, neosoft-Latitude-5420 resolves to a loopback address: 127.0.1.1; using 10.0.61.174 instead (on interface wlp0s20f3)
25/09/01 18:53:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/01 18:53:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df = spark.read.option("header","true").csv("user_data.csv")

In [5]:
df = df.withColumn("event_time", F.col("event_time").cast("timestamp"))

In [8]:
transform = df.withColumn('date', F.to_date(F.col('event_time'), 'yyyy-MM-dd'))
login_wind = Window.partitionBy('user_id','date')
num_logins = transform.filter(F.col('event_type')=='login')
num_logins = num_logins.withColumn('num_logins', F.count('event_type').over(login_wind))
num_logins.show()

+-------+-------------------+----------+-------+---------+-------+----------+----------+----------+
|user_id|         event_time|event_type| device| location|browser|session_id|      date|num_logins|
+-------+-------------------+----------+-------+---------+-------+----------+----------+----------+
|     U1|2025-08-02 18:55:35|     login| mobile|Australia|   Edge|     S8333|2025-08-02|         1|
|     U1|2025-08-08 15:47:01|     login|desktop|Australia| Safari|     S9952|2025-08-08|         1|
|     U1|2025-08-22 19:59:16|     login|desktop|       UK|Firefox|     S1298|2025-08-22|         1|
|     U1|2025-08-25 20:04:30|     login|desktop|       UK|   Edge|     S4541|2025-08-25|         1|
|     U1|2025-08-26 18:34:38|     login| tablet|Australia|   Edge|     S4014|2025-08-26|         1|
|    U10|2025-08-07 10:22:07|     login|desktop|      USA|Firefox|     S4146|2025-08-07|         1|
|    U10|2025-08-10 18:21:04|     login| mobile|      USA| Chrome|     S3050|2025-08-10|         1|


In [13]:
transform.show()

+-------+-------------------+----------+-------+---------+-------+----------+----------+
|user_id|         event_time|event_type| device| location|browser|session_id|      date|
+-------+-------------------+----------+-------+---------+-------+----------+----------+
|    U21|2025-08-12 16:17:06|  purchase|desktop|Australia| Chrome|     S9440|2025-08-12|
|     U2|2025-08-12 04:31:21|  purchase| tablet|Australia|Firefox|     S9876|2025-08-12|
|    U18|2025-08-17 14:03:12|  purchase|desktop|    India|Firefox|     S7075|2025-08-17|
|    U45|2025-08-20 01:56:42|      view|desktop|Australia| Safari|     S7144|2025-08-20|
|    U21|2025-08-25 19:31:51|     login| mobile|   Canada| Safari|     S3814|2025-08-25|
|     U3|2025-08-13 12:16:10|     click|desktop|Australia|Firefox|     S6097|2025-08-13|
|     U5|2025-08-02 05:58:30|    logout| tablet|    India|  Opera|     S2232|2025-08-02|
|    U19|2025-08-21 04:21:33|  purchase| tablet|      USA| Safari|     S7053|2025-08-21|
|     U5|2025-08-21 1

In [27]:
most_used = transform.groupBy('user_id','date','device').agg(F.count('*').alias('most_used_device'))
# most_used.take(most_used.count())
most_used = most_used.withColumn('rank', F.row_number().over(Window.partitionBy('user_id','date').orderBy(F.desc('most_used_device'))))
# most_used.show()
most_used = most_used.filter(F.col('rank')==1)
most_used.show()

+-------+----------+-------+----------------+----+
|user_id|      date| device|most_used_device|rank|
+-------+----------+-------+----------------+----+
|     U1|2025-08-01| mobile|               1|   1|
|     U1|2025-08-02| mobile|               2|   1|
|     U1|2025-08-04| tablet|               1|   1|
|     U1|2025-08-05| tablet|               1|   1|
|     U1|2025-08-07| tablet|               1|   1|
|     U1|2025-08-08|desktop|               1|   1|
|     U1|2025-08-10| mobile|               1|   1|
|     U1|2025-08-11| mobile|               1|   1|
|     U1|2025-08-12|desktop|               1|   1|
|     U1|2025-08-15| mobile|               1|   1|
|     U1|2025-08-22|desktop|               1|   1|
|     U1|2025-08-24| mobile|               1|   1|
|     U1|2025-08-25|desktop|               1|   1|
|     U1|2025-08-26| tablet|               1|   1|
|    U10|2025-08-02| tablet|               1|   1|
|    U10|2025-08-03| mobile|               2|   1|
|    U10|2025-08-04| mobile|   