# üéØ Interview Coding Challenge 1: User Session Analysis

**Difficulty:** Medium  
**Time:** 30 minutes  
**Skills:** Window functions, aggregations, complex transformations

## Problem Statement

Given user activity logs, calculate comprehensive session metrics including:
- Session duration and boundaries
- Page views per session
- Bounce rate analysis
- User-level engagement metrics

**Session Definition:** 30 minutes of inactivity ends a session

## Input Schema
```python
activity_schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("page_url", StringType(), True),
    StructField("event_type", StringType(), True),
    StructField("session_id", StringType(), True)
])
```

## Expected Output
- User-level metrics (total sessions, bounce rate, avg session duration)
- Session-level details (start/end times, page views, unique pages)

## Evaluation Criteria
- Correct session boundary detection
- Proper window function usage
- Efficient aggregation logic
- Clean, readable code

---

**Start coding your solution below!** üöÄ

In [None]:
# Initialize Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("Interview Challenge 1") \
    .getOrCreate()

print("Spark initialized successfully!")

In [None]:
# Create sample data for testing
sample_data = [
    ("user1", "2023-01-01 10:00:00", "/home", "page_view", "s1"),
    ("user1", "2023-01-01 10:05:00", "/products", "page_view", "s1"),
    ("user1", "2023-01-01 10:10:00", "/products", "click", "s1"),
    ("user1", "2023-01-01 11:00:00", "/checkout", "page_view", "s1"),  # New session
    ("user2", "2023-01-01 10:15:00", "/home", "page_view", "s2"),
    ("user2", "2023-01-01 10:20:00", "/home", "scroll", "s2"),
    ("user3", "2023-01-01 14:00:00", "/home", "page_view", "s3"),
    ("user3", "2023-01-01 14:05:00", "/about", "page_view", "s3"),
    ("user3", "2023-01-01 14:10:00", "/contact", "page_view", "s3"),
]

# Define schema
activity_schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("page_url", StringType(), True),
    StructField("event_type", StringType(), True),
    StructField("session_id", StringType(), True)
])

# Create DataFrame
activities_df = spark.createDataFrame(sample_data, 
    ["user_id", "timestamp", "page_url", "event_type", "session_id"])
activities_df = activities_df.withColumn("timestamp", to_timestamp("timestamp"))

print("Sample data created:")
activities_df.show()

## Your Solution

**Implement the session analysis logic below:**

1. Calculate time differences between consecutive events per user
2. Identify session boundaries (30-minute inactivity rule)
3. Assign session numbers within each user
4. Calculate per-session metrics
5. Compute user-level aggregations and bounce rate

**Hints:**
- Use `Window.partitionBy().orderBy()` for user-level operations
- Use `lag()` function to get previous timestamps
- Use `sum()` window function to create session numbers
- Bounce rate = sessions with only 1 page view

In [None]:
# YOUR SOLUTION HERE
# Implement the analyze_user_sessions function

def analyze_user_sessions(activities_df):
    """
    Calculate comprehensive user session metrics
    
    Args:
        activities_df: DataFrame with user activity data
    
    Returns:
        tuple: (user_metrics_df, session_details_df)
    """
    
    # Define session timeout (30 minutes of inactivity)
    SESSION_TIMEOUT_MINUTES = 30
    
    # Step 1: Calculate time differences between consecutive events per user
    # Hint: Use Window.partitionBy("user_id").orderBy("timestamp")
    # Hint: Use lag() function and unix_timestamp for time calculations
    
    # YOUR CODE HERE
    
    # Step 2: Identify session boundaries
    # Hint: New session when time_diff > SESSION_TIMEOUT_MINUTES or is null
    
    # YOUR CODE HERE
    
    # Step 3: Assign session numbers
    # Hint: Use sum() window function on new_session flag
    
    # YOUR CODE HERE
    
    # Step 4: Calculate per-session metrics
    # Hint: Group by user_id, session_number and aggregate
    
    # YOUR CODE HERE
    
    # Step 5: Calculate bounce rate and user-level metrics
    # Hint: Bounce = sessions with page_views == 1
    
    # YOUR CODE HERE
    
    return user_metrics, session_details

# Test your solution
try:
    user_metrics, session_details = analyze_user_sessions(activities_df)
    
    print("‚úÖ User-level metrics:")
    user_metrics.show()
    
    print("\n‚úÖ Session-level details:")
    session_details.show()
    
except Exception as e:
    print(f"‚ùå Error in your solution: {e}")
    print("Keep working on your implementation!")

## Expected Solution Output

**User-level metrics:**
```
+-------+--------------+------------+----------------+--------------+---------------------+-------------------------+-----------+-------------------+
|user_id|total_sessions|total_events|total_page_views|bounce_sessions|avg_session_duration|avg_page_views_per_session|bounce_rate|last_activity      |
+-------+--------------+------------+----------------+--------------+---------------------+-------------------------+-----------+-------------------+
|user1  |2             |4           |3               |1             |1800.0              |1.5                      |0.5       |2023-01-01 11:00:00|
|user2  |1             |2           |1               |1             |300.0               |1.0                      |1.0       |2023-01-01 10:20:00|
|user3  |1             |3           |3               |0             |600.0               |3.0                      |0.0       |2023-01-01 14:10:00|
+-------+--------------+------------+----------------+--------------+---------------------+-------------------------+-----------+-------------------+
```

**Session-level details:**
```
+-------+--------------+--------------------+-------------------+------------+------------+------------+---------------------+
|user_id|session_number|session_start       |session_end        |total_events|page_views  |unique_pages|session_duration_seconds|
+-------+--------------+--------------------+-------------------+------------+------------+------------+---------------------+
|user1  |1             |2023-01-01 10:00:00|2023-01-01 10:10:00|3           |2           |2           |600                  |
|user1  |2             |2023-01-01 11:00:00|2023-01-01 11:00:00|1           |1           |1           |0                    |
|user2  |1             |2023-01-01 10:15:00|2023-01-01 10:20:00|2           |1           |1           |300                  |
|user3  |1             |2023-01-01 14:00:00|2023-01-01 14:10:00|3           |3           |3           |600                  |
+-------+--------------+--------------------+-------------------+------------+------------+------------+---------------------+
```

---

**Great job completing this interview challenge!** üéâ

**Key concepts tested:**
- Window functions and analytical operations
- Complex aggregations and grouping
- Time-based calculations and session logic
- Bounce rate and engagement metrics calculation

**Ready for the next challenge?** üöÄ