# Interview Challenge 9: RDD Operations & Custom Transformations

## Problem Statement

You need to process a large log file dataset using PySpark RDD operations. The logs contain user activity data that requires complex transformations and aggregations that are better suited for RDD API than DataFrame API.

## Dataset Description

**Log Format:**
```
timestamp|user_id|action|page|session_id|ip_address|user_agent|response_time|status_code
```

Example log line:
```
2023-01-15 10:30:45|user123|VIEW|home|sess_abc123|192.168.1.100|Mozilla/5.0|250|200
```

## Tasks

1. **RDD Creation & Parsing**
   - Read log files into RDD
   - Parse and validate log lines
   - Handle malformed records
   - Convert to structured format

2. **Custom Transformations**
   - Implement custom `map` functions for data extraction
   - Create `filter` functions for data validation
   - Build `flatMap` operations for complex parsing
   - Develop custom partitioning logic

3. **Advanced RDD Operations**
   - Use `groupByKey` vs `reduceByKey` appropriately
   - Implement custom combiners for efficiency
   - Handle data skew with custom partitioning
   - Use `aggregateByKey` for complex aggregations

4. **Performance Optimization**
   - Implement proper caching strategies
   - Use appropriate persistence levels
   - Optimize shuffle operations
   - Minimize data serialization

5. **Complex Analytics**
   - Calculate user session metrics
   - Identify bot traffic patterns
   - Detect anomalous response times
   - Generate time-based aggregations

## Technical Requirements
- Use RDD API extensively (not DataFrame)
- Implement custom functions and transformations
- Handle edge cases and malformed data
- Optimize for performance and memory usage
- Include proper error handling
- Demonstrate understanding of lazy evaluation

## Key Concepts to Demonstrate
- RDD lineage and DAG optimization
- Shuffle operations and their impact
- Memory vs disk persistence
- Custom serialization
- Fault tolerance mechanisms

In [None]:
# Import libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import re
from collections import defaultdict
from datetime import datetime

# Create Spark context
spark = SparkSession.builder \
    .appName("RDDChallenge") \
    .getOrCreate()

sc = spark.sparkContext

# Sample log data (in production, this would be read from files)
sample_logs = [
    "2023-01-15 10:30:45|user123|VIEW|home|sess_abc123|192.168.1.100|Mozilla/5.0|250|200",
    "2023-01-15 10:31:12|user123|CLICK|products|sess_abc123|192.168.1.100|Mozilla/5.0|180|200",
    "2023-01-15 10:32:01|user456|PURCHASE|checkout|sess_def456|192.168.1.101|Chrome/90|450|200",
    "2023-01-15 10:32:30|user123|VIEW|profile|sess_abc123|192.168.1.100|Mozilla/5.0|300|200",
    "2023-01-15 10:33:15|bot_crawler|VIEW|home|sess_bot001|10.0.0.1|Bot/1.0|50|200",
    "2023-01-15 10:34:22|user789|SEARCH|products|sess_ghi789|192.168.1.102|Safari/14|320|200",
    "invalid log line without proper format",
    "2023-01-15 10:35:01|user123|LOGOUT|home|sess_abc123|192.168.1.100|Mozilla/5.0|120|200",
    "2023-01-15 10:35:45|user456|VIEW|orders|sess_def456|192.168.1.101|Chrome/90|280|500",  # Error response
    "2023-01-15 10:36:12|user999|LOGIN|login|sess_jkl999|192.168.1.103|Firefox/88|800|200"  # Slow response
]

# Create RDD from sample data
logs_rdd = sc.parallelize(sample_logs)

print(f"Created RDD with {logs_rdd.count()} log entries")

# === YOUR SOLUTION GOES HERE ===
# Implement RDD-based log processing

# Task 1: Parse and validate log entries
def parse_log_line(line):
    """Parse a single log line into structured format"""
    try:
        # Expected format: timestamp|user_id|action|page|session_id|ip|user_agent|response_time|status_code
        parts = line.split('|')
        if len(parts) != 9:
            return None  # Malformed line
        
        timestamp, user_id, action, page, session_id, ip, user_agent, resp_time, status = parts
        
        # Basic validation
        if not timestamp or not user_id or not action:
            return None
        
        return {
            'timestamp': timestamp,
            'user_id': user_id,
            'action': action,
            'page': page,
            'session_id': session_id,
            'ip_address': ip,
            'user_agent': user_agent,
            'response_time': int(resp_time),
            'status_code': int(status)
        }
    except (ValueError, IndexError):
        return None

# Parse logs
parsed_logs = logs_rdd.map(parse_log_line).filter(lambda x: x is not None)
print(f"Successfully parsed {parsed_logs.count()} valid log entries")

# Task 2: Custom transformations
# Extract user actions
user_actions = parsed_logs.map(lambda log: (log['user_id'], log['action']))

# Filter successful requests
successful_requests = parsed_logs.filter(lambda log: log['status_code'] < 400)

# Extract session information
session_info = parsed_logs.map(lambda log: (log['session_id'], {
    'user_id': log['user_id'],
    'start_time': log['timestamp'],
    'page_count': 1
}))

# Task 3: Advanced RDD operations
# Count actions per user
user_action_counts = user_actions \
    .groupByKey() \
    .mapValues(lambda actions: list(actions)) \
    .mapValues(lambda actions: {action: actions.count(action) for action in set(actions)})

# Calculate average response time per page
page_response_times = successful_requests \
    .map(lambda log: (log['page'], log['response_time'])) \
    .aggregateByKey(
        (0, 0),  # (sum, count)
        lambda acc, value: (acc[0] + value, acc[1] + 1),
        lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])
    ) \
    .mapValues(lambda sums: sums[0] / sums[1] if sums[1] > 0 else 0)

# Task 4: Session analysis
session_metrics = parsed_logs \
    .map(lambda log: (log['session_id'], (log['timestamp'], log['action'], log['page']))) \
    .groupByKey() \
    .mapValues(list) \
    .mapValues(lambda events: {
        'user_id': events[0][0] if events else None,  # From first event
        'duration_minutes': (datetime.strptime(max(e[0] for e in events), '%Y-%m-%d %H:%M:%S') - 
                           datetime.strptime(min(e[0] for e in events), '%Y-%m-%d %H:%M:%S')).total_seconds() / 60,
        'page_views': len([e for e in events if e[1] == 'VIEW']),
        'actions': len(events),
        'converted': any(e[1] == 'PURCHASE' for e in events)
    })

# Task 5: Anomaly detection
response_times = successful_requests.map(lambda log: log['response_time'])
response_stats = response_times.stats()

# Flag slow requests (above 95th percentile)
percentile_95 = response_times.sortBy(lambda x: x).zipWithIndex() \
    .filter(lambda x: x[1] >= int(response_times.count() * 0.95)) \
    .map(lambda x: x[0]).first()

slow_requests = successful_requests.filter(lambda log: log['response_time'] > percentile_95)

# Task 6: Bot detection
bot_indicators = ['bot', 'crawler', 'spider', 'scraper']
bot_traffic = parsed_logs.filter(
    lambda log: any(indicator.lower() in log['user_agent'].lower() for indicator in bot_indicators)
)

# Display results
print("\n=== RESULTS ===")
print(f"\nTotal valid logs: {parsed_logs.count()}")
print(f"Successful requests: {successful_requests.count()}")
print(f"Bot traffic detected: {bot_traffic.count()}")
print(f"Slow requests: {slow_requests.count()}")

print("\nTop 5 pages by average response time:")
for page, avg_time in page_response_times.sortBy(lambda x: x[1], ascending=False).take(5):
    print(f"{page}: {avg_time:.1f}ms")

print("\nUser action summary:")
for user, actions in user_action_counts.take(5):
    print(f"{user}: {actions}")

# Cache for performance
parsed_logs.cache()
print(f"\nCached {parsed_logs.count()} parsed log entries")

print("\nâœ… RDD processing completed!")
print("Key Learnings:")
print("- RDD transformations are lazy until actions are called")
print("- groupByKey can cause shuffle, prefer reduceByKey when possible")
print("- aggregateByKey provides better control over combiner functions")
print("- Caching prevents recomputation of expensive operations")
print("- Custom functions enable complex business logic")
