# Additional question from tutor

## "Given a map-reduce sequence of tasks, what would be the algorithm to convert it into Spark, and can one improve it in speed?"

### **1. How to convert a MapReduce sequence of tasks into Spark**

1) **Understanding the MapReduce workflow:**

* **Map phase:**

Processes input data to produce intermediate key-value pairs.
* **Reduce phase:**

Aggregates the intermediate data based on keys to produce the final output.

2) **Converting MapReduce tasks into Spark**

**A. Map phase conversion:**

- **In MapReduce:** each mapper processes a subset of the data, emitting key-value pairs.
- **In Spark:** The `map()` and `mapValues()` transformations are used to process and transform the data.
- **In the code:**
  - **Parsing and initial mapping:** Each line is parsed into a structured record.
    ```python
    parsed_rdd = clickstream_rdd.map(parse_csv_line).filter(lambda x: x is not None)
    ```
  - **Key-Value mapping:** Maps each record to a key-value pair where the key is `(user_id, session_id)`.
    ```python
    grouped_rdd = parsed_rdd.map(lambda row: ((row['user_id'], row['session_id']), row))
    ```


**B. Shuffle and grouping:**

- **In MapReduce:** Intermediate key-value pairs are shuffled and sorted to group values by key.
- **In Spark:** The `groupByKey()` transformation groups all the values with the same key.
- **In the code:** Groups events by `(user_id, session_id)`.
  ```python
  grouped_rdd = grouped_rdd.groupByKey()
  ```

**C. Reduce phase conversion:**

- **In MapReduce:** Reducers aggregate the values for each key to produce the final output.
- **In Spark:** Use transformations like `reduceByKey()` or `aggregateByKey()` for aggregation.
- **In the Code:**
  - **Constructing routes (reduction within each group):** Constructs the route for each user session.
    ```python
    routes_rdd = grouped_rdd.mapValues(construct_route)
    ```
  - **Mapping routes to counts:** Prepares data for counting occurrences of each route.
    ```python
    route_counts_rdd = routes_rdd.map(lambda x: (x[1], 1))
    ```
  - **Reducing by route key:** Aggregates the counts for each route.
    ```python
    final_counts_rdd = route_counts_rdd.reduceByKey(lambda a, b: a + b)
    ```
    

**D. Collecting and sorting results:**

- **In MapReduce:** Final results are written to HDFS or collected for further processing.
- **In Spark:** Actions like `collect()`, `take()`, or `takeOrdered()` are used to retrieve results.
- **In the code:** Retrieves the top 10 routes by count.
  ```python
  top_routes = final_counts_rdd.takeOrdered(10, key=lambda x: -x[1])
  ```
 

**E. Key differences and advantages:**

- **In-Memory computation:** Spark processes data in memory, reducing disk I/O compared to MapReduce.
- **Lazy evaluation and DAG optimization:** Spark builds a Directed Acyclic Graph (DAG) of transformations, optimizing execution.
- **Higher-level APIs:** Spark provides RDDs, DataFrames, and Datasets for more efficient data processing.


### **2. Can one improve it in speed?**

Yes, the speed of the Spark job can be improved. Here are several ways to enhance performance:

**A. Use DataFrames instead of RDDs**

- **Why:** DataFrames are optimized through Spark's Catalyst optimizer and can offer significant performance improvements over RDDs.
- **How to implement:** Convert data processing logic to use DataFrames and built-in functions.
- **Benefits:**
  - Better performance due to optimized execution plans.
  - Easier to write and maintain code.


**B. Avoid using UDFs when possible**

- **Issue:** User-Defined Functions (UDFs) can be slower because they operate outside of Spark's optimization scope.
- **Solution:** Use Spark SQL built-in functions if possible.
- **Applicability:** In this case, since route construction involves custom logic, using a UDF is acceptable. However, ensure the UDF is efficient.

**C. Reduce data shuffling**

- **Optimization:** Use `reduceByKey()` instead of `groupByKey()` as it performs local aggregation before shuffling data across the network.


**D. Cache intermediate RDDs**

- **Benefit:** If an RDD or DataFrame is reused multiple times, caching it can prevent recomputation.
- **Implementation:**
  ```python
  routes_rdd.persist()
  ```

**E. Optimize resource allocation**

- **Adjust parallelism:** Set the number of partitions appropriately to utilize all available cores.
  - Example:
    ```python
    num_partitions = sc.defaultParallelism * 2
    clickstream_rdd = clickstream_rdd.repartition(num_partitions)
    ```
- **Executor Memory and Cores:** Configure executor memory and cores based on cluster's resources.

**F. Data serialization**

- **Use efficient serialization Formats:** Configure Spark to use Kryo serialization for faster data serialization.
  - Example:
    ```python
    spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    ```

**G. Filter data early**

- **Optimization:** Filter out unnecessary data as early as possible to reduce the amount of data processed downstream.

**H. Monitor and tune performance**

- **Use Spark UI:** Monitor job execution to identify bottlenecks.
- **Adjust Configurations:** Tune Spark configurations like `spark.sql.shuffle.partitions` for better performance.


### Applying algorithm to code


In [26]:
import csv
from io import StringIO
from pyspark import SparkContext
from pyspark.sql import SparkSession

# 1) Initialize spark context and session:
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("ClickstreamAnalysisRDD").getOrCreate()

# 2) define a function to parse CSV lines:
def parse_csv_line(line):
    """
    Parses a single CSV line into a structured dictionary.
    
    Args:
        line (str): A line from the CSV file.
    
    Returns:
        dict: A dictionary with keys corresponding to CSV headers.
    """
    fields = line.split('\t')
    if len(fields) != 5:
        # malformed lines
        return None
    try:
        return {
            "user_id": int(fields[0]),
            "session_id": int(fields[1]),
            "event_type": fields[2],
            "event_page": fields[3],
            "timestamp": int(fields[4])
        }
    except ValueError:
        # conversion errors
        return None
    
# 3) Load data into an RDD: read the clickstream data from the HDFS path into an RDD.
clickstream_rdd = sc.textFile("hdfs:/data/clickstream.csv")

# 4) Remove the header row:
header = clickstream_rdd.first()
clickstream_rdd = clickstream_rdd.filter(lambda line: line != header)

# 5) parse csv lines into structured records
parsed_rdd = clickstream_rdd.map(parse_csv_line).filter(lambda x: x is not None)

# 6) key-value mapping and grouping (user_id, session_id)
grouped_rdd = parsed_rdd.map(lambda row: ((row['user_id'], row['session_id']), row)) \
                        .groupByKey()

# 7) function to construct user routes
def construct_route(events):
    """
    Constructs a navigation route from a sequence of events.
    
    Args:
        events (Iterable[dict]): An iterable of dictionaries containing event data.
        
    Returns:
        str: A hyphen-separated string representing the user's navigation route.
    """
    # sort events by timestamp
    sorted_events = sorted(events, key=lambda event: event['timestamp'])
    
    navigation_path = []
    for event in sorted_events:
        event_type = event['event_type'].lower()
        event_page = event['event_page']
        if "error" in event_type:
            break
        if event_type == "page":
            if not navigation_path or navigation_path[-1] != event_page:
                navigation_path.append(event_page)
    
    return "-".join(navigation_path)

# 8) Construct routes for each session
routes_rdd = grouped_rdd.mapValues(construct_route)

# 9) filter out empty routes
routes_rdd = routes_rdd.filter(lambda x: x[1] != "")

# 10) map routes to (route, 1)
route_counts_rdd = routes_rdd.map(lambda x: (x[1], 1))

# 11) reduce by key to count occurrences
final_counts_rdd = route_counts_rdd.reduceByKey(lambda a, b: a + b)

# 12) sort and take top 10
top_routes = final_counts_rdd.takeOrdered(10, key=lambda x: -x[1])
top_routes_dict = {route: count for route, count in top_routes}
print("Top routes from RDD approach using spark:")
print(top_routes_dict)


2024-11-07 11:31:09,564 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

Top routes from RDD approach using spark:
{'main': 8185, 'main-archive': 1113, 'main-rabota': 1047, 'main-internet': 897, 'main-bonus': 870, 'main-news': 769, 'main-tariffs': 677, 'main-online': 587, 'main-vklad': 518, 'main-rabota-archive': 170}


                                                                                