# Fault Tolerance

This section aims to explore the following questions:
- How does the system behave under Node/CPU/Memory/Hardware/... errors and failures?
- What happens during network interruptions and partitioning?
- How do error handling mechanisms affect efficiency/scale/latency/throughput/... etc.?

Therefore, we will observe Spark's fault tolerance mechanism by forcibly stopping a Java process.
 
We set up a Spark Cluster in Standalone mode, consisting of one master and three worker nodes, and start a Spark Application.


In [1]:
import findspark 
findspark.init()

In [2]:
# for sql
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col
from pyspark.sql.functions import sum,avg,max,count
from pyspark.sql import functions as F


In [3]:
import pyspark 
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://192.168.0.5:7077").appName("fault-tolerance4").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/31 12:25:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
root = '../../../Data/eCommerce-behavior-data/2019-Oct.csv'
# root = '../../data/only_purchases_1day.csv'
ecommerce = spark.read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv(root)

                                                                                

In [5]:
ecommerce.createOrReplaceTempView('ecommerce_2019_oct')

## Executors
![Image](https://i.imgur.com/TEtFIbe.png)
You can observe through the Spark UI that there are 4 Executors, namely Master, Worker1, Worker2, and Worker3. 

Each Executor has two Java processes, resulting in a total of 8 Java processes in execution.

In [8]:
sc = spark.sparkContext
# get the number of executors
num_executors = sc._jsc.sc().getExecutorMemoryStatus().size()
print("Number of executors:", num_executors)


Number of executors: 4


## Skill Executor
During the execution of a job, if an Executor encounters an exception, Spark takes the following actions:
- When an exception occurs in the Executor, the external wrapper class ExecutorRunner *sends the exception message to the Worker*.
- Subsequently, the Worker *sends a message to the Master*.
- Upon receiving the Executor status change message, if the Master detects an abnormal exit of the Executor, it invokes the Master.schedule method to *attempt to obtain an available Worker node* and restart the Executor.

### Narrow Dependency
In the case of *Narrow Dependency*, as each parent RDD partition depends on a specific child RDD partition, *the data from this child RDD partition can be directly used during recomputation*, avoiding Redundant Computation.

We conduct an experiment using the `filter` operation:

However, from the Spark UI, we can observe the following behavior. It triggers two jobs: 
1. The first job executes the `filter`
2. The second job executes the `count`. 

Since the `filter` operation is executed first and the result is stored in the cache, the second job skips the reading of the CSV step and directly reads the data from the cache for counting. You can see in the image that part of Stage 5 is skipped. Typically, it means that data has been fetched from the cache, and there was no need to re-execute the given stage.

![Image](https://i.imgur.com/dyulBi4.png)


Not skilling executor

In [5]:
only_purchases = ecommerce.filter(col("event_type") == 'purchase')
print("How many purchase session in one month:", only_purchases.count())



How many purchase session in one month: 742849


                                                                                

> **After examining the execution status under the condition where no errors occurred, we intentionally force stop one Executor's Java Process by repeatedly executing the same steps.**
> 
> **We observe the behavior of Spark in the case of narrow dependency execution (Filter):**

![Image](https://i.imgur.com/o8xf2hx.png)

In the following code snippet, I forcefully stop one of the Java Processes of an Executor using the Activity Monitor.
As mentioned in the previous step, one executor has two Java Processes. By stopping one of them, the entire executor's work comes to a halt.
This action will result in the following:
1. Removal of a specific Executor.
2. Redistribution of the pending tasks of the stopped Executor to other Executors.
3. The final result remains the same as the original, with a quantity of 742.849. Because the tasks are redistributed to other Executors, the result is the same as the original.

However, the worker node will be marked as dead due to the forced termination of the Java process.
It is necessary to restart the worker node.


In [7]:
only_purchases = ecommerce.filter(col("event_type") == 'purchase')
print("How many purchase session in one month:", only_purchases.count())

24/01/31 00:58:00 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.0.5: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 31.0 in stage 5.0 (TID 119) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 39.0 in stage 5.0 (TID 127) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 41.0 in stage 5.0 (TID 129) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 38.0 in stage 5.0 (TID 126) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
[Stage 5:>                          

How many purchase session in one month: 742849


                                                                                

### Wide Dependency
- In the case of **wide dependency**, when an entire subRDD partition is lost, Spark must **recompute all the parent RDDs associated with that subRDD partition**, as **multiple parent RDD partitions may depend on this subRDD partition**.
- In scenarios with a long compute chain and wide dependency, it is recommended to perform a checkpoint or caching to store intermediate results, reducing execution overhead.

Not skilling executor

In [10]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()



+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|0010e63b-0333-4f6...|2019-10-16 14:57:29|[525771398]|       1|             31.64|
|00149062-a045-4a1...|2019-10-26 22:11:53|[558054947]|       2|113.50999999999999|
|00167766-6565-4b6...|2019-10-30 09:50:10|[565693206]|       1|            385.83|
|0016bf0d-cdc0-4d6...|2019-10-17 13:17:20|[550091025]|       1|            242.07|
|001

                                                                                

Skilling Executor


![Image](https://i.imgur.com/7Tec3Hp.png)

In [11]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()

24/01/31 00:44:52 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.0.5: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 30.0 in stage 10.0 (TID 164) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 42.0 in stage 10.0 (TID 176) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 33.0 in stage 10.0 (TID 167) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 35.0 in stage 10.0 (TID 169) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetMa

+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000174ac-0ea3-402...|2019-10-18 12:46:20|[548449052]|       2|            499.72|
|0004400f-dc39-410...|2019-10-16 07:24:33|[550005829]|       1|            143.63|
|0004c309-ff34-44b...|2019-10-13 13:59:14|[547022478]|       2|             281.2|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|000a9525-b9a4-4cf...|2019-10-07 18:54:17|[557779190]|       1|            102.71|
|000

                                                                                

## Cache Before Killing Executor
Using cache to observe the difference when executor is killed.
After caching the data, killing an executor, and observing Spark's behavior:

**Error Message:**
```python
Lost executor 1 on 192.168.0.5: Command exited with code 143
```

- This indicates that Spark lost an executor, and the lost executor is Executor 1 on IP address 192.168.0.5.
- The error code 143 indicates that the process was terminated, typically due to a system-sent interrupt signal.

**Warning Message:**
```python
Lost task 30.0 in stage 6.0 (TID 161) (192.168.0.5 executor 1): 
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
```

- This is a warning message about the failure of task 30.0 in Stage 6.0.
- It indicates that the task was running on Executor 1, and due to the exit of that executor, the task also failed.
- The reason is that the command execution returned exit code 143.

```python
No more replicas available for rdd_21_30!
```
- This is a warning related to the RDD (rdd_21_30), indicating that no more replicas are available. This might be due to the loss of an executor, resulting in the loss of some partitions of the RDD.
- You may notice that when performing executor kill with and without caching, an additional error message appears:`No more replicas available for rdd...`
  - When we use cache or persist to cache data in Spark, Spark attempts to create replicas of the data on multiple nodes in the cluster to enhance data redundancy and reliability. 
  - This way, even if an Executor stops, there are still copies available on other nodes.
  - However, because the default replication factor is 1. Therefore, when an executor is killed, this error occurs as there are no other replicas available.


In [7]:
only_purchases.cache()

DataFrame[event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: double, user_id: int, user_session: string]

In [8]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()

24/01/31 01:16:15 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.0.5: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 18.0 in stage 6.0 (TID 151) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 12.0 in stage 6.0 (TID 145) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 21.0 in stage 6.0 (TID 154) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 6.0 in stage 6.0 (TID 139) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager

+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000174ac-0ea3-402...|2019-10-18 12:46:20|[548449052]|       2|            499.72|
|0004400f-dc39-410...|2019-10-16 07:24:33|[550005829]|       1|            143.63|
|0004c309-ff34-44b...|2019-10-13 13:59:14|[547022478]|       2|             281.2|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|000a9525-b9a4-4cf...|2019-10-07 18:54:17|[557779190]|       1|            102.71|
|000

                                                                                

24/01/31 01:18:13 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.0.5: Worker shutting down
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_30 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_24 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_42 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_26 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_6 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_5 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_29 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_37 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_12 !
24/01/31 01:18:13 WARN BlockManagerMasterEndpoint: No more r

## Persist Before Killing Executor

Here, we compare the `persist()` and `cache()` methods. Using the `persist()` method,

### `Persist()` vs `Cache()`

**Persist()**

1. `persist()` is a more general method and allows specifying more options.
2. The `persist()` method can set the storage level by specifying the `StorageLevel` parameter. For example, you can choose from options like `MEMORY_ONLY`, `MEMORY_ONLY_SER`, `DISK_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK_2`, etc.
   1. `MEMORY_ONLY_SER`: Serializes the storage in memory, reducing memory usage. If the memory storage is not sufficient to hold all RDD blocks, Spark won't cache the data and won't throw an error. This can result in expensive recomputation if the RDD is needed again.
   2. `MEMORY_AND_DISK_SER`: Serializes the storage in memory and on disk, reducing memory usage. In industry setups, `persist(Storage_Level.MEMORY_AND_DISK)` is commonly used because it combines the benefits of caching in memory and spilling to disk when memory is limited.
   3. `MEMORY_ONLY_2`: Sets replication factor as 2, creating replicas of each partition on two nodes in the cluster.
   4. `MEMORY_AND_DISK_2`: Sets replication factor as 2, meaning it stores two copies of each block in memory and on disk.

**Cache()**

1. `cache()` is a specific case of `persist()`, and it is equivalent to `.persist(StorageLevel.MEMORY_ONLY)`.
2. The `cache()` method does not provide detailed parameter options like `persist()`; it simply caches the data in memory.


In [10]:
from pyspark.storagelevel import StorageLevel
only_purchases.persist(StorageLevel.MEMORY_AND_DISK_2)

DataFrame[event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: double, user_id: int, user_session: string]

You will notice that by creating replicas on two nodes, the warning message `WARN BlockManagerMasterEndpoint: No more replicas available for rdd_...` does not occur.

So, when a node fails, the driver spawns another executor on a different node and provides it with the data partition on which it was supposed to work, along with the associated Directed Acyclic Graph (DAG) in a closure. With this information, it can recompute the data and materialize it.

In the meantime, the cached data in the Resilient Distributed Dataset (RDD) won't have all the data in memory. The data of the lost nodes needs to be fetched from the disk, which will take a little more time.


In [11]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()

24/01/31 12:18:26 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.0.5: Command exited with code 143
24/01/31 12:18:26 WARN TaskSetManager: Lost task 26.0 in stage 6.0 (TID 156) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 12:18:26 WARN TaskSetManager: Lost task 16.0 in stage 6.0 (TID 147) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 12:18:26 WARN TaskSetManager: Lost task 7.0 in stage 6.0 (TID 138) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 12:18:26 WARN TaskSetManager: Lost task 28.0 in stage 6.0 (TID 159) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 12:18:26 WARN TaskSetManager

+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000174ac-0ea3-402...|2019-10-18 12:46:20|[548449052]|       2|            499.72|
|0004400f-dc39-410...|2019-10-16 07:24:33|[550005829]|       1|            143.63|
|0004c309-ff34-44b...|2019-10-13 13:59:14|[547022478]|       2|             281.2|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|000a9525-b9a4-4cf...|2019-10-07 18:54:17|[557779190]|       1|            102.71|
|000

                                                                                