# Fault Tolerance 
試著回答以下問題：
- How does the system behave under Node/CPU/Memory/Hardware/... errors and failures?
- What happens during network interruptions and partitioning?
- How do error handling mechanisms affect efficiency/scale/latency/throughput/... etc.? 
  - Are there any worst/best case considerations?

在這裡我們要實現？
- 砍掉一個 Java Thread 會發生什麼事情？
  - 當執行窄依賴時
  - 當執行寬依賴時
  - 使用checkpoint 
- 砍掉一個 Node 會發生什麼事情？

In [1]:
import findspark 
findspark.init()

In [2]:
# for sql
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col
from pyspark.sql.functions import sum,avg,max,count
from pyspark.sql import functions as F


In [3]:
import pyspark 
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://192.168.0.5:7077").appName("fault-tolerance2").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/31 01:13:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
root = '../../../Data/eCommerce-behavior-data/2019-Oct.csv'
# root = '../../data/only_purchases_1day.csv'
ecommerce = spark.read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv(root)

                                                                                

In [5]:
ecommerce.createOrReplaceTempView('ecommerce_2019_oct')

## Executors
![Image](https://i.imgur.com/SMo8Il7.png)
Executors 為 4，分別為 Master、Worker1、Worker2、Worker3
而每個Executors的有兩個Java thread，因此共有 8 個 Java process 在執行

In [8]:
sc = spark.sparkContext
# get the number of executors
num_executors = sc._jsc.sc().getExecutorMemoryStatus().size()
print("Number of executors:", num_executors)


Number of executors: 4


## Partition 
因為機器的核心數量是10，因此我們的partition數量預設為10，但是當資料量大時，partition數量會增長約3倍。
而3倍的原因是Spark官方與DataBricks建議partition數量為叢集中可用核心數的3~4倍為一個參考值

In [9]:

# get the number of partitions
num_partitions = sc.defaultParallelism
print("Number of partitions:", num_partitions)

Number of partitions: 30


## Skill Executor
主要會先執行filter transformer，因為資料量大，因此會有多個partition執行。
- Executor發生異常時，外部的包裝類別ExecutorRunner會*把異常訊息傳送給Worker*
- 然後Worker會*傳訊息給Master*。 
- Master 接收 Executor 狀態變更訊息後，如果發現 Executor 出現異常退出，則呼叫 Master.schedule 方法，*嘗試取得可用的 Worker 節點*並重新啟動 Executor。

### Narrow Dependency
在*窄依賴*的情況下，因為每個父RDD Partition依賴於特定的子RDD Partition，*重新計算時可以直接使用這個子RDD Partition的數據*，沒有Redundant Computation（冗餘計算）。

我們透過filter來進行實驗：
但是從Spark UI中可以看到以下程式，他會觸發兩個Job，第一個Job是執行filter，第二個Job是執行count。
因為會先執行filter，並且將結果存入cache，因此第二個Job會skip掉讀取csv的步驟，直接從cache中讀取資料並且計算數量。
可以從圖片中看到stage 5的部分skip了。Typically it means that data has been fetched from cache and there was no need to re-execute given stage.

![Image](https://i.imgur.com/dyulBi4.png)

In [6]:
only_purchases = ecommerce.filter(col("event_type") == 'purchase')
print("How many purchase session in one month:", only_purchases.count())



How many purchase session in one month: 742849


                                                                                

> 當我在執行窄依賴時（Filter），故意將一個Executor的Java Process給強制停止，觀察Spark的行為：

![Image](https://i.imgur.com/o8xf2hx.png)

在以下程式碼中，我透過 Activity monitor 的其中一個Java Process，將其強制停止。
從上一個步驟中，可以知道一個executor有兩個Java Process，因此我將其中一個Java Process停止，這會導致整個executor的工作停止
他會做以下工作：
1. 把某個Executor移除
2. 將該Executor尚未完成的工作給分配到其他的Executor
3. 最終結果保持與原本相同，數量都是1.659.788

但是，worker node會因為java process的強制停止導致整個node dead。
需要重新啟動worker node。

In [7]:
only_purchases = ecommerce.filter(col("event_type") == 'purchase')
print("How many purchase session in one month:", only_purchases.count())

24/01/31 00:58:00 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.0.5: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 31.0 in stage 5.0 (TID 119) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 39.0 in stage 5.0 (TID 127) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 41.0 in stage 5.0 (TID 129) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:58:00 WARN TaskSetManager: Lost task 38.0 in stage 5.0 (TID 126) (192.168.0.5 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Command exited with code 143
[Stage 5:>                          

How many purchase session in one month: 742849


                                                                                

### Wide Dependency
- 在*寬依賴*的情況下，當丟失整一個子RDD Partition時，因為*多個父RDD Partition可能會依賴於這個子RDD Partition，因此Spark必須將與該子RDD相關的所有父RDD都重新計算*。
  - 因此如果compute chain很長的寬依賴情況下，建議做一次Checkpoint或是cache來先做緩存，減少執行開銷。

Not skilling executor

In [10]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()



+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|0010e63b-0333-4f6...|2019-10-16 14:57:29|[525771398]|       1|             31.64|
|00149062-a045-4a1...|2019-10-26 22:11:53|[558054947]|       2|113.50999999999999|
|00167766-6565-4b6...|2019-10-30 09:50:10|[565693206]|       1|            385.83|
|0016bf0d-cdc0-4d6...|2019-10-17 13:17:20|[550091025]|       1|            242.07|
|001

                                                                                

Skilling Executor


![Image](https://i.imgur.com/7Tec3Hp.png)

In [11]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()

24/01/31 00:44:52 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.0.5: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 30.0 in stage 10.0 (TID 164) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 42.0 in stage 10.0 (TID 176) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 33.0 in stage 10.0 (TID 167) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetManager: Lost task 35.0 in stage 10.0 (TID 169) (192.168.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 00:44:52 WARN TaskSetMa

+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000174ac-0ea3-402...|2019-10-18 12:46:20|[548449052]|       2|            499.72|
|0004400f-dc39-410...|2019-10-16 07:24:33|[550005829]|       1|            143.63|
|0004c309-ff34-44b...|2019-10-13 13:59:14|[547022478]|       2|             281.2|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|000a9525-b9a4-4cf...|2019-10-07 18:54:17|[557779190]|       1|            102.71|
|000

                                                                                

Using cache to observe the difference when executor is killed

In [7]:
only_purchases.cache()

DataFrame[event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: double, user_id: int, user_session: string]

In [8]:
aggregated_data = only_purchases.groupBy("user_session") \
    .agg(
        F.max("event_time").alias("Date_order"),
        F.collect_set("user_id").alias("user_id"),  # Unique user_ids
        F.count("user_session").alias("Quantity"),
        F.sum("price").alias("money_spent")
    )
aggregated_data.show()

24/01/31 01:16:15 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.0.5: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 18.0 in stage 6.0 (TID 151) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 12.0 in stage 6.0 (TID 145) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 21.0 in stage 6.0 (TID 154) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager: Lost task 6.0 in stage 6.0 (TID 139) (192.168.0.5 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Command exited with code 143
24/01/31 01:16:15 WARN TaskSetManager

+--------------------+-------------------+-----------+--------+------------------+
|        user_session|         Date_order|    user_id|Quantity|       money_spent|
+--------------------+-------------------+-----------+--------+------------------+
|000081ea-9376-4eb...|2019-10-24 11:08:58|[513622224]|       1|            131.51|
|000174ac-0ea3-402...|2019-10-18 12:46:20|[548449052]|       2|            499.72|
|0004400f-dc39-410...|2019-10-16 07:24:33|[550005829]|       1|            143.63|
|0004c309-ff34-44b...|2019-10-13 13:59:14|[547022478]|       2|             281.2|
|000723e7-1ff9-484...|2019-10-05 15:21:09|[543470009]|       1|             49.36|
|000941cc-a55d-4a5...|2019-10-24 22:20:26|[563830578]|       1|              40.9|
|00095607-9518-42c...|2019-10-05 19:05:28|[531516671]|       1|            386.08|
|000a2754-1167-47c...|2019-10-28 12:56:13|[554129220]|       1|             39.68|
|000a9525-b9a4-4cf...|2019-10-07 18:54:17|[557779190]|       1|            102.71|
|000

                                                                                