# OpenDC Demo 4
### Failures

In the [experiment 1](1.first_experiment.ipynb), we learned how to run OpenDC experiments and how to analyze and visualize the results to learn about the behavior of a data center. In this demo, we are using OpenDC to determine the impact of horizontal scaling. Horizontal scaling refers to reducing (or increasing) the number of machines available in a data center to fit a workload better.

In this demo, we will explore how machine failure can impact the performance of a data center. We run the same workload on data centers that are experiencing different levels of machine failure. 

# Failures

Failure causes hosts to stop periodically. In OpenDC, failures can be simulated by providing a trace. 
This trace describes when failures occur, how long they last, and how severe they are (i.e., the number of hosts affected). 

In this demo, we will investigate the effect of failures.

#### Let's start by looking at one of the failure traces.

In [8]:
import pandas as pd

df_failure = pd.read_parquet("failure_traces/Facebook_user_reported.parquet")

df_failure

Unnamed: 0,failure_interval,failure_duration,failure_intensity
0,0,14400000,1.000000
1,1200000,19200000,1.000000
2,1200000,6000000,0.666667
3,1200000,6000000,1.000000
4,1200000,1200000,0.833333
...,...,...,...
4064,13200000,1200000,0.500000
4065,194400000,2400000,1.000000
4066,280800000,1200000,0.666667
4067,332400000,2400000,0.833333


- *failure_interval* determines the time between failures
- *failure_duration* determines how long a machine cannot be used
- *failure_intensity* determines the ratio of machines affected by the failure.

## Experiment

A user can activate the use of failures by adding it to the Experiment file, as shown below:

```json
{
    "outputFolder": "output/4.failures",
    "topologies": [
        {
            "pathToFile": "topologies/4.failures/surfsara_small.json"
        }
    ],
    "workloads": [
        {
            "pathToFile": "workload_traces/surf_week",
            "type": "ComputeWorkload"
        }
    ],
    "failureModels": [
        {
            "type": "no"
        },
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Facebook_user_reported.parquet"
        }
    ],
    "exportModels": [
        {
            "exportInterval": 3600,
            "printFrequency": 24,
            "filesToExport": [
                "host",
                "powerSource",
                "service",
                "task"
            ]
        }
    ]
}
```

Failures are added using the "failureModels" parameter. In this experiment, we run two simulations. One without any failures, and one simulation in which the data center was injected with failures based on the Facebook_user_reported failure trace. 

#### Exercise 1: 
Extend the experiment file located [here](experiments/4.failures/failure_experiment.json) with more failure traces. See the [failure_traces](failure_traces) folder for all available traces.

In [None]:
import subprocess

pathToScenario = "experiments/4.failures/failure_experiment.json"
subprocess.run(["OpenDCExperimentRunner/bin/OpenDCExperimentRunner", "--experiment-path", pathToScenario])

#### *Note*: 
Because of the failures, not all tasks are able to be completed. 
When a task fails too many times, it is terminated from the system.

# Output

#### Exercise 2: 
Load the results into Pandas DataFrames


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Put your code here

# Visualization

#### Exercise 3: 
Plot the number of tasks for each of the failure models

In [None]:
# Put your code here

#### Exercise 4: 
Plot the energy usage of the datacenter for each of the failure models

In [None]:
# Put your code here

#### Lets compare the runtimes

#### Exercise 5: 
Print the runtime of the workload for each different failure model

In [None]:
# Put your code here

The workload took 6 days 23:00:00 without failures
The workload took 27 days 04:49:30 with facebook failures
The workload took 31 days 11:02:30 with instagram failures
The workload took 44 days 04:20:00 with netflix failures


#### Lets compare the runtimes

#### Exercise 6: 
Print the number of tasks that were terminated for each of the failure models

In [None]:
# Put your code here