# Project Overview

This project aims to explore and benchmark various machine learning models to detect disks at high risk of experiencing fail-slow anomalies. Below are the steps to reproduce the experiments on the Trovi platform.

---

### First Part: Preparing Our Chameleon Server

This step includes:

1. **Create Lease**  
   Reserve resources on the Chameleon cloud.

2. **Launch the Server**  
   Start the server instance using the reserved resources.

3. **Associate Floating IP**  
   Assign a public IP address to the server to enable external access.

4. **Connect to the Instance**  
   Use SSH to access the server.

---

## Configuration

In [None]:
import chi

chi.use_site("CHI@UC")

# Change to your project (CHI-XXXXXX)
chi.set("project_name", "CHI-210850")

print(f'Using Project {chi.get("project_name")}')

Now using CHI@UC:
URL: https://chi.uc.chameleoncloud.org
Location: Argonne National Laboratory, Lemont, Illinois, USA
Support contact: help@chameleoncloud.org
Using Project CHI-210850


## Create Lease

In [None]:
import os
import keystoneauth1, blazarclient
from chi import lease

reservations = []
lease_node_type = "compute_cascadelake_r"

try:
    print("Creating lease...")
    lease.add_fip_reservation(reservations, count=1)
    lease.add_node_reservation(reservations, node_type=lease_node_type, count=1)

    start_date, end_date = lease.lease_duration(hours=3)

    l = lease.create_lease(
        f"{os.getenv('USER')}-benchmark", 
        reservations, 
        start_date=start_date, 
        end_date=end_date
    )
    lease_id = l["id"]

    print("Waiting for lease to start ...")
    lease.wait_for_active(lease_id)
    print("Lease is now active!")
except keystoneauth1.exceptions.http.Unauthorized as e:
    print("Unauthorized.\nDid set your project name and and run the code in the first cell?")
except blazarclient.exception.BlazarClientException as e:
    print(f"There is an issue making the reservation. Check the calendar to make sure a {lease_node_type} node is available.")
    print("https://chi.uc.chameleoncloud.org/project/leases/calendar/host/")
    print(e)
except Exception as e:
    print("An unexpected error happened.")
    print(e)

Creating lease...
Waiting for lease to start ...
Lease is now active!


## Provision Node

In [None]:
from chi import server

image = "CC-Ubuntu20.04"

s = server.create_server(
    f"{os.getenv('USER')}-benchmark", 
    image_name=image,
    reservation_id=lease.get_node_reservation(lease_id)
)

print("Waiting for server to start ...")
server.wait_for_active(s.id)
print("Done")

Waiting for server to start ...
Done


## Associate Floating-IP

In [None]:
floating_ip = lease.get_reserved_floating_ips(lease_id)[0]
server.associate_floating_ip(s.id, floating_ip_address=floating_ip)

print(f"Waiting for SSH connectivity on {floating_ip} ...")
timeout = 60*2
import socket
import time
# Repeatedly try to connect via SSH.
start_time = time.perf_counter()
while True:
    try:
        with socket.create_connection((floating_ip, 22), timeout=timeout):
            print("Connection successful")
            break
    except OSError as ex:
        time.sleep(10)
        if time.perf_counter() - start_time >= timeout:
            print(f"After {timeout} seconds, could not connect via SSH. Please try again.")


Waiting for SSH connectivi|ty on 192.5.86.200 ...
Connection successful


## Configure Instance

In [20]:
from chi import ssh

with ssh.Remote(floating_ip) as conn:
    # test
    conn.run("ls")

README
data
index
my_mounting_point
openrc
output
output.tar.gz
requirements.txt
run_experiments.sh
scripts


---

### Second Part: Preparing the Environment and Data for the Experiments

This step includes:

1. **Download Data from the Repository**  
   Retrieve the data from my repository, noting there are two clusters from 25-cluster Perseus.

2. **Upload All Necessary Files to the Server**  
   Transfer the experiment scripts, datasets, and any other required files to the server.

3. **Uncompress the Necessary Datasets**  
   Extract the datasets to the appropriate directories for use in the experiments.

4. **Install the Dependencies**  
   Install all required libraries and tools, typically using package managers like `pip` for Python libraries.

---
### Note

Due to the memory limitations of Trovi, we have only provided data for two clusters from the Perseus dataset. If you are interested in the performance of all clusters, please refer to the provided repository. The repository includes comprehensive test results and heatmaps.

- The **scripts** directory contains all the source code for the algorithms.
- The **index** directory contains index files that map each script to its respective cluster data.
- The **requirements.txt** file lists all the dependencies needed for the project.

---

## Prepare the Experiment

In [9]:
!git clone https://github.com/songxikang/data.git

Cloning into 'data'...
remote: Enumerating objects: 2797, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 2797 (delta 2), reused 15 (delta 2), pack-reused 2782[K
Receiving objects: 100% (2797/2797), 468.43 MiB | 7.35 MiB/s, done.
Resolving deltas: 100% (7/7), done.


In [35]:
with ssh.Remote(floating_ip) as conn:
    # Create data, index, and scripts directories
    conn.run("mkdir -p data")
    conn.run("mkdir -p index")
    conn.run("mkdir -p scripts")
    
    # Upload Perseus to the data directory
    conn.put("data/cluster_A.tar.gz", "data/cluster_A.tar.gz")
    conn.put("data/cluster_B.tar.gz", "data/cluster_B.tar.gz")
    conn.put("index/slow_drive_info.csv", "data/slow_drive_info.csv")
    
    # Suppress the output of the following commands
    conn.run("tar -xvzf data/cluster_A.tar.gz -C data > /dev/null 2>&1 && rm data/cluster_A.tar.gz > /dev/null 2>&1")
    conn.run("tar -xvzf data/cluster_B.tar.gz -C data > /dev/null 2>&1 && rm data/cluster_B.tar.gz > /dev/null 2>&1")
    
    # Upload our FSA
    conn.put("scripts/csr.py", "scripts/csr.py")
    conn.put("scripts/multi_pred.py", "scripts/multi_pred.py")
    conn.put("scripts/lstm.py", "scripts/lstm.py")
    conn.put("scripts/patchTST.py", "scripts/patchTST.py")
    conn.put("scripts/GPT-4.py", "scripts/GPT-4.py")
    
    # Upload index files
    conn.put("index/A_index.csv", "index/A_index.csv")
    conn.put("index/B_index.csv", "index/B_index.csv")
    conn.put("index/all_drive_info.csv", "index/all_drive_info.csv")
    
    # Install dependancies
    conn.put("requirements.txt")
    conn.sudo("apt-get install -y python3-pip")
    conn.run("pip install -r requirements.txt")


Reading package lists...
Building dependency tree...
Reading state information...
python3-pip is already the newest version (20.0.2-5ubuntu1.10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


---

### Third Part: Running the Experiments

This step includes:

1. **Upload the `run_experiments.sh` Script**
   - Transfer the `run_experiments.sh` script to the server.

2. **Run the `run_experiments.sh` Script**
   - Execute the script to run all the FSA (Fail-Slow Anomaly Detection) algorithms.
   - The script will generate the prediction results.

3. **Compress the Output**
   - The script will compress the output directory into `output.tar.gz`.

4. **Download the Results**
   - Download the `output.tar.gz` file to obtain the prediction results to our local directory.

---

## Run the experiment

In [30]:
with ssh.Remote(floating_ip) as conn:
    # Upload the script
    conn.put("run_experiments.sh")
    # Run the script 
    conn.run("bash run_experiments.sh")

Running csr.py for index A
Running csr.py for index B
Running multi_pred.py for index A
Running multi_pred.py for index B


KeyboardInterrupt: 

In [34]:
with ssh.Remote(floating_ip) as conn:
    # Upload the script
    conn.put("scripts/lstm.py")
    # Run the script 
    conn.run("python3 scripts/lstm.py -p data -i index/B_index.csv -t cluster_A")

Traceback (most recent call last):
  File "scripts/lstm.py", line 199, in <module>
    main(args.perseus_dir, args.input_file, args.train_cluster)
  File "scripts/lstm.py", line 190, in main
    model, scaler_X, scaler_y = train(perseus_dir, cluster_host_mapping, train_cluster)
  File "scripts/lstm.py", line 99, in train
    TRAIN_HOSTS = cluster_host_mapping[train_cluster]
KeyError: 'cluster_A'


UnexpectedExit: Encountered a bad command exit code!

Command: 'python3 scripts/lstm.py -p data -i index/B_index.csv -t cluster_A'

Exit code: 1

Stdout: already printed

Stderr: already printed



In [13]:
import tarfile

with ssh.Remote(floating_ip) as conn:
    # Download the output
    conn.get("output.tar.gz")
with tarfile.open("output.tar.gz") as tar:
    # Extract the results to our notebook
    tar.extractall()
print("done")

done


#### Fourth Part: Parsing the Results

This step includes:

1. **Parse the Results in the Output Directory**
   - Uncompress the `output.tar.gz` file to access the results. ( I have already done in last step)

2. **Analyze the Results**
   - Open the `result_parser.ipynb` notebook to see all the analysis.
   - The notebook contains detailed analysis and visualizations of the prediction results.
