# Composing & Modeling Parallel Sorting Performance Data
## Part A: Composing Parallel Sorting Data

The parallel sorting dataset consists of over 10,000 MPI sorting algorithm performance profiles for 5 different algorithms, collected by about 100 users. We use this data to show how we can train models to determine the algorithm from the performance data.

## 1. Import Necessary Packages

Import packages and point to the dataset.

In [1]:
from glob import glob

import thicket as th

DATA_DIR = "../data/parallel-sorting"



## 2. Read files into Thicket

- `glob()` recursively grabs all Caliper files (`.cali`) in the data directory.
- `from_caliperreader()` reads the Caliper files into Thicket and `fill_perfdata=False` will save memory, since we have so many files.

In [2]:
data = glob(f"{DATA_DIR}/**/*.cali", recursive=True)
print(f"Total files: {len(data)}")

# Read caliper files without filling the profile index as it expensive and unnecessary in our case
tk = th.Thicket.from_caliperreader(
    data,
    fill_perfdata=False
)
print(f"DataFrame shape {tk.dataframe.shape}")
print(f"Metadata shape: {tk.metadata.shape}")

Total files: 12916


(1/2) Reading Files: 100%|██████████| 12916/12916 [01:34<00:00, 136.30it/s]
(2/2) Creating Thicket: 100%|██████████| 12915/12915 [02:06<00:00, 102.34it/s]

DataFrame shape (128716, 16)
Metadata shape: (12916, 62)





## 3. Modify and Filter Metadata Values

Since the dataset we are using is a compilation from many different users, there are various errors in the metadata annotations which we can fix using Thicket. We have defined two dictionaries from manual analysis of the data to achieve this:

- `META_FIX_DICT` is used to enforce consistency in the metadata by replacing values.
- `META_WHITELIST_DICT` is used to select the metadata parameters we are looking for from the experiments.

The metadata we reference are the experiment parameters and important identifying metadata:

- `InputType` - The type of sortedness of the input array.
- `Datatype` - The datatype of the values in the input array.
- `num_procs` - Number of parallel processes.
- `InputSize` - Size of the input array.
- `Algorithm` - The name of the parallel sorting algorithm.
- `group_num` - Unique identifier for different implementations.

In [3]:
META_FIX_DICT = {
    "Algorithm": {
        "bitonic_sort": "BitonicSort",
        "merge_sort": "MergeSort",
        "Merge Sort": "MergeSort",
        "odd_even_sort": "OddEvenSort",
        "Merge sort": "MergeSort",
        "Sample Sort": "SampleSort",
        "Bitonic_Sort": "BitonicSort",
        "Merge_Sort": "MergeSort",
        "OddEvenTranspositionSort": "OddEvenSort",
        "Bitonic Sort": "BitonicSort",
        "Mergesort": "MergeSort",
        "mergesort": "MergeSort",
        "oddEven": "OddEvenSort",
        "Odd Even Transposition Sort": "OddEvenSort",
        "RadixSort Sort": "RadixSort",
        "Odd Even Sort": "OddEvenSort",
        "Odd-Even Sort": "OddEvenSort",
        "OddevenSort": "OddEvenSort",
        "oddeven_sort": "OddEvenSort",
        "Radix Sort": "RadixSort",
        "Odd-Even Bubble Sort": "OddEvenSort",
        "Bubble_Sort": "OddEvenSort",
        "Bubblesort": "OddEvenSort",
        "Bubble Sort(Odd/Even)": "OddEvenSort",
        "Bubble/Odd-Even Sort": "OddEvenSort",
        "Parallel Bubble Sort": "OddEvenSort",
        "BubbleSort": "OddEvenSort",
        "Radix": "RadixSort",
        "Bitonic": "BitonicSort",
    },
    "InputType": {
        "perturbed_array": "1%perturbed",
        "sorted_array": "Sorted",
        "random_array": "Random",
        "ascending_array": "Sorted",
        "descending_array": "Reverse",
        "reversed_array": "Reverse",
        "reversedSort": "Reverse",
        "1% Perturbed": "1%perturbed",
        "reverse_sorted": "Reverse",
        "1perturbed": "1%perturbed",
        r"1%%perturbed": "1%perturbed",
        "1 Perturbed": "1%perturbed",
        "1 perturbed": "1%perturbed",
        "Reverse Sorted": "Reverse",
        "1%Perturbed": "1%perturbed",
        "1% perturbation": "1%perturbed",
        "1percentperturbed": "1%perturbed",
        "1 percent noise": "1%perturbed",
        "reverse sorted": "Reverse",
        "sorted_1%_perturbed": "1%perturbed",
        "Reversesorted": "Reverse",
        "ReverseSorted": "Reverse",
        "Reverse_Sorted": "Reverse",
        "ReversedSort": "Reverse",
        "Sorted_1%_perturbed": "1%perturbed",
        "Randomized": "Random",
        "Reversed": "Reverse",
        "reversed": "Reverse",
        "sorted": "Sorted",
        "random": "Random",
        "nearly": "Nearly",
        "reverse": "Reverse",
        " Reverse sorted": "Reverse",
        "Perturbed": "1%perturbed",
        "perturbed": "1%perturbed",
    },
    "Datatype": {
        "integer": "int",
        "Int": "int",
        "Integer": "int",
        "Double": "double",
    },
}

META_WHITELIST_DICT = {
    "InputType": ["Random", "Sorted", "Reverse", "1%perturbed", "Nearly"],
    "Algorithm": [
        "BitonicSort",
        "MergeSort",
        "OddEvenSort",
        "RadixSort",
        "SampleSort",
    ],
    "Datatype": ["int", "float", "double"],
    "num_procs": [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024],
    "InputSize": [65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456],
}

### 3A. Modify Metadata Values to Match Grammar

The `pandas.DataFrame.replace()` function replaces values in the metadata.

In [4]:
for meta_col, values in META_FIX_DICT.items():
    tk.metadata[meta_col] = tk.metadata[meta_col].replace(values)

### 3B. Filter Metadata Values from Whitelist

We use the `Thicket.filter_metadata()` function to filter any values that are not contained in our metadata whitelist.

In [5]:
print(f"Total profiles before: {len(tk.profile)}")
tk = tk.filter_metadata(lambda meta: all([meta[key] in META_WHITELIST_DICT[key] for key in META_WHITELIST_DICT.keys()]))
print(f"Total profiles after: {len(tk.profile)}")

Total profiles before: 12916
Total profiles after: 10624


### 3C. Filter Duplicate Metadata Values

Duplicate values across all of our experiment parameters indicates that one profile has incorrect metadata, since all of the profiles are single-trial. This is typically user error (metadata is manually annotated in Adiak).

We can remove duplicate values by using `Thicket.groupby()` on our experiment parameters except "num_procs", and then checking if there are any duplicates of "num_procs" using `pandas.DataFrame.duplicated()`. We then remove the duplicate profiles using `Thicket.filter_profile()`.

In [6]:
gb = tk.groupby(["Algorithm", "InputType", "Datatype", "group_num", "InputSize"])
rm_profs = []
for key, ttk in gb.items():
    if ttk.metadata["num_procs"].duplicated().any():
        print(f"Skipping {key} ({len(ttk.profile)} profiles) because it has duplicate num_procs")
        rm_profs += ttk.profile   
tk = tk.filter_profile([p for p in tk.profile if p not in set(rm_profs)])
print(f"Total profiles after removing duplicates: {len(tk.profile)}")

Skipping ('RadixSort', 'Random', 'double', 2, 65536) (27 profiles) because it has duplicate num_procs
Skipping ('RadixSort', 'Random', 'double', 2, 262144) (26 profiles) because it has duplicate num_procs
Total profiles after removing duplicates: 10571


## 4. Create Features

In this section, we structure the performance data where each column is a feature, and each row is a feature vector for one performance profile, which is necessary for modeling

### 4A. Query the Call Tree and Re-compose the Thicket

Here we select only the nodes we want to compare for our modeling using `Thicket.query()` and `Thicket.concat_thickets()`.

*Note: Unlike when we read the files, fill_perfdata is True here. This is so we can compute node presence using the None values in the "name" column.* 

In [7]:
# Perform query
nodes = [
    "comp",
    "comp_large",
    "comm",
    "comm_large",
    "comp_small",
    "comm_small"
]
ntk_dict = {n: tk.query(
    th.query.Query().match(
        "*",
        lambda row: row["name"].apply(
            lambda tn: tn == n
        ).all()
    )
) for n in nodes}

# Re-compose quieried Thickets
tk = th.Thicket.concat_thickets(
    thickets=list(ntk_dict.values()),
    fill_perfdata=True,
)
# Drop duplicate profiles in the metadata from concat_thickets
unhashable_cols = ["libraries", "cmdline"] # Can't pass these cols in the check or error will be thrown. Won't change the outcome of the check
tk.metadata = tk.metadata.drop_duplicates(subset=[col for col in tk.metadata.columns if col not in unhashable_cols])

### 4B. Remove Profiles not Containing All Nodes

Because our features will use all of the nodes, we remove profiles that do not have data for all nodes using `Thicket.filter_profile`.

In [8]:
# Nodes not considered in the check. They are only used for their presence T/F
not_considered = ["comp_small", "comm_small"]
profiles_per_node = [set(ntk_dict[n].dataframe.index.get_level_values("profile")) for n in ntk_dict.keys() if n not in not_considered]
# Intersection of the profiles
profile_truth = list(profiles_per_node[0].intersection(*profiles_per_node[1:]))
# Filter the Thicket to only contain these profiles
tk = tk.filter_profile(profile_truth)
print(f"Total profiles that contain all data: {len(tk.profile)}")

Total profiles that contain all data: 9406


### 4C. Compute Features from Performance Data

We compute the "node presence" feature and the derived "comp/comm" features using a mixture of `pandas` functions. The `add_root_node` function is used to add the "comp/comm" features to the performance data.

In [9]:
metric_cols = [
    "Variance time/rank",
    "Min time/rank",
    "Max time/rank",
    "Avg time/rank",
    "Total time",
]

# Compute node presence feature
tk.dataframe["Node presence"] = tk.dataframe["name"].apply(lambda name: False if name is None else True)

# Compute comp/comm feature
tk.add_root_node(attrs={"name": "comp/comm", "type": "derived"})
tdf = tk.dataframe.loc[tk.get_node("comp"), metric_cols].div(tk.dataframe.loc[tk.get_node("comm"), metric_cols])
for prof in tdf.index:
    tk.dataframe.loc[(tk.get_node("comp/comm"), prof), metric_cols] = tdf.loc[prof]

### 4D: Define Our Features Using Pandas Slices

To subselect the performance data we care about we use a slice generated by either `perf_idx()` or `presence_idx()` (they are functions because the node objects can change `id`'s after certain Thicket operations). We use the `Thicket.get_node()` function to select node objects.

We can index the performance data with these slices using `Thicket.dataframe.loc[perf_idx()]` or `Thicket.dataframe.loc[presence_idx()]`.

In [10]:
def perf_idx():
    return (
        (
            [
                tk.get_node("comp/comm"), 
                tk.get_node("comp_large"),
                tk.get_node("comm_large")
            ]
        ), metric_cols
    )

def presence_idx():
    return (
        (
            [
                tk.get_node("comp_small"),
                tk.get_node("comm_small"),
            ]
        ), [
            "Node presence"
        ]
    )

### 4E. Filter Features with NaN Values

Here we check one last time for any missing data points in any of the profiles for each of the slices we just defined. `any_nan_rows_series` will be a series of boolean values for each profile that will be `True` if there are any missing data points. We use the `Thicket.filter_profile()` function once again to filter out the profiles with missing data points.

In [11]:
print(f"Total profiles before dropping NaNs: {len(tk.profile)}")
nan_profs = []
for idx in [perf_idx(), presence_idx()]:
    any_nan_rows_series = tk.dataframe.loc[idx].isna().apply(lambda x: x.any(), axis=1)
    nan_profs.extend(tk.dataframe.loc[idx][any_nan_rows_series].index.get_level_values("profile").unique())
tk = tk.filter_profile([p for p in tk.profile if p not in nan_profs])
print(f"Total profiles after dropping NaNs: {len(tk.profile)}")

Total profiles before dropping NaNs: 9406
Total profiles after dropping NaNs: 9037


## 5. Remove Anomalies 

In [12]:
# Omitting

## 6. Write Model Data

Lastly we shuffle the data using `pandas.DataFrame.sample()` and pickle the Thicket object, which we will use to pick back up in the next notebook, part B, where we will create classification models using the performance data.

In [13]:
# Print how many profiles for each sorting algorithm
algs = tk.metadata.reset_index().groupby("Algorithm")
for name, data in algs:
    print(f"Algorithm: {name} has {len(data)} data points")

# Shuffle the data
tk.dataframe = tk.dataframe.sample(frac=1.0)
# Set useful attributes
tk.perf_idx = perf_idx()
tk.presence_idx = presence_idx()
# Write thicket to file
tk.to_pickle("thicket-modeldata.pkl")

Algorithm: BitonicSort has 1761 data points
Algorithm: MergeSort has 2322 data points
Algorithm: OddEvenSort has 2078 data points
Algorithm: RadixSort has 591 data points
Algorithm: SampleSort has 2285 data points
