# RQ2: Feature Engineering

---

Investigates which **combination of presortedness metrics** provides the most efficient prediction, **sampled metrics** are employed rather than full-sequence metrics.

- Specifically, a **sample size of 10** is used, with values drawn at intervals of 20 positions in the sequence.  
- For the evaluation of presortedness metric combinations, the **number of comparisons required** to compute the metrics is recorded, as this reflects the computational overhead of feature extraction.  

The **input file** is:  
- the `dataset_sequences/dataset_sequences_200.pkl` containing the raw sequences.

The **output files** computed by this notebook are:  
- the `rq2_dataset/dataset_training_rq2_200.csv` containing presortedness features and labels.
- the `rq2_model/model_rq2...` the best model for each presortedness combiniation.

---

## Requirements

Installs the exact version of packages required for this notebook into the current Jupyter kernel.

In [None]:
%pip install scikit-learn==1.7.0
%pip install pandas==2.2.3
%pip install numpy==1.26.0
%pip install tensorflow==2.15.0
%pip install matplotlib==3.10.1

## Create Dataset for Training, Validation, and Test

In order to train the model, it is necessary to determine the **most comparison-efficient sorting algorithm** (*Introsort*, *Insertionsort*, *Mergesort*, *Timsort*, *Quicksort*) for each sequence, which serves as the target label. As input features, we employ the **sampled presortedness metrics** (*Runs*, *Deletions*, *Inversions*, and *Inversion Distance*).  

For the evaluation of the model, we additionally record the **number of comparisons required** to compute the presortedness metrics, as this reflects the computational overhead associated with feature extraction.  

In [None]:
import os
import pickle
import pandas as pd

from intellisorts_training_set import *

dataset_path = 'dataset_sequences/dataset_sequences_200.pkl'

print(f"Loading sequences {dataset_path} ...")

with open(dataset_path, 'rb') as f:
    dataset_200_dfs = pickle.load(f)

print("Sequences loaded")

In [None]:
# RQ2 distributed sampling strategy for presortedness combination assessment
df_results = compute_training_data_extended(
    dataset_dfs = dataset_200_dfs,
    min_length = 200,
    max_length = 200,
    sampling_strategy = sampling_strategy_evenly,
    sample_size = 10
)

print("Dataset D200:", len(df_results))
df_results.to_csv('rq2_dataset/dataset_training_rq2_200.csv')

## Train and Evaluate Neural Network Model

Model training including hyperparameter optimization in a grid search. Finally, shows an algorithm prediction summary that compares the **actual vs. predicted counts** for each sorting algorithm in the test set and reports the number of **true positives** per class.

In [None]:
import pandas as pd

from intellisorts_model_training import grid_search

# load training dataset
df_results = pd.read_csv('rq2_dataset/dataset_training_rq2_200.csv')

# combinations
runs_rem = ['Runs', 'Deletions']
inv_dis = ['Inversions', 'InversionDistance']
inv_dis_runs = ['Inversions', 'InversionDistance', 'Runs']
inv_dis_rem = ['Inversions', 'InversionDistance', 'Deletions']
inv_dis_runs_rem = ['Inversions', 'InversionDistance', 'Runs', 'Deletions']

# features
train_input_runs_rem = df_results[runs_rem]
train_input_inv_dis = df_results[inv_dis]
train_input_inv_dis_runs = df_results[inv_dis_runs]
train_input_inv_dis_rem = df_results[inv_dis_rem]
train_input_inv_dis_runs_rem = df_results[inv_dis_runs_rem]

train_output = df_results['Algorithm']

# perform grid search
param_grid = {
    'batch_size': [512],
    'epochs': [500],
    'layers': [8],#[4,5,6,7,8],
    'layersize': [8]#[1,2,3,4,5,6,7,8,9,10]
}

### Train Model for Combination: Runs, Rem

In [None]:
# combination: Runs, Rem
(
    best_model_runs_rem,
    scaler_runs_rem,
    label_encoder_runs_rem,
    test_accuracy_runs_rem,
    test_indices_runs_rem,
    test_true_algorithms_runs_rem,
    test_predicted_algorithms_runs_rem
) = grid_search(
    'rq2_model/model_rq2_runs_rem',
    train_input_runs_rem,
    train_output,
    param_grid
)

### Train Model for Combination: Inv, Dis

In [None]:
# combination: Inv, Dis
(
    best_model_inv_dis,
    scaler_inv_dis,
    label_encoder_inv_dis,
    test_accuracy_inv_dis,
    test_indices_inv_dis,
    test_true_algorithms_inv_dis,
    test_predicted_algorithms_inv_dis
) = grid_search(
    'rq2_model/model_rq2_inv_dis',
    train_input_inv_dis,
    train_output,
    param_grid
)

### Train Model for Combination: Inv, Dis, Runs

In [None]:
# combination: Inv, Dis, Runs
(
    best_model_inv_dis_runs,
    scaler_inv_dis_runs,
    label_encoder_inv_dis_runs,
    test_accuracy_inv_dis_runs,
    test_indices_inv_dis_runs,
    test_true_algorithms_inv_dis_runs,
    test_predicted_algorithms_inv_dis_runs
) = grid_search(
    'rq2_model/model_rq2_inv_dis',
    train_input_inv_dis_runs,
    train_output,
    param_grid
)

### Train Model for Combination: Inv, Dis, Rem

In [None]:
# combination: Inv, Dis, Rem
(
    best_model_inv_dis_rem,
    scaler_inv_dis_rem,
    label_encoder_inv_dis_rem,
    test_accuracy_inv_dis_rem,
    test_indices_inv_dis_rem,
    test_true_algorithms_inv_dis_rem,
    test_predicted_algorithms_inv_dis_rem
) = grid_search(
    'rq2_model/model_rq2_inv_dis_rem',
    train_input_inv_dis_rem,
    train_output,
    param_grid
)

### Train Model for Combination: Inv, Dis, Runs, Rem

In [None]:
# combination: Inv, Dis, Runs, Rem
(
    best_model_inv_dis_runs_rem,
    scaler_inv_dis_runs_rem,
    label_encoder_inv_dis_runs_rem,
    test_accuracy_inv_dis_runs_rem,
    test_indices_inv_dis_runs_rem,
    test_true_algorithms_inv_dis_runs_rem,
    test_predicted_algorithms_inv_dis_runs_rem
) = grid_search(
    'rq2_model/model_rq2_inv_dis_runs_rem',
    train_input_inv_dis_runs_rem,
    train_output,
    param_grid
)

## Results

The average comparisons needed for the sorting algorithm prediction in comparison to the accuracy of the best model after hyperparameter tuning.

In [None]:
import pandas as pd
import numpy as np


def avg_comparisons_of_predicted_algorithms(df_results, test_indices, test_predicted_algorithms):
    test_set_df = df_results.iloc[test_indices]
    test_set_df.reset_index(drop=True, inplace=True)
    
    c = []
    for index, row in test_set_df.iterrows():
        c.append(row[test_predicted_algorithms[index] + "_Comparisons"])
    
    return round(np.sum(c)/len(test_set_df), 1)


def avg_comparisons_of_presortednes_computation(df_results, combinations):
    return df_results[[item + '_Comparisons' for item in combinations]].stack().mean()

    
# Store results in a list of dicts
results = []

# Define your combos and variable names
combinations = [
    ("runs_rem", test_accuracy_runs_rem, test_indices_runs_rem, test_predicted_algorithms_runs_rem),
    ("inv_dis", test_accuracy_inv_dis, test_indices_inv_dis, test_predicted_algorithms_inv_dis),
    ("inv_dis_runs", test_accuracy_inv_dis_runs, test_indices_inv_dis_runs, test_predicted_algorithms_inv_dis_runs),
    ("inv_dis_rem", test_accuracy_inv_dis_rem, test_indices_inv_dis_rem, test_predicted_algorithms_inv_dis_rem),
    ("inv_dis_runs_rem", test_accuracy_inv_dis_runs_rem, test_indices_inv_dis_runs_rem, test_predicted_algorithms_inv_dis_runs_rem),
]

# Loop and compute metrics for each combo
for name, test_acc, test_idx, test_pred in combinations:
    avg_presorted = avg_comparisons_of_presortednes_computation(df_results, eval(name))
    avg_predicted = avg_comparisons_of_predicted_algorithms(df_results, test_idx, test_pred)
    total_comparisons = avg_presorted + avg_predicted  # sum of both
    
    results.append({
        "Combination": name,
        "Test Accuracy": test_acc,
        "Avg Comparisons (Presortedness)": round(avg_presorted, 2),
        "Avg Comparisons (Predicted Algo)": avg_predicted,
        "Total Avg Comparisons": round(total_comparisons, 2)
    })

# Create the summary table
df_summary = pd.DataFrame(results)

# Print it nicely
print(df_summary.to_string(index=False))
