# RQ1: Feasibility Study

---

This notebook performs the training, hyperparameter optimization, and evaluation of the feasibility neural network model for predicting the most comparison-efficient sorting algorithm.

- The **most comparison-efficient sorting algorithm** for each sequence is determined, serving as the target label.  
- As input features, the **presortedness metrics** are computed across the entire sequence for all five metrics under consideration.
- The **number of comparisons required** is not taken into account in this setting, as the primary objective is to demonstrate that sorting algorithms can be predicted from presortedness metrics alone.  

The **input file** is:  
- the `dataset_sequences/dataset_sequences_200.pkl` containing the raw sequences.

The **output files** computed by this notebook are:  
- the `rq1_dataset/dataset_training_rq1_200.csv` containing presortedness features and labels.
- the `rq1_model/model_rq1...` the best model.

---

## Requirements

Installs the exact version of packages required for this notebook into the current Jupyter kernel.

In [None]:
%pip install scikit-learn==1.7.0
%pip install pandas==2.2.3
%pip install numpy==1.26.0
%pip install tensorflow==2.15.0
%pip install matplotlib==3.10.1

## Create Dataset for Training, Validation, and Test

In order to train the model, it is necessary to determine the **most comparison-efficient sorting algorithm** (*Introsort*, *Insertionsort*, *Mergesort*, *Timsort*, *Quicksort*) for each sequence, which serves as the target label. As input features, we employ the **sampled presortedness metrics** (*Runs*, *Deletions*, *Inversions*, and *Inversion Distance*).  

For the evaluation of the model, we additionally record the **number of comparisons required** to compute the presortedness metrics, as this reflects the computational overhead associated with feature extraction.  

In [None]:
import os
import pickle
import pandas as pd

from intellisorts_training_set import *

dataset_path = 'dataset_sequences/dataset_sequences_200.pkl'

print(f"Loading sequences {dataset_path} ...")

with open(dataset_path, 'rb') as f:
    dataset_200_dfs = pickle.load(f)

print("Sequences loaded")

In [None]:
# RQ1 full array presortedness
df_results = compute_training_data_extended(
    dataset_dfs = dataset_200_dfs,
    min_length = 200,
    max_length = 200,
    sampling_strategy = sampling_strategy_identity,
    sample_size = 200
)

print("Dataset D200:", len(df_results))
df_results.to_csv('rq1_dataset/dataset_training_rq1_200.csv')

## Train and Evaluate Neural Network Model

Model training including hyperparameter optimization in a grid search. Finally, shows an algorithm prediction summary that compares the **actual vs. predicted counts** for each sorting algorithm in the test set and reports the number of **true positives** per class.

In [None]:
import pandas as pd

from intellisorts_model_training import grid_search

# load training dataset
df_results = pd.read_csv('rq1_dataset/dataset_training_rq1_200.csv')

# features
train_input = df_results[['Inversions', 'Deletions', 'Runs', 'InversionDistance']]
train_output = df_results['Algorithm']

# perform grid search
param_grid = {
    'batch_size': [512],
    'epochs': [500],
    'layers': [0,1,2,3,4,5,6,7,8,9],
    'layersize': [1,2,3,4,5,6,7,8,9,10]
}

(
    best_model,
    scaler,
    label_encoder,
    test_accuracy,
    test_indices,
    test_true_algorithms,
    test_predicted_algorithms
) = grid_search(
    'rq1_model/model_rq1',
    train_input,
    train_output,
    param_grid
)


## Average Comparisons Analysis

This section evaluates the **average number of comparisons** required by:

- The **minimum possible**, which represents the theoretical lower bound always choosing the most comparison efficient sorting algorithm
- The **prediction model**
- The classical sorting algorithms (Timsort, Merge Sort, Insertion Sort, Introsort, Quick Sort)

The results are visualized on a logarithmic scale to account for big differences in magnitude

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

test_set_df = df_results.iloc[test_indices]
test_set_df.reset_index(drop=True, inplace=True)

c = []
t = []

for index, row in test_set_df.iterrows():
    c.append(row[test_predicted_algorithms[index] + "_Comparisons"])
    t.append(row[test_true_algorithms[index] + "_Comparisons"])

minimum_possible = round(np.sum(t)/len(test_set_df), 1)
algorithm_prediction_model = round(np.sum(c)/len(test_set_df), 1)

def calculate_average(column_name):
    return round(np.sum(test_set_df[column_name]) / len(test_set_df), 1)

merge_sort_avg = calculate_average('Mergesort_Comparisons')
timsort_avg = calculate_average('Timsort_Comparisons')
introsort_avg = calculate_average('Introsort_Comparisons')
quick_sort_avg = calculate_average('Quicksort_Comparisons')
insertion_sort_avg = calculate_average('Insertionsort_Comparisons')

data = {
    'Algorithm': ['Optimal', 'Predicted','Timsort', 'Mergesort', 'Insertionsort', 'Introsort', 'Quicksort'],
    'Average Value': [minimum_possible, round(algorithm_prediction_model, 1), timsort_avg, merge_sort_avg, insertion_sort_avg, introsort_avg, quick_sort_avg]
}
df_average_comp = pd.DataFrame(data)

# Plot
plt.figure(figsize=(8, 5))
plt.bar(df_average_comp['Algorithm'], df_average_comp['Average Value'], color='skyblue')
plt.title('Average Comparisons of Sorting Algorithms')

for i, val in enumerate(df_average_comp['Average Value']):
    plt.text(i, val + 100, str(val), ha='center')

plt.ylabel('Average Comparisons (Logarithmic Scale)')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.yscale('log')
plt.tight_layout()
plt.savefig("figure_rq1_avg_comparisons.svg", format='svg')
plt.show()

print('Optimal: ', minimum_possible)
print('Predicted: ', algorithm_prediction_model)
print()
print('Mergesort: ', merge_sort_avg)
print('Timsort: ', timsort_avg)
print('Introsort: ', introsort_avg)
print('Quicksort: ', quick_sort_avg)
print('Insertionsort: ', insertion_sort_avg)