# Final Model Training and Evaluation

---

This notebook performs the training, hyperparameter optimization, and evaluation of the final neural network model for predicting the most comparison-efficient sorting algorithm.

The **input file** is:  
- the `dataset_sequences/dataset_sequences_10k.pkl` containing the raw sequences.

The **output files** computed by this notebook are:  
- the `rq4_dataset/dataset_training_rq4_400p.csv` containing presortedness features and labels.
- the `rq4_model/model_rq4...` the **final model**.

---

## Requirements

Installs the exact version of packages required for this notebook into the current Jupyter kernel.

In [None]:
%pip install scikit-learn==1.7.0
%pip install pandas==2.2.3
%pip install numpy==1.26.0
%pip install tensorflow==2.15.0
%pip install matplotlib==3.10.1
%pip install seaborn==0.13.2

## Create Datasets for Training, Validation, and Test

In order to train the model, it is necessary to determine the **most comparison-efficient sorting algorithm** (*Insertionsort*, *Mergesort*, *Timsort*) for each sequence, which serves as the target label. As input features, we employ the **sampled presortedness metrics** (*Runs* and *Deletions*).  

For the evaluation of the model, we additionally record the **number of comparisons required** to compute the presortedness metrics, as this reflects the computational overhead associated with feature extraction.  

In [None]:
import os
import pickle
import pandas as pd

from intellisorts_training_set import *

dataset_path = 'dataset_sequences/dataset_sequences_10k.pkl'

print(f"Loading sequences {dataset_path} ...")

with open(dataset_path, 'rb') as f:
    dataset_10k_dfs = pickle.load(f)

print("Sequences loaded")

In [None]:
# RQ4 sampling for final model
df_results = compute_training_data(
    dataset_dfs = dataset_10k_dfs,
    min_length = 400,
    max_length = 10000,
    sampling_strategy = sampling_strategy_hybrid,
    sample_size = 40
)

print()
print("Dataset D400+:", len(df_results))
df_results.to_csv('rq4_dataset/dataset_training_rq4_400p.csv')

## Train and Evaluate Neural Network 

Model training including hyperparameter optimization in a grid search. Finally, shows an algorithm prediction summary that compares the **actual vs. predicted counts** for each sorting algorithm in the test set and reports the number of **true positives** per class.

In [None]:
import pandas as pd

from intellisorts_model_training import grid_search

# load training dataset
df_results = pd.read_csv('rq4_dataset/dataset_training_rq4_400p.csv')

# features
train_input = df_results[['Deletions', 'Runs','SequenceLength']]
train_output = df_results['Algorithm']

# perform grid search
param_grid = {
    'batch_size': [512],
    'epochs': [500],
    'layers': [0,1,2,3,4,5,6,7,8,9,10],
    'layersize': [1,2,3,4,5,6,7,8,9,10]
}

(
    best_model,
    scaler,
    label_encoder,
    test_accuracy,
    test_indices,
    test_true_algorithms,
    test_predicted_algorithms
) = grid_search(
    'rq4_model/model_rq4',
    train_input,
    train_output,
    param_grid
)

## Analysis of Misclassifications

This section examines the model’s errors by comparing the **true labels** with the **predicted labels** for the test set.

The figure visualizes, for each misclassified sequence, the number of comparisons associated with the true algorithm (left bar) and the predicted algorithm (right bar).

In [None]:
import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import rcParams, font_manager
import seaborn as sns

np.random.seed(42)

start_time = time.time()

X_scaled = scaler.transform(train_input)
y_pred_all = best_model.predict(X_scaled)

end_time = time.time()
elapsed_time = end_time - start_time

print("Time to predict the test set: " + str(elapsed_time))

y_pred_all_classes = np.argmax(y_pred_all, axis=1)
predicted_algorithms_all = label_encoder.inverse_transform(y_pred_all_classes)

df_results["y_pred"] = predicted_algorithms_all

test_set_df = df_results.iloc[test_indices]
test_set_df.reset_index(drop=True, inplace=True)

print("size of test set: " + str(len(test_set_df)))

missclassified_df = test_set_df[test_set_df["Algorithm"] != test_set_df["y_pred"]].copy()

missclassified_df.reset_index(drop=True, inplace=True)

print("number of missclassifications: " + str(len(missclassified_df)))

missclassified_df["f"] = missclassified_df.apply(lambda row: row[row["y_pred"] + "_Comparisons"], axis=1)
missclassified_df["t"] = missclassified_df.apply(lambda row: row[row["Algorithm"] + "_Comparisons"], axis=1)

missclassified_df["abs_diff"] = missclassified_df["f"] - missclassified_df["t"]

# Sort by absolute difference in descending order
missclassified_df = missclassified_df.sort_values(by="abs_diff", ascending=False).reset_index(drop=True)
print(missclassified_df)

unique_algorithms = list(set(missclassified_df["Algorithm"].unique()) | set(missclassified_df["y_pred"].unique()))
palette = sns.color_palette("tab10", len(unique_algorithms))
color_map = dict(zip(unique_algorithms, palette))

fig, ax = plt.subplots(figsize=(12, 6))

ax.bar(missclassified_df.index - 0.2, missclassified_df["t"], width=0.4, 
       color=[color_map[alg] for alg in missclassified_df["Algorithm"]])

ax.bar(missclassified_df.index + 0.2, missclassified_df["f"], width=0.4, 
       color=[color_map[alg] for alg in missclassified_df["y_pred"]])

ax.set_xticks(missclassified_df.index[::2])

legend_labels = {
    "merge_sort": "Mergesort",
    "insertion_sort": "Insertionsort",
    "timsort": "Timsort"
}

legend_patches = [
    mpatches.Patch(color=color_map[alg], label=legend_labels.get(alg, alg))
    for alg in unique_algorithms
]

ax.legend(handles=legend_patches, loc="upper right")

ax.set_xlabel("Index")
ax.set_ylabel("Comparisons")

plt.rc('font', size=24)  # only applies on a second run
plt.show()

## Best Model Predictions Compared to Timsort

This section highlights the **largest improvements** achieved by the model relative to Timsort.
The analysis computes, for each test instance, the absolute difference in the number of comparisons between **Timsort** and the algorithm predicted by the model. The 15 cases with the greatest improvements are selected for visualization.

The figure presents a side-by-side comparison for each selected sequence:  
- The **left bar** (blue) shows the number of comparisons required by Timsort.  
- The **right bar** (colored by algorithm type) shows the number of comparisons required by the model’s predicted algorithm.

In [None]:
# Calculate absolute difference between "y_pred" column and "timsort"
test_set_df["difference"] = np.abs(test_set_df["Timsort_Comparisons"] - test_set_df.apply(lambda row: row[row["y_pred"] + "_Comparisons"], axis=1))

# Select the 20 rows with the largest difference
test_df_15 = test_set_df.nlargest(15, "difference").copy()
test_df_15.reset_index(drop=True, inplace=True)

test_df_15.drop(columns=["difference"], inplace=True)

test_df_15["f"] = test_df_15.apply(lambda row: row[row["y_pred"] + "_Comparisons"], axis=1)

fig, ax = plt.subplots(figsize=(12, 6))

ax.bar(test_df_15.index - 0.2, test_df_15["Timsort_Comparisons"], width=0.4, color=[color_map["Timsort"]])

ax.bar(test_df_15.index + 0.2, test_df_15["f"], width=0.4, 
       color=[color_map[alg] for alg in test_df_15["y_pred"]])

legend_patches = [mpatches.Patch(color=color_map[alg], label=legend_labels.get(alg, alg))
    for alg in unique_algorithms]

ax.legend(handles=legend_patches, loc="upper left")
ax.set_xlabel("Index")
ax.set_ylabel("Comparisons")

plt.rc('font', size=24)  # only applies on a second run
plt.show()

## Algorithm Prediction Summary

This section compares the **actual vs. predicted counts** for each sorting algorithm in the test set and reports the number of **true positives** per class.

In [None]:
actual_counts = test_set_df["Algorithm"].value_counts()
predicted_counts = test_set_df["y_pred"].value_counts()

correct_merge_count = ((test_set_df["y_pred"] == "Mergesort") & (test_set_df["Algorithm"] == "Mergesort")).sum()
correct_insertion_count = ((test_set_df["y_pred"] == "Insertionsort") & (test_set_df["Algorithm"] == "Insertionsort")).sum()
correct_timsort_count = ((test_set_df["y_pred"] == "Timsort") & (test_set_df["Algorithm"] == "Timsort")).sum()

summary_table = pd.DataFrame({
    "Actual Count": actual_counts,
    "Predicted Count": predicted_counts
})

print(summary_table)
print("true positive insertion_sort :" + str(correct_insertion_count))
print("true positive merge_sort :" + str(correct_merge_count))
print("true positive timsort :" + str(correct_timsort_count))

## Overall Model Performance Relative to Timsort

This figure places the model’s performance into context by comparing it with baseline and reference values:  

- **Timsort** serves as the baseline and is normalized to 100%.  
- **Best possible** represents the theoretical lower bound always choosing the most comparison efficient sorting algorithm.  
- **Model** reflects the performance of the final neural network model.
- The **overhead of presortedness calculation** is displayed separately as an orange bar stacked on top of the model’s prediction cost.  

In [None]:
timsort_value = 457
prediction_model_value = 39
presortedness_value = 15
minimum_possible_value = 390

# Compute relative percentages
timsort_percentage = 100 
prediction_percentage = (prediction_model_value / timsort_value) * 100
presortedness_percentage = (presortedness_value / timsort_value) * 100
minimum_percentage = (minimum_possible_value / timsort_value) * 100

bar_labels = ["Timsort", "Best Possible", "Model"]
main_bar_values = [timsort_percentage, minimum_percentage, prediction_percentage]
presorted_bar_value = [0, 0, presortedness_percentage]

colors_bottom = ["none", "none", "orange"]

fig, ax = plt.subplots(figsize=(8, 6))

bars_main = ax.bar(bar_labels, main_bar_values)

bars_bottom = ax.bar(bar_labels, presorted_bar_value, color=colors_bottom)

ax.set_ylabel("Comparisons Relative to Timsort")
# ax.set_title("Sorting Algorithm Comparisons (Relative to Timsort = 100%)")

for bar, value in zip(bars_main, main_bar_values):
    ax.text(bar.get_x() + bar.get_width() / 2, value + 2, f"{value:.1f}%", ha="center", fontsize=24)

ax.text(bars_bottom[2].get_x() + bars_bottom[2].get_width() / 2, presortedness_percentage + 1,  
        f"{presortedness_percentage:.2f}%", ha="center", fontsize=12, color="black")

legend_patch = mpatches.Patch(color="orange", label="Presortedness Calculation")
ax.legend(handles=[legend_patch], loc="upper right")
ax.set_ylim(0, 130)
ax.set_yticks(np.arange(0, 101, 20))

plt.rc('font', size=20)  # only applies on a second run
plt.show()